Detect the Domain Name

  • + 0 comments

    This is a solid approach for extracting domain names from HTML content. I like how it uses regex to find URLs, cleans them by removing http(s) and www prefixes, and stores them in a set to avoid duplicates. Sorting at the end makes the output neat and easy to read.

    import re

    n=int(input()) htmls="\n".join(input() for _ in range(n))

    urls = re.findall(r"https?://[^\s\"'>?]+.[a-z]{2,3}", htmls) domains=set() for url in urls: domain=re.sub(r"^https?://?","",url) domain=domain.split("/")[0] if "." in domain: domain=re.sub(r"^(www\d*.)","",domain) domains.add(domain) print(";".join(sorted(domains)))

    You can also try testing this on a real website like https://estateagentsilford.co.uk/ to see how it handles actual live URLs, including subdomains and different top-level domains. It’s a practical way to check your regex and parsing logic beyond sample inputs.