Detect HTML links

Sort by

recency

|

210 Discussions

|

  • + 0 comments

    For Python 3:

    import re
    A_TAG = re.compile(
        r'<a\s+[^>]*?href\s*=\s*' 
        r'([\'"])(.*?)\1'        
        r'[^>]*>'                
        r'(.*?)'                 
        r'</a>',                 
        flags=re.IGNORECASE | re.DOTALL
    )
    
    TAG_STRIP = re.compile(r'<[^>]+>')
    WS        = re.compile(r'\s+')
    
    def clean_text(raw):
        return WS.sub(' ', TAG_STRIP.sub('', raw)).strip()
    
    N = int(input())
    out_lines = []
    for line in range(N):
        line = input()
        for quote, url, txt in A_TAG.findall(line):
            out_lines.append(f"{url},{clean_text(txt)}")
    print("\n".join(out_lines))
    
  • + 0 comments

    Detecting HTML links using regular expressions is a common application in web scraping and data validation, though it's often discouraged for parsing complex HTML due to potential inaccuracies (source: Wikipedia – Regular expression).Detecting HTML links using regular expressions is a common application in web scraping and data validation, though it's often discouraged for parsing complex HTML due to potential inaccuracies

  • + 0 comments

    Detecting HTML links typically involves using regular expressions or DOM parsers to match anchor () tags and extract the href attribute. According to Wikipedia, anchor elements are core to hyperlinking in web documents, often requiring context-aware parsing to avoid false positives. Developers usually prefer regex for speed, but tools like BeautifulSoup or JS DOM methods offer more reliability in complex cases. I once practiced this by scraping real-world structured data and categorizing links from different sections of the restaurant industry. One fun example I used was the Olive Garden menu, which had nested item pages and link structures worth parsing.

  • + 0 comments

    Only Regex solution, no string modification except trim (which is given)

    const regex = <a .*?href="(.*?)".*?>([^<].*?)<.*?; const parseAll = new RegExp(regex, "g");
    const parseInv = new RegExp(regex); input.match(parseAll).forEach((str) => { const res = parseInv.exec(str); console.log(res[1] + "," + res[2].trim()); });

  • + 0 comments

    Python 3

    import re
    import sys
    
    html = sys.stdin.read()
    
    pattern=r'href=\"([^-]*?)\"[^>]*>(<[^>]*>)?\s?([^>]*?)<'
    
    matches = re.findall(pattern, html)
    
    for match in matches:
        print(",".join((match[0], match[2])))