Tag Content Extractor

  • + 62 comments

    Java solution - passes 100% of test cases

    From my HackerRank solutions.

    import java.util.Scanner;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    /* Solution assumes we can't have the symbol "<" as text between tags */
    public class Solution{
        public static void main(String[] args){
            Scanner scan = new Scanner(System.in);
            int testCases = Integer.parseInt(scan.nextLine());
            
            while (testCases-- > 0) {
                String line = scan.nextLine();
                
                boolean matchFound = false;
                Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
                Matcher m = r.matcher(line);
    
                while (m.find()) {
                    System.out.println(m.group(2));
                    matchFound = true;
                }
                if ( ! matchFound) {
                    System.out.println("None");
                }
            }
        }
    }
    

    Let me try to explain the regular expression:

    <(.+)>
    

    matches HTML start tags. The parentheses save the contents inside the brackets into Group #1.

    ([^<]+)
    

    matches all the text in between the HTML start and end tags. We place a special restriction on the text in that it can't have the "<" symbol. The characters inside the parenthesis are saved into Group #2.

    </\\1>
    

    is to match the HTML end brace that corresponds to our previous start brace. The \1 is here to match all text from Group #1.