Java Regex 2 - Duplicate Words

  • + 1 comment

    Sure. The replaceAll call replaced smoo(the the) [the the]orem with smoo(the) [the]orem instead of smoothe {the the} theorem with smoothe {the} theorem (using matching brackets to highlight the replacements).

    The m.find() was delimited at word boundaries, and will find, in turn, "smoothe" "the the" and "theorem". Each of these strings is then passed to replaceAll, so we see these calls in order:

    • "smoothe the the theorem".replaceAll("smoothe", "smoothe") => "smoothe the the theorem",
    • "smoothe the the theorem".replaceAll("the the", "the") => "smoothe theorem",
    • "smoothe theorem".replaceAll("theorem", "theorem") => "smoothe theorem".

    So it was the second call that went wrong. Clearly it's matching the wrong things, and our original regexp didn't fail there because we used word boundary assertions. The second call should look like: "smoothe the the theorem".replaceAll("\bthe the\b", "the") => "smoothe the theorem"

    An aside: That said, the whole structure of this exercise is suspect. If I use custom input of "a a a b a a a a a" then replacing the first repeated 'a' will do this to the string: "a b a a a". But m.find() is still using the original string, so passes "a a a a a" to the replaceAll, and it longer exists... so the trailing three "a"s are left in the output. replaceAll is the wrong call to use.