Tutorial :Linkify text with regular expressions in Java



Question:

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.

What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.

This solution works when all I have is plain text:

String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";  Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);  Matcher matcher = pattern.matcher(comment);  comment = matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression  

But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.

So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.

Any ideas?


Solution:1

You are close. You can use a "negative lookbehind" like so:

(?<!href=")http:// etc  

All results preceded by href will be ignored.


Solution:2

If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:

(?!</a>)  

Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string

<a href="...">http://example.com/</a>  

This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.

You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.

This works for me (note the three extra +'s):

String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";  


Solution:3

If you really want to do it with regex, than:

   String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";  

e.g. check that the URL is not following a =" or />


Solution:4

Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.


Solution:5

If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »