Tutorial :.NET RegExp engine search performance optimization



Question:

I have List collection with around 35,000 strings

Typical string looks like this:

"<i>füüs</i>ampri tähis;lüh ld-st<i>anno</i>, aastal;<i>maj</i> lüh pr-st<i>argent</i>, raha (kursisedelitel)"  

Basically this string contains bunch of words in Estonian :)

I need to allow user to perform RegExp search on 35,000 strings

If I perform search using /ab.*/ expression, then search takes less than a second

If I perform search using /.*ab/ expression, then search takes around 10 seconds

My question is: How can I make second search faster (less then 1.5 seconds)?

Thank You very much


Solution:1

Use compiled regular expressions for better performance

http://en.csharp-online.net/CSharp_Regular_Expression_Recipesâ€"Compiling_Regular_Expressions

copy paste the full url, looks like there is rendering problem with this link.


Solution:2

It’s how regular expressions are processed that makes them perform so different. To explain that based on your examples:

  • /.*ab/   This expression consists on two sub-expressions, the .* and the literal ab. This would be processed as follows: In the normal greedy mode, where every quantor and thus the match is expanded to its maximum, the .* will first match the whole string. Then it will try to match the following ab. But as it is not possible (we’re already at the end of the string) backtracking will be used to find the last point where both sub-expressions match. So the match of .* is reduced by one character and again the ab is tested. If the whole expression cannot be matched, the match of .* again will be reduced by one character until the whole expression is matched. In the worst case there is no ab in the string and the algorithm will do n+1 backtracks and additional tests for ab until it can determine that a match is impossible.

  • /ab.*/   This expression consists of two sub-expressions too. But here the order is changed, first the literal ab and than the .*. This is processed as follows: The algorithm first tries to find the literal ab by comparing one character by another. If there is a match, it tries to find a match for .* that is obvious easy.

The main difference between those two regular expressions is, that the second has the static part at the beginning and the variable part at the end. This makes no backtracking necessary.

Try some regular expression analysis tool such as RegexBuddy to see the difference visually.


Solution:3

There are two possible modifications I can suggest for the slow .*ab expression.

I performed my tests with this test string "1234567890 ab 1234567890" using the benchmarking feature in Regex Hero.

A. 5X faster than original

^.*ab  RegexOptions.None  

or

B. 8X faster than original

.*ab  RegexOptions.RightToLeft  

Sometimes experimentation pays off. The RightToLeft option was my "ah ha!" moment. That essentially returns the same performance as your other ab.* expression by preventing the massive backtracking from ever occurring.


Solution:4

I got this crazy idea that you could also store the strings in reverse order and search those strings with /ba.*/ if the user enter /.*ab/.


Solution:5

Your second expression will match 'ab' and all characters before it (except the new line). You can try searching only /ab/, get the index of the match and as a result concat the part of the string preceeding the match with match.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »