Tutorial :Regex optimization question



Question:

As a personal learning exercise, I wrote this regex to split a unary string into parts whose length is increasing powers of two (see also on ideone.com):

    for (String s :         new String(new char[500])         .split("(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)")      ) {         System.out.printf("%s ", s.length());      }      // prints "1 2 4 8 16 32 64 128 245 "  

This uses a combination of capturing during lookarounds, nested lookarounds, matching on backreferences, and infinite length lookbehind (which isn't officially supported in Java but works anyway). Properties of sums of powers of two and the fact that the string has a unary alphabet is also taken advantage of.

This solution is both unreadable and has a horrible performance.

My question is, how would you "optimize" this regex?

  • Can you make the regex more readable (it's okay if it performs worse)
  • Can you make the regex performs better (it's okay if it's less readable)


Solution:1

I am not a Java guy, therefore my answers is based on the .NET Regex implementation. I used

"(?<=(^.*)\\G.*)(?<=\\G\\1.)"  

based on the fact that \sum_{i=0}^{n} 2^n = 2^{n+1} - 1. It basically reads "Match every position, for which the part after the last match is one longer than the part before the last match."

Its about twice as fast as your original (on .NET, again), taking less than 2 seconds to split 10.000 characters, and I'd argue that its a little more readable. Well... less unreadable. =)

Cheers! Nice question! =)

Edit: Looking at your Regex again, it seems to me that you are using the same approach, but in a more complicated manner. I admit that I have not tried to read yours before trying to make my own solution, both because I like a challenge and because your regex is quite unreadable. =) Are these nested lookarounds neccessary because of the Java regex engine?


Solution:2

I really wouldn't. I'd throw the whole thing away and redo it as readable procedural code.

There are some things you really shouldn't do with regular expressions. This is one of them. I'm all for educating yourself but do you really think that's going to come in handy at some point?

Perhaps you'd be better off learning something that will actually be usable and maintainable :-)


Solution:3

These are patterns that have worked for me in Java. I'll eventually revise everything into one comprehensive answer with FULL explanations. These are all Java string literal representations.

  • P000: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2\\2.\\1)^.*)"
  • P001: "(?=(.*))(?<=(?<=(^.*))\\G.*)(?<=(?=\\2.\\1)\\G.*)"
    • Essentially the same as P000, but instead of ^\2\G\2.\1, we cut off the head to just \G\2.\1
    • 500 in 2.21s (ideone.com)
  • P002: "(?=(.*))(?<=(?<=(^.*))(?=\\2.\\1)\\G.*)"
    • Much slower than P000 but shorter
    • Essentially a "refactoring" of P001 now that both lookbehinds are anchored at \G
    • 400 in 3.05s (ideone.com)

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »