Tutorial :Matching on repeated substrings in a regex



Question:

Is it possible for a regex to match based on other parts of the same regex?

For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?

Matches:

abcabc  xyz abc xyz  

Doesn't Match:

abc123  

Undefined: (Can match or not, whichever is easiest)

ababa  a  

Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.


Solution:1

Use capture groups and backreferences.

/^(.{3}).*\1$/  

The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.


Solution:2

You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):

<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>  

This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.

Applying this to your case:

/^(.{3}).*\1$/  

(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)

A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):

  • ^ matches the start of the line.
  • (.{3}) grabs three characters of any type and saves them in a group for later reference.
  • .* matches anything for as long as possible. (You don't care what's in the middle of the line.)
  • \1 matches the group that was captured in step 2.
  • $ matches the end of the line.


Solution:3

For the same characters at the beginning and end:

/^(.{3}).*\1$/  

This is a backreference.


Solution:4

This works:

my $test = 'abcabc';  print $test =~ m/^([a-z]{3}).*(\1)$/;  

For matching the beginning and the end you should add ^ and $ anchors.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »