Tutorial :How do I write a regular expression for these path expressions


I'm trying to write a helper method that breaks down path expressions and would love some help. Please consider a path pattern like the following four (round brackets indicate predicates):

  1. item.sub_element.subsubelement(@key = string) ; or,
  2. item..subsub_element(@key = string) ; or,
  3. //subsub_element(@key = string) ; or,
  4. item(@key = string)

what would a regular expression look like that matches those?

What I have come up with is this:


I'm reading this as: "match one or more occurences of a string that consists of two groups: group one consists of one or more words with optional underscores and optional double forward slash prefix ; group two is optional and consists of at least one word with all other characters optional ; groups are trailed by zero to two dots."

However, a test run on the fourth example with Matcher.matches() returns false. So, where's my error?

Any ideas?



Edit: from trying with http://www.regexplanet.com/simple/index.html it seems I wasn't aware of the difference between the Matcher.matches() and the Matcher.find() methods of the Matcher object. I was trying to break down the input string in to substrings that match my regex. Consequently I need to use find(), not matches().

Edit2: This does the trick



You misunderstand character classes, I think. I've found that for testing regular expressions, http://gskinner.com/RegExr/ is of great help. As a tutorial for regular expressions, I'd recommend http://www.regular-expressions.info/tutorial.html.

I am not entirely sure, how you want to group your strings. Your sentence seems to suggest, that your first group is just the item part of item..subsub_element(@key = string), but then I am not sure what the second group should be. Judging from what I deduce from your Regex, I'll just group the part before the brackets into group one, and the part in the brackets into group two. You can surely modify this if I misunderstood you.

I don't escape the expression for Java, so you'd have to do that. =)

The first group should begin with an optional double slash. I use (?://)?. Here ?: means that this part should not be captured, and the last ? makes the group before it optional.

Following that, there are words, containing characters and underscores, grouped by dots. One such word (with trailing dots) can be represented as [a-zA-Z_]+\.{0,2}. The \w you use actually is a shortcut for [a-zA-Z0-9_], I think. It does NOT represent a word, but a "word character".

This last expression may be present multiple times, so the capturing expression for the first group looks like


For the part in the brackets, one can use \([^)]*\), which means an opening bracket (escaped, since it has special meaning, followed by an arbitrary number of non-brackets (not escaped, sind it has no special meaning inside a character class), and then a closing bracket.

Combined with ^ and $ to mark the beginning and end of line respectively, we arrive at


If I misunderstood your requirements, and need help with those, please ask in the comments.


You may find this website useful for testing your regex's http://www.fileformat.info/tool/regex.htm.

As a general approach, try building the regex up from one that handles a simple case, write some tests and get it to pass. Then make the regex more complicated to handle the other cases as well. Make sure it passes both the original and the new tests.


There are so many things wrong with your pattern:

/{2}?: what do you think ? means here? Because if you think it makes /{2} optional, you're wrong. Instead ? is a reluctant modifier for the {2} repetition. Perhaps something like (?:/{2})? is what you intend.

[\w+_*] : what do you think the + and * means here? Because if you think they represent repetition, you're wrong. This is a character class definition, and + and * literally means the characters + and *. Perhaps you intend... actually I'm not sure what you intend.

Solution attempt

Here's an attempt at guessing what your spec is:

    String PART_REGEX =          "(word)(?:<<@(word) = (word)>>)?"              .replace("word", "\\w+")              .replace(" ", "\\s*")              .replace("<<", "\\(")              .replace(">>", "\\)");      Pattern entirePattern = Pattern.compile(          "(?://)?part(?:\\.{1,2}part)*"              .replace("part", PART_REGEX)      );      Pattern partPattern = Pattern.compile(PART_REGEX);  

Then we can test it as follows:

    String[] tests = {          "item.sub_element.subsubelement(@key = string)",          "item..subsub_element(@key = string)",          "//subsub_element(@key = string)",          "item(@key = string)",          "one.dot",          "two..dots",          "three...dots",          "part1(@k1=v1)..part2(@k2=v2)",          "whatisthis(@k=v1=v2)",          "noslash",          "/oneslash",          "//twoslashes",          "///threeslashes",          "//multiple//double//slashes",          "//multiple..double..dots",          "..startingwithdots",      };      for (String test : tests) {          System.out.println("[ " + test + " ]");          if (entirePattern.matcher(test).matches()) {              Matcher part = partPattern.matcher(test);              while (part.find()) {                  System.out.printf("  [%s](%s => %s)%n",                      part.group(1),                      part.group(2),                      part.group(3)                  );              }          }      }  

The above prints:

[ item.sub_element.subsubelement(@key = string) ]    [item](null => null)    [sub_element](null => null)    [subsubelement](key => string)  [ item..subsub_element(@key = string) ]    [item](null => null)    [subsub_element](key => string)  [ //subsub_element(@key = string) ]    [subsub_element](key => string)  [ item(@key = string) ]    [item](key => string)  [ one.dot ]    [one](null => null)    [dot](null => null)  [ two..dots ]    [two](null => null)    [dots](null => null)  [ three...dots ]  [ part1(@k1=v1)..part2(@k2=v2) ]    [part1](k1 => v1)    [part2](k2 => v2)  [ whatisthis(@k=v1=v2) ]  [ noslash ]    [noslash](null => null)  [ /oneslash ]  [ //twoslashes ]    [twoslashes](null => null)  [ ///threeslashes ]  [ //multiple//double//slashes ]  [ //multiple..double..dots ]    [multiple](null => null)    [double](null => null)    [dots](null => null)  [ ..startingwithdots ]  


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »