Tutorial :regex for parsing SQL statements



Question:

I've got an IronPython script that executes a bunch of SQL statements against a SQL Server database. the statements are large strings that actually contain multiple statements, separated by the "GO" keyword. That works when they're run from sql management studio and some other tools, but not in ADO. So I split up the strings using the 2.5 "re" module like so:

splitter = re.compile(r'\bGO\b', re.IGNORECASE)  for script in splitter.split(scriptBlob):      if(script):          [... execute the query ...]  

This breaks in the rare case that there's the word "go" in a comment or a string. How in the heck would I work around that? i.e. correctly parse this string into two scripts:

-- this is a great database script!  go team go!  INSERT INTO myTable(stringColumn) VALUES ('go away!')  /*    here are some comments that go with this script.  */  GO  INSERT INTO myTable(stringColumn) VALUES ('this is the next script')  

EDIT:

I searched more and found this SQL documentation: http://msdn.microsoft.com/en-us/library/ms188037(SQL.90).aspx

As it turns out, GO must be on its own line as some answers suggested. However it can be followed by a "count" integer which will actually execute the statement batch that many times (has anybody actually used that before??) and it can be followed by a single-line comments on the same line (but not a multi-line, I tested this.) So the magic regex would look something like:

"(?m)^\s*GO\s*\d*\s*$"  

Except this doesn't account for:

  • a possible single-line comment ("--" followed by any character except a line break) at the end.
  • the whole line being inside a larger multi-line comment.

I'm not concerned about capturing the "count" argument and using it. Now that I have some technical documentation i'm tantalizingly close to writing this "to spec" and never having to worry about it again.


Solution:1

Is "GO" always on a line by itself? You could just split on "^GO$".


Solution:2

since you can have comments inside comments, nested comments, comments inside queries, etc, there is no sane way to do it with regexes.

Just immagine the following script:

INSERT INTO table (name) VALUES (  -- GO NOW GO  'GO to GO /* GO */ GO' +  /* some comment 'go go go'  -- */ 'GO GO' /*  GO */  )  

That without mentioning:

INSERT INTO table (go) values ('xxx') GO  

The only way would be to build a stateful parser instead. One that reads a char at a time, and has a flag that will be set when it is inside a comment/quote-delimited string/etc and reset when it ends, so the code can ignore "GO" instances when inside those.


Solution:3

If GO is always on a line by itself you can use split like this:

#!/usr/bin/python    import re    sql = """-- this is a great database script!  go team go!  INSERT INTO myTable(stringColumn) VALUES ('go away!')  /*    here are some comments that go with this script.  */  GO 5 --this is a test  INSERT INTO myTable(stringColumn) VALUES ('this is the next script')"""    statements = re.split("(?m)^\s*GO\s*(?:[0-9]+)?\s*(?:--.*)?$", sql)    for statement in statements:      print "the statement is\n%s\n" % (statement)  
  • (?m) turns on multiline matchings, that is ^ and $ will match start and end of line (instead of start and end of string).
  • ^ matches at the start of a line
  • \s* matches zero or more whitespaces (space, tab, etc.)
  • GO matches a literal GO
  • \s* matches as before
  • (?:[0-9]+)? matches an optional integer number (with possible leading zeros)
  • \s* matches as before
  • (?:--.*)? matches an optional end-of-line comment
  • $ matches at the end of a line

The split will consume the GO line, so you won't have to worry about it. This will leave you with a list of statements.

This modified split has a problem: it will not give you back the number after the GO, if that is important I would say it is time to move to a parser of some form.


Solution:4

This won't detect if GO ever is used as a variable name inside some statement, but should take care of those inside comments or strings.

EDIT: This now works if GO is part of the statement, as long as it is not in it's own line.

import re    line_comment = r'(?:--|#).*$'  block_comment = r'/\*[\S\s]*?\*/'  singe_quote_string = r"'(?:\\.|[^'\\])*'"  double_quote_string = r'"(?:\\.|[^"\\])*"'  go_word = r'^[^\S\n]*(?P<GO>GO)[^\S\n]*\d*[^\S\n]*(?:(?:--|#).*)?$'    full_pattern = re.compile(r'|'.join((      line_comment,      block_comment,      singe_quote_string,      double_quote_string,      go_word,  )), re.IGNORECASE | re.MULTILINE)    def split_sql_statements(statement_string):      last_end = 0      for match in full_pattern.finditer(statement_string):          if match.group('GO'):              yield statement_string[last_end:match.start()]              last_end = match.end()      yield statement_string[last_end:]  

Example usage:

statement_string = r"""  -- this is a great database script!  go team go!  INSERT INTO go(go) VALUES ('go away!')  go 7 -- foo  INSERT INTO go(go) VALUES (      'I have to GO " with a /* comment to GO inside a /* GO string /*'  )  /*    here are some comments that go with this script.    */    GO    INSERT INTO go(go) VALUES ('this is the next script')  """    for statement in split_sql_statements(statement_string):      print '======='      print statement  

Output:

=======    -- this is a great database script!  go team go!  INSERT INTO go(go) VALUES ('go away!')    =======    INSERT INTO go(go) VALUES (      'I have to GO " with a /* comment to GO inside a /* GO string /*'  )  /*    here are some comments that go with this script.    */    =======      INSERT INTO go(go) VALUES ('this is the next script')  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »