Ubuntu: Problems with “+” in grep



Question:

I'm trying to write a grep command to find lines like the below in a large text file:

<div class="node_thumbnail" data-type="file" name="GOPR0036.MP4_frame000001.jpg" data="813334c25191468c9f1c57afc99fde60" aid="133948" rel="/Files/ToolTipView?fileId=813334c25191468c9f1c57afc99fde60&pageNo=1&NoCache=101016083044" rev="topMiddle">  

but the + symbol seems to be causing problems in the below commands:

 grep 'data=[a-z,0-9,\"]' file  

Lots of hits

 grep 'data=[a-z,0-9,\"]+' file  

No hits


Solution:1

If you want + to mean "one or more of the preceding atom", then you have to do one of:

  1. Use -E (Extended Regular Expressions) (or -P, PCRE):

    grep -E 'data=[a-z,0-9,\"]+' file  
  2. Escape + so that is treated specially in the Basic Regular Expressions used by default in grep:

    grep 'data=[a-z,0-9,"]\+' file  


Solution:2

Points:

  • + is an ERE (Extended Regular Expression) token, which indicates one or more of the preceding token, can be used if -E option of grep is used or with escaped (\+) in case of BRE (Basic Regex) i.e. only regular grep

  • The character class [a-z,0-9,\"] would match any of the characters between [a-z], [0-9], , or ". This may not be what you want

  • Normally grep outputs whole line, if you want to output only the matched portion, use -o option of grep


Based on your example, you can do:

grep -E '\bdata=[a-z0-9"]+\b' file  
  • -E enables ERE
  • \b matches string edges, zero width
  • data= matches data= literally
  • [a-z0-9"] matches any character of [a-z], [0-9], and ". + matches the previous token one or more times

Your current pattern even you make it correct, without \b this would match false positives like foo fdata=2322ab, data=12AB and so on.

Example:

% grep -oE '\bdata=[a-z0-9"]+\b' <<<'<div class="node_thumbnail" data-type="file" name="GOPR0036.MP4_frame000001.jpg" data="813334c25191468c9f1c57afc99fde60" aid="133948" rel="/Files/ToolTipView?fileId=813334c25191468c9f1c57afc99fde60&pageNo=1&NoCache=101016083044" rev="topMiddle"'  data="813334c25191468c9f1c57afc99fde60  


Solution:3

Another option is to use egrep:

egrep 'data=[a-z,0-9,\"]+' file  

egrep is bundled with grep, it is just a wrapper for grep:

#!/bin/sh  exec grep -E "$@"  

this is good for interactive use. However in scripts I would use grep -E.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »