Ubuntu: Find multiple word-patterns in files



Question:

I have around 50000 files (.txt) and more items in filesdir folder.The values: 'fax', 'phone', 'address' are presented in different configurations in these files. I need to find all files which contain 'fax' AND 'phone' and does not contain 'address'. I tried for loop with a few grep commands. ls gives 'too many arguments'. So I tried:

find /filesdir/ -maxdepth 1 -name '*.txt' -exec grep -l 'fax' \; grep -l 'phone' \; grep -l -v 'address'  

Why it does not work?


Solution:1

There are several reasons that would not work:

  • you have omitted the {} placeholder for the -exec
  • you are trying to -exec multiple grep commands with a single invocation
  • I suspect your logic is flawed since the default operation for find is logical AND whereas you presumably want fax OR phone AND not address

I haven't fully tested it but I think you want something more like

find /filesdir/ -maxdepth 1 -name '*.txt' -exec grep -q 'fax\|phone' {} \; -exec grep -lv 'address' {} \;  


Solution:2

git grep

You can use git grep for multiple patterns combined using Boolean expressions, e.g.:

git grep --all-match --no-index -e "fax" --and -e "phone" --and --not -e "address"  

You can combine different patterns with Boolean expressions such as --and, --or and --not.

--all-match When giving multiple pattern expressions, this flag is specified to limit the match to files that have lines to match all of them.

--no-index Search files in the current directory that is not managed by Git.

-l/--files-with-matches/--name-only Show only the names of files.

-e The next parameter is the pattern. Default is to use basic regexp.

Other params to consider:

--threads Number of grep worker threads to use.

-q/--quiet/--silent Do not output matched lines; exit with status 0 when there is a match.

To change the pattern type, you may also use -G/--basic-regexp (default), -F/--fixed-strings, -E/--extended-regexp, -P/--perl-regexp, -f file, and other.

Check man git-grep for further help.

grep

Here is the grep syntax which uses chain of command substitutions:

grep -L "address" $(grep -l "phone" $(grep -rl "fax" .))  

Explanation:

  1. Find the filenames having the "fax" pattern (grep -rl "fax" .).
  2. Filter found filenames which are having "phone" pattern (grep -l "phone" $(cmd)).
  3. Filter further down to exclude files not having address (grep -L "address" $(cmd)).

If you're working with large data, consider using ripgrep instead.

find

Above example may not work well with files with whitespaces, so here is the version with find:

find . -type f -name '*.txt' \    -execdir bash -c 'grep -L "address" "$(grep -l "phone" "$(grep -l "fax" "{}")")"' ';' \  2>/dev/null  

See also: Check if all of multiple strings or regexes exist in a file


Solution:3

Printing the file names and their content on one line for each file

I think this command line will do it:

find -maxdepth 1 -name "*.txt" -exec echo "{} :" \; -exec cat {} \; -exec echo EOF \;| tr '\n' ' '|sed 's/EOF /\n/g'|grep -iv 'address'|grep -i 'fax'|grep -i 'phone'  

Explanation:

  • for each file (which is found by find)

    • echo the file name
    • print the content
    • print an End Of File flag (that should be different from what can be inside the files. Select this flag carefully! I use EOF, you may need something else.
  • for the whole output

    • convert the newlines to spaces to get everything on one line
    • convert the End Of File flags to newlines

    Now the content of each file is in one separate line, suitable for grep .

  • and finally

    • skip lines with 'address'
    • from the remaining output, select lines with 'fax'
    • from the remaining output, select lines with 'phone'

Printing only the file names

The previous command line prints the file names and the file content (merged to one line), which is good for testing, but not for processing thousands of files.

The following command line prints only the file names. It uses ':::' to separate each file name from the content of the file.

find -maxdepth 1 -name "*.txt" -exec echo "{} :::" \; -exec cat {} \; -exec echo EOF \;| tr '\n' ' '|sed 's/EOF /\n/g'|grep -iv 'address'|grep -i 'fax'|grep -i 'phone' | sed 's/ :::.*//'  


Solution:4

To find files ( compatible with files including whitespace/or newline ) those doesn't contain the pattern address:

find -type f ! -exec grep -q 'address' {} \; -print   

and print only those that contains the patterns fax and phone in any order in whole file:

find -type f ! -exec grep -q 'address' {} \; \                 -exec grep -qP '(?s)(?=.*?fax)(?=.*?phone)' {} \; -print  

Or POSIXly:

find -type f ! -exec grep -q 'address' {} \; \                 -exec grep -q 'fax' {} \; \                 -exec grep -q 'phone' {} \; -print  

Or assuming there is no \newline in files name, then:

grep -lP '(?s)(?=.*?fax)(?=.*?phone)' * |xargs -d'\n' grep -L address  
  • (?=pattern): Positive Lookahead: The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

  • (?s) Known "dot-all" which tells grep to allow the dot . to match \newline characters as well.

  • The .*? means matching any characters . that occurrences zero or more times * while they are optional followed by a pattern(fax or phone). The ? makes everything optional before it (means zero or one time of everything matched .*)

future reading:

Regex lookahead, lookbehind and atomic groups


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »