Ubuntu: Extract one element from lines of a text file



Question:

Command grep will print a line when the line contains a string that matches an expression, which is not handy to search for specifed content.

For instance, I have vocabulary files with formatting

**word**  1. Definition:  2. Usage  3. Others  

I'd like to retrieve all the words to make a wordlist within files

grep '\*\*[^*]*\*\*'  

Returns the bulk of the content.

How to use grep to catch only the word?


Solution:1

With awk way:

awk -F'*\\*' 'NF>2{print $2}' infile  

sample test input:

*wrd*  *woooord  **WRD  WORD**  woooooooooood*  **word**  

the output:

word  


Solution:2

Like this for word, using regex (-P) :

grep -oP '^\s*\*\*\K[^*]+(?=\*\*)' file  

Output :

word  

Like this for words :

grep -oP '^\s*\d+\.\s*\K\w+' file  

Output :

Definition  Usage  Others  


Solution:3

There are several tools available that can be used to extract word, here's a version implemented in sed:

 sed '/^\*\*/!d' <your_file  

This command will match every line in your file that starts with ** and print it. The other lines will be deleted from the output. If you also want to remove the stars you can extend the command to this:

sed '/^\*\*/!d;s/\*//g' <your_file  

This command, in addition, will remove all * characters from the line before it is printed.


Solution:4

This is one of those questions where it is helpful to have test input file and examples of desired output.

Input File

Here is a test input file I copied from the Internet and modified to encase search words within ** pairs:

$ cat ~/Downloads/wordlist.txt  **Schadenfreude**  This is a German word, although used in English too, which is used to mean ‘malicious enjoyment of the misfortunes of others’. It comes from the joining of the words schaden meaning ‘harm’ and freude meaning ‘joy’.    **Waldeinsamkeit**  Ever found yourself wandering alone through a forest and wanting to express the emotion brought about by that wander? Look no further! In German, Waldeinsamkeit means ‘woodland solitude’.    **L’esprit de l’escalier**  We all know the feeling of walking away from an argument and instantly thinking of the ideal comeback, or leaving a conversation and remembering the perfect contribution to a no-longer relevant subject. In French, l’esprit de l’escalier is the term used to refer to that irritating feeling. It literally translates as ‘the spirit of the staircase’, more commonly known as ‘staircase wit’. It comes from the idea of thinking of a response as you’re leaving somebody’s house, via their staircase.    **Schlimazel**  The Mr Men series of books by Roger Hargreaves is a staple of many a British child’s bookshelves, and there is a word which could have been created for the character Mr Bump. Like Mr Bump, a Schlimazel is ‘a consistently unlucky, accident-prone person, a born loser’. It is a Yiddish word, coming from the Middle High German word slim meaning ‘crooked’ and the Hebrew mazzāl meaning ‘luck’.    **Depaysement**  Ever go on holiday, only to experience a strange sensation of disorientation at the change of scenery? Dépaysement is a French word which refers to that feeling of disorientation that specifically arises when you are not in your home country.    **Duende**  This Spanish term implies something magical or enchanting. It originally referred to a supernatural being or spirit  similar to an imp or pixie (and is occasionally borrowed in that sense into English with reference to Spanish and Latin American folklore). Now, it has adapted to refer to the spirit of art or the power that a song or piece of art has to deeply move a person.    **Torschlusspanik**  Are you getting older? Scared of being left behind or ‘left on the shelf’? This British idiom has its own word in German: Torschlusspanik, which literally translates as ‘panic at the shutting of a gate’, is used frequently in a general sense meaning ‘last â€"minute panic’, of the type you might experience before a deadline.    *Do*Not*Return*these four star lines  *word***  ***word*  word**  

Using grep

Using grep it's fairly straightforward to get a word list:

$ grep -E -o '\*\*[^*]{,20}\*\*' ~/Downloads/wordlist.txt  **Schadenfreude**  **Waldeinsamkeit**  **L’esprit de l’escalier**  **Schlimazel**  **Depaysement**  **Duende**  **Torschlusspanik**  

If you want to remove the ** encasing the words, add a pipe to sed:

$ grep -E -o '\*\*[^*]{,20}\*\*' ~/Downloads/wordlist.txt | sed 's/*//g'  Schadenfreude  Waldeinsamkeit  L’esprit de l’escalier  Schlimazel  Depaysement  Duende  Torschlusspanik  

Saving index of words to a file

If you want to save your grep and sed output use the file redirection > command:

$ grep -E -o '\*\*[^*]{,20}\*\*' ~/Downloads/wordlist.txt | sed 's/*//g' > ~/Downloads/wordlist-index.txt    $ cat ~/Downloads/wordlist-index.txt  Schadenfreude  Waldeinsamkeit  L’esprit de l’escalier  Schlimazel  Depaysement  Duende  Torschlusspanik  

Note original answer posted yesterday enhanced with new post today from muru on a separate Q&A: Use specified quantifier in grep to retrieve satisfied vocabulary


Solution:5

If you don't mind using additional tools a very simple solution would be to post-filter the grep output with tr to delete all occurrences of the character *:

grep -x '\*\*[^*]*\*\*' | tr -d '*'  

I also recommend that you use the -x flag of GNU grep as above to match only whole lines to not accidentally catch **word** appearing surrounded by other text on the same line. This may also speed up the pattern matching process since it can now discard many potential matches early on.

sed alternative

You can also take advantage of sed’s p flag to match, replace and print as a single command:

sed -nre 's/^\*\*([^*]*)\*\*$/\1/p'  


Solution:6

GNU grep

Your particular case is extracting text between two patterns on a line/string. This has been covered in the 2012 question How to use sed/grep to extract text between two words?. Particularly, as anishsane mentioned, you can use look-ahead and look-back patterns with Perl-regex flag -P. In your particular case, the solution would be

grep -o -P '(?<=\*\*).*(?=\*\*)' input.txt  

However, as ghoti mentioned, -P is specific to GNU grep. Keep that in mind if you are porting your scripts/commands between different *nix systems.


Perl

Instead of trying to use Perl regex, let's just use Perl itself:

$ perl -a -F\\*\\* -lane 'print $F[1] if /\*\*/' input.txt  word  

This has two advantages. One, it specifies the delimiter for "fields", which means we can deal with individual items separated by **. Second, syntactically this is just slightly less confusing than look-ahead/back pattern.


Python

Of course, there's other ways to do it, and one of them is Python. Python 2.7 script would be:

#!/usr/bin/env python  from __future__ import print_function  import sys    for f in sys.argv[1:]:      with open(f) as fd:          for line in fd:              if line.startswith('**'):                   print(line.split('*')[2])  

You could also make it a one-liner and take advantage of stdin redirection:

python -c 'import re,sys; print "\n".join([ l.split("**")[1] for l in sys.stdin if "**" in l  ])' < input.txt  

Others who prefer regex, may want to use re module.

python -c 'import re,sys; print "\n".join([ re.split("\*\*",l)[1] for l in sys.stdin if "**" in l  ])' < input.txt  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »