Ubuntu: Split text file into several ones when pattern appears, with command line in linux


I want to split a text file into several ones. One new file every time the pattern appears. Example: The pattern will be PAT

Original file content:

PAT --example html http://askubuntu.com/page01  ABC  DEF    PAT --example html http://askubuntu.com/page02  GHI  JKL    PAT --example html http://askubuntu.com/page03  MNO  PQR  

(and so on)

The original file is called original.txt I would like to get files like so:

$ cat page01.txt  ABC  DEF  $ cat page02.txt  GHI  JKL  $ cat page03.txt  MNO  PQR  

(and so on)

Ideally with commands like grep, awk... The renaming of the files is secondary, but would be a plus to help classifying them. Thanks in advance.


You could use awk with some redirection:

awk -F/ '/^PAT/{file = $NF; next} /./{print >> file}' foo  

The result:

$ head page0*  ==> page01 <==  ABC  DEF        ==> page02 <==  GHI  JKL        ==> page03 <==  MNO  PQR  

Essentially, for each line beginning with PAT, I'm saving the last field (via a field separator of /) the variable file, and then printing every non-empty line (/./ matches lines with at least one character) to the name contained in file.


Since @muru beat me to the awk solution, here's a Perl approach (but use @Muru's instead, it is simpler and more efficient):

perl -00ne 's#PAT.*/(.*)\n##; open($F,">","$1.txt"); s/\n\s*(\n|$)//g;               print $F "$_\n"' original.txt   

The -00 makes perl treat paragraphs as lines: a "line" (a "record") is now a paragraph, defined by an empty line. s#PAT.*/(.*)\n## will remove the line starting with PAT from the record, and the parentheses capture the last word after the / as $1.Then, we open $1.txt for writing (open($F,">","$1.txt")) with the file handle $F. The next step, s/\n\s*\n//g; removes blank lines and, finally, the current record is printed to the filehandle $F with print $F "$_\n".

To use everything after the // as a name, try:

perl -00ne 's#PAT.*//(.*)\n##; $k=$1; $k=~s#[./]##g;open($F,">","$k.txt");                 s/\n\s*(\n|$)//g; print $F "$_\n"' original.txt   

On your example, that would result in the following files:

askubuntucompage01.txt  askubuntucompage02.txt  askubuntucompage03.txt  


Also have a look at csplit(1):

csplit --suppress-matched --prefix page --suffix-format %02d.txt original.txt '/^PAT/' '{*}'

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »