Ubuntu: How can I repeat the content of a file n times?



Question:

I'm trying to benchmark to compare two different ways of processing a file. I have a small amount of input data but in order to get good comparisons, I need to repeat the tests a number of times.

Rather than just repeating the tests I would like to duplicate the input data a number of times (eg 1000) so a 3 line file becomes 3000 lines and I can run a much more fulfilling test.

I'm passing the input data in via a filename:

mycommand input-data.txt  


Solution:1

You don't need input-duplicated.txt.

Try:

mycommand <(perl -0777pe '$_=$_ x 1000' input-data.txt)  

Explanation

  • 0777 : -0 sets sets the input record separator (perl special variable $/ which is a newline by default). Setting this to a value greater than 0400 will cause Perl to slurp the entire input file into memory.
  • pe : the -p means "print each input line after applying the script given by -e to it".
  • $_=$_ x 1000 : $_ is the current input line. Since we're reading the entire file at once because of -0700, this means the entire file. The x 1000 will result in 1000 copies of the entire file being printed.


Solution:2

I was originally thinking that I would have to generate a secondary file but I could just loop the original file in Bash and use some redirection to make it appear as a file.

There are probably a dozen different ways of doing the loop but here are four:

mycommand <( seq 1000 | xargs -i -- cat input-data.txt )  mycommand <( for _ in {1..1000}; do cat input-data.txt; done )  mycommand <((for _ in {1..1000}; do echo input-data.txt; done) | xargs cat )  mycommand <(awk '{for(i=0; i<1000; i++)print}' input-data.txt)  #*  

The third method there is improvised from maru's comment below and builds a big list of input filenames for cat. xargs will split this into as many arguments as the system will allow. It's much faster than n separate cats.

The awk way (inspired by terdon's answer) is probably the most optimised but it duplicates each line at a time. This may or may not suit a particular application, but it's lightning fast and efficient.


But this is generating on the fly. Bash outputting is likely to be very much more slow than something can read so you should generate a new file for testing. Thankfully that's only a very simple extension:

(for _ in {1..1000}; do echo input-data.txt; done) | xargs cat > input-duplicated.txt  mycommand input-duplicated.txt  


Solution:3

Here's an awk solution:

awk '{a[NR]=$0}END{for (i=0; i<1000; i++){for(k in a){print a[k]}}}' file   

It's essentially as fast as @Gnuc's Perl (I ran both 1000 times and got the average time):

$ for i in {1..1000}; do    (time awk '{a[NR]=$0}END{for (i=0;i<1000;i++){for(k in a){print a[k]}}}' file > a) 2>&1 |       grep -oP 'real.*?m\K[\d\.]+'; done | awk '{k+=$1}END{print k/1000}';   0.00426    $ for i in {1..1000}; do     (time perl -0777pe '$_=$_ x 1000' file > a ) 2>&1 |       grep -oP 'real.*?m\K[\d\.]+'; done | awk '{k+=$1}END{print k/1000}';   0.004076  


Solution:4

I would just use a text editor.

vi input-data.txt  gg (move cursor to the beginning of the file)  yG (yank til the end of the file)  G (move the cursor to the last line of the file)  999p (paste the yanked text 999 times)  :wq (save the file and exit)  

If you absolutely need to do it via the command-line (this requires you to have vim installed, as vi doesn't have the :normal command), you could use:

vim -es -u NONE "+normal ggyGG999p" +wq input-data.txt  

Here, -es (or -e -s) makes vim operate silently, so it shouldn't take over your terminal window, and -u NONE stops it from looking at your vimrc, which should make it run a little faster than it otherwise would (maybe much faster, if you use a lot of vim plugins).


Solution:5

Here is a simple one-liner, no scripting involved:

mycommand <(cat `yes input-data.txt | head -1000 | paste -s`)  

Explanation

  • `yes input-data.txt | head -1000 | paste -s` produces the text input-data.txt 1000 times seperated by white space
  • The text is then passed to cat as a files list


Solution:6

While working on a completely different script , I've learned that with 29 million lines of text, using seek() and operating on data bytewise is often faster than on line-by-line basis. Same idea is applied in the script below: we open file, and instead of looping through opening and closing the file (which may add overhead, even if not significant), we keep the file open and seek back to the beginning.

#!/usr/bin/env python3  from __future__ import print_function  import sys,os    def error_out(string):      sys.stderr.write(string+"\n")      sys.exit(1)    def read_bytewise(fp):      data = fp.read(1024)      print(data.decode(),end="",flush=True)      while data:          data = fp.read(1024)          print(data.decode(),end="",flush=True)      #fp.seek(0,1)    def main():      howmany = int(sys.argv[1]) + 1      if not os.path.isfile(sys.argv[2]):         error_out("Needs a valid file")         fp = open(sys.argv[2],'rb')      for i in range(1,howmany):          #print(i)          fp.seek(0)          read_bytewise(fp)      fp.close()    if __name__ == '__main__': main()  

The script itself is quite simple in usage:

./repeat_text.py <INT> <TEXT.txt>  

For 3 line text file and 1000 iteration it goes quite alright, about 0.1 seconds:

$ /usr/bin/time ./repeat_text.py 1000 input.txt  > /dev/null                                                               0.10user 0.00system 0:00.23elapsed 45%CPU (0avgtext+0avgdata 9172maxresident)k  0inputs+0outputs (0major+1033minor)pagefaults 0swaps  

The script itself isn't most elegant, probably could be shortened, but does the job. Of course, I added a few extra bits here and there, like error_out() function, which isn't necessary - it's just a small user-friendly touch.


Solution:7

We can solve this without an additional file, nor special programs, pure Bash (well, cat is an standard command).

Based on a feature of printf inside bash we can generate a repeated string):

printf "test.file.txt %.0s\n" {1..1000}  

Then, we can send such list of 1000 filenames (repeated) and call cat:

printf "test.file.txt %.0s" {1..1000} | xargs cat   

And finally, we can give the output to the command to execute:

mycommand "$( printf "%.0sinput.txt\n" {1..1000} | xargs cat )"  

Or, if the command needs to receive the input in the stdin:

mycommand < <( printf "%.0sinput.txt\n" {1..1000} | xargs cat )  

Yes, the double < is needed.


Solution:8

I would generate a new file using Unix for loop:

content=$(cat Alex.pgn); for i in {1..900000}; do echo "$content" >> new_file; done   

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »