Tutorial :How can I randomly sample the contents of a file?



Question:

I have a file with contents

abc  def  high  lmn  ...  ...  

There are more than 2 million lines in the files. I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).

Related (Possibly Duplicate) Questions:


Solution:1

Assuming you basically want to output about 2.5% of all lines, this would do:

print if 0.025 > rand while <$input>;  


Solution:2

Shell way:

sort -R file | head -n 50000  


Solution:3

Perl way:

use CPAN. There is module File::RandomLine that does exactly what you need.


Solution:4

From perlfaq5: "How do I select a random line from a file?"


Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.

Here's a reservoir-sampling algorithm from the Camel Book:

srand;  rand($.) < 1 && ($line = $_) while <>;  

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

You can use the File::Random module which provides a function for that algorithm:

use File::Random qw/random_line/;  my $line = random_line($filename);  

Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.


Solution:5

If you need to extract an exact number of lines:

use strict;  use warnings;    # Number of lines to pick and file to pick from  # Error checking omitted!  my ($pick, $file) = @ARGV;    open(my $fh, '<', $file)      or die "Can't read file '$file' [$!]\n";    # count lines in file  my ($lines, $buffer);  while (sysread $fh, $buffer, 4096) {      $lines += ($buffer =~ tr/\n//);  }    # limit number of lines to pick to number of lines in file  $pick = $lines if $pick > $lines;    # build list of N lines to pick, use a hash to prevent picking the  # same line multiple times  my %picked;  for (1 .. $pick) {      my $n = int(rand($lines)) + 1;      redo if $picked{$n}++  }    # loop over file extracting selected lines  seek($fh, 0, 0);  while (<$fh>) {      print if $picked{$.};  }  close $fh;  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »