Tutorial :How can I group a large dataset


I have simple text file containing two columns, both integers

1 5  1 12  2 5  2 341  2 12  

and so on..

I need to group the dataset by second value, such that the output will be.

5 1 2  12 1 2  341 2  

Now the problem is that the file is very big around 34 Gb in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.

I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).

data = load 'Net.txt';  gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet  store gdata into 'res.txt';  

I wanted to know if there was any simpler way of doing this.

Update: keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.

The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].

For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.

Solution which I had used:

class AdjDict(dict):      """       A special Dictionary Class to hold adjecancy list      """      def __missing__(self, key):          """          Missing is changed such that when a key is not found an integer array is initialized          """          self.__setitem__(key,array.array('i'))          return self[key]    Adj= AdjDict()    for line in file("net.txt"):      entry =  line.strip().split('\t')      node = int(entry[1])      follower = int(entry[0])      if node < 10 ** 6:          Adj[node].append(follower)    # Code for writting Adj matrix to the file:  


Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.

With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).


If I had to solve this on my current hardware, I'd probably write a few small programs:

The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)

Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)

Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)

Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:

5 1  5 2  12 1  12 2  

and you'll need to return:

5 1 2  12 1 2  

(This won't take much memory.)

By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.


Maybe you can do a multi-pass through the file.

Do a range of keys each pass through the file, for example if you picked a range size of 100

1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.

for your sample, the 1st pass would output

5 1 2  12 1 2  

and the 4th pass would output

341 2  

Choose the range size so that the dict you are creating fits into your RAM

I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk


If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.

THAT SAID HOWEVER, I really do think that: (a) for such a large data set, exact resolution is not appropriate and that (b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.

Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.

For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.


I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).

a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);  b = GROUP a BY user;  x = FOREACH b GENERATE group.user, a.following;  store x INTO 'following-list';  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »