Tutorial :finding a number appearing again among numbers stored in a file


Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?

Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.

How would you approach this problem?

Thanks in advance :)


I had this as an interview question once.

Here is an algorithm that is O(N)

Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.

Author Edit:

Below, @Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.

Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.

Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.

Phimuemue Edit:

Of course one needs to take a bit "special" hash table:

Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.

So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.


The important question is whether you want to solve this problem efficiently, or whether you want accurately.

If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.

Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).

A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.

And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).


Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).

After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.

After the second pass, items with count > 1 in the second table are your duplicates.

This is still O(n), just twice as slow as a single pass.


How about:

  1. Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
  2. Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.


Finding duplicates

Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".

If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.

Counting Duplicates

If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.

Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.

Its going to be a hack, but its doable.


You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.

How? Oh, what language are you using?


You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.


If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.


If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.

With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).


If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.


I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?
Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.


Had to do this a long time ago. What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:

1 to 10, 12, 16, 20 to 50, 52 would become..

  [1,10], 12, 16, [20,50], 52, ...  

Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage

p.s. another way to store them

  1, -9, 12, 16, 20, -30, 52,  

when i had no numbers lower than zero

After that i applied various algorithms (described by other posters) here on the reduced data set


#include <stdio.h>  #include <stdlib.h>  /* Macro is overly general but I left it 'cos it's convenient */  #define BITOP(a,b,op) \   ((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))  int main(void)  {      unsigned x=0;      size_t *seen = malloc(1<<8*sizeof(unsigned)-3);      while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);      if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);      else printf("no duplicate\n");      return 0;  }  


This is a simple problem that can be solved very easily (several lines of code)
and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters

i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions

  1. Map(key, value)
  2. Reduce(key, values[])

so all in all:

  • open file and iterate through the data
  • for each number -> Map(number, line_index)
  • in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
  • so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
  • print the duplicates : number, line_index1, line_index2,...

    again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages

    there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
    or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour

    good luck :)

    • Another more simple approach could be in using bloom filters



Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.

Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »