Tutorial :Substring search algorithms (very large haystack, small needle)



Question:

I know there are already several similar questions here, but I need some recommendations for my case (couldn't find anything similar).

I have to search a very large amount of data for a substring that would be about a billion times smaller (10 bytes in 10 billion bytes). The haystack doesn't change, so I can bear with large precomputation if needed. I just need the searching part to be as fast as possible.

I found algorithms which take O(n) time (n = haystack size, m = needle size), and the naive search takes O(n+m). As the m in this particular case would be very small, is there any other algorithm I could look into?

Edit: Thanks everyone for your suggestions! Some more info - The data could be considered random bits, so I don't think any kind of indexing / sorting would be possible. The data to be searched can be anything, not english words or anything predictable.


Solution:1

You are looking for the data structure called the Trie or "prefix tree". In short, this data structure encodes all the possible string prefixes which can be found in your corpus.

Here is a paper which searches a DNA sequences for a small substring, using a prefix tree. I imagine that might help you, since your case sounds similar.

If you know a definite limit on the length of the input search string, you can limit the growth of your Trie so that it does not store any prefixes longer than this length max. In this way, you may be able to fit a Trie representing all 10G into less than 10G of memory. Especially for highly repetitive data, any sort of Trie is a compressed data representation. (Or should be, if implemented sanely.) Limiting the Trie depth to the max input search string allows you to limit the memory consumption still further.


Solution:2

Given Knuthâ€"Morrisâ€"Pratt or Boyerâ€"Moore you're not going to do any better, what you should consider is parallelization of your search process.


Solution:3

It's worth looking at suffix arrays and trees. They both require precomputation and significant memory, but they are better than reverse indexes in the sense that you can search for arbitrary substrings in O(m) (for suffix trees) and O(m + log n) (for suffix arrays with least common prefix info).

If you have a lot of time on your hands, you can look into compressed suffix arrays and succinct CSAs that are compressed versions of your data that are also self-indexing (i.e. the data is also the index, there is no external index). This is really the best of all worlds because not only do you have a compressed version of your data (you can throw the original data away), but it's also indexed! The problem is understanding the research papers and translating them into code :)

If you do not need perfect substring matching, but rather general searching capabilities, check out Lucene.


Solution:4

Prefix/suffix tree are generally the standard, best and most cautious solution for this sort of thing in my opinion. You can't go wrong with them.

But here is a different idea, which resorts to Bloom filters. You probably know what these are, but just in case (and for other people reading this answer): Bloom filters are very small, very compact bit-vectors which approximate set inclusion. If you have a set S and a Bloom filter for that set B(S), then

x ∈ S â‡' x ∈ B(S)  

but the reciprocate is false. This is what is probabilistic about the structure: there is a (quantifiable) probability that the Bloom filter will return a false positive. But approximating inclusion with the Bloom filter is wildly faster than testing it exactly on the set.

(A simple case use: in a lot of applications, the Bloom filter is used, well, as a filter. Checking cache is expensive, because you have to do a hard drive access, so programs like Squid will first check a small Bloom filter in memory, and if the Bloom filter returns a positive result, then Squid will go check the cache. If it was false positive, then that's OK, because Squid will find that out when actually visiting the cache---but the advantage is that the Bloom filter will have spared Squid having to check the cache for a lot of requests where it would have been useless.)

Bloom filters were used with some success in string search. Here is a sketch (I may remember some of the details wrong) of this application. A text file is a sequence of N lines. You are looking for a word composed of M letters (and no word can be spread accross two lines). A preprocessing phase will build ONE Bloom filter for each line, by adding every subsequence of the line to the Bloom filter; for instance, for this line

 Touching this dreaded sight, twice seene of vs,  

And the corresponding Bloom filter will be created with "T", "To", "Tou" ... "o", "ou", ... "vs,", "s", "s,", ",". (I may have this part wrong. Or you might want to optimize.)

Then when searching for the subword of size M, simply do one very fast check on each of the Bloom filters, and when there is a hit, examine the line closely with the KMP algorithm, for instance. In practice, if you tune your Bloom filters well, the trade-off is remarkable. Searching is incredibly fast because you eliminate all useless lines.

I believe from this concept you could derive a useful scheme for your situation. Right now, I see two evident adaptation:

  • either cut your data set in many blocks of size K (each with its Bloom filter, like the lines in the previous example);

  • or use a sort-of dichotomy where you split the set into two subset, each with a Bloom filter, then each subset split into two sub-subsets with their own Bloom filter, etc. (if you are going to add all substrings as suggested with the method I described, this second idea would be a bit prohibitive---except you don't have to add all substrings, only substrings ranging size 1 to 10).

Both ideas can be combined in inventive ways to create multi-layered schemes.


Solution:5

If you can afford the space (a lot of space!) to create an index, it'd definitely be worth your while indexing small chunks (e.g. four byte blocks) and storing these with their offsets within the haystack - then searches for 10 bytes involve searching for all four byte blocks for the first four bytes of the search term and checking the next six bytes.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »