Tutorial :What is the fastest way to find if an array of byte arrays contains another byte array?



Question:

I have some code that is really slow. I knew it would be and now it is. Basically, I am reading files from a bunch of directories. The file names change but the data does not. To determine if I have read the file, I am hashing it's bytes and comparing that to a list of hashes of already processed files. There are about 1000 files in each directory, and figuring out what's new in each directory takes a good minute or so (and then the processing starts). Here's the basic code:

public static class ProgramExtensions  {      public static byte[] ToSHA256Hash(this FileInfo file)      {          using (FileStream fs = new FileStream(file.FullName, FileMode.Open))          {              using (SHA256 hasher = new SHA256Managed())              {                  return hasher.ComputeHash(fs);              }          }      }      public static string ToHexString(this byte[] p)      {            char[] c = new char[p.Length * 2 + 2];            byte b;            c[0] = '0'; c[1] = 'x';            for (int y = 0, x = 2; y < p.Length; ++y, ++x)          {              b = ((byte)(p[y] >> 4));                c[x] = (char)(b > 9 ? b + 0x37 : b + 0x30);                b = ((byte)(p[y] & 0xF));                c[++x] = (char)(b > 9 ? b + 0x37 : b + 0x30);          }            return new string(c);        }  }    class Program  {      static void Main(string[] args)      {          var allFiles = new DirectoryInfo("c:\\temp").GetFiles("*.*");            List<string> readFileHashes = GetReadFileHashes();            List<FileInfo> filesToRead = new List<FileInfo>();            foreach (var file in allFiles)          {              if (readFileHashes.Contains(file.ToSHA256Hash().ToHexString()))                  filesToRead.Add(file);          }            //read new files      }  }  

Is there anyway I can speed this up?


Solution:1

I believe you can archive the most significant performance improvement by simply first checking the filesize, if the filesize does not match, you can skip the entire file and don't even open it.

Instead of just saving a list of known hashes, you would also keep a list of known filesizes and only do a content comparison when filesizes match. When filesize doesn't match, you can save yourself from even looking at the file content.

Depending on the general size your files generally have, a further improvement can be worthwhile:

  • Either doing a binary compare with early abort when the first byte is different (saves reading the entire file which can be a very significant improvement if your files generally are large, any hash algorithm would read the entire file. Detecting that the first byte is different saves you from reading the rest of the file). If your lookup file list likely contains many files of the same size so you'd likely have to do a binary comparison against several files instead consider:

  • hashing in blocks of say 1MB each. First check the first block only against the precalculated 1st block hash in your lookup. Only compare 2nd block if 1st block is the same, saves reading beyond 1st block in most cases for different files. Both those options are only really worth the effort when your files are large.

I doubt that changing the hashing algorithm itself (e.g first check doing a CRC as suggested) would make any significant difference. Your bottleneck is likely disk IO, not CPU so avoiding disk IO is what will give you the most improvement. But as always in performance, do measure.

Then, if this is still not enough (and only then), experiment with asynchronous IO (remember though that sequential reads are generally faster than random access, so too much random asynchronous reading can hurt your performance)


Solution:2

  • Create a file list
  • Sort the list by filesize
  • Eliminate files with unique sizes from the list
  • Now do hashing (a fast hash first might improve performance as well)


Solution:3

  • Use an data structure for your readFileHashes store that has an efficient search capability (hashing or binary search). I think HashSet or TreeSet would serve you better here.

  • Use an appropriate checksum (hash sum) function. SHA256 is a cryptographic hash that is probably overkill. CRC is less computationally expensive, originally intended for catching unintentional/random changes (tranmission errors), but is susceptable to changes to are designed/intended to be hidden. What fits the differences between the files you are scanning?

    See http://en.wikipedia.org/wiki/List_of_checksum_algorithms#Computational_costs_of_CRCs_vs_Hashes

    Would a really simple checksum via sampling (e.g. checksum = (first 10 bytes and last 10 bytes)) work?


Solution:4

I'd do a quick CRC hash check first, as it is less expensive. if the CRC does not match, continue on with a more "reliable" hash test such as SHA


Solution:5

Your description of the problem still isn't clear enough.

The biggest problem is that you are doing a bunch of hashing. This is guaranteed to be slow.

You might want to try searching for the modification time, which does not change if a filename is changed:

http://msdn.microsoft.com/en-us/library/ms724320(VS.85,loband).aspx

Or you might want to monitor the folder for any new file changes:

http://www.codeguru.com/forum/showthread.php?t=436716


Solution:6

First group the files by file sizes - this will leave you with smaller groups of files. Now it depends on the group size and file sizes. You could just start reading all files in parallel until you find a difference. If there is a difference, split the group into smaller groups having the same value at the current position. If you have information how the files differ, you can use this information - start reading at the end, don't read and compare byte by byte if larger cluster change, or what ever you know about the files. This solution might introduce I/O performance problems if you have to read many files in parallel causing random disc access.

You could also calculate hash values for all files in each group and compare them. You must not neccessarily process the whole files at once - just calculate the hash of a few (maybe a 4kiB cluster or whatever fits your file sizes) bytes and check if there are allready differences. If not, calculate the hashes of the next few bytes. This will give you the possibility to process larger blocks of each file without requiring to keep one such large block for each file in a group in the memory.

So its all about a time-memory (disc I/O-memory) trade-off. You have to find your way between reading all files in a group into memory and comparing them byte by byte (high memory requirement, fast sequential access, but may read to much data) and reading the files byte by byte and comparing only the last byte read (low memory requirement, slow random access, reads only required data). Further, if the groups are very large, comparing the files byte by byte will become slower - comparing one byte from n files is a O(n) operation - and it might become more efficient to calculate hash values first and then compare only the hash values.


Solution:7

updated: Definitely DO NOT make your only check for file size. If your os version allows use FileInfo.LastWriteTime

I've implemented something similar for an in-house project compiler/packager. We have over 8k files so we store the last modified dates and hash data into a sql database. then on subsequent runs we query first against the modified date on any specific file, and only then on the hash data... that way we only calculate new hash data for those files that appear to be modified...

.net has a way to check for last modified date, in the FileInfo class.. I suggest you check it out. EDIT: here is the link LastWriteTime

Our packager takes about 20 secs to find out what files have been modified.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »