Tutorial :How many threads does it take to make them a bad choice?


I have to write a not-so-large program in C++, using boost::thread.

The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millons are a possibility as well) number of (possibly) large files. Each file is independent from another, and they all reside in the same directory. I´m thinking of using the multi threaded aproach, but the question is, how many threads should I use? I mean, what order of magnitude? 10, 500, 12400?

There are some synchronization issues, each thread should return a struct of values (which are accumulated for each file), and those are added to a "global" struct to get the overall data. I realize that some threads could "get hungry" because of synchronization, but if it's only an add operation, does it matter?

I was thinking of

for(each file f in directory){        if (N < max_threads)//N is a static variable controlling amount of threads           thread_process(f)      else         sleep()  }  

This is in HP - UX, but I won't be able to test it often, since it's a remote and quite unaccessible server.


According to Amdahl's law that was discussed by Herb Sutter in his article:

Some amount of a program's processing is fully "O(N)" parallelizable (call this portion p), and only that portion can scale directly on machines having more and more processor cores. The rest of the program's work is "O(1)" sequential (s). [1,2] Assuming perfect use of all available cores and no parallelization overhead, Amdahl's Law says that the best possible speedup of that program workload on a machine with N cores is given by
formula image

In your case I/O operations could take most of the time, as well as synchronization issues. You could count time that will be spend in blocking(?) slow I/O operations and approximately find number of threads that will be suitable for your task.

Full list of concurrency related articles by Herb Sutter could be found here.


I'm not too sure about HP/UX, but in the Windows world, we use thread pools to solve this sort of problem. Raymond Chen wrote about this a while back, in fact...

The skinny of it is that I would generally not expect anything to scale well on a CPU-bound load if the number of threads is more than about 2x the number of CPU cores you have in the system. For I/O bound loads, you might be able to get away with more, depending on how fast your disk subsystem is, but once you reach about 100 or so, I would seriously consider changing the model...


You said the files are all in one directory. Does that mean they are all on one physical drive?

If that is so, and assuming they are not already cached, then your job will be to keep the single read head busy, and no amount of threading will help it. In fact, if it has to hop between tracks due to parallelism, you could slow it down.

On the other hand, if the computation part takes significant time, causing the read head to have to wait, then it might make sense to have >1 thread.

Often, using threads for performance is missing the point unless it lets you get parallel pieces of hardware working at the same time.

More often, the value of threads is in, for example, keeping track of multiple simultaneous conversations, like if you have multiple users, where each thread can wait for its own Johny or Suzy and not get confused.


To elaborate it really depends on

IO boundedness of the problem      how big are the files      how contiguous are the files      in what order must they be processed      can you determine the disk placement  how much concurrency you can get in the "global structure insert"      can you "silo" the data structure with a consolidation wrapper  the actual CPU cost of the "global structure insert"   

For example if your files reside on a 3 terabyte flash memory array then the solution is different than if they reside on a single disk (where if the "global structure insert" takes less that the read the problem is I/O bounded and you might just as well have a 2 stage pipe with 2 threads - the read stage feeding the insert stage.)

But in both cases the architecture would probably be a vertical pipeline of 2 stages. n reading threads and m writing threads with n and m being determined by a "natural concurrency" for the stage.

Creating a thread per file will probably lead to disk thrashing. Just like you tailor the number of threads of a CPU bound process to the naturally achievable CPU concurrency (and going above that creates context switching overhead AKA thrashing) the same is true on the I/O side - in a sense you can think of the disk thrashing as "context switching on the disk".


If the workload is anywhere near as I/O bound as it sounds, then you're probably going to get maximum throughput with about as many threads as you have spindles. If you have more than one disk and all data is on the same RAID 0, you probably don't want any more than one thread. If more than one thread is trying to access non-sequential parts of the disk, the OS must stop reading one file, even though it may be right under the head, and move to another part of the disk to service another thread, so that it doesn't starve. With only one thread, the disk need never stop reading to move the head.

Obviously that depends on the access patterns being very linear (such as with video recoding) and the data actually being unfragmented on disk, which it depends on a lot. If the workload is more CPU bound, then it won't matter quite as much and you can use more threads, since the disk will be twiddling its thumbs anyway.

As other posters suggest, profile first!


Use a thread pool instead of creating a thread for each file. You can easily to adjust the number of threads once you write your solution. If the jobs are independed from each other, i'd say the number of threads should be equal to number of cores/cpus.


Not to sound trite but you use as many threads as you need.

Basically you can draw a graph of the number of threads against the (real) time to completion. You can also draw one that is total threads to total thread time.

The first graph in particular will help you identify where the bottleneck in CPU power lies. At some point you will become either I/O bound (meaning the disk can't load the data fast enough) or the number of threads will become so large it will impact performance of the machine.

The second does happen. I saw one piece of code that ended up creating 30,000+ threads. It ended up being quicker by capping it to 1,000.

The other way to look at this is: how fast is fast enough? The point where I/O becomes a bottleneck is one thing but you may hit a point before that where it's "fast enough".


The answer depends somewhat on how CPU intensive the processing you need to perform on each file is.

At one extreme where the processing time dominates the I/O time, the benefit that threading gives you is just the ability to take advantage of multiple cores (and possibly hyperthreading) to make use of the maximum available processing power of your CPU. In this case you'd want to aim for a number of worker threads roughly equal to the number of logical cores on the system.

At the other extreme where I/O is your bottleneck you aren't going to see all that much benefit from multiple threads since they will spend most of their time sleeping waiting for I/O to complete. In that case you'd want to focus on maximizing your I/O throughput rather than your CPU utilization. On a single unfragmented hard drive or a DVD where you were I/O bound having multiple threads would likely hurt performance since you'd get maximum I/O throughput from sequential reads on a single thread. If the drive is fragmented or you have a RAID array or similar then having multiple I/O requests in flight simultaneously might boost your I/O throughput since the controller may be able to intelligently rearrange them to make more efficient reads.

I think it might be helpful to view this as really two separate problems. One is how to get maximum I/O throughput for your file reads, the other is how to make maximum use of your CPU for processing the files. You would probably get optimal throughput by having a small number of I/O threads kicking off I/O requests and a pool of worker threads roughly equal to the number of logical CPU cores processing the data as it becomes available. Whether it is worth the effort to implement a more complex setup like that depends on where the bottlenecks are in your particular problem though.


This might be a bit too old school sounding but have you considered simply forking processes? It sounds like you have highly independent work units with a small aggregation of return data. A process model would also free up virtual address space (which might be tight if you're on a 32-bit machine) allowing each worker room to say mmap() the whole file being processed.


There are a lot of variables that will effect performance (OS, filesystem, hard drive speed vs CPU speed, data access patterns, how much processing is done on the data after it is read, etc).

So your best bet is to simply try a test run for every possible thread count, on a representative data set (a big one if possible, so that filesystem caching won't skew the results too badly), and record how long it takes each time. Start with a single thread, then try it again with two threads, and so on until you feel you have enough data points. At the end you should have data that graphs into a nice curve that indicates where the "sweet spot" is. You should be able to do this in a loop so that the results are compiled automatically overnight.


More threads will not necessarily give you higher throughput. Threads have a non-trivial cost, both to create (in terms of CPU time and OS resources) and to run (in terms of memory and scheduling). And the more threads you have, the more potential for contention with other threads. Adding threads can sometimes even slow down execution. Each problem is subtly different, and you are best off writing a nice, flexible solution and experimenting with the parameters to see what works best.

Your example code, spawning a thread for each file, would almost immediately swamp the system for values of max_threads beyond around 10. As others have suggested, a thread pool with a work queue is what you probably want. The fact that each file is independent is nice, as that makes it almost embarrassingly parallel (save for the aggregation at the end of each unit of work).

Some factors that will affect your throughput:

  • Number of CPU cores
  • The number of disk channels (spindles, RAID devices, etc)
  • The processing algorithm, and whether the problem is CPU or I/O bound
  • Contention for the master statistics structure

Last year I wrote an application that does essentially the same as you describe. I ended up using Python and the pprocess library. It used a multi-process model with a pool of worker processes, communicating via pipes (rather than threads). A master process would read the work queue, chop up the input into chunks, and send chunk info to the workers. A worker would crunch the data, collecting stats, and when it was done send the results to back the master. The master would combine the results with the global totals and send another chunk to the worker. I found it scaled almost linearly up to 8 worker threads (on an 8-core box, which is pretty good), and beyond that it degraded.

Some things to consider:

  • Use a thread pool with a work queue, where the number of threads is likely around the number of cores in your system
  • Alternatively, use a multi-process setup, which communicates via pipes
  • Evaluate using mmap() (or equivalent) to memory map the input files, but only after you've profiled the baseline case
  • Read data in multiples of the block size (eg. 4kB), and chop up into lines in memory
  • Build in detailed logging from the start, to aid debugging
  • Keep an eye on contention when updating the master statistics, although it will likely be swamped by the processing and read times of the data
  • Don't make assumptions - test, and measure
  • Set up a local dev environment that is as close as possible to the deployment system
  • Use a database (such as SQLite) for state data, processing results, etc
  • The database can keep track of which files have been processed, which lines had errors, warnings, etc
  • Only give your app read-only access to the original directory and files, and write your results elsewhere
  • Be careful not to try to process files that are open by another process (there's a few tricks here)
  • Careful you don't hit OS limits of the number of files per directory
  • Profile everything, but be sure to only change one thing at a time, and keep detailed records. Performance optimization is hard.
  • Set up scripts so you can consistently re-run tests. Having a DB helps here, as you can delete the records that flag a file as having been processed and re-run against the same data.

When you have a significant number of files in the one directory as you describe, aside from potentially hitting filesystem limits, the time to stat the directory and figure out which files you've already processed and which you still need to goes up significantly. Consider breaking up the files into subdirectories by date, for example.

One more word on performance profiling: be careful when extrapolating performance from small test data sets to super-huge data sets. You can't. I found out the hard way that you can reach a certain point where regular assumptions about resources that we make every day in programming just don't hold any more. For example, I only found out the statement buffer in MySQL is 16MB when my app went way over it! And keeping 8 cores busy can take a lot of memory, but you can easily chew up 2GB of RAM if you're not careful! At some point you have to test on real data on the production system, but give yourself a safe test sandbox to run in, so you don't munge production data or files.

Directly related to this discussion is a series of articles on Tim Bray's blog called the "Wide Finder" project. The problem was simply to parse logfiles and generate some simple statistics, but in the fastest manner possible on a multicore system. Many people contributed solutions, in a variety of languages. It is definitely worth reading.


I agree with everyone suggesting a thread pool: You schedule tasks with the pool, and the pool assigns threads to do the tasks.

If you're CPU-bound, simply keep adding threads as long as the CPU usage is below 100%. When you're I/O bound, disk thrashing might at some point prevent more threads from improving speed. That you'll have to find out yourself.

Have you seen Intel's Threading Building Blocks? Note that I cannot comment whether this is what you need. I only made a small toy project on Windows and that was a few years ago. (It was somewhat similar to yours, BTW: It recursively traverses a folder hierarchy and counts lines in source code files it finds.)


How expensive the simplest thread is depends on the OS (you may also need to tune some OS parameters to get past a certain number of threads). At minimum each has its own CPU state (registers/flags incl. floating point) and stack as well as any thread-specific heap storage.

If each individual thread doesn't need too much distinct state, then you can probably get them pretty cheap by using a small stack size.

In the limit, you may end up needing to use a non-OS cooperative threading mechanism, or even multiplex events yourself using tiny "execution context' objects.

Just start with threads and worry about it later :)


As a ballpark number, you should probably keep the thread count between 10 and 100 to minimize lock contention and context switching overhead.


There are two problems here, the first is your question about the ideal number ofthreads to use for processing this large number of files, the second is how to acheive the best performance.

Let's start with the second problem, to begin with I would not parallelize per file but I would parallelize the processing done on one file at a time. This would help significantly on multiple parts of your environment: - The hard drive as it does not have to seek out from one file to the n - 1 others - The operating system file cache will be warm with the data you will need on all your threads and you will not experience as much cache trashing.

I admit that the code to parallelize your application gets slightly more complex but the benefits you'll obtain are significant.

From this the answer to your question is easy, you should match at most one thread per core present in your system. This will allow you to be respectful of your caches and ultimately achieve the best performance possible on your system.

The ultimate point of course is that using this type of processing your application will be more respectful of your system as accessing n files simultaneously may make your OS unresponsive.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »