Tutorial :How to best avoid Multiple Inserts?



Question:

I have a service application that monitors a directory changes.

The service application will monitor the directory closely. Here's how it does the thing:

  1. Sleep X minutes
  2. Scan through the directory, see whether there are new additions
  3. Launch one thread per one addition
  4. Repeat 1-3

One of the thing it does is that it will insert a record to a database. Now, since multiple threads can be running at the same time, it is entirely possible that two threads are inserting record at the same time.

I want to avoid this kind of simultaneous inserts; instead, I want those inserts to queue up, meaning that only 1 thread can access the database at a single point of time. Other service can't access the db when the there area already 1 thread that accesses the database.

The reason I want this is because in the case of a power surge, I lost only one transaction instead of many.

I am thinking about using lock statement that locks out the database transaction. Is this the best approach?

P/S: I am writing a .Net service that talks to MySQL database.


Solution:1

A database is meant to allow multiple processes to insert data at the same time, so I fail to see your problem. Do you get an error?

[EDIT] You fear a power surge. Where? On the server or the client? Here are your options:

  1. Server crash. Just have the clients check for the error code for "connection lost" and have them retry their insert until it succeeds. To figure out this error code, either shut down the server in the middle of an insert or pull out the network cable of your client (or both; sometimes, you get different errors for unexpected network problems).

  2. Client crash (much more likely, since clients are usually cheap PCs. If users have access to it, it will eventually be infected with a virus or the file system will become corrupted, it will run out of RAM, someone will install something "cool" which destroys a vital DLL, or whatnot). In this case, you have to start the client again, check what has already been inserted (using an application key) and continuing from there.

An application key is an ID for a row that is related to your business process. For example, if you sell cards, the app key could be the name of the car and the name of the customer plus the timestamp.


Solution:2

I would look first at the why you want to do this. Normally applications operate like that, there is no issue in 2 clients doing inserts at the same time (well, besides wrong approaches in the code).

Also the solution will vary with the scenario. One option is to have a microsoft message queue (MSMQ), and move the inserts out of those services, so the load of the inserts is controlled by the process that reads form the queue.

Update 1: I still fail to see why you want to avoid the inserts from running in parallel (and from what I read in the other responses I think others do to). I will quote 2 of your comments on this:

The reason I want this is because in the case of a power surge, I lost only one transaction instead of many.

Ant the other:

It's because before the insertion, there are other lengthy jobs that require parallelism. Only when during the insertion time, there is a need for sequential access.

If I read the first alone, I would actually think that's a reason to want them in parallel. You want to be done with them as fast as possible. Going sequential will actually increase the time to make the inserts, so there is more time where the power surge can occur.

Reading the second, it makes me think that you are probably more concerned in the effect of these processes running on parallel and the insert not getting into the db. This again means you want to be done with the inserts asap, so not a reason to do them sequentially.

For built-in support, you might have a case for a distributed transaction that also includes the file system (ms was looking at the file system thing, but I don't recall if they ever did something on the newer OS). Distributed transactions are kind of a pain to set up though (msdtc and access to the ports it uses).

A good path I have followed, is adding more info to the process to be able to tell where it failed. You might not even code an automatic recovery process yet, but at least you know you will have the info for sure to know something went wrong.

The simplest way is an insert at the beginning of the process and have a flag to signal when it was completed. If it is a long running process you might want to have something more like an status you keep updating to be able to tell in which step it failed. An alternative is writing the status to the file system.

In any case it will only tell you that the last step that completed successfully, not whether the current step was or not able to be completed. This is what makes the retry logic a bit more complex, as you can't just continue where it stopped, you have to check whether the last step was or not done, and that depends on each step.

Ps. if the above is the case, it is hard to tell from the question. You might want to open a different question about long running processes and/or automatic retries.


Solution:3

If you have multiple windows services installed watching the same folder or if you have a multi-threaded windows service then you will get multiple simultaneous inserts. Otherwise you don't need to worry.


Solution:4

Why do you create a new thread per insert? Why not simply loop over the inserts? Threads are only necessary when you need parallelism; in you case, this seems to be the exact opposite of what you want.

[EDIT]

It's because before the insertion, there are other lengthy jobs that require parallelism. Only when during the insertion time, there is a need for sequential access.

Then have those threads report back to the main thread who collects the results and inserts them in a loop.


Solution:5

As many said before: this is a bad thing to do. It will artificaly create a bottleneck.

But if you really want to do it, here are your options:

a) create a primary key on the table. You should do that anyway for a clean db design. In order to check the validity of the PK the database has to use locking which will force your inserts into a sequence.

b) if a) isn't enough of a bottleneck for your purpose you should use a separate lock. Depending on your environment there might be a special api for this (e.g. the java concurrency api).

c) If that isn't available or you don't like it, you can use the locking mechanisms of the database: create a table with just a single row in it. On each insert do the following: - read the special table 'for update'. This will (in most database setups) keep any other session from doing the same. - do the insert. - commit. This will release the lock and allow the next session to proceed.

And just as a reminder: This is exactly the opposite of what one tries to achieve normally: Allowing as many threads as possible to work without interference


Solution:6

I think the best option here is to follow Aaron's suggestion and have the threads report back after processing each file; then update the database from a single-threaded 'manager' class.

If that is not an option you could use lock/synclock around the code that updates the database.

However, one of your comments says there are multiple services running. Is that true? Multiple Windows Services watching the same directory AND multi-threading the processing? Why? You'll have to better defend that architecture to get better responses, I think.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »