Tutorial :Deleting data from a large table



Question:

I have a table with about 10 fields to store gps info for customers. Over time as we have added more customers that table has grown to about 14 million rows. As the gps data comes in a service constantly inserts a row into the table. 90% of the data is not revelent i.e. the customer does not care where the vehicle was 3 months ago, but the most recent data is used to generate tracking reports. My goal is to write a sql to perform a purge of the data that is older than a month.

Here is my problem I can NOT use TRUNCATE TABLE as I would lose everything? Yesterday I wrote a delete table statement with a where clause. When I ran it on a test system it locked up my table and the simulation gps inserts were intermittently failing. Also my transaction log grew to over 6GB as it attempted to log each delete.

My first thought was to delete the data a little at a time starting with the oldest first but I was wondering if there was a better way.


Solution:1

Try this

WHILE EXISTS ( SELECT * FROM table WHERE (condition for deleting))

BEGIN
SET ROWCOUNT 1000
DELETE Table WHERE (condition for deleting)
SET ROWCOUNT 0
ENd

This will delete the rows in groups of 1000


Solution:2

My 2 cents:

If you are using SQL 2005 and above, you can consider to partition your table based on the date field, so the table doesn't get locked when deleting old records.

Maybe, if you are in position of making dba decisions, you can temporarily change your log model to Simple, so it won't grow up too fast, it will still be growing, but the log won't be too detailed.


Solution:3

Better is to create a temporary table and insert only the data you want to keep. Then truncate your original table and copy back the backup.

Oracle syntax (SQL Server is similar)

create table keep as select * from source where data_is_good = 1;  truncate table source;  insert into source select * from keep;  

You'll need to disable foreign keys, if there are any on the source table.

In Oracle, index names must be unique in the entire schema, not just per-table. In SQL server, you can further optimize this by just renaming "keep" to "source", as you can easily create indexes of the same name on both tables


Solution:4

If you're using SQL Server 2005 or 2008, sliding window partitioning is the perfect solution for this - instant archiving or purging without any perceptible locking. Have a look here for further information.


Solution:5

Can you copy recent data to a new table, truncate the table, then copy it back?

Of course, then you're going to need to worry about doing that again in 6 months or a year.


Solution:6

I would do a manual delete by day/month (whatever is largest unit you can get away with.) Once you do that first one, then write a stored proc to kick off every day that deletes the oldest data you don't need.

DELETE FROM TABLENAME   WHERE datediff(day,tableDateTime,getdate() > 90  

Personally, I hate doing stuff to production datasets where one missed key results in some really bad things happening.


Solution:7

Welcome to Data Warehousing. You need to to split your data into two parts.

  • The actual application, with current data only.

  • The history.

You need to do write a little "ETL" job to move data from current to history and delete the history that was moved.

You need to run this periodically. Daily - weekly - monthly quarterly -- doesn't matter technically. What matters is what use the history has and who uses it.


Solution:8

I would probably do it in batches as you have already come up with. Another option would be to insert the important data into another table, truncate the GPS table, then reinsert the important data. You would have a small window where you would be missing the recent historical data. How small that window is would depend on how much data you needed to reinsert. Also, you would need to be careful if the table uses autoincrementing numbers or other defaults so that you use the original values.

Once you have the table cleaned up, a regular cleaning job should be scheduled. You might also want to look into partitioning depending on your RDBMS.


Solution:9

I assume you can't down the production system (or queue up the GPS results for insertion after the purge is complete).

I'd go with your inclination of deleting a fraction of it at a time (perhaps 10%) depending on the performance you find on your test system.

Is your table indexed? That might help, but the indexing process my have simmilar effects on the system as doing the one great purge.


Solution:10

Keep in mind that most databases lock the neighboring records in an index during an transaction, so keeping your operations short will be helpful. I'm assuming that your insertions are failing on lock wait timeouts, so delete your data in small, bursty transactions. I'd suggest a single-threaded Perl script that loops through in the oldest 1,000 chunk increments. I hope your primary key (and hopefully clustered index incase they somehow ended up being two different things) can be correlated to time as that would be the best thing to delete by.

PseudoSQL: Select max(primId) < 3_months_ago Delete from table where primId < maxPrimId limit 1000

Now, here's the really fun part: All these deletions MAY make your indexes a mess and require that they be rebuilt to keep the machine from getting slow. In that case, you'll either have to swap in an up-to-date slave, or just suffer some downtime. Make sure you test for this possible case on your test machine.


Solution:11

If you are using oracle, i would set up a partition by date on your tables and the indexes. Then you delete the data by dropping the partition... the data will magically go away with the partition.

This is an easy step - and doesn't clog up your redo logs etc.

There's a basic intro to all this here


Solution:12

Does the delete statement use any of the indexes on the table? Often times a huge performance improvement can be obtained by either modifying the statement to use an existing index or to add an index on the table that helps improve the performance of the query that the delete statement does.

Also, as other mentioned, the deletes should be done in multiple chunks instead of one huge statement. This prevents the table from getting locked too long, and having other processes time out waiting for the delete to finish.


Solution:13

Performance is pretty fast when dropping a table- even a very large one. So here is what I would do. Script out your table complete with indexes from Management Studio. Edit the script and run it to create a copy of your table. Call it table2. Do a select insert to park the data you want to retain into the new table2. Rename the old table, say tableOld. Rename table2 with the original name. Wait. If no one screams at you drop table2. There is some risk. 1) Check if there are triggers or constraints defined on the original table. They may not get included in the script generated by management studio. 2) if original table has identity fields you may have to turn on identity_insert before inserting into the new table.


Solution:14

I came up with the following T-SQL script which gets an arbitrary amount of recent data.

IF EXISTS(SELECT name FROM sys.tables WHERE name = 'tmp_xxx_tblGPSVehicleInfoLog')  BEGIN      PRINT 'Dropping temp table tmp_xxx_tblGPSVehicleInfoLog'      DROP TABLE tmp_xxx_tblGPSVehicleInfoLog  END  GO    PRINT 'Creating temp table tmp_xxx_tblGPSVehicleInfoLog'  CREATE TABLE [dbo].[tmp_xxx_tblGPSVehicleInfoLog](      [GPSVehicleInfoLogId] [uniqueidentifier] NOT NULL,      [GPSVehicleInfoId] [uniqueidentifier] NULL,      [Longitude] [float] NULL,      [Latitude] [float] NULL,      [GroundSpeed] [float] NULL,      [Altitude] [float] NULL,      [Heading] [float] NULL,      [GPSDeviceTimeStamp] [datetime] NULL,      [Milliseconds] [float] NULL,      [DistanceNext] [float] NULL,      [UpdateDate] [datetime] NULL,      [Stopped] [nvarchar](1) NULL,      [StopTime] [datetime] NULL,      [StartTime] [datetime] NULL,      [TimeStopped] [nvarchar](100) NULL  ) ON [PRIMARY]  GO    PRINT 'Inserting data from tblGPSVehicleInfoLog to tmp_xxx_tblGPSVehicleInfoLog'  SELECT * INTO tmp_xxx_tblGPSVehicleInfoLog   FROM tblGPSVehicleInfoLog   WHERE tblGPSVehicleInfoLog.UpdateDate between '03/30/2009 23:59:59' and '05/19/2009  00:00:00'  GO    PRINT 'Truncating table tblGPSVehicleInfoLog'  TRUNCATE TABLE tblGPSVehicleInfoLog  GO    PRINT 'Inserting data from tmp_xxx_tblGPSVehicleInfoLog to tblGPSVehicleInfoLog'  INSERT INTO tblGPSVehicleInfoLog   SELECT * FROM tmp_xxx_tblGPSVehicleInfoLog   GO  


Solution:15

To keep the transaction log from growing out of control, modify it in the following way:

DECLARE @i INT  SET @i = 1  SET ROWCOUNT 10000    WHILE @i > 0  BEGIN      BEGIN TRAN          DELETE TOP 1000 FROM dbo.SuperBigTable          WHERE RowDate < '2009-01-01'      COMMIT      SELECT @i = @@ROWCOUNT  END  SET ROWCOUNT 0  

And here is a version using the preferred TOP syntax for SQL 2005 and 2008:

DECLARE @i INT  SET @i = 1    WHILE @i > 0  BEGIN      BEGIN TRAN          DELETE TOP 1000 FROM dbo.SuperBigTable          WHERE RowDate < '2009-01-01'      COMMIT      SELECT @i = @@ROWCOUNT  END  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »