Tutorial :Django with huge mysql database



Question:

What would be the best way to import multi-million record csv files into django.

Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.

Can this process be achieved to execute in few hours.

Can memcache be used somehow.

Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.


Solution:1

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.

Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).


Solution:2

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.


Solution:3

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.

As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.

If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).

And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.


Solution:4

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.

Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.


Solution:5

Like Craig said, you'd better fill the db directly first. It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)

Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »