Tutorial :Increasing throughput in a python script


I'm processing a list of thousands of domain names from a DNSBL through dig, creating a CSV of URLs and IPs. This is a very time-consuming process that can take several hours. My server's DNSBL updates every fifteen minutes. Is there a way I can increase throughput in my Python script to keep pace with the server's updates?

Edit: the script, as requested.

import re  import subprocess as sp    text = open("domainslist", 'r')  text = text.read()  text = re.split("\n+", text)    file = open('final.csv', 'w')    for element in text:          try:              ip = sp.Popen(["dig", "+short", url], stdout = sp.PIPE)              ip = re.split("\n+", ip.stdout.read())              file.write(url + "," + ip[0] + "\n")          except:              pass  


Well, it's probably the name resolution that's taking you so long. If you count that out (i.e., if somehow dig returned very quickly), Python should be able to deal with thousands of entries easily.

That said, you should try a threaded approach. That would (theoretically) resolve several addresses at the same time, instead of sequentially. You could just as well continue to use dig for that, and it should be trivial to modify my example code below for that, but, to make things interesting (and hopefully more pythonic), let's use an existing module for that: dnspython

So, install it with:

sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython  

And then try something like the following:

import threading  from dns import resolver    class Resolver(threading.Thread):      def __init__(self, address, result_dict):          threading.Thread.__init__(self)          self.address = address          self.result_dict = result_dict        def run(self):          try:              result = resolver.query(self.address)[0].to_text()              self.result_dict[self.address] = result          except resolver.NXDOMAIN:              pass      def main():      infile = open("domainlist", "r")      intext = infile.readlines()      threads = []      results = {}      for address in [address.strip() for address in intext if address.strip()]:          resolver_thread = Resolver(address, results)          threads.append(resolver_thread)          resolver_thread.start()        for thread in threads:          thread.join()        outfile = open('final.csv', 'w')      outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))      outfile.close()    if __name__ == '__main__':      main()  

If that proves to start too many threads at the same time, you could try doing it in batches, or using a queue (see http://www.ibm.com/developerworks/aix/library/au-threadingpython/ for an example)


The vast majority of the time here is spent in the external calls to dig, so to improve that speed, you'll need to multithread. This will allow you to run multiple calls to dig at the same time. See for example: Python Subprocess.Popen from a thread . Or, you can use Twisted ( http://twistedmatrix.com/trac/ ).

EDIT: You're correct, much of that was unnecessary.


I'd consider using a pure-Python library to do the DNS queries, rather than delegating to dig, because invoking another process can be relatively time-consuming. (Of course, looking up anything on the internet is also relatively time-consuming, so what gilesc said about multithreading still applies) A Google search for python dns will give you some options to get started with.


In order to keep pace with the server updates, one must take less than 15 minutes to execute. Does your script take 15 minutes to run? If it doesn't take 15 minutes, you're done!

I would investigate caching and diffs from previous runs in order to increase performance.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »