Tutorial :Most Efficient Way to Find Whether a Large List Contains a Specific String (Python)



Question:

I have a file containing roughly all the words in English (~60k words, ~500k characters). I want to test whether a certain word I receive as input is "in English" (i.e. if this exact word is in the list).

What would be the most efficient way to do this in Python?

The trivial solution is to load the file into a list and check whether the word is in that list. The list can be sorted, which I believe will shrink the complexity to O(logn). However I'm not sure about how Python implements searching through lists, and whether there's a performance penalty if such a large list is in memory. Can I "abuse" the fact I can put a cap on the length of words? (e.g. say the longest one is 15 characters long).

Please note I run the application on a machine with lots of memory, so I care less for memory consumption than for speed and CPU utilization.

Thanks


Solution:1

The python Set is what you should try.

A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.


Solution:2

Sample Python code:

L = ['foo', 'bar', 'baz'] # Your list  s = set(L)  # Converted to Set    print 'foo'  in s # True  print 'blah' in s # False  


Solution:3

A Trie structure would suit your purposes. There are undoubtedly Python implementations to be found out there...


Solution:4

Two things:

The Python 'mutable set' type has an 'add' method ( s.add(item) ), so you could go right from reading (a line) from your big file straight into a set without using a list as an intermediate data structure.

Python lets you 'pickle' a data structure, so you could save your big set to a file and save the time of reinitiating the set.

Second, I've been looking for a list of all the single-syllable words in English for my own amusement, but the ones I've found mentioned seem to be proprietary. If it isn't being intrusive, could I ask whether your list of English words can be obtained by others?


Solution:5

Others have given you the in-memory way using set(), and this is generally going to be the fastest way, and should not tax your memory for a 60k word dataset (a few MiBs at most). You should be able to construct your set with:

f=open('words.txt')  s = set(word.strip() for word in f)  

However, it does require some time to load the set into memory. If you are checking lots of words, this is no problem - the lookup time will more than make up for it. However if you're only going to be checking one word per command execution (eg. this is a commandline app like "checkenglish [word]" ) the startup time will be longer than it would have taken you just to search through the file line by line.

If this is your situation, or you have a much bigger dataset, using an on-disk format may be better. The simplest way would be using the dbm module. Create such a database from a wordlist with:

import dbm  f=open('wordlist.txt')  db = dbm.open('words.db','c')  for word in f:      db[word] = '1'  f.close()  db.close()  

Then your program can check membership with:

db = dbm.open('words.db','r')  if db.has_key(word):      print "%s is english" % word  else:      print "%s is not english" % word  

This will be slower than a set lookup, since there will be disk access, but will be faster than searching, have low memory use and no significant initialisation time.

There are also other alternatives, such as using a SQL database (eg sqlite).


Solution:6

You're basically testing whether a member is in a set or not, right?

If so, and because you said you have lots of memory, why not just load all the words as keys in memcache, and then for every word, just check if it is present in memcache or not.

Or use that data structure that is used by bash to autocomplete command names - this is fast and highly efficient in memory (can't remember the name).


Solution:7

If memory consumption isn't an issue and the words won't change, the fastest way to do this is put everything in a hash and search that way. In Python, this is the Set. You'll have constant-time lookup.


Solution:8

500k character is not a large list. if items in your list are unique and you need to do this search repeatedly use set which would lower the complexity to O(1) in the best case.


Solution:9

Converting the list to a set will only be helpful if you repeatedly run this kind of query against the data, as will sorting the list and doing a binary search. If you're only going to pull data out of the list once, a plain old linear search is your best bet:

if 'foo' in some_list:      do_something()  

Otherwise, your best bet is to use either a set as has been mentioned or a binary search. Which one you should choose depends largely on how big the data is and how much memory you can spare. I'm told that really large lists tend to benefit more from hashing, although the amount of memory that's taken up can be prohibitively expensive.

Finally, a third option is that you can import the data into a sqlite database and read directly from it. Sqlite is very fast and it may save you the trouble of loading the whole list from file. Python has a very good built-in sqlite library.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »