Tutorial :Python - efficient method to remove all non-letters and replace them with underscores



Question:

def format_title(title):      ''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', title.strip()))  

Anything faster?


Solution:1

The faster way to do it is to use str.translate() This is ~50 times faster than your way

# You only need to do this once  >>> title_trans=''.join(chr(c) if chr(c).isupper() or chr(c).islower() else '_' for c in range(256))    >>> "abcde1234!@%^".translate(title_trans)  'abcde________'    # Using map+lambda  $ python -m timeit '"".join(map(lambda x: x if (x.isupper() or x.islower()) else "_", "abcd1234!@#$".strip()))'  10000 loops, best of 3: 21.9 usec per loop    # Using str.translate  $ python -m timeit -s 'titletrans="".join(chr(c) if chr(c).isupper() or chr(c).islower() else "_" for c in range(256))' '"abcd1234!@#$".translate(titletrans)'  1000000 loops, best of 3: 0.422 usec per loop    # Here is regex for a comparison  $ python -m timeit -s 'import re;transre=re.compile("[\W\d]+")' 'transre.sub("_","abcd1234!@#$")'  100000 loops, best of 3: 3.17 usec per loop  

Here is a version for unicode

# coding: UTF-8    def format_title_unicode_translate(title):      return title.translate(title_unicode_trans)    class TitleUnicodeTranslate(dict):      def __missing__(self,item):          uni = unichr(item)          res = u"_"          if uni.isupper() or uni.islower():              res = uni          self[item] = res          return res  title_unicode_trans=TitleUnicodeTranslate()    print format_title_unicode_translate(u"Metallica Μεταλλικα")  

Note that the Greek letters count as upper and lower, so they are not substituted. If they are to be substituted, simply change the condition to

        if item<256 and (uni.isupper() or uni.islower()):  


Solution:2

import re  title = re.sub("[\W\d]", "_", title.strip())  

should be faster.

If you want to replace a succession of adjacent non-letters with a single underscore, use

title = re.sub("[\W\d]+", "_", title.strip())  

instead which is even faster.

I just ran a time comparison:

C:\>python -m timeit -n 100 -s "data=open('test.txt').read().strip()" "''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', data))"  100 loops, best of 3: 4.51 msec per loop    C:\>python -m timeit -n 100 -s "import re; regex=re.compile('[\W\d]+'); data=open('test.txt').read().strip()" "title=regex.sub('_',data)"  100 loops, best of 3: 2.35 msec per loop  

This will work on Unicode strings, too (under Python 3, \W matches any character which is not a Unicode word character. Under Python 2, you'd have to additionally set the UNICODE flag for this).


Solution:3

Instead of (x.isupper() or x.islower()) you should be able to use x.isalpha(). The isalpha() method might return True for '_' (I don't remember if it does or not) but then you'll just end up replacing '_' with '_' so no harm done. (Thanks for pointing that out, KennyTM.)


Solution:4

Curious about this for my own reasons I wrote a quick script to test the different approaches listed here along with just removing the lambda which I expected (incorrectly) would speed up the original solution.

The short version is that the str.translate approach blows the other ones away. As an aside the regex solution, while a close second, is in correct as written above.

Here is my test program:

import re  from time import time      def format_title(title):      return ''.join(map(lambda x: x if (x.isupper() or x.islower()) else "_",                         title.strip()))      def format_title_list_comp(title):      return ''.join([x if x.isupper() or x.islower() else "_" for x in                      title.strip()])      def format_title_list_comp_is_alpha(title):      return ''.join([x if x.isalpha() else "_" for x in title.strip()])      def format_title_is_alpha(title):      return ''.join(map(lambda x: x if x.isalpha() else '_', title.strip()))      def format_title_no_lambda(title):        def trans(c):          if c.isupper() or c.islower():              return c          return "_"        return ''.join(map(trans, title.strip()))      def format_title_no_lambda_is_alpha(title):        def trans(c):          if c.isalpha():              return c          return "_"        return ''.join(map(trans, title.strip()))      def format_title_re(title):      return re.sub("[\W\d]+", "_", title.strip())      def format_title_re_corrected(title):      return re.sub("[\W\d]", "_", title.strip())      TITLE_TRANS = ''.join(chr(c) if chr(c).isalpha() else '_' for c in range(256))      def format_title_with_translate(title):      return title.translate(TITLE_TRANS)      ITERATIONS = 200000  EXAMPLE_TITLE = "abc123def_$%^!FOO BAR*bazx-bif"      def timetest(f):      start = time()      for i in xrange(ITERATIONS):          result = f(EXAMPLE_TITLE)      diff = time() - start      return result, diff      baseline_result, baseline_time = timetest(format_title)      def print_result(f, result, time):      if result == baseline_result:          msg = "CORRECT"      else:          msg = "INCORRECT"      diff = time - baseline_time      if diff < 0:          indicator = ""      else:          indicator = "+"      pct = (diff / baseline_time) * 100      print "%s: %0.3fs %s%0.3fs [%s%0.4f%%] (%s - %s)" % (          f.__name__, time, indicator, diff, indicator, pct, result, msg)      print_result(format_title, baseline_result, baseline_time)    print "----"    for f in [format_title_is_alpha,            format_title_list_comp,            format_title_list_comp_is_alpha,            format_title_no_lambda,            format_title_no_lambda_is_alpha,            format_title_re,            format_title_re_corrected,            format_title_with_translate]:      alt_result, alt_time = timetest(f)      print_result(f, alt_result, alt_time)  

And here are the results:

format_title: 3.121s +0.000s [+0.0000%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  ----  format_title_is_alpha: 2.336s -0.785s [-25.1470%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_list_comp: 2.369s -0.751s [-24.0773%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_list_comp_is_alpha: 1.735s -1.386s [-44.4021%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_no_lambda: 2.992s -0.129s [-4.1336%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_no_lambda_is_alpha: 2.377s -0.744s [-23.8314%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_re: 1.290s -1.831s [-58.6628%] (abc_def__FOO_BAR_bazx_bif - INCORRECT)  format_title_re_corrected: 1.338s -1.782s [-57.1165%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_with_translate: 0.098s -3.022s [-96.8447%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  
  • EDITED: I added a variation that shows list comprehensions significantly improve the original implementation as well as a correct regex implementation that shows it's still nearly as fast when correct. Of course str.translate still wins hands down.


Solution:5

import string,sys  letters=string.letters  mystring = list("abc134#$@e##$%%$*&(()#def")  for n,c in enumerate(mystring):    if not c in letters:      mystring[n]="_"  print ''.join(mystring)  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »