Tutorial :Python - efficient method to remove all non-letters and replace them with underscores


def format_title(title):      ''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', title.strip()))  

Anything faster?


The faster way to do it is to use str.translate() This is ~50 times faster than your way

# You only need to do this once  >>> title_trans=''.join(chr(c) if chr(c).isupper() or chr(c).islower() else '_' for c in range(256))    >>> "abcde1234!@%^".translate(title_trans)  'abcde________'    # Using map+lambda  $ python -m timeit '"".join(map(lambda x: x if (x.isupper() or x.islower()) else "_", "abcd1234!@#$".strip()))'  10000 loops, best of 3: 21.9 usec per loop    # Using str.translate  $ python -m timeit -s 'titletrans="".join(chr(c) if chr(c).isupper() or chr(c).islower() else "_" for c in range(256))' '"abcd1234!@#$".translate(titletrans)'  1000000 loops, best of 3: 0.422 usec per loop    # Here is regex for a comparison  $ python -m timeit -s 'import re;transre=re.compile("[\W\d]+")' 'transre.sub("_","abcd1234!@#$")'  100000 loops, best of 3: 3.17 usec per loop  

Here is a version for unicode

# coding: UTF-8    def format_title_unicode_translate(title):      return title.translate(title_unicode_trans)    class TitleUnicodeTranslate(dict):      def __missing__(self,item):          uni = unichr(item)          res = u"_"          if uni.isupper() or uni.islower():              res = uni          self[item] = res          return res  title_unicode_trans=TitleUnicodeTranslate()    print format_title_unicode_translate(u"Metallica Μεταλλικα")  

Note that the Greek letters count as upper and lower, so they are not substituted. If they are to be substituted, simply change the condition to

        if item<256 and (uni.isupper() or uni.islower()):  


import re  title = re.sub("[\W\d]", "_", title.strip())  

should be faster.

If you want to replace a succession of adjacent non-letters with a single underscore, use

title = re.sub("[\W\d]+", "_", title.strip())  

instead which is even faster.

I just ran a time comparison:

C:\>python -m timeit -n 100 -s "data=open('test.txt').read().strip()" "''.join(map(lambda x: x if (x.isupper() or x.islower()) else '_', data))"  100 loops, best of 3: 4.51 msec per loop    C:\>python -m timeit -n 100 -s "import re; regex=re.compile('[\W\d]+'); data=open('test.txt').read().strip()" "title=regex.sub('_',data)"  100 loops, best of 3: 2.35 msec per loop  

This will work on Unicode strings, too (under Python 3, \W matches any character which is not a Unicode word character. Under Python 2, you'd have to additionally set the UNICODE flag for this).


Instead of (x.isupper() or x.islower()) you should be able to use x.isalpha(). The isalpha() method might return True for '_' (I don't remember if it does or not) but then you'll just end up replacing '_' with '_' so no harm done. (Thanks for pointing that out, KennyTM.)


Curious about this for my own reasons I wrote a quick script to test the different approaches listed here along with just removing the lambda which I expected (incorrectly) would speed up the original solution.

The short version is that the str.translate approach blows the other ones away. As an aside the regex solution, while a close second, is in correct as written above.

Here is my test program:

import re  from time import time      def format_title(title):      return ''.join(map(lambda x: x if (x.isupper() or x.islower()) else "_",                         title.strip()))      def format_title_list_comp(title):      return ''.join([x if x.isupper() or x.islower() else "_" for x in                      title.strip()])      def format_title_list_comp_is_alpha(title):      return ''.join([x if x.isalpha() else "_" for x in title.strip()])      def format_title_is_alpha(title):      return ''.join(map(lambda x: x if x.isalpha() else '_', title.strip()))      def format_title_no_lambda(title):        def trans(c):          if c.isupper() or c.islower():              return c          return "_"        return ''.join(map(trans, title.strip()))      def format_title_no_lambda_is_alpha(title):        def trans(c):          if c.isalpha():              return c          return "_"        return ''.join(map(trans, title.strip()))      def format_title_re(title):      return re.sub("[\W\d]+", "_", title.strip())      def format_title_re_corrected(title):      return re.sub("[\W\d]", "_", title.strip())      TITLE_TRANS = ''.join(chr(c) if chr(c).isalpha() else '_' for c in range(256))      def format_title_with_translate(title):      return title.translate(TITLE_TRANS)      ITERATIONS = 200000  EXAMPLE_TITLE = "abc123def_$%^!FOO BAR*bazx-bif"      def timetest(f):      start = time()      for i in xrange(ITERATIONS):          result = f(EXAMPLE_TITLE)      diff = time() - start      return result, diff      baseline_result, baseline_time = timetest(format_title)      def print_result(f, result, time):      if result == baseline_result:          msg = "CORRECT"      else:          msg = "INCORRECT"      diff = time - baseline_time      if diff < 0:          indicator = ""      else:          indicator = "+"      pct = (diff / baseline_time) * 100      print "%s: %0.3fs %s%0.3fs [%s%0.4f%%] (%s - %s)" % (          f.__name__, time, indicator, diff, indicator, pct, result, msg)      print_result(format_title, baseline_result, baseline_time)    print "----"    for f in [format_title_is_alpha,            format_title_list_comp,            format_title_list_comp_is_alpha,            format_title_no_lambda,            format_title_no_lambda_is_alpha,            format_title_re,            format_title_re_corrected,            format_title_with_translate]:      alt_result, alt_time = timetest(f)      print_result(f, alt_result, alt_time)  

And here are the results:

format_title: 3.121s +0.000s [+0.0000%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  ----  format_title_is_alpha: 2.336s -0.785s [-25.1470%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_list_comp: 2.369s -0.751s [-24.0773%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_list_comp_is_alpha: 1.735s -1.386s [-44.4021%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_no_lambda: 2.992s -0.129s [-4.1336%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_no_lambda_is_alpha: 2.377s -0.744s [-23.8314%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_re: 1.290s -1.831s [-58.6628%] (abc_def__FOO_BAR_bazx_bif - INCORRECT)  format_title_re_corrected: 1.338s -1.782s [-57.1165%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  format_title_with_translate: 0.098s -3.022s [-96.8447%] (abc___def_____FOO_BAR_bazx_bif - CORRECT)  
  • EDITED: I added a variation that shows list comprehensions significantly improve the original implementation as well as a correct regex implementation that shows it's still nearly as fast when correct. Of course str.translate still wins hands down.


import string,sys  letters=string.letters  mystring = list("abc134#$@e##$%%$*&(()#def")  for n,c in enumerate(mystring):    if not c in letters:      mystring[n]="_"  print ''.join(mystring)  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »