Tutorial :Read multilanguage strings from html via Python 2.7



Question:

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.

import urllib2  import BeautifulSoup    url = 'http://www.bbc.co.uk/zhongwen/simp/'    page = urllib2.urlopen(url).read().decode("utf-8")  dom = BeautifulSoup.BeautifulSoup(page)  data = dom.findAll('meta', {'name' : 'keywords'})    print data[0]['content'].encode("utf-8")  

the result I am taking is

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text  

The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?

PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.

Thank you in advance!


Solution:1

You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.

Example:

import urllib2  from bs4 import BeautifulSoup    url = 'http://www.bbc.co.uk/zhongwen/simp/'    page = urllib2.urlopen(url).read().decode("utf-8")  dom = BeautifulSoup(page)  data = dom.findAll('meta', {'name' : 'keywords'})    with open("test.txt", "w") as myfile:      myfile.write(data[0]['content'].encode("utf-8"))  

test.txt:

BBC中文ç½',主页,bbcchinese.com, email news, newsletter, subscription, full text    

Which OS and terminal you are using?


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »