Tutorial :Beautiful Soup and uTidy


I want to pass the results of utidy to Beautiful Soup, ala:

page = urllib2.urlopen(url)  options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0)  cleaned_html = tidy.parseString(page.read(), **options)  soup = BeautifulSoup(cleaned_html)  

When run, the following error results:

Traceback (most recent call last):    File "soup.py", line 34, in <module>      soup = BeautifulSoup(cleaned_html)    File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__      BeautifulStoneSoup.__init__(self, *args, **kwargs)    File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__      self._feed(isHTML=isHTML)    File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1245, in _feed      smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)    File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1751, in __init__      self._detectEncoding(markup, isHTML)    File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1899, in _detectEncoding      xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)  TypeError: expected string or buffer  

I gather utidy returns an XML document while BeautifulSoup wants a string. Is there a way to cast cleaned_html? Or am I doing it wrong and should take a different approach?


Just wrap str() around cleaned_html when passing it to BeautifulSoup.


Convert the value passed to BeautifulSoup into a string. In your case, do the following edit to the last line:

soup = BeautifulSoup(str(cleaned_html))  

