Tutorial :need to selectively escape html entities (&)



Question:

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.

however, the html page has a '&'. I can use cgi.escape to convert this into &amp; but it also converts all my html <> tags into &lt;&gt; which makes parseString() unhappy.

how do i go about this? i would rather not just hack it and straight replace the "&"s

thanks


Solution:1

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)


Solution:2

i would rather not just hack it and straight replace the "&"s

Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.

If you only want to replace a single character, just replace the single character:

yourstring.replace('&', '&amp;')  

Don't beat around the bush.


Solution:3

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform &amp; into &amp;amp; or &szlig; into &amp;szlig;), you could

import re  newstring = re.sub(r"&(?![A-Za-z])", "&amp;", oldstring)  

This will leave &s alone when they are followed by a letter.


Solution:4

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »