
Question:
I have downloaded a page using urlopen. How do I remove all html tags from it? Is there any regexp to replace all <*> tags?
Solution:1
A very simple regexp would be :
import re notag = re.sub("<.*?>", " ", html)
The drawback of this solution is that it doesn't remove javascript or css, but only tags.
Solution:2
I can also recommend BeautifulSoup which is an easy to use html parser. There you would do something like:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) all_text = ''.join(soup.findAll(text=True))
This way you get all the text from a html document.
Solution:3
There's a great python library called bleach. This call below will remove all html tags, leaving everything else (but not removing the content inside tags that are not visible).
bleach.clean(thestring, tags=[], attributes={}, styles=[], strip=True)
Solution:4
If you need HTML parsing, Python has a module for you!
Solution:5
Try this:
import re def remove_html_tags(data): p = re.compile(r'<.*?>') return p.sub('', data)
Solution:6
You could use html2text which is supposed to make a readable text equivalent from an HTML source (programatically with Python or as a command-line tool). Thus I may extrapolate your needs from your question...
Solution:7
There are multiple options to filter out html tags from data. you can use Regex or core python. but use simple way:
import remove_tags data_to_remove = '<p>hello\t\t, \tworld\n</p>' print remove_tags(data_to_remove)
OUTPUT: hello world
Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
EmoticonEmoticon