Tutorial :Using BeautifulSoup to find a HTML tag that contains certain text



Question:

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

<h2> this is cool #12345678901 </h2>  

So, the previous would match by using:

soup('h2',text=re.compile(r' #\S{11}'))  

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']  

I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.

Ideas?


Solution:1

from BeautifulSoup import BeautifulSoup  import re    html_text = """  <h2>this is cool #12345678901</h2>  <h2>this is nothing</h2>  <h1>foo #126666678901</h1>  <h2>this is interesting #126666678901</h2>  <h2>this is blah #124445678901</h2>  """    soup = BeautifulSoup(html_text)      for elem in soup(text=re.compile(r' #\S{11}')):      print elem.parent  

Prints:

<h2>this is cool #12345678901</h2>  <h2>this is interesting #126666678901</h2>  <h2>this is blah #124445678901</h2>  


Solution:2

BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoup  from pprint import pprint  import re    html_text = """  <h2>this is cool #12345678901</h2>  <h2>this is nothing</h2>  <h2>this is interesting #126666678901</h2>  <h2>this is blah #124445678901</h2>  """    soup = BeautifulSoup(html_text)    # Even though the OP was not looking for 'cool', it's more understandable to work with item zero.  pattern = re.compile(r'cool')    pprint(soup.find(text=pattern).__dict__)  #>> {'next': u'\n',  #>>  'nextSibling': None,  #>>  'parent': <h2>this is cool #12345678901</h2>,  #>>  'previous': <h2>this is cool #12345678901</h2>,  #>>  'previousSibling': None}    print soup.find('h2')  #>> <h2>this is cool #12345678901</h2>  print soup.find('h2', text=pattern)  #>> this is cool #12345678901  print soup.find('h2', text=pattern).parent  #>> <h2>this is cool #12345678901</h2>  print soup.find('h2', text=pattern) == soup.find('h2')  #>> False  print soup.find('h2', text=pattern) == soup.find('h2').text  #>> True  print soup.find('h2', text=pattern).parent == soup.find('h2')  #>> True  


Solution:3

With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

from bs4 import BeautifulSoup  soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")  soup('h2',text=re.compile(r' #\S{11}'))  

returns [<h2> this is cool #12345678901 </h2>].


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »