Tutorial :How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?


Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules.

I have HTML like this:

<div id="listSubtitlesFilm">    <dt id="a1">      <a href="/45/subtitles-67624.aspx">        .45 (2006)      </a>    </dt>  </div>  

I want to parse out the /45/subtitles-67624.asp, but more importantly I want to know how to parse out the contents of the div.

I was given this example on a previous question:

while ( my $anchor = $parser->get_tag('a') ) {      if ( my $href = $anchor->get_attr('href') ) {   #http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles-272112.aspx          push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;      }  

This worked perfectly for that, but when I tried to edit it a bit and use it on a ``div` it didn't work. Here is the code I tried:

I tried using this code:

while (my $anchor = $p->get_tag("dt")) {    if($stuff = $anchor->get_attr('a1')) {      print $stuff."\n";    }  }  


To address, your specific question, given the HTML:

I am assuming you are interested in the anchor text, i.e. ".45 (2006)", in this case, but only if the anchor occurs in a div with id listSubtitlesFilm.

#!/usr/bin/perl    use strict;  use warnings;    use HTML::TokeParser::Simple;    my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);    my @dnldLinks;    while ( my $div = $parser->get_tag('div') ) {      my $id = $div->get_attr('id');      next unless defined($id) and $id eq 'listSubtitlesFilm';        my $anchor = $parser->get_tag('a');      my $href = $anchor->get_attr('href');      next unless defined($href)          and $href =~ m!/subtitles-(\d{2,8})\.aspx\z!;      push @dnldLinks, [$parser->get_trimmed_text('/a'), $1];  }    use Data::Dumper;  print Dumper \@dnldLinks;      __DATA__  <div id="listSubtitlesFilm">    <dt id="a1">      <a href="/45/subtitles-67624.aspx">        .45 (2006)      </a>    </dt>  </div>  


  $VAR1 = [            [              '.45 (2006)',              '67624'            ]          ];  


You could use (yet another module!) HTML::TreeBuilder::XPath, which, as per its name, will let you use XPath on HTML::TreeBuilder objects.

#!/usr/bin/perl    use strict;  use warnings;    use HTML::TreeBuilder::XPath;    my $root = HTML::TreeBuilder::XPath->new_from_file( "my.html");    # print $root->as_HTML; # useful to see how HTML::TreeBuilder  # understands your HTML. For example it will wrap the implied  # dl element around dt, which you need to take into account  # when writing the XPath query below    my $id= "a1";  # you need the .//dt because of the extra dl  my @divs= $root->findnodes( qq{//div[.//dt[\@id="$id"]]});    print $divs[0]->as_HTML; # or as_text  


Code using HTML::TreeBuilder:

use HTML::TreeBuilder;    my $tree = HTML::TreeBuilder->new_from_content($html);    for my $link ($tree->look_down(    _tag => 'a',     href => qr{/subtitle-\d{2,8}\.aspx})  ) {    my $linkid = $link->attr('href') =~ m!/subtitle-\d{2,8}\.aspx!;    # Scalar context gets the first, and the first is the nearest parent    my $parent_div = $link->look_up(_tag => 'div');    # Now the interesting bit of the link is in $linkid, the parent div ID    # is $parent_div->id or $parent_div->attr_id, and its text is e.g.    # $parent_div->as_trimmed_text or you can do other stuff with its content.  }  


You need to change the get_attr("a1") to get_attr("id") here. The get_attr (x) is looking for an attribute with the name x, but you are giving it the value of the attribute, not its name.

Incidentally the <dt> tag is not a <div>, it is the item tag for a <dl> (definition list).


get_attr('a1') should have probably read get_attr('id') and it would print "a1"

I think getting the text content would look like:

while ( my $anchor = $parser->get_tag('div') ) {    my $content = $parser-get_text('/div');  }  

Or if you meant the text content of the link it would be:

while ( my $anchor = $parser->get_tag('a') ) {      if ( my $href = $anchor->get_attr('href') ) {          my $content = $parser->get_text('/a');  #http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitle-272112.aspx          push @dnldLinks, $1 if $href =~ m!/subtitle-(\d{2,8})\.aspx!;      }  

