Tutorial :Perl web scraper, extract content from DIV that only has “style” tag?



Question:

I'm stuck on this and have been all day.. I'm still pretty new to parsing / scraping in perl but I thought I had it down until this.. I have been trying this with different perl modules (tokeparser, tokeparser:simple, web parser and some others)... I have the following string (which in reality is actually an entire HTML page, but this is just showing the relevant part.. I am trying to extract "text1" and "text1_a".. and so on (the "text1", etc is just put in there as an example)... so basically I think I need to extract this first from each:

"<span style="float: left;">test1</span>test1_a"  

Then to parse this to get the 2 values.. I don't know why this is giving me so much trouble as I thought I could just do it in tokeparser:simple but I couldn't seem to return the value inside of the DIV, I wonder if its because it contains another set of tags (the tags)

string (represents html web page)

<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right: 10px; float: right;">  <div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>  <div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>  <div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>  

my attempt in perl web parser module:

my $uri  = URI->new($theurl);    my $proxyscraper = scraper {  process 'div[style=~"width: 250px; text-align: right;"]',  'proxiesextracted[]' => scraper {  process '.style',  style => 'TEXT';  };  result 'proxiesextracted';  

I'm just kind of blindly trying to make sense of the web:parser module as there is essentially no documentation on it so I just pieced that together from the examples they included with the module and one I found on the internet.. any advice is greatly appreciated.


Solution:1

If you want a DOM parser (easier to use tree browsing, slightly slower). Try HTML::TreeBuilder

HTML::Element man page (module is included)

Note also that look_down considers "" (empty-string) and undef to be  

different things, in attribute values. So this:

  $h->look_down("alt", "")  

Which leads us to your answer:

use HTML::TreeBuilder;    # check html::treebuilder pod, there are a few ways to construct (file, fh, html string)  my $tb = HTML::TreeBuilder->new_from_(constructor)    $tb->look_down( _tag => 'div', style => '' )->as_text;  


Solution:2

using Web::Scraper, try :

#!/usr/bin/perl    use strict;  use warnings;  use Data::Dumper::Simple;  use Web::Scraper;    $Data::Dumper::Indent = 1;    my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$  <div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>  <div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>  <div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';      my $proxyscraper = scraper {      process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {         process '//span', 'data1' => 'TEXT';         process '//text()', 'data2' => 'TEXT';       }  };    my $results = $proxyscraper->scrape( $html );    print Dumper($results);  

It give :

$results = {    'proxiesextracted' => [      {        'data2' => 'test1_a',        'data1' => 'test1'      },      {        'data2' => 'test2_a',        'data1' => 'test2'      },      {        'data2' => 'test3_a',        'data1' => 'test3'      }    ]  };  

Hope this helps


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »