Tutorial :Use curl to parse XML, get an image's URL and download it



Question:

I want to write a shell script to get an image from an rss feed. Right now I have:

curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g'  

This I use to grab the first occurence of an image URL in the file. Now I want to put this URL in a variable to use cURL again to download the image. Any help appreciated! (Also you might give tipps on how to better remove everything from the line with the URL. This is the line:

 <img src="http://www.nichtlustig.de/comics/full/100802.jpg" alt="" width="400" height="400" />  

There's probably some better regex to remove everything except the URL than my solution.) Thanks in advance!


Solution:1

Using a regexp to parse HTML/XML is a Bad Idea in general. Therefore I'd recommend that you use a proper parser.

If you don't object to using Perl, let Perl do the proper XML or HTML parsing for you using appropriate parser libraries:

HTML

curl http://BOGUS.com |& perl -e '{use HTML::TokeParser;       $parser = HTML::TokeParser->new(\*STDIN);       $img = $parser->get_tag('img') ;       print "$img->[1]->{src}\n";   }'    /content02/groups/intranetcommon/documents/image/blk_logo.gif  

XML

curl http://BOGUS.com/whdata0.xml | perl -e '{use XML::Twig;      $twig=XML::Twig->new(twig_handlers =>{img => sub {          print $_[1]->att("src")."\n"; exit 0;}});       open(my $fh, "-");      $twig->parse($fh);  }'    /content02/groups/intranetcommon/documents/image/blk_logo.gif  


Solution:2

I used wget instead of curl, but its just the same

#!/bin/bash  url='http://www.nichtlustig.de/rss/nichtrss.rss'  wget -O- -q "$url" | awk 'BEGIN{ RS="</a>" }  /<img src=/{    gsub(/.*<img src=\"/,"")    gsub(/\".[^>]*>/,"")    print  }'  |  xargs -i wget "{}"  


Solution:3

Use a DOM parser and extract all img elements using getElementsByTagName. Then add them to a list/array, loop through and separately fetch them.

I would suggest using Python, but any language would have a DOM library.


Solution:4

#!/bin/sh  URL=$(curl http://foo.com/rss.xml | grep -E '<img src="http://www.foo.com/full/' | head -1 | sed -e 's/<img src="//' -e 's/" alt=""//' -e 's/width="400"//' -e 's/  height="400" \/>//' | sed 's/ //g')  curl -C - -O $URL  

This totally does the job! Any idea on the regex?


Solution:5

Here's a quick Python solution:

from BeautifulSoup import BeautifulSoup  from os import sys    soup = BeautifulSoup(sys.stdin.read())  print soup.findAll('img')[0]['src']  

Usage:

$ curl http://www.google.com/`curl http://www.google.com | python get_img_src.py`  

This works like a charm and will not leave you trying to find the magical regex that will parse random HTML (Hint: there is no such expression, especially not if you have a greedy matcher like sed.)


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »