Tutorial :Scrape FULL image src with PHP



Question:

I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).

ie. How do I get a FULL path including the domain in one of the following two examples?

src="../foo/logo.png"  src="/images/logo.png"  

Thanks,

Allan


Solution:1

You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://, and if not, you have like 3 different cases.

  1. If it begins with a / then prepend http://domain.com
  2. If it begins with .. you'll have to split the full URL and hack off pieces until the src starts with a /
  3. Else (it begins with a letter), the take the full domain, and strip it down to the last slash then append the src URL.

Or.... be lazy and steal this script

$url = "http://www.goat.com/money/dave.html";  $rel = "../images/cheese.jpg";    $com = InternetCombineURL($url,$rel);    //  Returns http://www.goat.com/images/cheese.jpg    function InternetCombineUrl($absolute, $relative) {      $p = parse_url($relative);      if($p["scheme"])return $relative;        extract(parse_url($absolute));        $path = dirname($path);         if($relative{0} == '/') {          $cparts = array_filter(explode("/", $relative));      }      else {          $aparts = array_filter(explode("/", $path));          $rparts = array_filter(explode("/", $relative));          $cparts = array_merge($aparts, $rparts);          foreach($cparts as $i => $part) {              if($part == '.') {                  $cparts[$i] = null;              }              if($part == '..') {                  $cparts[$i - 1] = null;                  $cparts[$i] = null;              }          }          $cparts = array_filter($cparts);      }      $path = implode("/", $cparts);      $url = "";      if($scheme) {          $url = "$scheme://";      }      if($user) {          $url .= "$user";          if($pass) {              $url .= ":$pass";          }          $url .= "@";      }      if($host) {          $url .= "$host/";      }      $url .= $path;      return $url;  }  

From http://www.web-max.ca/PHP/misc_24.php


Solution:2

Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.

I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »