Ubuntu: How can I extract all PDF links on a website?


This is a little off topic, but I hope you guys will help me. I've found a website full of articles I need, but those are mixed with a lot of useless files (mainly jpgs).

I would like to know if there is a way to find (not download) all PDFs on the server to make a list of links. Basically I would simply like to filter out everything that's not a PDF, in order to get a better view on what to download and what not.



Ok, here you go. This is a programmatic solution in form of a script:

#!/bin/bash    # NAME:         pdflinkextractor  # AUTHOR:       Glutanimate (http://askubuntu.com/users/81372/), 2013  # LICENSE:      GNU GPL v2  # DEPENDENCIES: wget lynx  # DESCRIPTION:  extracts PDF links from websites and dumps them to the stdout and as a textfile  #               only works for links pointing to files with the ".pdf" extension  #  # USAGE:        pdflinkextractor "www.website.com"    WEBSITE="$1"    echo "Getting link list..."    lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt    # OPTIONAL  #  # DOWNLOAD PDF FILES  #  #echo "Downloading..."      #wget -P pdflinkextractor_files/ -i pdflinks.txt  


You will need to have wget and lynx installed:

sudo apt-get install wget lynx  


The script will get a list of all the .pdf files on the website and dump it to the command line output and to a textfile in the working directory. If you comment out the "optional" wget command the script will proceed to download all files to a new directory.


$ ./pdflinkextractor http://www.pdfscripting.com/public/Free-Sample-PDF-Files-with-scripts.cfm  Getting link list...  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ModifySubmit_Example.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/DynamicEmail_XFAForm_V2.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcquireMenuItemNames.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/BouncingButton.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JavaScriptClock.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/Matrix2DOperations.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/RobotArm_3Ddemo2.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/SimpleFormCalculations.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/TheFlyv3_EN4Rdr.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ImExportAttachSample.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_BasicToggle.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_ToggleButton_Sample.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcorXFA_BasicToggle.pdf  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ConditionalCalcScripts.pdf  Downloading...  --2013-12-24 13:31:25--  http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf  Resolving www.pdfscripting.com (www.pdfscripting.com)...  Connecting to www.pdfscripting.com (www.pdfscripting.com)||:80... connected.  HTTP request sent, awaiting response... 200 OK  Length: 176008 (172K) [application/pdf]  Saving to: `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf'    100%[===========================================================================================================================================================================>] 176.008      120K/s   in 1,4s        2013-12-24 13:31:29 (120 KB/s) - `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf' saved [176008/176008]    ...  


a simple javascript snippet can solve this: (NOTE: I assume all pdf files are ended with .pdf in the link.)

open your browser javascript console, copy following code and paste it to js console, done!

//get all link elements  var link_elements = document.querySelectorAll(":link");    //extract out all uris.  var link_uris = [];  for (var i=0; i < link_elements.length; i++)  {      //remove duplicated links      if (link_elements[i].href in link_uris)          continue;        link_uris.push (link_elements[i].href);  }    //filter out all links containing ".pdf" string  var link_pdfs = link_uris.filter (function (lu) { return lu.indexOf (".pdf") != -1});    //print all pdf links  for (var i=0; i < link_pdfs.length; i++)      console.log (link_pdfs[i]);  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »