Tutorial :Does solr do web crawling?



Question:

I am interested to do web crawling. I was looking at solr.

Does solr do web crawling, or what are the steps to do web crawling?


Solution:1

Solr 5+ DOES in fact now do web crawling! http://lucene.apache.org/solr/

Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene.

If you need to crawl web pages using another Solr project then you have a number of options including:

If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.

See this also:

Lucene crawler (it needs to build lucene index)


Solution:2

Solr does not in of itself have a web crawling feature.

Nutch is the "de-facto" crawler (and then some) for Solr.


Solution:3

Solr 5 started supporting simple webcrawling (Java Doc). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)

To get it up and running, you can take a detail look at here. However, here is how to get it up and running in one line:

java   -classpath <pathtosolr>/dist/solr-core-5.4.1.jar   -Dauto=yes   -Dc=gettingstarted     -> collection: gettingstarted  -Ddata=web             -> web crawling and indexing  -Drecursive=3          -> go 3 levels deep  -Ddelay=0              -> for the impatient use 10+ for production  org.apache.solr.util.SimplePostTool   -> SimplePostTool  http://datafireball.com/      -> a testing wordpress blog  

The crawler here is very "naive" where you can find all the code from this Apache Solr's github repo.

Here is how the response looks like:

SimplePostTool version 5.0.0  Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract  Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log  SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked  Entering recursive mode, depth=3, delay=0s  Entering crawl at level 0 (1 links total, 1 new)  POSTed web resource http://datafireball.com (depth: 0)  Entering crawl at level 1 (52 links total, 51 new)  POSTed web resource http://datafireball.com/2015/06 (depth: 1)  ...  Entering crawl at level 2 (266 links total, 215 new)  ...  POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2)  ...  Entering crawl at level 3 (846 links total, 656 new)  POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3)  SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302  423 web pages indexed.  COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract...  Time spent: 0:05:55.059  

In the end, you can see all the data are indexed properly. enter image description here


Solution:4

You might also want to take a look at

http://www.crawl-anywhere.com/

Very powerful crawler that is compatible with Solr.


Solution:5

I have been using Nutch with Solr on my latest project and it seems to work quite nicely.

If you are using a Windows machine then I would strongly recommend following the 'No cygwin' instructions given by Jason Riffel too!


Solution:6

Yes, I agree with the other posts here, use Apache Nutch

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Although your solr version has the match the correct version of Nutch, because older versions of solr stores the indices in a different format

Its tutorial: http://wiki.apache.org/nutch/NutchTutorial


Solution:7

I know it's been a while, but in case someone else is searching for a Solr crawler like me, there is a new open-source crawler called Norconex HTTP Collector


Solution:8

Def Nutch ! Nutch also has a basic web front end which will let you query your search results. You might not even need to bother with SOLR depending on your requirements. If you do a Nutch/SOLR combination you should be able to take advantage of the recent work done to integrate SOLR and Nutch ... http://issues.apache.org/jira/browse/NUTCH-442


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »