Your SlideShare is downloading. ×
Web scraping with nutch solr part 2
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web scraping with nutch solr part 2


Published on

Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Published in: Technology, Design
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Web Scraping Using Nutch and Solr - Part 2 ● The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie ● Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  • 2. Empty Nutch Database ● Clean up the Nutch crawl database – Previously used apache-nutch-1.6/ – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data ● Clean this as – cd apache-nutch-1.6; rm -rf crawl ● Only because it contained dummy data ! ● Next run of script will create dir again
  • 3. Empty Solr Database ● Clean Solr database via a url – Book mark this url – Only use it if you need to empty your data ● Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  • 4. Set up Nutch ● Now we will do something more complex ● Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 ● This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering ● Remember that you need to get permission to scrape a third party web site
  • 5. Nutch Configuration ● Change seed file for Nutch ● apache-nutch-1.6/urls/seed.txt ● In this instance I will use a url of the form – – ( this is not a real url – just an example ) ● Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*.)* ● This will only consider some site Search urls
  • 6. Run Nutch ● Now run nutch using start script – cd apache-nutch-1.6 ; ./nutch_start.bash ● Monitor for errors in solr admin log window ● The Nutch crawl should end with – crawl finished: crawl
  • 7. Checking Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
  • 8. Checking Data
  • 9. Results ● Congratulations you have completed your second crawl – With parameterised urls – More complex url filtering – With a Solr Query search
  • 10. Contact Us ● Feel free to contact us at – – ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems