Web Scraping Using Nutch and Solr - Part 2
● The following example assumes that you have
– Watched “web scraping with nutc...
Empty Nutch Database
● Clean up the Nutch crawl database
– Previously used apache-nutch-1.6/nutch_start.sh
– This containe...
Empty Solr Database
● Clean Solr database via a url
– Book mark this url
– Only use it if you need to empty your data
● Ru...
Set up Nutch
● Now we will do something more complex
● Web scrape a url that has parameters i.e.
– http://<site>/<function...
Nutch Configuration
● Change seed file for Nutch
● apache-nutch-1.6/urls/seed.txt
● In this instance I will use a url of t...
Run Nutch
● Now run nutch using start script
– cd apache-nutch-1.6 ; ./nutch_start.bash
● Monitor for errors in solr admin...
Checking Data
● Data should have been indexed in Solr
● In Solr Admin window
– Set 'Core Selector' = collection1
– Click '...
Checking Data
Results
● Congratulations you have completed your second crawl
– With parameterised urls
– More complex url filtering
– Wi...
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project...
Upcoming SlideShare
Loading in...5
×

Web scraping with nutch solr part 2

2,356

Published on

Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Published in: Technology, Design
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,356
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
53
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Web scraping with nutch solr part 2

  1. 1. Web Scraping Using Nutch and Solr - Part 2 ● The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie ● Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  2. 2. Empty Nutch Database ● Clean up the Nutch crawl database – Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data ● Clean this as – cd apache-nutch-1.6; rm -rf crawl ● Only because it contained dummy data ! ● Next run of script will create dir again
  3. 3. Empty Solr Database ● Clean Solr database via a url – Book mark this url – Only use it if you need to empty your data ● Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  4. 4. Set up Nutch ● Now we will do something more complex ● Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 ● This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering ● Remember that you need to get permission to scrape a third party web site
  5. 5. Nutch Configuration ● Change seed file for Nutch ● apache-nutch-1.6/urls/seed.txt ● In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) ● Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*.)*somesite.co.nz/Search ● This will only consider some site Search urls
  6. 6. Run Nutch ● Now run nutch using start script – cd apache-nutch-1.6 ; ./nutch_start.bash ● Monitor for errors in solr admin log window ● The Nutch crawl should end with – crawl finished: crawl
  7. 7. Checking Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
  8. 8. Checking Data
  9. 9. Results ● Congratulations you have completed your second crawl – With parameterised urls – More complex url filtering – With a Solr Query search
  10. 10. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×