Web Crawling and Data Gathering with Apache Nutch


Published on

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Web Crawling and Data Gathering with Apache Nutch

  1. 1. Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin
  2. 2. Topics <ul><li>Introduction </li></ul><ul><li>The Big Data Analytics Ecosystem </li></ul><ul><li>Load Tooling </li></ul><ul><li>How is Crawl data being used? </li></ul><ul><li>Web Crawling - Considerations </li></ul><ul><li>Apache Nutch Overview </li></ul><ul><li>Apache Nutch Crawl Lifecycle, Setup and Demos </li></ul>
  3. 3. The Offline (Analytics) Big Data Ecosystem Load Tooling Web Content Your Content Hadoop Data Catalogs Analytics Tooling Export Tooling Find Analyze Visualize Consume
  4. 4. Load Tooling - Data Gathering Patterns and Enablers <ul><li>Web Content </li></ul><ul><ul><li>Downloading – Amazon Public DataSets / InfoChimps </li></ul></ul><ul><ul><li>Stream Harvesting – Collecta / Roll-your-own (Twitter4J) </li></ul></ul><ul><ul><li>API Harvesting – Roll your own (Facebook REST Query) </li></ul></ul><ul><ul><li>Web Crawling – Nutch </li></ul></ul><ul><li>Your Content </li></ul><ul><ul><li>Copy from FileSystem </li></ul></ul><ul><ul><li>Load from Database - SQOOP </li></ul></ul><ul><ul><li>Event Collection Frameworks - Scribe and Flume </li></ul></ul>
  5. 5. How is Crawl data being used? <ul><li>Build your own search engine </li></ul><ul><ul><li>Built in Lucene Indexes for querying </li></ul></ul><ul><ul><li>Solr integration for Multi-faceted search </li></ul></ul><ul><li>Analytics </li></ul><ul><ul><li>Selective filtering and extraction with data from a single provider </li></ul></ul><ul><ul><li>Joining datasets from multiple providers for further analytics </li></ul></ul><ul><ul><li>Event Portal Example </li></ul></ul><ul><ul><li>Is Austin really a startup town? </li></ul></ul><ul><li>Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed” </li></ul>
  6. 6. Web Crawling - considerations <ul><li>Robots.txt </li></ul><ul><li>Facebook lawsuit against API Harvester </li></ul><ul><li>“ No Crawling without written approval” in Mint.com Terms of Use </li></ul><ul><li>What if the web had as many crawlers as Apache Web Servers ? </li></ul>
  7. 7. Apache Nutch – What is it ? <ul><li>Apache Nutch Project – nutch.apache.org </li></ul><ul><ul><li>Hadoop + Web Crawler + Lucene </li></ul></ul><ul><li>Hadoop based web crawler ? How does that work ? </li></ul>
  8. 8. Apache Nutch Overview <ul><li>Seeds and Crawl Filters </li></ul><ul><li>Crawl Depths </li></ul><ul><li>Fetch Lists and Partitioning </li></ul><ul><li>Segments - Segment Reading using Hadoop </li></ul><ul><li>Indexing / Lucene </li></ul><ul><li>Web Application for Querying </li></ul>
  9. 9. Apache Nutch - Web Application
  10. 10. Crawl Lifecycle Generate Inject LinkDB Fetch Index CrawlDB Update Dedup Merge
  11. 11. Single Process Web Crawling
  12. 12. Single Process Web Crawling <ul><li>Create the seed file and copy it into a “urls” directory </li></ul><ul><li>Export JAVA_HOME </li></ul><ul><li>Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain) </li></ul><ul><li>Edit the conf/nutch-site.xml and specify an http.agent.name </li></ul><ul><li>bin/nutch crawl urls -dir crawl -depth 2 </li></ul><ul><li>D E M O </li></ul>
  13. 13. Distributed Web Crawling
  14. 14. Distributed Web Crawling <ul><li>The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup. </li></ul><ul><li>Why orchestrate your crawl? </li></ul><ul><li>How? </li></ul><ul><ul><li>Create the seed file and copy it into a “urls” directory. Then copy the directory up to the HDFS </li></ul></ul><ul><ul><li>Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain) </li></ul></ul><ul><ul><li>Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory. </li></ul></ul><ul><ul><li>Restart Hadoop so the new files are picked up in the classpath </li></ul></ul>
  15. 15. Distributed Web Crawling <ul><li>Code Review: org.apache.nutch.crawl.Crawl </li></ul><ul><li>Orchestrated Crawl Example (Step 1 - Inject): </li></ul><ul><li>bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls </li></ul><ul><li>D E M O </li></ul>
  16. 16. Segment Reading
  17. 17. Segment Readers <ul><li>The SegmentReader class is not all that useful. But here it is anyway: </li></ul><ul><ul><li>bin/nutch readseg -list crawl/segments/20110128170617 </li></ul></ul><ul><ul><li>bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir </li></ul></ul><ul><li>What you really want to do is process each crawled page in M/R as an individual record </li></ul><ul><ul><li>SequenceFileInputFormatters over Nutch HDFS Segments FTW </li></ul></ul><ul><ul><li>RecordReader returns Content Objects as Value </li></ul></ul><ul><li>Code Walkthrough </li></ul><ul><li>D E M O </li></ul>
  18. 18. Thanks <ul><li>Questions ? </li></ul><ul><li>Steve Watt - [email_address] </li></ul><ul><li>Twitter: @wattsteve </li></ul><ul><li>Blog: stevewatt.blogspot.com </li></ul><ul><li>austinhug.blogspot.com </li></ul>