Your SlideShare is downloading. ×
Web Crawling and Data Gathering with Apache Nutch
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Crawling and Data Gathering with Apache Nutch


Published on

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Published in: Technology
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin
  • 2. Topics
    • Introduction
    • The Big Data Analytics Ecosystem
    • Load Tooling
    • How is Crawl data being used?
    • Web Crawling - Considerations
    • Apache Nutch Overview
    • Apache Nutch Crawl Lifecycle, Setup and Demos
  • 3. The Offline (Analytics) Big Data Ecosystem Load Tooling Web Content Your Content Hadoop Data Catalogs Analytics Tooling Export Tooling Find Analyze Visualize Consume
  • 4. Load Tooling - Data Gathering Patterns and Enablers
    • Web Content
      • Downloading – Amazon Public DataSets / InfoChimps
      • Stream Harvesting – Collecta / Roll-your-own (Twitter4J)
      • API Harvesting – Roll your own (Facebook REST Query)
      • Web Crawling – Nutch
    • Your Content
      • Copy from FileSystem
      • Load from Database - SQOOP
      • Event Collection Frameworks - Scribe and Flume
  • 5. How is Crawl data being used?
    • Build your own search engine
      • Built in Lucene Indexes for querying
      • Solr integration for Multi-faceted search
    • Analytics
      • Selective filtering and extraction with data from a single provider
      • Joining datasets from multiple providers for further analytics
      • Event Portal Example
      • Is Austin really a startup town?
    • Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”
  • 6. Web Crawling - considerations
    • Robots.txt
    • Facebook lawsuit against API Harvester
    • “ No Crawling without written approval” in Terms of Use
    • What if the web had as many crawlers as Apache Web Servers ?
  • 7. Apache Nutch – What is it ?
    • Apache Nutch Project –
      • Hadoop + Web Crawler + Lucene
    • Hadoop based web crawler ? How does that work ?
  • 8. Apache Nutch Overview
    • Seeds and Crawl Filters
    • Crawl Depths
    • Fetch Lists and Partitioning
    • Segments - Segment Reading using Hadoop
    • Indexing / Lucene
    • Web Application for Querying
  • 9. Apache Nutch - Web Application
  • 10. Crawl Lifecycle Generate Inject LinkDB Fetch Index CrawlDB Update Dedup Merge
  • 11. Single Process Web Crawling
  • 12. Single Process Web Crawling
    • Create the seed file and copy it into a “urls” directory
    • Export JAVA_HOME
    • Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
    • Edit the conf/nutch-site.xml and specify an
    • bin/nutch crawl urls -dir crawl -depth 2
    • D E M O
  • 13. Distributed Web Crawling
  • 14. Distributed Web Crawling
    • The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.
    • Why orchestrate your crawl?
    • How?
      • Create the seed file and copy it into a “urls” directory. Then copy the directory up to the HDFS
      • Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
      • Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.
      • Restart Hadoop so the new files are picked up in the classpath
  • 15. Distributed Web Crawling
    • Code Review: org.apache.nutch.crawl.Crawl
    • Orchestrated Crawl Example (Step 1 - Inject):
    • bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls
    • D E M O
  • 16. Segment Reading
  • 17. Segment Readers
    • The SegmentReader class is not all that useful. But here it is anyway:
      • bin/nutch readseg -list crawl/segments/20110128170617
      • bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir
    • What you really want to do is process each crawled page in M/R as an individual record
      • SequenceFileInputFormatters over Nutch HDFS Segments FTW
      • RecordReader returns Content Objects as Value
    • Code Walkthrough
    • D E M O
  • 18. Thanks
    • Questions ?
    • Steve Watt - [email_address]
    • Twitter: @wattsteve
    • Blog: