• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web Crawling and Data Gathering with Apache Nutch
 

Web Crawling and Data Gathering with Apache Nutch

on

  • 21,353 views

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Statistics

Views

Total Views
21,353
Views on SlideShare
20,463
Embed Views
890

Actions

Likes
24
Downloads
516
Comments
2

3 Embeds 890

http://stevewatt.blogspot.com 491
http://www.emergingafrican.com 395
http://www.stevewatt.blogspot.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Crawling and Data Gathering with Apache Nutch Web Crawling and Data Gathering with Apache Nutch Presentation Transcript

    • Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin
    • Topics
      • Introduction
      • The Big Data Analytics Ecosystem
      • Load Tooling
      • How is Crawl data being used?
      • Web Crawling - Considerations
      • Apache Nutch Overview
      • Apache Nutch Crawl Lifecycle, Setup and Demos
    • The Offline (Analytics) Big Data Ecosystem Load Tooling Web Content Your Content Hadoop Data Catalogs Analytics Tooling Export Tooling Find Analyze Visualize Consume
    • Load Tooling - Data Gathering Patterns and Enablers
      • Web Content
        • Downloading – Amazon Public DataSets / InfoChimps
        • Stream Harvesting – Collecta / Roll-your-own (Twitter4J)
        • API Harvesting – Roll your own (Facebook REST Query)
        • Web Crawling – Nutch
      • Your Content
        • Copy from FileSystem
        • Load from Database - SQOOP
        • Event Collection Frameworks - Scribe and Flume
    • How is Crawl data being used?
      • Build your own search engine
        • Built in Lucene Indexes for querying
        • Solr integration for Multi-faceted search
      • Analytics
        • Selective filtering and extraction with data from a single provider
        • Joining datasets from multiple providers for further analytics
        • Event Portal Example
        • Is Austin really a startup town?
      • Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”
    • Web Crawling - considerations
      • Robots.txt
      • Facebook lawsuit against API Harvester
      • “ No Crawling without written approval” in Mint.com Terms of Use
      • What if the web had as many crawlers as Apache Web Servers ?
    • Apache Nutch – What is it ?
      • Apache Nutch Project – nutch.apache.org
        • Hadoop + Web Crawler + Lucene
      • Hadoop based web crawler ? How does that work ?
    • Apache Nutch Overview
      • Seeds and Crawl Filters
      • Crawl Depths
      • Fetch Lists and Partitioning
      • Segments - Segment Reading using Hadoop
      • Indexing / Lucene
      • Web Application for Querying
    • Apache Nutch - Web Application
    • Crawl Lifecycle Generate Inject LinkDB Fetch Index CrawlDB Update Dedup Merge
    • Single Process Web Crawling
    • Single Process Web Crawling
      • Create the seed file and copy it into a “urls” directory
      • Export JAVA_HOME
      • Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
      • Edit the conf/nutch-site.xml and specify an http.agent.name
      • bin/nutch crawl urls -dir crawl -depth 2
      • D E M O
    • Distributed Web Crawling
    • Distributed Web Crawling
      • The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.
      • Why orchestrate your crawl?
      • How?
        • Create the seed file and copy it into a “urls” directory. Then copy the directory up to the HDFS
        • Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)
        • Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.
        • Restart Hadoop so the new files are picked up in the classpath
    • Distributed Web Crawling
      • Code Review: org.apache.nutch.crawl.Crawl
      • Orchestrated Crawl Example (Step 1 - Inject):
      • bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls
      • D E M O
    • Segment Reading
    • Segment Readers
      • The SegmentReader class is not all that useful. But here it is anyway:
        • bin/nutch readseg -list crawl/segments/20110128170617
        • bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir
      • What you really want to do is process each crawled page in M/R as an individual record
        • SequenceFileInputFormatters over Nutch HDFS Segments FTW
        • RecordReader returns Content Objects as Value
      • Code Walkthrough
      • D E M O
    • Thanks
      • Questions ?
      • Steve Watt - [email_address]
      • Twitter: @wattsteve
      • Blog: stevewatt.blogspot.com
      • austinhug.blogspot.com