Web Scale Crawling with Apache Nutch
Upcoming SlideShare
Loading in...5
×
 

Web Scale Crawling with Apache Nutch

on

  • 6,851 views

 

Statistics

Views

Total Views
6,851
Views on SlideShare
6,847
Embed Views
4

Actions

Likes
6
Downloads
204
Comments
0

1 Embed 4

https://si0.twimg.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  • A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  • Note that I mention crawling and not web search -> used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Fetcher . multithreaded but polite
  • Fetcher . multithreaded but polite
  • Writable object – crawl datum
  • What does this mean for Nutch?
  • What does this mean for Nutch?

Web Scale Crawling with Apache Nutch Web Scale Crawling with Apache Nutch Presentation Transcript

  • Web Scale Crawling with Julien Nioche [email_address] Berlin Buzzwords 08/06/11 Apache
  • DigitalPebble Ltd
    • Based in Bristol (UK)
    • Specialised in Text Engineering
      • Web Crawling
      • Natural Language Processing
      • Information Retrieval
      • Data Mining
    • Strong focus on Open Source & Apache ecosystem
    • User | Contributor | Committer
      • Nutch , SOLR, Lucene
      • Tika
      • GATE, UIMA
      • Mahout
      • Behemoth
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • Nutch?
    • “ Distributed framework for large scale web crawling”
      • but does not have to be large scale at all
      • or even on the web (file-protocol)
    • Based on Apache Hadoop
    • Indexing and Search
    • Open Source – Apache 2.0 License
  • Short history
    • 2002/2003 : Started By Doug Cutting & Mike Caffarella
    • 2004 : sub-project of Lucene @Apache
    • 2005 : MapReduce implementation in Nutch
      • 2006 : Hadoop sub-project of Lucene @Apache
    • 2006/7 : Parser and MimeType in Tika
      • 2008 : Tika sub-project of Lucene @Apache
    • May 2010 : TLP project at Apache
    • June 2011 (?) : Nutch 1.3
    • Q4 2011 (?) : Nutch 2.0
  • In a Nutch Shell (1.3)
    • Inject -> populates CrawlDB from seed list
    • Generate -> Selects URLS to fetch in segment
    • Fetch -> Fetches URLs from segment
    • Parse -> Parses content (text + metadata)
    • UpdateDB -> Updates CrawlDB (new URLs, new status...)
    • InvertLinks -> Build Webgraph
    • SOLRIndex -> Send docs to SOLR
    • SOLRDedup -> Remove duplicate docs based on signature
    • Step by Step :
    • Or use the all-in-one ' nutch crawl ' command
    • Repeat steps 2 to 8
  • Frontier expansion
    • Manual “discovery”
      • Adding new URLs by hand, “seeding”
    • Automatic discovery of new resources (frontier expansion)
      • Not all outlinks are equally useful - control
      • Requires content parsing and link extraction
    [Slide courtesy of A. Bialecki] seed i = 1 i = 2 i = 3
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • An extensible framework
    • Endpoints
      • Protocol
      • Parser
      • HtmlParseFilter
      • ScoringFilter (used in various places)
      • URLFilter (ditto)
      • URLNormalizer (ditto)
      • IndexingFilter
    • Plugins
      • Activated with parameter ' plugin.includes'
      • Implement one or more endpoints
  • Features
    • Fetcher
      • Multi-threaded fetcher
      • Follows robots.txt
      • Groups URLs per hostname / domain / IP
      • Limit the number of URLs for round of fetching
      • Default values are polite but can be made more aggressive
    • Crawl Strategy
      • Breadth-first but can be depth-first
      • Configurable via custom scoring plugins
    • Scoring
      • OPIC (On-line Page Importance Calculation) by default
      • LinkRank
  • Features (cont.)
    • Protocols
      • Http, file, ftp, https
    • Scheduling
      • Specified or adaptative
    • URL filters
      • Regex, FSA, TLD domain, prefix, suffix
    • URL normalisers
      • Default, regex
  • Features (cont.)
    • Other plugins
      • CreativeCommons
      • Feeds
      • Language Identification
      • Rel tags
      • Arbitrary Metadata
    • Indexing to SOLR
      • Bespoke schema
    • Parsing with Apache Tika
      • But some legacy parsers as well
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • Data Structures
    • MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
    • CrawlDB => status of known pages
    • Input of : generate - index
    • Output of : inject - update
    CrawlDB MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
  • Data Structures 2
    • Segment => round of fetching
    • Identified by a timestamp
    Segment /crawl_generate/ -> SequenceFile<Text,CrawlDatum> /crawl_fetch/ -> MapFile<Text,CrawlDatum> /content/ -> MapFile<Text,Content> /crawl_parse/ -> SequenceFile<Text,CrawlDatum> /parse_data/ -> MapFile<Text,ParseData> /parse_text/ -> MapFile<Text,ParseText>
    • Can have multiple versions of a page in different segments
  • Data Structures – 3
    • Output of : invertlinks
    • Input of : SOLRIndex
    LinkDB
    • linkDB => storage for Web Graph
    MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink : String fromUrl String anchor
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • Use cases
    • Crawl for Search Systems
      • Web wide or vertical
      • Single node to large clusters
      • Legacy Lucene-based search or SOLR
    • … but not necessarily
      • NLP (e.g.Sentiment Analysis)
      • ML, Classification / Clustering
      • Data Mining
      • MAHOUT / UIMA / GATE
      • Use Behemoth as glueware ( http://github.com/jnioche/behemoth )
    • SimilarPages.com
      • Large cluster on Amazon EC2 (up to 400 nodes)
      • Fetched & parsed 3 billion pages
      • 10+ billion pages in crawlDB (~100TB data)
      • 200+ million lists of similarities
      • No indexing / search involved
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • NUTCH 1.3
    • Transition between 1.x and 2.0
    • http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/
    • 1.3-RC3 => imminent
    • Removed Lucene-based indexing and search webapp
      • delegate indexing / search remotely to SOLR
      • change of focus : “Web search application” -> “Crawler”
    • Removed deprecated parse plugins
      • delegate most parsing to Tika
    • Separate local / distributed runtimes
    • Ivy-based dependency management
  • NUTCH 2.0
    • Became trunk in 2010
    • Same features as 1.3
      • delegation to SOLR, TIKA, etc...
    • Moved to table-based architecture
      • Wealth of NoSQL projects in last 2 years
    • Preliminary version known as NutchBase (Doğacan Güney)
    • Moved storage layer to subproject in incubator -> GORA
  • GORA
    • http://incubator.apache.org/gora/
    • ORM for NoSQL databases
      • and limited SQL support
    • Serialization with Apache AVRO
    • Object-to-datastore mappings (backend-specific)
    • Backend implementations
      • HBase
      • Cassandra
      • SQL
      • Memory
    • 0.1 released in April 2011
  • GORA (cont.)
    • Atomic operations
      • Get
      • Put
      • Delete
    • Querying
      • Execute
      • deleteByQuery
    • Wrappers for Apache Hadoop
      • GORAInput|OutputFormat
      • GORAMapper|Reducer
  • Benefits for Nutch
    • Storage still distributed and replicated
    • but one big table
      • status, metadata, content, text -> one place
    • Simplified logic in Nutch
      • Simpler code for updating / merging information
    • More efficient
      • No need to read / write entire structure to update records
      • e.g. update step in 1.x
    • Easier interaction with other resources
      • Third-party code just need to use GORA and schema
  • Status Nutch 2.0
    • Beta stage
      • debugging / testing required
    • Compare performance of GORA backends
    • Need to update documentation / WIKI
    • Enthusiasm from community
    • GORA – next great project coming out of Nutch?
  • Future
    • Delegate code to crawler-commons
    • (http://code.google.com/p/crawler-commons/)
      • Fetcher / protocol handling
      • Robots.txt parsing
      • URL normalisation / filtering
    • New functionalities
      • Sitemap
      • Canonical tag
      • More indexers (e.g. ElasticSearch) + pluggable indexers?
    • Definitive move to 2.0?
      • Contribute backends and functionalities to GORA
  • Outline
    • Overview
    • Features
    • Data Structures
    • Use cases
    • What's new in Nutch 1.3
    • Nutch 2.0
    • GORA
    • Conclusion
  • Where to find out more?
    • Project page : http://nutch.apache.org/
    • Wiki : http://wiki.apache.org/nutch/
    • Mailing lists :
      • [email_address]
      • [email_address]
    • Chapter in 'Hadoop the Definitive Guide' (T. White)
      • Understanding Hadoop is essential anyway...
    • Support / consulting :
      • http://wiki.apache.org/nutch/Support
      • [email_address]
  • Questions ?
  •