Your SlideShare is downloading. ×
Web Scale Crawling with Apache Nutch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Scale Crawling with Apache Nutch

6,493

Published on

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,493
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
225
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  • A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  • Note that I mention crawling and not web search -> used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Fetcher . multithreaded but polite
  • Fetcher . multithreaded but polite
  • Writable object – crawl datum
  • What does this mean for Nutch?
  • What does this mean for Nutch?
  • Transcript

    • 1. Web Scale Crawling with Julien Nioche [email_address] Berlin Buzzwords 08/06/11 Apache
    • 2. DigitalPebble Ltd
      • Based in Bristol (UK)
      • Specialised in Text Engineering
        • Web Crawling
        • Natural Language Processing
        • Information Retrieval
        • Data Mining
      • Strong focus on Open Source & Apache ecosystem
      • User | Contributor | Committer
        • Nutch , SOLR, Lucene
        • Tika
        • GATE, UIMA
        • Mahout
        • Behemoth
    • 3. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 4. Nutch?
      • “ Distributed framework for large scale web crawling”
        • but does not have to be large scale at all
        • or even on the web (file-protocol)
      • Based on Apache Hadoop
      • Indexing and Search
      • Open Source – Apache 2.0 License
    • 5. Short history
      • 2002/2003 : Started By Doug Cutting & Mike Caffarella
      • 2004 : sub-project of Lucene @Apache
      • 2005 : MapReduce implementation in Nutch
        • 2006 : Hadoop sub-project of Lucene @Apache
      • 2006/7 : Parser and MimeType in Tika
        • 2008 : Tika sub-project of Lucene @Apache
      • May 2010 : TLP project at Apache
      • June 2011 (?) : Nutch 1.3
      • Q4 2011 (?) : Nutch 2.0
    • 6. In a Nutch Shell (1.3)
      • Inject -> populates CrawlDB from seed list
      • Generate -> Selects URLS to fetch in segment
      • Fetch -> Fetches URLs from segment
      • Parse -> Parses content (text + metadata)
      • UpdateDB -> Updates CrawlDB (new URLs, new status...)
      • InvertLinks -> Build Webgraph
      • SOLRIndex -> Send docs to SOLR
      • SOLRDedup -> Remove duplicate docs based on signature
      • Step by Step :
      • Or use the all-in-one ' nutch crawl ' command
      • Repeat steps 2 to 8
    • 7. Frontier expansion
      • Manual “discovery”
        • Adding new URLs by hand, “seeding”
      • Automatic discovery of new resources (frontier expansion)
        • Not all outlinks are equally useful - control
        • Requires content parsing and link extraction
      [Slide courtesy of A. Bialecki] seed i = 1 i = 2 i = 3
    • 8. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 9. An extensible framework
      • Endpoints
        • Protocol
        • Parser
        • HtmlParseFilter
        • ScoringFilter (used in various places)
        • URLFilter (ditto)
        • URLNormalizer (ditto)
        • IndexingFilter
      • Plugins
        • Activated with parameter ' plugin.includes'
        • Implement one or more endpoints
    • 10. Features
      • Fetcher
        • Multi-threaded fetcher
        • Follows robots.txt
        • Groups URLs per hostname / domain / IP
        • Limit the number of URLs for round of fetching
        • Default values are polite but can be made more aggressive
      • Crawl Strategy
        • Breadth-first but can be depth-first
        • Configurable via custom scoring plugins
      • Scoring
        • OPIC (On-line Page Importance Calculation) by default
        • LinkRank
    • 11. Features (cont.)
      • Protocols
        • Http, file, ftp, https
      • Scheduling
        • Specified or adaptative
      • URL filters
        • Regex, FSA, TLD domain, prefix, suffix
      • URL normalisers
        • Default, regex
    • 12. Features (cont.)
      • Other plugins
        • CreativeCommons
        • Feeds
        • Language Identification
        • Rel tags
        • Arbitrary Metadata
      • Indexing to SOLR
        • Bespoke schema
      • Parsing with Apache Tika
        • But some legacy parsers as well
    • 13. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 14. Data Structures
      • MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
      • CrawlDB => status of known pages
      • Input of : generate - index
      • Output of : inject - update
      CrawlDB MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
    • 15. Data Structures 2
      • Segment => round of fetching
      • Identified by a timestamp
      Segment /crawl_generate/ -> SequenceFile<Text,CrawlDatum> /crawl_fetch/ -> MapFile<Text,CrawlDatum> /content/ -> MapFile<Text,Content> /crawl_parse/ -> SequenceFile<Text,CrawlDatum> /parse_data/ -> MapFile<Text,ParseData> /parse_text/ -> MapFile<Text,ParseText>
      • Can have multiple versions of a page in different segments
    • 16. Data Structures – 3
      • Output of : invertlinks
      • Input of : SOLRIndex
      LinkDB
      • linkDB => storage for Web Graph
      MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink : String fromUrl String anchor
    • 17. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 18. Use cases
      • Crawl for Search Systems
        • Web wide or vertical
        • Single node to large clusters
        • Legacy Lucene-based search or SOLR
      • … but not necessarily
        • NLP (e.g.Sentiment Analysis)
        • ML, Classification / Clustering
        • Data Mining
        • MAHOUT / UIMA / GATE
        • Use Behemoth as glueware ( http://github.com/jnioche/behemoth )
      • SimilarPages.com
        • Large cluster on Amazon EC2 (up to 400 nodes)
        • Fetched & parsed 3 billion pages
        • 10+ billion pages in crawlDB (~100TB data)
        • 200+ million lists of similarities
        • No indexing / search involved
    • 19. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 20. NUTCH 1.3
      • Transition between 1.x and 2.0
      • http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/
      • 1.3-RC3 => imminent
      • Removed Lucene-based indexing and search webapp
        • delegate indexing / search remotely to SOLR
        • change of focus : “Web search application” -> “Crawler”
      • Removed deprecated parse plugins
        • delegate most parsing to Tika
      • Separate local / distributed runtimes
      • Ivy-based dependency management
    • 21. NUTCH 2.0
      • Became trunk in 2010
      • Same features as 1.3
        • delegation to SOLR, TIKA, etc...
      • Moved to table-based architecture
        • Wealth of NoSQL projects in last 2 years
      • Preliminary version known as NutchBase (Doğacan Güney)
      • Moved storage layer to subproject in incubator -> GORA
    • 22. GORA
      • http://incubator.apache.org/gora/
      • ORM for NoSQL databases
        • and limited SQL support
      • Serialization with Apache AVRO
      • Object-to-datastore mappings (backend-specific)
      • Backend implementations
        • HBase
        • Cassandra
        • SQL
        • Memory
      • 0.1 released in April 2011
    • 23. GORA (cont.)
      • Atomic operations
        • Get
        • Put
        • Delete
      • Querying
        • Execute
        • deleteByQuery
      • Wrappers for Apache Hadoop
        • GORAInput|OutputFormat
        • GORAMapper|Reducer
    • 24. Benefits for Nutch
      • Storage still distributed and replicated
      • but one big table
        • status, metadata, content, text -> one place
      • Simplified logic in Nutch
        • Simpler code for updating / merging information
      • More efficient
        • No need to read / write entire structure to update records
        • e.g. update step in 1.x
      • Easier interaction with other resources
        • Third-party code just need to use GORA and schema
    • 25. Status Nutch 2.0
      • Beta stage
        • debugging / testing required
      • Compare performance of GORA backends
      • Need to update documentation / WIKI
      • Enthusiasm from community
      • GORA – next great project coming out of Nutch?
    • 26. Future
      • Delegate code to crawler-commons
      • (http://code.google.com/p/crawler-commons/)
        • Fetcher / protocol handling
        • Robots.txt parsing
        • URL normalisation / filtering
      • New functionalities
        • Sitemap
        • Canonical tag
        • More indexers (e.g. ElasticSearch) + pluggable indexers?
      • Definitive move to 2.0?
        • Contribute backends and functionalities to GORA
    • 27. Outline
      • Overview
      • Features
      • Data Structures
      • Use cases
      • What's new in Nutch 1.3
      • Nutch 2.0
      • GORA
      • Conclusion
    • 28. Where to find out more?
      • Project page : http://nutch.apache.org/
      • Wiki : http://wiki.apache.org/nutch/
      • Mailing lists :
        • [email_address]
        • [email_address]
      • Chapter in 'Hadoop the Definitive Guide' (T. White)
        • Understanding Hadoop is essential anyway...
      • Support / consulting :
        • http://wiki.apache.org/nutch/Support
        • [email_address]
    • 29. Questions ?
    • 30.  

    ×