Web Scale Crawling with Apache Nutch


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  • A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  • Note that I mention crawling and not web search -> used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Fetcher . multithreaded but polite
  • Fetcher . multithreaded but polite
  • Writable object – crawl datum
  • What does this mean for Nutch?
  • What does this mean for Nutch?
  • Web Scale Crawling with Apache Nutch

    1. 1. Web Scale Crawling with Julien Nioche [email_address] Berlin Buzzwords 08/06/11 Apache
    2. 2. DigitalPebble Ltd <ul><li>Based in Bristol (UK) </li></ul><ul><li>Specialised in Text Engineering </li></ul><ul><ul><li>Web Crawling </li></ul></ul><ul><ul><li>Natural Language Processing </li></ul></ul><ul><ul><li>Information Retrieval </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><li>Strong focus on Open Source & Apache ecosystem </li></ul><ul><li>User | Contributor | Committer </li></ul><ul><ul><li>Nutch , SOLR, Lucene </li></ul></ul><ul><ul><li>Tika </li></ul></ul><ul><ul><li>GATE, UIMA </li></ul></ul><ul><ul><li>Mahout </li></ul></ul><ul><ul><li>Behemoth </li></ul></ul>
    3. 3. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    4. 4. Nutch? <ul><li>“ Distributed framework for large scale web crawling” </li></ul><ul><ul><li>but does not have to be large scale at all </li></ul></ul><ul><ul><li>or even on the web (file-protocol) </li></ul></ul><ul><li>Based on Apache Hadoop </li></ul><ul><li>Indexing and Search </li></ul><ul><li>Open Source – Apache 2.0 License </li></ul>
    5. 5. Short history <ul><li>2002/2003 : Started By Doug Cutting & Mike Caffarella </li></ul><ul><li>2004 : sub-project of Lucene @Apache </li></ul><ul><li>2005 : MapReduce implementation in Nutch </li></ul><ul><ul><li>2006 : Hadoop sub-project of Lucene @Apache </li></ul></ul><ul><li>2006/7 : Parser and MimeType in Tika </li></ul><ul><ul><li>2008 : Tika sub-project of Lucene @Apache </li></ul></ul><ul><li>May 2010 : TLP project at Apache </li></ul><ul><li>June 2011 (?) : Nutch 1.3 </li></ul><ul><li>Q4 2011 (?) : Nutch 2.0 </li></ul>
    6. 6. In a Nutch Shell (1.3) <ul><li>Inject -> populates CrawlDB from seed list </li></ul><ul><li>Generate -> Selects URLS to fetch in segment </li></ul><ul><li>Fetch -> Fetches URLs from segment </li></ul><ul><li>Parse -> Parses content (text + metadata) </li></ul><ul><li>UpdateDB -> Updates CrawlDB (new URLs, new status...) </li></ul><ul><li>InvertLinks -> Build Webgraph </li></ul><ul><li>SOLRIndex -> Send docs to SOLR </li></ul><ul><li>SOLRDedup -> Remove duplicate docs based on signature </li></ul><ul><li>Step by Step : </li></ul><ul><li>Or use the all-in-one ' nutch crawl ' command </li></ul><ul><li>Repeat steps 2 to 8 </li></ul>
    7. 7. Frontier expansion <ul><li>Manual “discovery” </li></ul><ul><ul><li>Adding new URLs by hand, “seeding” </li></ul></ul><ul><li>Automatic discovery of new resources (frontier expansion) </li></ul><ul><ul><li>Not all outlinks are equally useful - control </li></ul></ul><ul><ul><li>Requires content parsing and link extraction </li></ul></ul>[Slide courtesy of A. Bialecki] seed i = 1 i = 2 i = 3
    8. 8. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    9. 9. An extensible framework <ul><li>Endpoints </li></ul><ul><ul><li>Protocol </li></ul></ul><ul><ul><li>Parser </li></ul></ul><ul><ul><li>HtmlParseFilter </li></ul></ul><ul><ul><li>ScoringFilter (used in various places) </li></ul></ul><ul><ul><li>URLFilter (ditto) </li></ul></ul><ul><ul><li>URLNormalizer (ditto) </li></ul></ul><ul><ul><li>IndexingFilter </li></ul></ul><ul><li>Plugins </li></ul><ul><ul><li>Activated with parameter ' plugin.includes' </li></ul></ul><ul><ul><li>Implement one or more endpoints </li></ul></ul>
    10. 10. Features <ul><li>Fetcher </li></ul><ul><ul><li>Multi-threaded fetcher </li></ul></ul><ul><ul><li>Follows robots.txt </li></ul></ul><ul><ul><li>Groups URLs per hostname / domain / IP </li></ul></ul><ul><ul><li>Limit the number of URLs for round of fetching </li></ul></ul><ul><ul><li>Default values are polite but can be made more aggressive </li></ul></ul><ul><li>Crawl Strategy </li></ul><ul><ul><li>Breadth-first but can be depth-first </li></ul></ul><ul><ul><li>Configurable via custom scoring plugins </li></ul></ul><ul><li>Scoring </li></ul><ul><ul><li>OPIC (On-line Page Importance Calculation) by default </li></ul></ul><ul><ul><li>LinkRank </li></ul></ul>
    11. 11. Features (cont.) <ul><li>Protocols </li></ul><ul><ul><li>Http, file, ftp, https </li></ul></ul><ul><li>Scheduling </li></ul><ul><ul><li>Specified or adaptative </li></ul></ul><ul><li>URL filters </li></ul><ul><ul><li>Regex, FSA, TLD domain, prefix, suffix </li></ul></ul><ul><li>URL normalisers </li></ul><ul><ul><li>Default, regex </li></ul></ul>
    12. 12. Features (cont.) <ul><li>Other plugins </li></ul><ul><ul><li>CreativeCommons </li></ul></ul><ul><ul><li>Feeds </li></ul></ul><ul><ul><li>Language Identification </li></ul></ul><ul><ul><li>Rel tags </li></ul></ul><ul><ul><li>Arbitrary Metadata </li></ul></ul><ul><li>Indexing to SOLR </li></ul><ul><ul><li>Bespoke schema </li></ul></ul><ul><li>Parsing with Apache Tika </li></ul><ul><ul><li>But some legacy parsers as well </li></ul></ul>
    13. 13. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    14. 14. Data Structures <ul><li>MapReduce jobs => I/O : Hadoop [Sequence|Map]Files </li></ul><ul><li>CrawlDB => status of known pages </li></ul><ul><li>Input of : generate - index </li></ul><ul><li>Output of : inject - update </li></ul>CrawlDB MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
    15. 15. Data Structures 2 <ul><li>Segment => round of fetching </li></ul><ul><li>Identified by a timestamp </li></ul>Segment /crawl_generate/ -> SequenceFile<Text,CrawlDatum> /crawl_fetch/ -> MapFile<Text,CrawlDatum> /content/ -> MapFile<Text,Content> /crawl_parse/ -> SequenceFile<Text,CrawlDatum> /parse_data/ -> MapFile<Text,ParseData> /parse_text/ -> MapFile<Text,ParseText> <ul><li>Can have multiple versions of a page in different segments </li></ul>
    16. 16. Data Structures – 3 <ul><li>Output of : invertlinks </li></ul><ul><li>Input of : SOLRIndex </li></ul>LinkDB <ul><li>linkDB => storage for Web Graph </li></ul>MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink : String fromUrl String anchor
    17. 17. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    18. 18. Use cases <ul><li>Crawl for Search Systems </li></ul><ul><ul><li>Web wide or vertical </li></ul></ul><ul><ul><li>Single node to large clusters </li></ul></ul><ul><ul><li>Legacy Lucene-based search or SOLR </li></ul></ul><ul><li>… but not necessarily </li></ul><ul><ul><li>NLP (e.g.Sentiment Analysis) </li></ul></ul><ul><ul><li>ML, Classification / Clustering </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><ul><li>MAHOUT / UIMA / GATE </li></ul></ul><ul><ul><li>Use Behemoth as glueware ( http://github.com/jnioche/behemoth ) </li></ul></ul><ul><li>SimilarPages.com </li></ul><ul><ul><li>Large cluster on Amazon EC2 (up to 400 nodes) </li></ul></ul><ul><ul><li>Fetched & parsed 3 billion pages </li></ul></ul><ul><ul><li>10+ billion pages in crawlDB (~100TB data) </li></ul></ul><ul><ul><li>200+ million lists of similarities </li></ul></ul><ul><ul><li>No indexing / search involved </li></ul></ul>
    19. 19. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    20. 20. NUTCH 1.3 <ul><li>Transition between 1.x and 2.0 </li></ul><ul><li>http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ </li></ul><ul><li>1.3-RC3 => imminent </li></ul><ul><li>Removed Lucene-based indexing and search webapp </li></ul><ul><ul><li>delegate indexing / search remotely to SOLR </li></ul></ul><ul><ul><li>change of focus : “Web search application” -> “Crawler” </li></ul></ul><ul><li>Removed deprecated parse plugins </li></ul><ul><ul><li>delegate most parsing to Tika </li></ul></ul><ul><li>Separate local / distributed runtimes </li></ul><ul><li>Ivy-based dependency management </li></ul>
    21. 21. NUTCH 2.0 <ul><li>Became trunk in 2010 </li></ul><ul><li>Same features as 1.3 </li></ul><ul><ul><li>delegation to SOLR, TIKA, etc... </li></ul></ul><ul><li>Moved to table-based architecture </li></ul><ul><ul><li>Wealth of NoSQL projects in last 2 years </li></ul></ul><ul><li>Preliminary version known as NutchBase (Doğacan Güney) </li></ul><ul><li>Moved storage layer to subproject in incubator -> GORA </li></ul>
    22. 22. GORA <ul><li>http://incubator.apache.org/gora/ </li></ul><ul><li>ORM for NoSQL databases </li></ul><ul><ul><li>and limited SQL support </li></ul></ul><ul><li>Serialization with Apache AVRO </li></ul><ul><li>Object-to-datastore mappings (backend-specific) </li></ul><ul><li>Backend implementations </li></ul><ul><ul><li>HBase </li></ul></ul><ul><ul><li>Cassandra </li></ul></ul><ul><ul><li>SQL </li></ul></ul><ul><ul><li>Memory </li></ul></ul><ul><li>0.1 released in April 2011 </li></ul>
    23. 23. GORA (cont.) <ul><li>Atomic operations </li></ul><ul><ul><li>Get </li></ul></ul><ul><ul><li>Put </li></ul></ul><ul><ul><li>Delete </li></ul></ul><ul><li>Querying </li></ul><ul><ul><li>Execute </li></ul></ul><ul><ul><li>deleteByQuery </li></ul></ul><ul><li>Wrappers for Apache Hadoop </li></ul><ul><ul><li>GORAInput|OutputFormat </li></ul></ul><ul><ul><li>GORAMapper|Reducer </li></ul></ul>
    24. 24. Benefits for Nutch <ul><li>Storage still distributed and replicated </li></ul><ul><li>but one big table </li></ul><ul><ul><li>status, metadata, content, text -> one place </li></ul></ul><ul><li>Simplified logic in Nutch </li></ul><ul><ul><li>Simpler code for updating / merging information </li></ul></ul><ul><li>More efficient </li></ul><ul><ul><li>No need to read / write entire structure to update records </li></ul></ul><ul><ul><li>e.g. update step in 1.x </li></ul></ul><ul><li>Easier interaction with other resources </li></ul><ul><ul><li>Third-party code just need to use GORA and schema </li></ul></ul>
    25. 25. Status Nutch 2.0 <ul><li>Beta stage </li></ul><ul><ul><li>debugging / testing required </li></ul></ul><ul><li>Compare performance of GORA backends </li></ul><ul><li>Need to update documentation / WIKI </li></ul><ul><li>Enthusiasm from community </li></ul><ul><li>GORA – next great project coming out of Nutch? </li></ul>
    26. 26. Future <ul><li>Delegate code to crawler-commons </li></ul><ul><li>(http://code.google.com/p/crawler-commons/) </li></ul><ul><ul><li>Fetcher / protocol handling </li></ul></ul><ul><ul><li>Robots.txt parsing </li></ul></ul><ul><ul><li>URL normalisation / filtering </li></ul></ul><ul><li>New functionalities </li></ul><ul><ul><li>Sitemap </li></ul></ul><ul><ul><li>Canonical tag </li></ul></ul><ul><ul><li>More indexers (e.g. ElasticSearch) + pluggable indexers? </li></ul></ul><ul><li>Definitive move to 2.0? </li></ul><ul><ul><li>Contribute backends and functionalities to GORA </li></ul></ul>
    27. 27. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    28. 28. Where to find out more? <ul><li>Project page : http://nutch.apache.org/ </li></ul><ul><li>Wiki : http://wiki.apache.org/nutch/ </li></ul><ul><li>Mailing lists : </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Chapter in 'Hadoop the Definitive Guide' (T. White) </li></ul><ul><ul><li>Understanding Hadoop is essential anyway... </li></ul></ul><ul><li>Support / consulting : </li></ul><ul><ul><li>http://wiki.apache.org/nutch/Support </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>
    29. 29. Questions ?