Your SlideShare is downloading. ×
0
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Web Scale Crawling with Apache Nutch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Web Scale Crawling with Apache Nutch

6,552

Published on

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,552
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
228
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  • A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  • Note that I mention crawling and not web search -> used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  • Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  • Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  • Fetcher . multithreaded but polite
  • Fetcher . multithreaded but polite
  • Writable object – crawl datum
  • What does this mean for Nutch?
  • What does this mean for Nutch?
  • Transcript

    • 1. Web Scale Crawling with Julien Nioche [email_address] Berlin Buzzwords 08/06/11 Apache
    • 2. DigitalPebble Ltd <ul><li>Based in Bristol (UK) </li></ul><ul><li>Specialised in Text Engineering </li></ul><ul><ul><li>Web Crawling </li></ul></ul><ul><ul><li>Natural Language Processing </li></ul></ul><ul><ul><li>Information Retrieval </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><li>Strong focus on Open Source & Apache ecosystem </li></ul><ul><li>User | Contributor | Committer </li></ul><ul><ul><li>Nutch , SOLR, Lucene </li></ul></ul><ul><ul><li>Tika </li></ul></ul><ul><ul><li>GATE, UIMA </li></ul></ul><ul><ul><li>Mahout </li></ul></ul><ul><ul><li>Behemoth </li></ul></ul>
    • 3. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 4. Nutch? <ul><li>“ Distributed framework for large scale web crawling” </li></ul><ul><ul><li>but does not have to be large scale at all </li></ul></ul><ul><ul><li>or even on the web (file-protocol) </li></ul></ul><ul><li>Based on Apache Hadoop </li></ul><ul><li>Indexing and Search </li></ul><ul><li>Open Source – Apache 2.0 License </li></ul>
    • 5. Short history <ul><li>2002/2003 : Started By Doug Cutting & Mike Caffarella </li></ul><ul><li>2004 : sub-project of Lucene @Apache </li></ul><ul><li>2005 : MapReduce implementation in Nutch </li></ul><ul><ul><li>2006 : Hadoop sub-project of Lucene @Apache </li></ul></ul><ul><li>2006/7 : Parser and MimeType in Tika </li></ul><ul><ul><li>2008 : Tika sub-project of Lucene @Apache </li></ul></ul><ul><li>May 2010 : TLP project at Apache </li></ul><ul><li>June 2011 (?) : Nutch 1.3 </li></ul><ul><li>Q4 2011 (?) : Nutch 2.0 </li></ul>
    • 6. In a Nutch Shell (1.3) <ul><li>Inject -> populates CrawlDB from seed list </li></ul><ul><li>Generate -> Selects URLS to fetch in segment </li></ul><ul><li>Fetch -> Fetches URLs from segment </li></ul><ul><li>Parse -> Parses content (text + metadata) </li></ul><ul><li>UpdateDB -> Updates CrawlDB (new URLs, new status...) </li></ul><ul><li>InvertLinks -> Build Webgraph </li></ul><ul><li>SOLRIndex -> Send docs to SOLR </li></ul><ul><li>SOLRDedup -> Remove duplicate docs based on signature </li></ul><ul><li>Step by Step : </li></ul><ul><li>Or use the all-in-one ' nutch crawl ' command </li></ul><ul><li>Repeat steps 2 to 8 </li></ul>
    • 7. Frontier expansion <ul><li>Manual “discovery” </li></ul><ul><ul><li>Adding new URLs by hand, “seeding” </li></ul></ul><ul><li>Automatic discovery of new resources (frontier expansion) </li></ul><ul><ul><li>Not all outlinks are equally useful - control </li></ul></ul><ul><ul><li>Requires content parsing and link extraction </li></ul></ul>[Slide courtesy of A. Bialecki] seed i = 1 i = 2 i = 3
    • 8. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 9. An extensible framework <ul><li>Endpoints </li></ul><ul><ul><li>Protocol </li></ul></ul><ul><ul><li>Parser </li></ul></ul><ul><ul><li>HtmlParseFilter </li></ul></ul><ul><ul><li>ScoringFilter (used in various places) </li></ul></ul><ul><ul><li>URLFilter (ditto) </li></ul></ul><ul><ul><li>URLNormalizer (ditto) </li></ul></ul><ul><ul><li>IndexingFilter </li></ul></ul><ul><li>Plugins </li></ul><ul><ul><li>Activated with parameter ' plugin.includes' </li></ul></ul><ul><ul><li>Implement one or more endpoints </li></ul></ul>
    • 10. Features <ul><li>Fetcher </li></ul><ul><ul><li>Multi-threaded fetcher </li></ul></ul><ul><ul><li>Follows robots.txt </li></ul></ul><ul><ul><li>Groups URLs per hostname / domain / IP </li></ul></ul><ul><ul><li>Limit the number of URLs for round of fetching </li></ul></ul><ul><ul><li>Default values are polite but can be made more aggressive </li></ul></ul><ul><li>Crawl Strategy </li></ul><ul><ul><li>Breadth-first but can be depth-first </li></ul></ul><ul><ul><li>Configurable via custom scoring plugins </li></ul></ul><ul><li>Scoring </li></ul><ul><ul><li>OPIC (On-line Page Importance Calculation) by default </li></ul></ul><ul><ul><li>LinkRank </li></ul></ul>
    • 11. Features (cont.) <ul><li>Protocols </li></ul><ul><ul><li>Http, file, ftp, https </li></ul></ul><ul><li>Scheduling </li></ul><ul><ul><li>Specified or adaptative </li></ul></ul><ul><li>URL filters </li></ul><ul><ul><li>Regex, FSA, TLD domain, prefix, suffix </li></ul></ul><ul><li>URL normalisers </li></ul><ul><ul><li>Default, regex </li></ul></ul>
    • 12. Features (cont.) <ul><li>Other plugins </li></ul><ul><ul><li>CreativeCommons </li></ul></ul><ul><ul><li>Feeds </li></ul></ul><ul><ul><li>Language Identification </li></ul></ul><ul><ul><li>Rel tags </li></ul></ul><ul><ul><li>Arbitrary Metadata </li></ul></ul><ul><li>Indexing to SOLR </li></ul><ul><ul><li>Bespoke schema </li></ul></ul><ul><li>Parsing with Apache Tika </li></ul><ul><ul><li>But some legacy parsers as well </li></ul></ul>
    • 13. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 14. Data Structures <ul><li>MapReduce jobs => I/O : Hadoop [Sequence|Map]Files </li></ul><ul><li>CrawlDB => status of known pages </li></ul><ul><li>Input of : generate - index </li></ul><ul><li>Output of : inject - update </li></ul>CrawlDB MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
    • 15. Data Structures 2 <ul><li>Segment => round of fetching </li></ul><ul><li>Identified by a timestamp </li></ul>Segment /crawl_generate/ -> SequenceFile<Text,CrawlDatum> /crawl_fetch/ -> MapFile<Text,CrawlDatum> /content/ -> MapFile<Text,Content> /crawl_parse/ -> SequenceFile<Text,CrawlDatum> /parse_data/ -> MapFile<Text,ParseData> /parse_text/ -> MapFile<Text,ParseText> <ul><li>Can have multiple versions of a page in different segments </li></ul>
    • 16. Data Structures – 3 <ul><li>Output of : invertlinks </li></ul><ul><li>Input of : SOLRIndex </li></ul>LinkDB <ul><li>linkDB => storage for Web Graph </li></ul>MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink : String fromUrl String anchor
    • 17. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 18. Use cases <ul><li>Crawl for Search Systems </li></ul><ul><ul><li>Web wide or vertical </li></ul></ul><ul><ul><li>Single node to large clusters </li></ul></ul><ul><ul><li>Legacy Lucene-based search or SOLR </li></ul></ul><ul><li>… but not necessarily </li></ul><ul><ul><li>NLP (e.g.Sentiment Analysis) </li></ul></ul><ul><ul><li>ML, Classification / Clustering </li></ul></ul><ul><ul><li>Data Mining </li></ul></ul><ul><ul><li>MAHOUT / UIMA / GATE </li></ul></ul><ul><ul><li>Use Behemoth as glueware ( http://github.com/jnioche/behemoth ) </li></ul></ul><ul><li>SimilarPages.com </li></ul><ul><ul><li>Large cluster on Amazon EC2 (up to 400 nodes) </li></ul></ul><ul><ul><li>Fetched & parsed 3 billion pages </li></ul></ul><ul><ul><li>10+ billion pages in crawlDB (~100TB data) </li></ul></ul><ul><ul><li>200+ million lists of similarities </li></ul></ul><ul><ul><li>No indexing / search involved </li></ul></ul>
    • 19. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 20. NUTCH 1.3 <ul><li>Transition between 1.x and 2.0 </li></ul><ul><li>http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ </li></ul><ul><li>1.3-RC3 => imminent </li></ul><ul><li>Removed Lucene-based indexing and search webapp </li></ul><ul><ul><li>delegate indexing / search remotely to SOLR </li></ul></ul><ul><ul><li>change of focus : “Web search application” -> “Crawler” </li></ul></ul><ul><li>Removed deprecated parse plugins </li></ul><ul><ul><li>delegate most parsing to Tika </li></ul></ul><ul><li>Separate local / distributed runtimes </li></ul><ul><li>Ivy-based dependency management </li></ul>
    • 21. NUTCH 2.0 <ul><li>Became trunk in 2010 </li></ul><ul><li>Same features as 1.3 </li></ul><ul><ul><li>delegation to SOLR, TIKA, etc... </li></ul></ul><ul><li>Moved to table-based architecture </li></ul><ul><ul><li>Wealth of NoSQL projects in last 2 years </li></ul></ul><ul><li>Preliminary version known as NutchBase (Doğacan Güney) </li></ul><ul><li>Moved storage layer to subproject in incubator -> GORA </li></ul>
    • 22. GORA <ul><li>http://incubator.apache.org/gora/ </li></ul><ul><li>ORM for NoSQL databases </li></ul><ul><ul><li>and limited SQL support </li></ul></ul><ul><li>Serialization with Apache AVRO </li></ul><ul><li>Object-to-datastore mappings (backend-specific) </li></ul><ul><li>Backend implementations </li></ul><ul><ul><li>HBase </li></ul></ul><ul><ul><li>Cassandra </li></ul></ul><ul><ul><li>SQL </li></ul></ul><ul><ul><li>Memory </li></ul></ul><ul><li>0.1 released in April 2011 </li></ul>
    • 23. GORA (cont.) <ul><li>Atomic operations </li></ul><ul><ul><li>Get </li></ul></ul><ul><ul><li>Put </li></ul></ul><ul><ul><li>Delete </li></ul></ul><ul><li>Querying </li></ul><ul><ul><li>Execute </li></ul></ul><ul><ul><li>deleteByQuery </li></ul></ul><ul><li>Wrappers for Apache Hadoop </li></ul><ul><ul><li>GORAInput|OutputFormat </li></ul></ul><ul><ul><li>GORAMapper|Reducer </li></ul></ul>
    • 24. Benefits for Nutch <ul><li>Storage still distributed and replicated </li></ul><ul><li>but one big table </li></ul><ul><ul><li>status, metadata, content, text -> one place </li></ul></ul><ul><li>Simplified logic in Nutch </li></ul><ul><ul><li>Simpler code for updating / merging information </li></ul></ul><ul><li>More efficient </li></ul><ul><ul><li>No need to read / write entire structure to update records </li></ul></ul><ul><ul><li>e.g. update step in 1.x </li></ul></ul><ul><li>Easier interaction with other resources </li></ul><ul><ul><li>Third-party code just need to use GORA and schema </li></ul></ul>
    • 25. Status Nutch 2.0 <ul><li>Beta stage </li></ul><ul><ul><li>debugging / testing required </li></ul></ul><ul><li>Compare performance of GORA backends </li></ul><ul><li>Need to update documentation / WIKI </li></ul><ul><li>Enthusiasm from community </li></ul><ul><li>GORA – next great project coming out of Nutch? </li></ul>
    • 26. Future <ul><li>Delegate code to crawler-commons </li></ul><ul><li>(http://code.google.com/p/crawler-commons/) </li></ul><ul><ul><li>Fetcher / protocol handling </li></ul></ul><ul><ul><li>Robots.txt parsing </li></ul></ul><ul><ul><li>URL normalisation / filtering </li></ul></ul><ul><li>New functionalities </li></ul><ul><ul><li>Sitemap </li></ul></ul><ul><ul><li>Canonical tag </li></ul></ul><ul><ul><li>More indexers (e.g. ElasticSearch) + pluggable indexers? </li></ul></ul><ul><li>Definitive move to 2.0? </li></ul><ul><ul><li>Contribute backends and functionalities to GORA </li></ul></ul>
    • 27. Outline <ul><li>Overview </li></ul><ul><li>Features </li></ul><ul><li>Data Structures </li></ul><ul><li>Use cases </li></ul><ul><li>What's new in Nutch 1.3 </li></ul><ul><li>Nutch 2.0 </li></ul><ul><li>GORA </li></ul><ul><li>Conclusion </li></ul>
    • 28. Where to find out more? <ul><li>Project page : http://nutch.apache.org/ </li></ul><ul><li>Wiki : http://wiki.apache.org/nutch/ </li></ul><ul><li>Mailing lists : </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>Chapter in 'Hadoop the Definitive Guide' (T. White) </li></ul><ul><ul><li>Understanding Hadoop is essential anyway... </li></ul></ul><ul><li>Support / consulting : </li></ul><ul><ul><li>http://wiki.apache.org/nutch/Support </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul>
    • 29. Questions ?
    • 30.  

    ×