Large Scale Crawling with Apache Nutch and Friends

  • 4,463 views
Uploaded on

Presented by Julien Nioche, Director, DigitalPebble …

Presented by Julien Nioche, Director, DigitalPebble

This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,463
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
108
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large Scale Crawling with Apache and friends... Julien Nioche julien@digitalpebble.com LUCENE/SOLR REVOLUTION EU 2013
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – – – – Web Crawling Natural Language Processing Information Retrieval Machine Learning  Strong focus on Open Source & Apache ecosystem  VP Apache Nutch  User | Contributor | Committer – – – – – Tika SOLR, Lucene GATE, UIMA Mahout Behemoth 2 / 43
  • 3. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 3 / 43
  • 4. Nutch?  “Distributed framework for large scale web crawling” (but does not have to be large scale at all)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search by 4 / 43
  • 5. A bit of history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  Sept 2010 : Storage abstraction in Nutch 2.x – 2012 : Gora TLP @Apache 5 / 43
  • 6. Recent Releases trunk 1.0 1.1 1.2 1.3 1.4 1.5.1 1.6 1.7 2.x 2.0 2.1 06/09 06/10 06/11 06/12 2.2.1 06/13 6 / 43
  • 7. Why use Nutch?  Usual reasons – Open source with a business-friendly license, mature, community, ...  Scalability – Tried and tested on very large scale – Standard Hadoop  Features – – – – Index with SOLR / ES / CloudSearch PageRank implementation Loads of existing plugins Can easily be extended / customised 7 / 43
  • 8. Use cases  Crawl for search – Generic or vertical – Index and Search with SOLR and al. – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML  with – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 8 / 43
  • 9. Customer cases Specificity (Verticality) BetterJobs.com (CareerBuilder) – – – – – Single server Aggregates content from job portals Extracts and normalizes structure (description, requirements, locations) ~2M pages total Feeds SOLR index SimilarPages.com – – – – – Large cluster on Amazon EC2 (up to 400 nodes) Fetched & parsed 3 billion pages 10+ billion pages in crawlDB (~100TB data) 200+ million lists of similarities No indexing / search involved Size 9 / 43
  • 10. CommonCrawl http://commoncrawl.org/  Open repository of web crawl data  2012 dataset : 3.83 billion docs  ARC files on Amazon S3  Using Nutch 1.7  A few modifications to Nutch code – https://github.com/Aloisius/nutch  Next release imminent 10 / 43
  • 11. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 11 / 43
  • 12. Installation  http://nutch.apache.org/downloads.html  1.7 => src and bin distributions  2.2.1 => src only  'ant clean runtime' – runtime/local => local mode (test and debug) – runtime/deploy => job jar for Hadoop + scripts  Binary distribution for 1.x == runtime/local 12 / 43
  • 13. Configuration and resources  Changes in $NUTCH_HOME/conf – Need recompiling with 'ant runtime' – Local mode => can be made directly in runtime/local/conf  Specify configuration in nutch-site.xml – Leave nutch-default alone!  At least : <property>   <name>http.agent.name</name>   <value>WhateverNameDescribesMyMightyCrawler</value> </property> 13 / 43
  • 14. Running it!  bin/crawl script : typical sequence of steps  bin/nutch : individual Nutch commands – Inject / generate / fetch / parse / update ….  Local mode : great for testing and debugging  Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism – MapReduce UI to monitor crawl, check logs, counters 14 / 43
  • 15. Monitor Crawl with MapReduce UI 15 / 43
  • 16. Counters and logs 16 / 43
  • 17. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 17 / 43
  • 18. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) 2) 3) 4) 5) 6) 7) Inject → populates CrawlDB from seed list Generate → Selects URLS to fetch in segment Fetch → Fetches URLs from segment Parse → Parses content (text + metadata) UpdateDB → Updates CrawlDB (new URLs, new status...) InvertLinks → Build Webgraph Index → Send docs to [SOLR | ES | CloudSearch | … ]  Repeat steps 2 to 7  Or use the all-in-one crawl script 18 / 43
  • 19. Main steps from a data perspective Seed List Segment CrawlDB /crawl_generate/ /crawl_fetch/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 19 / 43
  • 20. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i=1 i=2 i=3 [Slide courtesy of A. Bialecki] 20 / 43
  • 21. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – – – – – – – – Protocol Parser HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x) ScoringFilter (used in various places) URLFilter (ditto) URLNormalizer (ditto) IndexingFilter IndexWriter (NEW IN 1.7!) 21 / 43
  • 22. Features  Fetcher – – – – Multi-threaded fetcher Queues URLs per hostname / domain / IP Limit the number of URLs for round of fetching Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom ScoringFilters  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 22 / 43
  • 23. Features (cont.)  Protocols – Http, file, ftp, https – Respects robots.txt directives  Scheduling – Fixed or adaptive  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 23 / 43
  • 24. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – – – – – CreativeCommons Feeds Language Identification Rel tags Arbitrary Metadata  Pluggable indexing – SOLR | ES etc... 24 / 43
  • 25. Indexing  Apache SOLR – schema.xml in conf/ – SOLR 3.4 – JIRA issue for SOLRCloud • https://issues.apache.org/jira/browse/NUTCH-1377  ElasticSearch – Version 0.90.1  AWS CloudSearch – WIP : https://issues.apache.org/jira/browse/NUTCH-1517  Easy to build your own – Text, DB, etc... 25 / 43
  • 26. Typical Nutch document  Some of the fields (IndexingFilters in plugins or core code) – – – – – – – – – – url content title anchor site boost digest segment host type  Configurable ones – meta tags (keywords, description etc...) – arbitrary metadata 26 / 43
  • 27. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 27 / 43
  • 28. NUTCH 2.x  2.0 released in July 2012  2.2.1 in July 2013  Common features as 1.x – MapReduce, Tika, delegation to SOLR, etc...  Moved to 'big table'-like architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 28 / 43
  • 29. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  Current version 0.3  DataStore implementations ● ● ● Accumulo Cassandra HBase ● ● ● Avro DynamoDB SQL (broken)  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 29 / 43
  • 30. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 30 / 43
  • 31. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 31 / 43
  • 32. DataStore operations  Basic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 32 / 43
  • 33. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – hence source only distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 33 / 43
  • 34. Benefits  Storage still distributed and replicated  … but one big table – status, metadata, content, text → one place – no more segments  Resume-able fetch and parse steps  Easier interaction with other resources – Third-party code just need to use GORA and schema  Simplify the Nutch code  Potentially faster (e.g. update step) 34 / 43
  • 35. Drawbacks  More stuff to install and configure – Higher hardware requirements  Current performance :-( – – – – – – http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html N2+HBase : 2.7x slower than 1.x N2+Cassandra : 4.4x slower than 1.x due mostly to GORA layer : not inherent to Hbase or Cassandra https://issues.apache.org/jira/browse/GORA-119 → filtered scans Not all backends provide data locality!  Not as stable as Nutch 1.x 35 / 43
  • 36. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph) – No pluggable indexers yet (NUTCH-1568)  Filter enabled scans – GORA-119 • => don't need to de-serialize the whole dataset 36 / 43
  • 37. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 37 / 43
  • 38. Future  1.x and 2.x to coexist in parallel – 2.x not yet a replacement of 1.x  New functionalities – – – – Support for SOLRCloud Sitemap (from CrawlerCommons library) Canonical tag Generic deduplication (NUTCH-656)  Move to new MapReduce API – Use Nutch on Hadoop 2.x 38 / 43
  • 39. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – URL normalisation / filtering  PageRank-like computations to graph library – Apache Giraph – Should be more efficient + less code to maintain 39 / 43
  • 40. Longer term  Hadoop 2.x & YARN  Convergence of batch and streaming – Storm / Samza / Storm-YARN / …  End of 100% batch operations ? – Fetch and parse as streaming ? – Always be fetching – Generate / update / pagerank remain batch  See https://github.com/DigitalPebble/storm-crawler 40 / 43
  • 41. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 41 / 43
  • 42. Questions ? 42 / 43
  • 43. 43 / 43