This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
4. Nutch?
“Distributed framework for large scale web crawling”
(but does not have to be large scale at all)
Apache TLP since May 2010
Based on Apache Hadoop
Indexing and Search by
4 / 43
5. A bit of history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache
5 / 43
7. Why use Nutch?
Usual reasons
– Open source with a business-friendly license, mature, community, ...
Scalability
– Tried and tested on very large scale
– Standard Hadoop
Features
–
–
–
–
Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised
7 / 43
8. Use cases
Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud
… but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML
with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)
8 / 43
9. Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–
Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index
SimilarPages.com
–
–
–
–
–
Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved
Size
9 / 43
10. CommonCrawl
http://commoncrawl.org/
Open repository of web crawl data
2012 dataset : 3.83 billion docs
ARC files on Amazon S3
Using Nutch 1.7
A few modifications to Nutch code
– https://github.com/Aloisius/nutch
Next release imminent
10 / 43
12. Installation
http://nutch.apache.org/downloads.html
1.7 => src and bin distributions
2.2.1 => src only
'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts
Binary distribution for 1.x == runtime/local
12 / 43
13. Configuration and resources
Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf
Specify configuration in nutch-site.xml
– Leave nutch-default alone!
At least :
<property>
<name>http.agent.name</name>
<value>WhateverNameDescribesMyMightyCrawler</value>
</property>
13 / 43
14. Running it!
bin/crawl script : typical sequence of steps
bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….
Local mode : great for testing and debugging
Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters
14 / 43
18. Typical Nutch Steps
Same in 1.x and 2.x
Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)
Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]
Repeat steps 2 to 7
Or use the all-in-one crawl script
18 / 43
19. Main steps from a data perspective
Seed
List
Segment
CrawlDB
/
/crawl_fetch/
crawl_generate/
/content/
/crawl_parse/
/parse_data/
/parse_text/
LinkDB
19 / 43
20. Frontier expansion
Manual “discovery”
– Adding new URLs by
hand, “seeding”
Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction
seed
i=1
i=2
i=3
[Slide courtesy of A. Bialecki]
20 / 43
21. An extensible framework
Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints
Endpoints
–
–
–
–
–
–
–
–
Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)
21 / 43
22. Features
Fetcher
–
–
–
–
Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive
Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters
Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank
22 / 43
24. Features (cont.)
Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well
Other plugins
–
–
–
–
–
CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata
Pluggable indexing
– SOLR | ES etc...
24 / 43
25. Indexing
Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377
ElasticSearch
– Version 0.90.1
AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517
Easy to build your own
– Text, DB, etc...
25 / 43
26. Typical Nutch document
Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–
url
content
title
anchor
site
boost
digest
segment
host
type
Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata
26 / 43
28. NUTCH 2.x
2.0 released in July 2012
2.2.1 in July 2013
Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...
Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years
Abstraction over storage layer → Apache GORA
28 / 43
29. Apache GORA
http://gora.apache.org/
ORM for NoSQL databases
– and limited SQL support + file based storage
Current version 0.3
DataStore implementations
●
●
●
Accumulo
Cassandra
HBase
●
●
●
Avro
DynamoDB
SQL (broken)
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
29 / 43
33. GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backends
– can be modified if necessary
Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
33 / 43
34. Benefits
Storage still distributed and replicated
… but one big table
– status, metadata, content, text → one place
– no more segments
Resume-able fetch and parse steps
Easier interaction with other resources
– Third-party code just need to use GORA and schema
Simplify the Nutch code
Potentially faster (e.g. update step)
34 / 43
35. Drawbacks
More stuff to install and configure
– Higher hardware requirements
Current performance :-(
–
–
–
–
–
–
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!
Not as stable as Nutch 1.x
35 / 43
36. 2.x Work in progress
Stabilise backend implementations
– GORA-Hbase most reliable
Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)
Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset
36 / 43
38. Future
1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x
New functionalities
–
–
–
–
Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)
Move to new MapReduce API
– Use Nutch on Hadoop 2.x
38 / 43
39. More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering
PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain
39 / 43
40. Longer term
Hadoop 2.x & YARN
Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …
End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch
See https://github.com/DigitalPebble/storm-crawler
40 / 43
41. Where to find out more?
Project page : http://nutch.apache.org/
Wiki : http://wiki.apache.org/nutch/
Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org
Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...
Support / consulting :
– http://wiki.apache.org/nutch/Support
41 / 43
I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
A few words about myself just before I start...
What I mean by Text Engineering is a variety of activities ranging from ....
What makes the identity of DP is
The main projects I am involved in are …
Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
Endpoints are called in various places
URL filters and normalisers in a lot of places
Same for Soring Filters
Main steps in Nutch
More actions available
Shell Wrappers around hadoop commands
Main steps in Nutch
More actions available
Shell Wrappers around hadoop commands
Endpoints are called in various places
URL filters and normalisers in a lot of places
Same for Soring Filters