SlideShare a Scribd company logo
1 of 43
Download to read offline
Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43
Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43
A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43
Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43
Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43
Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43
Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43
CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

11 / 43
Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43
Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
  <name>http.agent.name</name>
  <value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43
Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43
Monitor Crawl with MapReduce UI

15 / 43
Counters and logs

16 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

17 / 43
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43
Main steps from a data perspective
Seed
List

Segment

CrawlDB

/
/crawl_fetch/
crawl_generate/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43
An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43
Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43
Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43
Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43
Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43
Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

27 / 43
NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43
Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43
AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43
Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
<!-- fetch fields
-->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43
DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43
GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43
Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43
Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43
2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

37 / 43
Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43
More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43
Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43
Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43
Questions

?
42 / 43
43 / 43

More Related Content

What's hot

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Mike Frampton
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSSaumitra Srivastav
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solrMike Frampton
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 

What's hot (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 

Viewers also liked

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDavid Gil Sánchez
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrIván Campaña Naranjo
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolToni de la Fuente
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 

Viewers also liked (20)

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 

Similar to Large Scale Crawling with Apache Nutch and Friends

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesThanigai Vellore
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble BehemothSteve Loughran
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Lucas Jellema
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOPaolo Cristofaro
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 

Similar to Large Scale Crawling with Apache Nutch and Friends (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 

Recently uploaded

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...OnePlan Solutions
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dbaRemote DBA Services
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapIshara Amarasekera
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisDEEPRAJ PATHAK
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxTechnogeeks
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsDEEPRAJ PATHAK
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 

Recently uploaded (20)

JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
Transform your Corporate Strategy Office - Harness OnePlan’s Strategic Portfo...
 
oracle 23c new features for developer and dba
oracle 23c new features for developer and dbaoracle 23c new features for developer and dba
oracle 23c new features for developer and dba
 
Key Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery RoadmapKey Steps in Agile Software Delivery Roadmap
Key Steps in Agile Software Delivery Roadmap
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Business Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business AnalysisBusiness Analyzopedia - Your Pocket Gita for Business Analysis
Business Analyzopedia - Your Pocket Gita for Business Analysis
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
What is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docxWhat is Mendix and the concept of low-code development.docx
What is Mendix and the concept of low-code development.docx
 
Effort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software ProjectsEffort Estimation Techniques used in Software Projects
Effort Estimation Techniques used in Software Projects
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 

Large Scale Crawling with Apache Nutch and Friends

  • 1. Large Scale Crawling with Apache and friends... Julien Nioche julien@digitalpebble.com LUCENE/SOLR REVOLUTION EU 2013
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – – – – Web Crawling Natural Language Processing Information Retrieval Machine Learning  Strong focus on Open Source & Apache ecosystem  VP Apache Nutch  User | Contributor | Committer – – – – – Tika SOLR, Lucene GATE, UIMA Mahout Behemoth 2 / 43
  • 3. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 3 / 43
  • 4. Nutch?  “Distributed framework for large scale web crawling” (but does not have to be large scale at all)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search by 4 / 43
  • 5. A bit of history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  Sept 2010 : Storage abstraction in Nutch 2.x – 2012 : Gora TLP @Apache 5 / 43
  • 6. Recent Releases trunk 1.0 1.1 1.2 1.3 1.4 1.5.1 1.6 1.7 2.x 2.0 2.1 06/09 06/10 06/11 06/12 2.2.1 06/13 6 / 43
  • 7. Why use Nutch?  Usual reasons – Open source with a business-friendly license, mature, community, ...  Scalability – Tried and tested on very large scale – Standard Hadoop  Features – – – – Index with SOLR / ES / CloudSearch PageRank implementation Loads of existing plugins Can easily be extended / customised 7 / 43
  • 8. Use cases  Crawl for search – Generic or vertical – Index and Search with SOLR and al. – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML  with – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 8 / 43
  • 9. Customer cases Specificity (Verticality) BetterJobs.com (CareerBuilder) – – – – – Single server Aggregates content from job portals Extracts and normalizes structure (description, requirements, locations) ~2M pages total Feeds SOLR index SimilarPages.com – – – – – Large cluster on Amazon EC2 (up to 400 nodes) Fetched & parsed 3 billion pages 10+ billion pages in crawlDB (~100TB data) 200+ million lists of similarities No indexing / search involved Size 9 / 43
  • 10. CommonCrawl http://commoncrawl.org/  Open repository of web crawl data  2012 dataset : 3.83 billion docs  ARC files on Amazon S3  Using Nutch 1.7  A few modifications to Nutch code – https://github.com/Aloisius/nutch  Next release imminent 10 / 43
  • 11. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 11 / 43
  • 12. Installation  http://nutch.apache.org/downloads.html  1.7 => src and bin distributions  2.2.1 => src only  'ant clean runtime' – runtime/local => local mode (test and debug) – runtime/deploy => job jar for Hadoop + scripts  Binary distribution for 1.x == runtime/local 12 / 43
  • 13. Configuration and resources  Changes in $NUTCH_HOME/conf – Need recompiling with 'ant runtime' – Local mode => can be made directly in runtime/local/conf  Specify configuration in nutch-site.xml – Leave nutch-default alone!  At least : <property>   <name>http.agent.name</name>   <value>WhateverNameDescribesMyMightyCrawler</value> </property> 13 / 43
  • 14. Running it!  bin/crawl script : typical sequence of steps  bin/nutch : individual Nutch commands – Inject / generate / fetch / parse / update ….  Local mode : great for testing and debugging  Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism – MapReduce UI to monitor crawl, check logs, counters 14 / 43
  • 15. Monitor Crawl with MapReduce UI 15 / 43
  • 17. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 17 / 43
  • 18. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) 2) 3) 4) 5) 6) 7) Inject → populates CrawlDB from seed list Generate → Selects URLS to fetch in segment Fetch → Fetches URLs from segment Parse → Parses content (text + metadata) UpdateDB → Updates CrawlDB (new URLs, new status...) InvertLinks → Build Webgraph Index → Send docs to [SOLR | ES | CloudSearch | … ]  Repeat steps 2 to 7  Or use the all-in-one crawl script 18 / 43
  • 19. Main steps from a data perspective Seed List Segment CrawlDB / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 19 / 43
  • 20. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i=1 i=2 i=3 [Slide courtesy of A. Bialecki] 20 / 43
  • 21. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – – – – – – – – Protocol Parser HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x) ScoringFilter (used in various places) URLFilter (ditto) URLNormalizer (ditto) IndexingFilter IndexWriter (NEW IN 1.7!) 21 / 43
  • 22. Features  Fetcher – – – – Multi-threaded fetcher Queues URLs per hostname / domain / IP Limit the number of URLs for round of fetching Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom ScoringFilters  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 22 / 43
  • 23. Features (cont.)  Protocols – Http, file, ftp, https – Respects robots.txt directives  Scheduling – Fixed or adaptive  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 23 / 43
  • 24. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – – – – – CreativeCommons Feeds Language Identification Rel tags Arbitrary Metadata  Pluggable indexing – SOLR | ES etc... 24 / 43
  • 25. Indexing  Apache SOLR – schema.xml in conf/ – SOLR 3.4 – JIRA issue for SOLRCloud • https://issues.apache.org/jira/browse/NUTCH-1377  ElasticSearch – Version 0.90.1  AWS CloudSearch – WIP : https://issues.apache.org/jira/browse/NUTCH-1517  Easy to build your own – Text, DB, etc... 25 / 43
  • 26. Typical Nutch document  Some of the fields (IndexingFilters in plugins or core code) – – – – – – – – – – url content title anchor site boost digest segment host type  Configurable ones – meta tags (keywords, description etc...) – arbitrary metadata 26 / 43
  • 27. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 27 / 43
  • 28. NUTCH 2.x  2.0 released in July 2012  2.2.1 in July 2013  Common features as 1.x – MapReduce, Tika, delegation to SOLR, etc...  Moved to 'big table'-like architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 28 / 43
  • 29. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  Current version 0.3  DataStore implementations ● ● ● Accumulo Cassandra HBase ● ● ● Avro DynamoDB SQL (broken)  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 29 / 43
  • 30. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 30 / 43
  • 31. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 31 / 43
  • 32. DataStore operations  Basic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 32 / 43
  • 33. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – hence source only distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 33 / 43
  • 34. Benefits  Storage still distributed and replicated  … but one big table – status, metadata, content, text → one place – no more segments  Resume-able fetch and parse steps  Easier interaction with other resources – Third-party code just need to use GORA and schema  Simplify the Nutch code  Potentially faster (e.g. update step) 34 / 43
  • 35. Drawbacks  More stuff to install and configure – Higher hardware requirements  Current performance :-( – – – – – – http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html N2+HBase : 2.7x slower than 1.x N2+Cassandra : 4.4x slower than 1.x due mostly to GORA layer : not inherent to Hbase or Cassandra https://issues.apache.org/jira/browse/GORA-119 → filtered scans Not all backends provide data locality!  Not as stable as Nutch 1.x 35 / 43
  • 36. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph) – No pluggable indexers yet (NUTCH-1568)  Filter enabled scans – GORA-119 • => don't need to de-serialize the whole dataset 36 / 43
  • 37. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 37 / 43
  • 38. Future  1.x and 2.x to coexist in parallel – 2.x not yet a replacement of 1.x  New functionalities – – – – Support for SOLRCloud Sitemap (from CrawlerCommons library) Canonical tag Generic deduplication (NUTCH-656)  Move to new MapReduce API – Use Nutch on Hadoop 2.x 38 / 43
  • 39. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – URL normalisation / filtering  PageRank-like computations to graph library – Apache Giraph – Should be more efficient + less code to maintain 39 / 43
  • 40. Longer term  Hadoop 2.x & YARN  Convergence of batch and streaming – Storm / Samza / Storm-YARN / …  End of 100% batch operations ? – Fetch and parse as streaming ? – Always be fetching – Generate / update / pagerank remain batch  See https://github.com/DigitalPebble/storm-crawler 40 / 43
  • 41. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 41 / 43

Editor's Notes

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. What does this mean for Nutch?
  11. What does this mean for Nutch?