SlideShare a Scribd company logo
Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43
Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43
A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43
Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43
Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43
Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43
Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43
CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

11 / 43
Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43
Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
  <name>http.agent.name</name>
  <value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43
Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43
Monitor Crawl with MapReduce UI

15 / 43
Counters and logs

16 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

17 / 43
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43
Main steps from a data perspective
Seed
List

Segment

CrawlDB

/
/crawl_fetch/
crawl_generate/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43
An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43
Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43
Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43
Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43
Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43
Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

27 / 43
NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43
Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43
AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43
Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
<!-- fetch fields
-->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43
DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43
GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43
Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43
Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43
2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

37 / 43
Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43
More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43
Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43
Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43
Questions

?
42 / 43
43 / 43

More Related Content

What's hot

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
Mike Frampton
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
Julien Nioche
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
Saumitra Srivastav
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Knoldus Inc.
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
Mike Frampton
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
Sameer Tiwari
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
Jim Dowling
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
Dongmin Yu
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Jimmy Angelakos
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
Joe Stein
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
pmanvi
 

What's hot (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 

Viewers also liked

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
hannonhill
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
gagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
Paolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
ProjectHub
ProjectHubProjectHub
Search engine
Search engineSearch engine
Search engine
Alisha Korpal
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
David Gil Sánchez
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
Iván Campaña Naranjo
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
Toni de la Fuente
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
Jorge Luis Betancourt Gonzalez
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
Jorge Luis Betancourt Gonzalez
 

Viewers also liked (20)

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 

Similar to Large Scale Crawling with Apache Nutch and Friends

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
Julien Nioche
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
jaxLondonConference
 
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptxHadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptx
ShashankSahu34
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
MapR Technologies
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Oleksiy Panchenko
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
javier ramirez
 
Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
Sebastian Schaffert
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
Thanigai Vellore
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
Steve Loughran
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)
Lucas Jellema
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Paolo Cristofaro
 

Similar to Large Scale Crawling with Apache Nutch and Friends (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptxHadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptx
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 

Recently uploaded

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
aisafed42
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 

Recently uploaded (20)

一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 

Large Scale Crawling with Apache Nutch and Friends

  • 1. Large Scale Crawling with Apache and friends... Julien Nioche julien@digitalpebble.com LUCENE/SOLR REVOLUTION EU 2013
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – – – – Web Crawling Natural Language Processing Information Retrieval Machine Learning  Strong focus on Open Source & Apache ecosystem  VP Apache Nutch  User | Contributor | Committer – – – – – Tika SOLR, Lucene GATE, UIMA Mahout Behemoth 2 / 43
  • 3. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 3 / 43
  • 4. Nutch?  “Distributed framework for large scale web crawling” (but does not have to be large scale at all)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search by 4 / 43
  • 5. A bit of history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  Sept 2010 : Storage abstraction in Nutch 2.x – 2012 : Gora TLP @Apache 5 / 43
  • 6. Recent Releases trunk 1.0 1.1 1.2 1.3 1.4 1.5.1 1.6 1.7 2.x 2.0 2.1 06/09 06/10 06/11 06/12 2.2.1 06/13 6 / 43
  • 7. Why use Nutch?  Usual reasons – Open source with a business-friendly license, mature, community, ...  Scalability – Tried and tested on very large scale – Standard Hadoop  Features – – – – Index with SOLR / ES / CloudSearch PageRank implementation Loads of existing plugins Can easily be extended / customised 7 / 43
  • 8. Use cases  Crawl for search – Generic or vertical – Index and Search with SOLR and al. – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML  with – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 8 / 43
  • 9. Customer cases Specificity (Verticality) BetterJobs.com (CareerBuilder) – – – – – Single server Aggregates content from job portals Extracts and normalizes structure (description, requirements, locations) ~2M pages total Feeds SOLR index SimilarPages.com – – – – – Large cluster on Amazon EC2 (up to 400 nodes) Fetched & parsed 3 billion pages 10+ billion pages in crawlDB (~100TB data) 200+ million lists of similarities No indexing / search involved Size 9 / 43
  • 10. CommonCrawl http://commoncrawl.org/  Open repository of web crawl data  2012 dataset : 3.83 billion docs  ARC files on Amazon S3  Using Nutch 1.7  A few modifications to Nutch code – https://github.com/Aloisius/nutch  Next release imminent 10 / 43
  • 11. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 11 / 43
  • 12. Installation  http://nutch.apache.org/downloads.html  1.7 => src and bin distributions  2.2.1 => src only  'ant clean runtime' – runtime/local => local mode (test and debug) – runtime/deploy => job jar for Hadoop + scripts  Binary distribution for 1.x == runtime/local 12 / 43
  • 13. Configuration and resources  Changes in $NUTCH_HOME/conf – Need recompiling with 'ant runtime' – Local mode => can be made directly in runtime/local/conf  Specify configuration in nutch-site.xml – Leave nutch-default alone!  At least : <property>   <name>http.agent.name</name>   <value>WhateverNameDescribesMyMightyCrawler</value> </property> 13 / 43
  • 14. Running it!  bin/crawl script : typical sequence of steps  bin/nutch : individual Nutch commands – Inject / generate / fetch / parse / update ….  Local mode : great for testing and debugging  Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism – MapReduce UI to monitor crawl, check logs, counters 14 / 43
  • 15. Monitor Crawl with MapReduce UI 15 / 43
  • 17. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 17 / 43
  • 18. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) 2) 3) 4) 5) 6) 7) Inject → populates CrawlDB from seed list Generate → Selects URLS to fetch in segment Fetch → Fetches URLs from segment Parse → Parses content (text + metadata) UpdateDB → Updates CrawlDB (new URLs, new status...) InvertLinks → Build Webgraph Index → Send docs to [SOLR | ES | CloudSearch | … ]  Repeat steps 2 to 7  Or use the all-in-one crawl script 18 / 43
  • 19. Main steps from a data perspective Seed List Segment CrawlDB / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 19 / 43
  • 20. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i=1 i=2 i=3 [Slide courtesy of A. Bialecki] 20 / 43
  • 21. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – – – – – – – – Protocol Parser HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x) ScoringFilter (used in various places) URLFilter (ditto) URLNormalizer (ditto) IndexingFilter IndexWriter (NEW IN 1.7!) 21 / 43
  • 22. Features  Fetcher – – – – Multi-threaded fetcher Queues URLs per hostname / domain / IP Limit the number of URLs for round of fetching Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom ScoringFilters  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 22 / 43
  • 23. Features (cont.)  Protocols – Http, file, ftp, https – Respects robots.txt directives  Scheduling – Fixed or adaptive  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 23 / 43
  • 24. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – – – – – CreativeCommons Feeds Language Identification Rel tags Arbitrary Metadata  Pluggable indexing – SOLR | ES etc... 24 / 43
  • 25. Indexing  Apache SOLR – schema.xml in conf/ – SOLR 3.4 – JIRA issue for SOLRCloud • https://issues.apache.org/jira/browse/NUTCH-1377  ElasticSearch – Version 0.90.1  AWS CloudSearch – WIP : https://issues.apache.org/jira/browse/NUTCH-1517  Easy to build your own – Text, DB, etc... 25 / 43
  • 26. Typical Nutch document  Some of the fields (IndexingFilters in plugins or core code) – – – – – – – – – – url content title anchor site boost digest segment host type  Configurable ones – meta tags (keywords, description etc...) – arbitrary metadata 26 / 43
  • 27. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 27 / 43
  • 28. NUTCH 2.x  2.0 released in July 2012  2.2.1 in July 2013  Common features as 1.x – MapReduce, Tika, delegation to SOLR, etc...  Moved to 'big table'-like architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 28 / 43
  • 29. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  Current version 0.3  DataStore implementations ● ● ● Accumulo Cassandra HBase ● ● ● Avro DynamoDB SQL (broken)  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 29 / 43
  • 30. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 30 / 43
  • 31. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 31 / 43
  • 32. DataStore operations  Basic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 32 / 43
  • 33. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – hence source only distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 33 / 43
  • 34. Benefits  Storage still distributed and replicated  … but one big table – status, metadata, content, text → one place – no more segments  Resume-able fetch and parse steps  Easier interaction with other resources – Third-party code just need to use GORA and schema  Simplify the Nutch code  Potentially faster (e.g. update step) 34 / 43
  • 35. Drawbacks  More stuff to install and configure – Higher hardware requirements  Current performance :-( – – – – – – http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html N2+HBase : 2.7x slower than 1.x N2+Cassandra : 4.4x slower than 1.x due mostly to GORA layer : not inherent to Hbase or Cassandra https://issues.apache.org/jira/browse/GORA-119 → filtered scans Not all backends provide data locality!  Not as stable as Nutch 1.x 35 / 43
  • 36. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph) – No pluggable indexers yet (NUTCH-1568)  Filter enabled scans – GORA-119 • => don't need to de-serialize the whole dataset 36 / 43
  • 37. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 37 / 43
  • 38. Future  1.x and 2.x to coexist in parallel – 2.x not yet a replacement of 1.x  New functionalities – – – – Support for SOLRCloud Sitemap (from CrawlerCommons library) Canonical tag Generic deduplication (NUTCH-656)  Move to new MapReduce API – Use Nutch on Hadoop 2.x 38 / 43
  • 39. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – URL normalisation / filtering  PageRank-like computations to graph library – Apache Giraph – Should be more efficient + less code to maintain 39 / 43
  • 40. Longer term  Hadoop 2.x & YARN  Convergence of batch and streaming – Storm / Samza / Storm-YARN / …  End of 100% batch operations ? – Fetch and parse as streaming ? – Always be fetching – Generate / update / pagerank remain batch  See https://github.com/DigitalPebble/storm-crawler 40 / 43
  • 41. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 41 / 43

Editor's Notes

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. What does this mean for Nutch?
  11. What does this mean for Nutch?