SlideShare a Scribd company logo
1 of 35
Large Scale Crawling with

   Apache
Julien Nioche
julien@digitalpebble.com

ApacheCon Europe 2012
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
    –   Web Crawling
    –   Natural Language Processing
    –   Information Retrieval
    –   Data Mining
   Strong focus on Open Source & Apache ecosystem
   Apache Nutch VP
   Apache Tika committer
   User | Contributor
    –   SOLR, Lucene
    –   GATE, UIMA
    –   Mahout
    –   Behemoth


                                                     2 / 37
Objectives

 Overview of the project

 Nutch in a nutshell

 Nutch 2.x

 Future developments




                            3 / 37
Nutch?
 “Distributed framework for large scale web crawling”
   – but does not have to be large scale at all
   – or even on the web (file-protocol)


  Apache TLP since May 2010


  Based on Apache Hadoop

  Indexing and Search



                                                     4 / 37
Short history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2004 : sub-project of Lucene @Apache
 2005 : MapReduce implementation in Nutch
   – 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
   – 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 June 2012 : Nutch 1.5.1
 Oct 2012 : Nutch 2.1

                                                    5 / 37
Recent Releases


        1.0            1.1 1.2   1.3     1.4 1.5.1
trunk


               2.x
                                               2.0 2.1



              06/09   06/10      06/11       06/12




                                                         7 / 37
Community

 6 active committers / PMC members
   – 4 within the last 18 months


 Constant stream of new contributions & bug reports

 Steady numbers of mailing list subscribers and traffic

 Nutch is a very healthy 10-year old




                                                       9 / 37
Why use Nutch?

 Usual reasons
   – Mature, business-friendly license, community, ...


 Scalability
   – Tried and tested on very large scale
   – Hadoop cluster : installation and skills

 Features
   – e.g. Index with SOLR
   – PageRank implementation
   – Can be extended with plugins




                                                         10 / 37
Not the best option when ...

 Hadoop based == batch processing == high latency
   – No guarantee that a page will be fetched / parsed / indexed within X
     minutes|hours


 Javascript / Ajax not supported (yet)




                                                                      11 / 37
Use cases

 Crawl for IR
   – Generic or vertical
   – Index and Search with SOLR
   – Single node to large clusters on Cloud

 … but also
   – Data Mining
   – NLP (e.g.Sentiment Analysis)
   – ML


   – MAHOUT / UIMA / GATE
   – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)




                                                                   12 / 37
Customer cases
Specificity (Verticality)
                       Usecase : BetterJobs.com
                        –   Single server
                        –   Aggregates content from job portals
                        –   Extracts and normalizes structure (description,
                            requirements, locations)
                        –   ~1M pages total
                        –   Feeds SOLR index


          Usecase : SimilarPages.com
           –   Large cluster on Amazon EC2 (up to 400
               nodes)
           –   Fetched & parsed 3 billion pages
           –   10+ billion pages in crawlDB (~100TB data)
           –   200+ million lists of similarities
           –   No indexing / search involved
                                                                              Scale


                                                                                13 / 37
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
      1)   Inject → populates CrawlDB from seed list
      2)   Generate → Selects URLS to fetch in segment
      3)   Fetch → Fetches URLs from segment
      4)   Parse → Parses content (text + metadata)
      5)   UpdateDB → Updates CrawlDB (new URLs, new status...)
      6)   InvertLinks → Build Webgraph
      7)   SOLRIndex → Send docs to SOLR
      8)   SOLRDedup → Remove duplicate docs based on signature
 Repeat steps 2 to 8
 Or use the all-in-one crawl script

                                                          14 / 37
Main steps


 Seed
 List        CrawlDB                 Segment
                                /
                                /crawl_fetch/
                                crawl_generate/
                                /content/
                                /crawl_parse/
                                /parse_data/
                                /parse_text/




                       LinkDB


                                                  15 / 37
Frontier expansion

 Manual “discovery”
   – Adding new URLs by
     hand, “seeding”


 Automatic discovery
  of new resources
  (frontier expansion)
   – Not all outlinks are
     equally useful - control       seed
   – Requires content
                                           i=1
     parsing and link
     extraction
                                                 i=2
                                                        i=3

  [Slide courtesy of A. Bialecki]

                                                       16 / 37
An extensible framework
 Plugins
   – Activated with parameter 'plugin.includes'
   – Implement one or more endpoints

 Endpoints
   –   Protocol
   –   Parser
   –   HtmlParseFilter (ParseFilter in Nutch 2.x)
   –   ScoringFilter (used in various places)
   –   URLFilter (ditto)
   –   URLNormalizer (ditto)
   –   IndexingFilter




                                                    17 / 37
Features

 Fetcher
   –   Multi-threaded fetcher
   –   Follows robots.txt
   –   Groups URLs per hostname / domain / IP
   –   Limit the number of URLs for round of fetching
   –   Default values are polite but can be made more aggressive

 Crawl Strategy
   – Breadth-first but can be depth-first
   – Configurable via custom scoring plugins

 Scoring
   – OPIC (On-line Page Importance Calculation) by default
   – LinkRank


                                                                   18 / 37
Features (cont.)

 Protocols
   – Http, file, ftp, https

 Scheduling
   – Specified or adaptative

 URL filters
   – Regex, FSA, TLD, prefix, suffix


 URL normalisers
   – Default, regex




                                       19 / 37
Features (cont.)

 Parsing with Apache Tika
   – Hundreds of formats supported
   – But some legacy parsers as well
 Other plugins
   –   CreativeCommons
   –   Feeds
   –   Language Identification
   –   Rel tags
   –   Arbitrary Metadata

 Indexing to SOLR
   – Bespoke schema




                                       20 / 37
Data Structures in 1.x
 MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
 CrawlDB => status of known pages
                       MapFile : <Text,CrawlDatum>
                       byte status; [fetched? Unfetched? Failed? Redir?]
                       long fetchTime;
                       byte retries;
     CrawlDB           int fetchInterval;
                       float score = 1.0f;
                       byte[] signature = null;
                       long modifiedTime;
                       org.apache.hadoop.io.MapWritable metaData;


 Input of : generate - index
 Output of : inject - update


                                                                   21 / 37
Data Structures 1.x

 Segment => round of fetching
 Identified by a timestamp

 Segment
      /crawl_generate/ → SequenceFile<Text,CrawlDatum>
      /crawl_fetch/ → MapFile<Text,CrawlDatum>
      /content/ → MapFile<Text,Content>
      /crawl_parse/ → SequenceFile<Text,CrawlDatum>
      /parse_data/ → MapFile<Text,ParseData>
      /parse_text/ → MapFile<Text,ParseText>


 Can have multiple versions of a page in different
  segments


                                                         22 / 37
Data Structures – 1.x

 linkDB => storage for Web Graph

                        MapFile : <Text,Inlinks>
                        Inlinks : HashSet <Inlink>
     LinkDB             Inlink :
                                 String fromUrl
                                 String anchor


 Output of : invertlinks
 Input of : SOLRIndex




                                                     23 / 37
NUTCH 2.x

 2.0 released in July 2012

 2.1 in October 2012

 Common features as 1.x
   – delegation to SOLR, TIKA, MapReduce etc...


 Moved to table-based architecture
   – Wealth of NoSQL projects in last few years


 Abstraction over storage layer → Apache GORA


                                                  24 / 37
Apache GORA

 http://gora.apache.org/

 ORM for NoSQL databases
   – and limited SQL support + file based storage
 0.2.1 released in August 2012
 DataStore implementations
       ●   Accumulo             ●   Avro
       ●   Cassandra            ●   DynamoDB (soon)
       ●   HBase                ●   SQL

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)

                                                      25 / 37
AVRO Schema => Java code
 {"name": "WebPage",
  "type": "record",
  "namespace": "org.apache.nutch.storage",
  "fields": [
        {"name": "baseUrl", "type": ["null", "string"] },
        {"name": "status", "type": "int"},
        {"name": "fetchTime", "type": "long"},
        {"name": "prevFetchTime", "type": "long"},
        {"name": "fetchInterval", "type": "int"},
        {"name": "retriesSinceFetch", "type": "int"},
        {"name": "modifiedTime", "type": "long"},
        {"name": "protocolStatus", "type": {
           "name": "ProtocolStatus",
           "type": "record",
           "namespace": "org.apache.nutch.storage",
           "fields": [
               {"name": "code", "type": "int"},
               {"name": "args", "type": {"type": "array", "items": "string"}},
               {"name": "lastModified", "type": "long"}
           ]
           }},
 […]

                                                                                 26 / 37
Mapping file (backend specific – Hbase)
<gora-orm>

  <table name="webpage">
     <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
     <family name="f" maxVersions="1"/>
     <family name="s" maxVersions="1"/>
     <family name="il" maxVersions="1"/>
     <family name="ol" maxVersions="1"/>
     <family name="h" maxVersions="1"/>
     <family name="mtdt" maxVersions="1"/>
     <family name="mk" maxVersions="1"/>
  </table>
  <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

    <!-- fetch fields                        -->
    <field name="baseUrl" family="f" qualifier="bas"/>
    <field name="status" family="f" qualifier="st"/>
    <field name="prevFetchTime" family="f" qualifier="pts"/>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="fetchInterval" family="f" qualifier="fi"/>
    <field name="retriesSinceFetch" family="f" qualifier="rsf"/>



                                                                                            27 / 37
DataStore operations

 Atomic operations
   – get(K key)
   – put(K key, T obj)
   – delete(K key)

 Querying
   – execute(Query<K, T> query) → Result<K,T>
   – deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
   – GORAInput|OutputFormat
   – GoraRecordReader|Writer
   – GORAMapper|Reducer



                                                28 / 37
GORA in Nutch

 AVRO schema provided and java code pre-generated


 Mapping files provided for backends
   – can be modified if necessary

 Need to rebuild to get dependencies for backend
   – No binary distribution of Nutch 2.x

 http://wiki.apache.org/nutch/Nutch2Tutorial



                                                    29 / 37
Benefits

 Storage still distributed and replicated
 but one big table
   – status, metadata, content, text → one place
 Simplified logic in Nutch
   – Simpler code for updating / merging information
 More efficient (?)
   – No need to read / write entire structure to update records
   – No comparison available yet + early days for GORA
 Easier interaction with other resources
   – Third-party code just need to use GORA and schema



                                                                  30 / 37
Drawbacks

 More stuff to install and configure :-)

 Not as stable as Nutch 1.x

 Dependent on success of Gora




                                            31 / 37
2.x Work in progress

 Stabilise backend implementations
   – GORA-Hbase most reliable


 Synchronize features with 1.x
   – e.g. has ElasticSearch but missing LinkRank equivalent


 Filter enabled scans (GORA-119)
   – Don't need to de-serialize the whole dataset




                                                              32 / 37
Future

 Both 1.x and 2.x in parallel
   – but more frequent releases for 2.x


 New functionalities
   –   Support for SOLRCloud
   –   Sitemap (from Crawler Commons library)
   –   Canonical tag
   –   More indexers (e.g. ElasticSearch) + pluggable indexers?




                                                                  33 / 37
More delegation
 Great deal done in recent years (SOLR, Tika)

 Share code with crawler-commons
  (http://code.google.com/p/crawler-commons/)
   – Fetcher / protocol handling
   – Robots.txt parsing
   – URL normalisation / filtering


 PageRank-like computations to graph library
   – e.g. Apache Giraph
   – Should be more efficient as well




                                                 34 / 37
Where to find out more?

  Project page : http://nutch.apache.org/
  Wiki : http://wiki.apache.org/nutch/
  Mailing lists :
    – user@nutch.apache.org
    – dev@nutch.apache.org


 Chapter in 'Hadoop the Definitive Guide' (T. White)
   – Understanding Hadoop is essential anyway...


 Support / consulting :
   – http://wiki.apache.org/nutch/Support



                                                        35 / 37
Questions




            ?

                36 / 37
37 / 37

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
Primal Pappachan
 

What's hot (20)

Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
ElastiCache Deep Dive: Best Practices and Usage Patterns - March 2017 AWS Onl...
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 

Viewers also liked

Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
abial
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
Nate Murray
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

Viewers also liked (20)

Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Challenges in Large-Scale Web Crawling
Challenges in Large-Scale Web CrawlingChallenges in Large-Scale Web Crawling
Challenges in Large-Scale Web Crawling
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 

Similar to Large scale crawling with Apache Nutch

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Similar to Large scale crawling with Apache Nutch (20)

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Large scale crawling with Apache Nutch

  • 1. Large Scale Crawling with Apache Julien Nioche julien@digitalpebble.com ApacheCon Europe 2012
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Data Mining  Strong focus on Open Source & Apache ecosystem  Apache Nutch VP  Apache Tika committer  User | Contributor – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth 2 / 37
  • 3. Objectives  Overview of the project  Nutch in a nutshell  Nutch 2.x  Future developments 3 / 37
  • 4. Nutch?  “Distributed framework for large scale web crawling” – but does not have to be large scale at all – or even on the web (file-protocol)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search 4 / 37
  • 5. Short history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2004 : sub-project of Lucene @Apache  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  June 2012 : Nutch 1.5.1  Oct 2012 : Nutch 2.1 5 / 37
  • 6. Recent Releases 1.0 1.1 1.2 1.3 1.4 1.5.1 trunk 2.x 2.0 2.1 06/09 06/10 06/11 06/12 7 / 37
  • 7. Community  6 active committers / PMC members – 4 within the last 18 months  Constant stream of new contributions & bug reports  Steady numbers of mailing list subscribers and traffic  Nutch is a very healthy 10-year old 9 / 37
  • 8. Why use Nutch?  Usual reasons – Mature, business-friendly license, community, ...  Scalability – Tried and tested on very large scale – Hadoop cluster : installation and skills  Features – e.g. Index with SOLR – PageRank implementation – Can be extended with plugins 10 / 37
  • 9. Not the best option when ...  Hadoop based == batch processing == high latency – No guarantee that a page will be fetched / parsed / indexed within X minutes|hours  Javascript / Ajax not supported (yet) 11 / 37
  • 10. Use cases  Crawl for IR – Generic or vertical – Index and Search with SOLR – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 12 / 37
  • 11. Customer cases Specificity (Verticality) Usecase : BetterJobs.com – Single server – Aggregates content from job portals – Extracts and normalizes structure (description, requirements, locations) – ~1M pages total – Feeds SOLR index Usecase : SimilarPages.com – Large cluster on Amazon EC2 (up to 400 nodes) – Fetched & parsed 3 billion pages – 10+ billion pages in crawlDB (~100TB data) – 200+ million lists of similarities – No indexing / search involved Scale 13 / 37
  • 12. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) Inject → populates CrawlDB from seed list 2) Generate → Selects URLS to fetch in segment 3) Fetch → Fetches URLs from segment 4) Parse → Parses content (text + metadata) 5) UpdateDB → Updates CrawlDB (new URLs, new status...) 6) InvertLinks → Build Webgraph 7) SOLRIndex → Send docs to SOLR 8) SOLRDedup → Remove duplicate docs based on signature  Repeat steps 2 to 8  Or use the all-in-one crawl script 14 / 37
  • 13. Main steps Seed List CrawlDB Segment / /crawl_fetch/ crawl_generate/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 15 / 37
  • 14. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control seed – Requires content i=1 parsing and link extraction i=2 i=3 [Slide courtesy of A. Bialecki] 16 / 37
  • 15. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – Protocol – Parser – HtmlParseFilter (ParseFilter in Nutch 2.x) – ScoringFilter (used in various places) – URLFilter (ditto) – URLNormalizer (ditto) – IndexingFilter 17 / 37
  • 16. Features  Fetcher – Multi-threaded fetcher – Follows robots.txt – Groups URLs per hostname / domain / IP – Limit the number of URLs for round of fetching – Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom scoring plugins  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 18 / 37
  • 17. Features (cont.)  Protocols – Http, file, ftp, https  Scheduling – Specified or adaptative  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 19 / 37
  • 18. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – CreativeCommons – Feeds – Language Identification – Rel tags – Arbitrary Metadata  Indexing to SOLR – Bespoke schema 20 / 37
  • 19. Data Structures in 1.x  MapReduce jobs => I/O : Hadoop [Sequence|Map]Files  CrawlDB => status of known pages MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; CrawlDB int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;  Input of : generate - index  Output of : inject - update 21 / 37
  • 20. Data Structures 1.x  Segment => round of fetching  Identified by a timestamp Segment /crawl_generate/ → SequenceFile<Text,CrawlDatum> /crawl_fetch/ → MapFile<Text,CrawlDatum> /content/ → MapFile<Text,Content> /crawl_parse/ → SequenceFile<Text,CrawlDatum> /parse_data/ → MapFile<Text,ParseData> /parse_text/ → MapFile<Text,ParseText>  Can have multiple versions of a page in different segments 22 / 37
  • 21. Data Structures – 1.x  linkDB => storage for Web Graph MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> LinkDB Inlink : String fromUrl String anchor  Output of : invertlinks  Input of : SOLRIndex 23 / 37
  • 22. NUTCH 2.x  2.0 released in July 2012  2.1 in October 2012  Common features as 1.x – delegation to SOLR, TIKA, MapReduce etc...  Moved to table-based architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 24 / 37
  • 23. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  0.2.1 released in August 2012  DataStore implementations ● Accumulo ● Avro ● Cassandra ● DynamoDB (soon) ● HBase ● SQL  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 25 / 37
  • 24. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 26 / 37
  • 25. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 27 / 37
  • 26. DataStore operations  Atomic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 28 / 37
  • 27. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – No binary distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 29 / 37
  • 28. Benefits  Storage still distributed and replicated  but one big table – status, metadata, content, text → one place  Simplified logic in Nutch – Simpler code for updating / merging information  More efficient (?) – No need to read / write entire structure to update records – No comparison available yet + early days for GORA  Easier interaction with other resources – Third-party code just need to use GORA and schema 30 / 37
  • 29. Drawbacks  More stuff to install and configure :-)  Not as stable as Nutch 1.x  Dependent on success of Gora 31 / 37
  • 30. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. has ElasticSearch but missing LinkRank equivalent  Filter enabled scans (GORA-119) – Don't need to de-serialize the whole dataset 32 / 37
  • 31. Future  Both 1.x and 2.x in parallel – but more frequent releases for 2.x  New functionalities – Support for SOLRCloud – Sitemap (from Crawler Commons library) – Canonical tag – More indexers (e.g. ElasticSearch) + pluggable indexers? 33 / 37
  • 32. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – Robots.txt parsing – URL normalisation / filtering  PageRank-like computations to graph library – e.g. Apache Giraph – Should be more efficient as well 34 / 37
  • 33. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 35 / 37
  • 34. Questions ? 36 / 37

Editor's Notes

  1. I&apos;ll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
  2. A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
  3. Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
  4. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  5. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  6. Main steps in Nutch More actions available Shell Wrappers around hadoop commands
  7. Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
  8. Fetcher . multithreaded but polite
  9. Fetcher . multithreaded but polite
  10. Writable object – crawl datum
  11. What does this mean for Nutch?
  12. What does this mean for Nutch?