SlideShare a Scribd company logo
Low latency scalable web crawling on
Apache Storm
Julien Nioche
julien@digitalpebble.com
Berlin Buzzwords 01/06/2015
Storm Crawler
digitalpebble
2
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
– Nutch
– Tika
– GATE, UIMA
– SOLR, Elasticearch
– Behemoth
3
 Collection of resources (SDK) for building web crawlers on Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Apache License v2
 Artefacts available from Maven Central
 Active and growing fast
 Version 0.5 just released!
Storm-Crawler : what is it?
=> Can scale
=> Can do low latency
=> Easy to use and extend
=> Nice features (politeness, scraping, sitemaps etc...)
4
What it is not
 A well packaged application
– It's a SDK : it requires some minimal programming and configuration
 No global processing of the pages e.g.PageRank
 No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
– But can use Storm UI + metrics to various backends (Librato, ElasticSearch)
5
Comparison with Apache Nutch
6
StormCrawler vs Nutch
 Nutch is batch driven : little control on when URLs are fetched
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always fetching' ; better use of resources
 More flexible
– Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard
SC components
– Logical crawls : multiple crawlers with their own scheduling easier with SC via queues
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed some code and concepts
– Contributed some stuff back
7
Apache Storm
8
Apache Storm
 Distributed real time computation system
 http://storm.apache.org/
 Scalable, fault-tolerant, polyglot and fun!
 Implemented in Clojure + Java
 “Realtime analytics, online machine learning, continuous computation,
distributed RPC, ETL, and more”
9
Topology = spouts + bolts + tuples + streams
10
What's in Storm Crawler?
https://www.flickr.com/photos/dipster1/1403240351/
11
Resources: core vs external
 Core
– Fetcher(s) Bolts
– JsoupParserBolt
– SitemapParserBolt
– ...
 External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
– Tika-based parser bolt
 User-maintained external resources
 Generic Storm resources (e.g. spouts)
12
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Internal fetch threads
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– Default HTTP implementation based on Apache HttpClient
 Also have a SimpleFetcherBolt
– Need handle politeness elsewhere (spout?)
13
JSoupParserBolt
 HTML only but have a Tika-based one in external
 Extracts texts and outlinks
 Calls URLFilters on outlinks
– normalize and / or blacklists URLs
 Calls ParseFilters on document
– e.g. scrape info with XpathFilter
– Enrich metadata content
14
ParseFilter
 Extracts information from document
– Called by *ParserBolt(s)
– Configured via JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
15
URLFilter
 Control crawl expansion
 Delete or rewrites URLs
 Configured in JSON file
 Interface
– String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
 Basic
 MaxDepth
 Host
 RegexURLFilter
 RegexURLNormalizer
16
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout (1) pull URLs from an external source 
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata
17
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Default
stream
(4) <String url, byte[] content, Metadata m, String
text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)
18
Basic scenario
 Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
 Indexer : any form of storage but often index e.g. SOLR, ElasticSearch
 Components in grey : default ones from SC
 Use case : non-recursive crawls (i.e. no links discovered), URLs known in
advance. Failures don't matter too much.
 “Great! What about recursive crawls and / or failure handling?”
19
Frontier expansion
 Manual “discovery”
– Adding new URLs by hand,
“seeding”
 Automatic discovery of new
resources (frontier
expansion)
– Not all outlinks are equally useful
- control
– Requires content parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
20
Recursive Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status stream
Switch
DB
(unique
URLS)
21
Status stream
 Used for handling errors and update info about URLs
 Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}
 StatusUpdater : writes to storage
 Can/should extend AbstractStatusUpdaterBolt
22
External : ElasticSearch
 IndexerBolt
– Indexes URL / metadata / text for search
– extends AbstractIndexerBolt
 StatusUpdaterBolt
– extends AbstractStatusUpdaterBolt
– URL / status / metadata / nextFetchDate in status index
 ElasticSearchSpout
– Reads from status index
– Sends URL + Metadata tuples down topology
 MetricsConsumer
– indexes Storm metrics for displaying e.g. with Kibana
23
How to use it?
24
How to use it?
 Write your own Topology class (or hack the example one)
 Put resources in src/main/resources
– URLFilters, ParseFilters etc...
 Build uber-jar with Maven
– mvn clean package
 Set the configuration
– External YAML file (e.g. crawler-conf.yaml)
 Call Storm
storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar
com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
25
Topology in code
Grouping!
Politeness and data 
locality
26
27
Case Studies
 'No follow' crawl #1: existing list of URLs only – one off
– http://www.stolencamerafinder.com/ : [RabbitMQ + Elasticsearch]
 'No follow' crawl #2: Streams of URLs
– http://www.weborama.com : [queues? + HBase]
 Monitoring of finite set of URLs / non recursive crawl
– http://www.shopstyle.com : scraping + indexing [DynamoDB + AWS SQS]
– http://www.ontopic.io : [Redis + Kafka + ElasticSearch]
– http://www.careerbuilder.com : [RabbitMQ + Redis + ElasticSearch]
 Full recursive crawler
– http://www.shopstyle.com : discovery of product pages [DynamoDB]
28
What's next?
 Just released 0.5
– Improved WIKI documentation but can always do better
 All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
– Most resources already available in external
 Additional Parse/URLFilters
– Basic URL Normalizer #120
 Parse -> generate output for more than one document #117
– Change to the ParseFilter interface
 Selenium-based protocol implementation
– Handle AJAX pages
29
Resources
 https://storm.apache.org/
 https://github.com/DigitalPebble/storm-crawler/wiki
 http://nutch.apache.org/
 “Storm Applied”
http://www.manning.com/sallen/
30
Questions
?
31

More Related Content

What's hot

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
Andrew Montalenti
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
Samuel Rash
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
Lester Martin
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
Xin Wang
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
Bryan Bende
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
Nati Shalom
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
Humoyun Ahmedov
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
P. Taylor Goetz
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Davorin Vukelic
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
Jim Dowling
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 

What's hot (20)

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

Similar to Low latency scalable web crawling on Apache Storm

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
Julien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
Julien Nioche
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Nex clipper 1905_summary_eng
Nex clipper 1905_summary_engNex clipper 1905_summary_eng
Nex clipper 1905_summary_eng
Jinyong Kim
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
SIGHUP
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with Terraform
Mathieu Herbert
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
Lucidworks
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
Prajal Kulkarni
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
MapR Technologies
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018
Mathieu Herbert
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
Hao Chen
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 

Similar to Low latency scalable web crawling on Apache Storm (20)

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Nex clipper 1905_summary_eng
Nex clipper 1905_summary_engNex clipper 1905_summary_eng
Nex clipper 1905_summary_eng
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with Terraform
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 

Recently uploaded

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 

Recently uploaded (20)

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 

Low latency scalable web crawling on Apache Storm

  • 1. Low latency scalable web crawling on Apache Storm Julien Nioche julien@digitalpebble.com Berlin Buzzwords 01/06/2015 Storm Crawler digitalpebble
  • 2. 2 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem – Nutch – Tika – GATE, UIMA – SOLR, Elasticearch – Behemoth
  • 3. 3  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Apache License v2  Artefacts available from Maven Central  Active and growing fast  Version 0.5 just released! Storm-Crawler : what is it? => Can scale => Can do low latency => Easy to use and extend => Nice features (politeness, scraping, sitemaps etc...)
  • 4. 4 What it is not  A well packaged application – It's a SDK : it requires some minimal programming and configuration  No global processing of the pages e.g.PageRank  No fancy UI, dashboards, etc... – Build your own or integrate in your existing ones – But can use Storm UI + metrics to various backends (Librato, ElasticSearch)
  • 6. 6 StormCrawler vs Nutch  Nutch is batch driven : little control on when URLs are fetched – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always fetching' ; better use of resources  More flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard SC components – Logical crawls : multiple crawlers with their own scheduling easier with SC via queues  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed some code and concepts – Contributed some stuff back
  • 8. 8 Apache Storm  Distributed real time computation system  http://storm.apache.org/  Scalable, fault-tolerant, polyglot and fun!  Implemented in Clojure + Java  “Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”
  • 9. 9 Topology = spouts + bolts + tuples + streams
  • 10. 10 What's in Storm Crawler? https://www.flickr.com/photos/dipster1/1403240351/
  • 11. 11 Resources: core vs external  Core – Fetcher(s) Bolts – JsoupParserBolt – SitemapParserBolt – ...  External – Metrics-related : inc. connector for Librato – ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics – Tika-based parser bolt  User-maintained external resources  Generic Storm resources (e.g. spouts)
  • 12. 12 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Internal fetch threads – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – Default HTTP implementation based on Apache HttpClient  Also have a SimpleFetcherBolt – Need handle politeness elsewhere (spout?)
  • 13. 13 JSoupParserBolt  HTML only but have a Tika-based one in external  Extracts texts and outlinks  Calls URLFilters on outlinks – normalize and / or blacklists URLs  Calls ParseFilters on document – e.g. scrape info with XpathFilter – Enrich metadata content
  • 14. 14 ParseFilter  Extracts information from document – Called by *ParserBolt(s) – Configured via JSON file – Interface • filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata) – com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java • Xpath expressions • Info extracted stored in metadata • Used later for indexing
  • 15. 15 URLFilter  Control crawl expansion  Delete or rewrites URLs  Configured in JSON file  Interface – String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)  Basic  MaxDepth  Host  RegexURLFilter  RegexURLNormalizer
  • 16. 16 Basic Crawl Topology Fetcher Parser URLPartitioner Indexer Some Spout (1) pull URLs from an external source  (2) group per host / domain / IP (3) fetch URLs Default stream (4) parse content (3) fetch URLs (5) index text and metadata
  • 17. 17 Basic Topology : Tuple perspective Fetcher Parser URLPartitioner Indexer Some Spout (1) <String url, Metadata m> (2) <String url, String key, Metadata m> Default stream (4) <String url, byte[] content, Metadata m, String text> (3) <String url, byte[] content, Metadata m> (1) (2) (3) (4)
  • 18. 18 Basic scenario  Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...  Indexer : any form of storage but often index e.g. SOLR, ElasticSearch  Components in grey : default ones from SC  Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.  “Great! What about recursive crawls and / or failure handling?”
  • 19. 19 Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i = 1 i = 2 i = 3 [Slide courtesy of A. Bialecki]
  • 21. 21 Status stream  Used for handling errors and update info about URLs  Newly discovered URLs from parser public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}  StatusUpdater : writes to storage  Can/should extend AbstractStatusUpdaterBolt
  • 22. 22 External : ElasticSearch  IndexerBolt – Indexes URL / metadata / text for search – extends AbstractIndexerBolt  StatusUpdaterBolt – extends AbstractStatusUpdaterBolt – URL / status / metadata / nextFetchDate in status index  ElasticSearchSpout – Reads from status index – Sends URL + Metadata tuples down topology  MetricsConsumer – indexes Storm metrics for displaying e.g. with Kibana
  • 24. 24 How to use it?  Write your own Topology class (or hack the example one)  Put resources in src/main/resources – URLFilters, ParseFilters etc...  Build uber-jar with Maven – mvn clean package  Set the configuration – External YAML file (e.g. crawler-conf.yaml)  Call Storm storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
  • 26. 26
  • 27. 27 Case Studies  'No follow' crawl #1: existing list of URLs only – one off – http://www.stolencamerafinder.com/ : [RabbitMQ + Elasticsearch]  'No follow' crawl #2: Streams of URLs – http://www.weborama.com : [queues? + HBase]  Monitoring of finite set of URLs / non recursive crawl – http://www.shopstyle.com : scraping + indexing [DynamoDB + AWS SQS] – http://www.ontopic.io : [Redis + Kafka + ElasticSearch] – http://www.careerbuilder.com : [RabbitMQ + Redis + ElasticSearch]  Full recursive crawler – http://www.shopstyle.com : discovery of product pages [DynamoDB]
  • 28. 28 What's next?  Just released 0.5 – Improved WIKI documentation but can always do better  All-in-one crawler based on ElasticSearch – Also a good example of how to use SC – Separate project – Most resources already available in external  Additional Parse/URLFilters – Basic URL Normalizer #120  Parse -> generate output for more than one document #117 – Change to the ParseFilter interface  Selenium-based protocol implementation – Handle AJAX pages
  • 29. 29 Resources  https://storm.apache.org/  https://github.com/DigitalPebble/storm-crawler/wiki  http://nutch.apache.org/  “Storm Applied” http://www.manning.com/sallen/
  • 31. 31