SlideShare a Scribd company logo
1 of 15
Download to read offline
A quick introduction to 
Storm Crawler 
Julien Nioche 
julien@digitalpebble.com 
@digitalpebble 
ApacheCon EU 2014 - Budapest
2 / 15 
About myself 
 DigitalPebble Ltd, Bristol (UK) 
 Specialised in Text Engineering 
– Web Crawling 
– Natural Language Processing 
– Information Retrieval 
– Machine Learning 
 Strong focus on Open Source & Apache ecosystem 
 PMC Chair Apache Nutch 
 User | Contributor | Committer 
– Tika 
– SOLR, Lucene 
– GATE, UIMA 
– Mahout 
– Behemoth
What is it? 
 Collection of resources (SDK) for building web crawlers on 
Apache Storm 
 https://github.com/DigitalPebble/storm-crawler 
 Artefacts available from Maven Central 
 Apache License v2 
3 / 15 
 Scalable 
 Low latency 
 Easily extensible
What it is not 
 A ready-to-use, feature-complete, recursive web crawler 
– Might be something like that as a separate project using S/C later 
4 / 15 
 e.g. no PageRank or explicit ranking of pages 
– Build your own 
 No fancy UI, dashboards, etc... 
– Build your own
Comparison with Nutch 
 Nutch is batch driven : little control on when URLs are 
fetched 
5 / 15 
– Potential issue for use cases where need sessions 
– latency++ 
 Fetching only one of the steps in Nutch 
– SC : 'always be fetching' (Ken Krugler); better use of resources 
 Make it even more flexible 
– Typical case : few custom classes (at least a Topology) the rest are just 
dependencies and standard S/C components 
 Not ready-to use as Nutch : it's a SDK 
 Would not have existed without it 
– Borrowed code and concepts
6 / 15 
Overview of resources 
https://www.flickr.com/photos/dipster1/1403240351/
7 / 15 
FetcherBolt 
 Multi-threaded 
 Polite 
– Puts incoming tuples into internal queues based on IP/domain/hostname 
– Sets delay between requests from same queue 
– Respects robots.txt 
 Protocol-neutral 
– Protocol implementations are pluggable 
– HTTP implementation taken from Nutch 
 Output 
– String URL 
– byte[] content 
– HashMap<String, String[]> metadata
8 / 15 
ParserBolt 
 Based on Apache Tika 
 Supports most commonly used doc formats 
– HTML, PDF, DOC etc... 
 Calls ParseFilters on document 
– e.g. scrape info with XPathFilter 
 Calls URLFilters on outlinks 
– e.g normalize and / or blacklists URLs based on RegExps 
 Output 
– String URL 
– byte[] content 
– HashMap<String, String[]> metadata 
– String text 
– Set<String> outlinks
9 / 15 
Other resources 
 ElasticSearchBolt 
– Sends fields to ElasticSearch for indexing 
– (deprecated by resources in elasticsearch-hadoop?) 
 URLPartitionerBolt 
– Generates a key based on the hostname / domain / IP of URL 
– Output : 
• String URL 
• String key 
• String metadata 
– Useful for fieldGrouping
10 / 15 
Other resources 
 ConfigurableTopology 
– Overrides config with local YAML file 
– Simple switch for running in local mode 
– Abstract class to be extended 
 Simple Spouts (for testing) 
– FileSpout / RandomURLSpout 
 Various Metrics-related stuff 
– Including a MetricsConsumer for https://www.librato.com/ 
 FetchQueue package 
– BlockingURLSpout and ShardedQueue abstraction
11 / 15 
Integrate it! 
 Write your the Spout for your usecase 
– Will work fine existing resources as long as it generates URL, metadata 
 Typical scenario 
– Group URLs to fetch into separate external queues based on host or 
domain (AWS SQS, Apache Kafka) 
– Write Spout for it and throttle with topology.max.spout.pending 
– So that can enforce politeness without getting timeout on Tuples → fail 
– Parse and extract 
– Send new URLs to queues 
 Can use various forms of persistence for URLs 
– ElasticSearch, DynamoDB, Hbase, etc...
12 / 15 
Some use cases (prototype stage) 
 Processing of streams of data (natural fit for Storm) 
– http://www.weborama.com 
 Monitoring of finite set of URLs 
– http://www.ontopic.io (more on them later) 
– http://www.shopstyle.com : scraping + indexing 
 One-off non-recursive crawling 
– http://www.stolencamerafinder.com/ : scraping + indexing 
 Recursive crawler 
– WIP
13 / 15 
What's next? 
 All-in-one crawler project built on SC 
– Also a good example of how to use SC 
 Additional Parse/URLFilters 
 More tests and documentation 
 A nice logo (this is an invitation) 
 A better name?
14 / 15 
Questions 
?
15 / 15

More Related Content

What's hot

Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Mike Frampton
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSSaumitra Srivastav
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...InfluxData
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 

What's hot (20)

Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 

Viewers also liked

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
Low latency Java apps
Low latency Java appsLow latency Java apps
Low latency Java appsSimon Ritter
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkbanq jdon
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaReactivesummit
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler Thamme Gowda
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 

Viewers also liked (17)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Low latency Java apps
Low latency Java appsLow latency Java apps
Low latency Java apps
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafka
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 

Similar to A quick introduction to Storm Crawler

Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble BehemothSteve Loughran
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Baruch Sadogursky
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkFlorent Georges
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Open Cloud Computing Interface Presentation
Open Cloud Computing Interface PresentationOpen Cloud Computing Interface Presentation
Open Cloud Computing Interface PresentationIntel Corporation
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Webtech1b - hello 123 123
Webtech1b - hello 123 123Webtech1b - hello 123 123
Webtech1b - hello 123 123Gunjan Juyal
 

Similar to A quick introduction to Storm Crawler (20)

Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Open Cloud Computing Interface Presentation
Open Cloud Computing Interface PresentationOpen Cloud Computing Interface Presentation
Open Cloud Computing Interface Presentation
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
webtech1b.ppt
webtech1b.pptwebtech1b.ppt
webtech1b.ppt
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Presentation_1367055087514
Presentation_1367055087514Presentation_1367055087514
Presentation_1367055087514
 
Webtech1b - hello 123 123
Webtech1b - hello 123 123Webtech1b - hello 123 123
Webtech1b - hello 123 123
 

Recently uploaded

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 

Recently uploaded (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 

A quick introduction to Storm Crawler

  • 1. A quick introduction to Storm Crawler Julien Nioche julien@digitalpebble.com @digitalpebble ApacheCon EU 2014 - Budapest
  • 2. 2 / 15 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem  PMC Chair Apache Nutch  User | Contributor | Committer – Tika – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth
  • 3. What is it?  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Artefacts available from Maven Central  Apache License v2 3 / 15  Scalable  Low latency  Easily extensible
  • 4. What it is not  A ready-to-use, feature-complete, recursive web crawler – Might be something like that as a separate project using S/C later 4 / 15  e.g. no PageRank or explicit ranking of pages – Build your own  No fancy UI, dashboards, etc... – Build your own
  • 5. Comparison with Nutch  Nutch is batch driven : little control on when URLs are fetched 5 / 15 – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always be fetching' (Ken Krugler); better use of resources  Make it even more flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard S/C components  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed code and concepts
  • 6. 6 / 15 Overview of resources https://www.flickr.com/photos/dipster1/1403240351/
  • 7. 7 / 15 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – HTTP implementation taken from Nutch  Output – String URL – byte[] content – HashMap<String, String[]> metadata
  • 8. 8 / 15 ParserBolt  Based on Apache Tika  Supports most commonly used doc formats – HTML, PDF, DOC etc...  Calls ParseFilters on document – e.g. scrape info with XPathFilter  Calls URLFilters on outlinks – e.g normalize and / or blacklists URLs based on RegExps  Output – String URL – byte[] content – HashMap<String, String[]> metadata – String text – Set<String> outlinks
  • 9. 9 / 15 Other resources  ElasticSearchBolt – Sends fields to ElasticSearch for indexing – (deprecated by resources in elasticsearch-hadoop?)  URLPartitionerBolt – Generates a key based on the hostname / domain / IP of URL – Output : • String URL • String key • String metadata – Useful for fieldGrouping
  • 10. 10 / 15 Other resources  ConfigurableTopology – Overrides config with local YAML file – Simple switch for running in local mode – Abstract class to be extended  Simple Spouts (for testing) – FileSpout / RandomURLSpout  Various Metrics-related stuff – Including a MetricsConsumer for https://www.librato.com/  FetchQueue package – BlockingURLSpout and ShardedQueue abstraction
  • 11. 11 / 15 Integrate it!  Write your the Spout for your usecase – Will work fine existing resources as long as it generates URL, metadata  Typical scenario – Group URLs to fetch into separate external queues based on host or domain (AWS SQS, Apache Kafka) – Write Spout for it and throttle with topology.max.spout.pending – So that can enforce politeness without getting timeout on Tuples → fail – Parse and extract – Send new URLs to queues  Can use various forms of persistence for URLs – ElasticSearch, DynamoDB, Hbase, etc...
  • 12. 12 / 15 Some use cases (prototype stage)  Processing of streams of data (natural fit for Storm) – http://www.weborama.com  Monitoring of finite set of URLs – http://www.ontopic.io (more on them later) – http://www.shopstyle.com : scraping + indexing  One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing  Recursive crawler – WIP
  • 13. 13 / 15 What's next?  All-in-one crawler project built on SC – Also a good example of how to use SC  Additional Parse/URLFilters  More tests and documentation  A nice logo (this is an invitation)  A better name?
  • 14. 14 / 15 Questions ?