SlideShare a Scribd company logo
1 of 29
Download to read offline
http://ontopic.io 1
Storm Crawler
A real-time distributed web crawling and
monitoring framework
Jake Dodd, co-founder
http://ontopic.io
jake@ontopic.io
ApacheCon North America 2015
http://ontopic.io
§  Overview
§  Continuous vs. Batch
§  Storm-Crawler Components
§  Integration
§  Use Cases
§  Demonstration
§  Q&A
2
Agenda
http://ontopic.io 3
Storm-Crawler overview
§  Software Development Kit (SDK) for building web
crawlers on Apache Storm
§  https://github.com/DigitalPebble/storm-crawler
§  Apache License v2
§  Project Director: Julien Nioche (DigitalPebble Ltd)
§  + 3 committers
http://ontopic.io 4
Facts overview
§  Powered by the Apache Storm framework
§  Real-time, distributed, continuous crawling
§  Discovery to indexing with low latency
§  Java API
§  Available as a Maven dependency
http://ontopic.io 5
The Old Way continuous vs.
batch
§  Batch-oriented crawling
§  Generate a batch of URLs
§  batch fetch à batch parse à batch index à rinse & repeat
§  Benefits
§  Well-suited when data locality is paramount
§  Challenges
§  Inefficient use of resources—parsing when you could be
fetching, hard to allocate and scale resources for individual
tasks
§  High latency—at least several minutes, often hours,
sometimes days between discovery and indexing
http://ontopic.io 6
Continuous Crawl continuous vs.
batch
§  Treat crawling as a streaming problem
§  Feed the machine with a stream of URLs, receive a stream of
results ASAP
§  URL à fetch à parse à (other stuff) à index
§  Benefits
§  Low latency—discovery to indexing in mere moments
§  Efficient use of resources—always be fetching
§  Able to allocate resources to tasks on-the-fly (e.g. scale
fetchers while holding parsers constant)
§  Easily support stateful features (sessions and more)
§  Challenges
§  URL queuing and scheduling
http://ontopic.io 7
The Static Web continuous vs.
batch
§  The Old Model: the web as a collection of linked static
documents
§  Still a useful model…just ask Google, Yahoo, Bing, and friends
§  But the web has evolved—dynamism is the rule, not
the exception
http://ontopic.io 8
The Web Stream continuous vs.
batch
§  Dynamic resources produce a stream of links to new
documents
§  Applies to web pages, feeds, and social media
static
static
dynamic
new
new
new
http://ontopic.io 9
Can we do both? continuous vs.
batch
§  From a crawler’s perspective, there’s not much
difference between new and existing (but newly-
discovered) pages
§  Creating a web index from scratch can be modeled as
a streaming problem
§  Seed URLs à stream of discovered outlinks à rinse & repeat
§  Discovering and indexing new content is a streaming
problem
§  Batch vs. continuous: both methods work, but
continuous offers faster data availability
§  Often important for new content
http://ontopic.io 10
Conclusions continuous vs.
batch
§  A modern web crawler should:
§  Use resources efficiently
§  Leverage the elasticity of modern cloud infrastructures
§  Be responsive—fetch and index new documents with low
latency
§  Elegantly handle streams of new content
§  The dynamic web requires a dynamic crawler
http://ontopic.io 11
Storm-Crawler: What is it? storm-crawler
components
§  A Software Development Kit (SDK) for building and
configuring continuous web crawlers
§  Storm components (spouts & bolts) that handle
primary web crawling operations
§  Fetching, parsing, and indexing
§  Some of the code has been borrowed (with much
gratitude) from Apache Nutch
§  High level of maturity
§  Organized into two sub-projects
§  Core (sc-core): components and utilities needed by all crawler
apps
§  External (sc-external): components that depend on external
technologies (Elasticsearch and more)
http://ontopic.io 12
What is it not? storm-crawler
components
§  Storm-Crawler is not a full-featured, ready-to-use web
crawler application
§  We’re in the process of building that separately—will use the
Storm-Crawler SDK
§  No explicit link & content management (such as linkdb
and crawldb with Nutch)
§  But quickly adding components to support recursive crawls
§  No PageRank
http://ontopic.io 13
Basic Topology storm-crawler
components
spouts bolts
Spout URL
Partitioner
Fetcher 1
Fetcher 2
Fetcher
(n)
Parser 1
Parser (n)
Indexer
Storm topologies consist
of spouts and bolts
http://ontopic.io 14
Spouts storm-crawler
components
§  File spout
§  In sc-core
§  Reads URLs from a file
§  Elasticsearch spout
§  In sc-external
§  Reads URLs from an Elasticsearch index
§  Functioning, but we’re working on improvements
§  Other options (Redis, Kafka, etc.)
§  Will discuss later in presentation
http://ontopic.io 15
Bolts storm-crawler
components
§  The SDK includes several bolts that handle:
§  URL partitioning
§  Fetching
§  Parsing
§  Filtering
§  Indexing
§  We’ll briefly discuss each of these
http://ontopic.io 16
Bolts: URL Partitioner storm-crawler
components
§  Partitions incoming URLs by host, domain, or IP
address
§  Strategy is configurable in the topology configuration file
§  Creates a partition field in the tuple
§  Storm’s grouping feature can then be used to distribute
tuples according to requirements
§  localOrShuffle() to randomly distribute URLs to fetchers
§  or fieldsGrouping() to ensure all URLs with the same {host,
domain, IP} go to the same fetcher
http://ontopic.io 17
Bolts: Fetchers storm-crawler
components
§  Two fetcher bolts provided in sc-core
§  Both respect robots.txt
§  FetcherBolt
§  Multithreaded (configurable number of threads)
§  Use with fieldsGrouping() on the partition key and a
configurable crawl delay to ensure your crawler is polite
§  SimpleFetcherBolt
§  No internal queues
§  Concurrency configured using parallelism hint and # of tasks
§  Politeness must be handled outside of the topology
§  Easier to reason about; requires additional work to enforce
politeness
http://ontopic.io 18
Bolts: Parsers storm-crawler
components
§  Parser Bolt
§  Utilizes Apache Tika for parsing
§  Collects, filters, normalizes, and emits outlinks
§  Collects page metadata (HTTP headers, etc)
§  Parses the page’s content to a text representation
§  Sitemap Parser Bolt
§  Uses the Crawler-Commons sitemap parser
§  Collects, filters, normalizes, and emits outlinks
§  Requires a priori knowledge that a page is a sitemap
http://ontopic.io 19
Bolts: Indexing storm-crawler
components
§  Printer Bolt (in sc-core)
§  Prints output to stdout—useful for debugging
§  Elasticsearch Indexer Bolt (in sc-external)
§  Indexes parsed page content and metadata into Elasticsearch
§  Elasticsearch Status Bolt (in sc-external)
§  URLs and their status (discovered, fetched, error) are emitted
to a special status stream in the storm topology
§  This bolt indexes the URL, metadata, and its status into a
‘status’ Elasticsearch index
http://ontopic.io 20
Other components storm-crawler
components
§  URL Filters & Normalizers
§  Configurable with a JSON file
§  Regex filter & normalizer borrowed from Nutch
§  HostURLFilter enables you to ignore outlinks from outside
domains or hosts
§  Parse Filters
§  Useful for scraping and extracting info from pages
§  XPath-based parse filter, more to come
§  Filters & Normalizers are easily pluggable
http://ontopic.io 21
Integrating Storm-Crawler integration
§  Because Storm-Crawler is an SDK, it needs to be
integrated with other technologies to build a full-
featured web crawler
§  At the very least, a database
§  For URLs, metadata, and maybe content
§  Some search engines can double as your core data store
(beware…research ‘Jepsen tests’ for caveats)
§  Probably a search engine
§  Solr, Elasticsearch, etc.
§  sc-external provides basic integration with Elasticsearch
§  Maybe some distributed system technologies for
crawl control
§  Redis, Kafka, ZooKeeper, etc.
http://ontopic.io 22
Storm-Crawler at Ontopic integration
§  The storm-crawler SDK is our workhorse for web
monitoring
§  Integrated with Apache Kafka, Redis, and several
other technologies
§  Running on an EC2 cluster managed by Hortonworks
HDP 2.2
http://ontopic.io 23
Architecture integration
•  Seed List
•  Domain locks
•  Outlink List
•  Logstash
events
Redis
URL
Manager
(Ruby
app)
manages
•  One topology
•  Seed stream and
outlink stream
storm-crawler
•  One topic, two
partitions
kafka
Publishes seeds and
outlinks to Kafka
Kafka Spout with two executors
(one for each topic partition)
Elasticsearch
indexinglogstash
http://ontopic.io 24
R&D Cluster (AWS) integration
Redis
manages
storm-crawler
kafka
Publishes seeds and
outlinks to Kafka
Kafka Spout with two executors
(one for each topic partition)
Elasticsearch
indexinglogstash
1 x r3.large1 x m1.small instance
(Redis and Ruby app)
Nimbus: 1 x r3.large
Supervisors: 3 x c3.large
(in a placement group)
1 x c3.large
http://ontopic.io 25
Integration Examples integration
§  Formal crawl metadata specification & serialization
with Avro
§  Kafka publishing bolt
§  Component to publish crawl data to Kafka (complex URL
status handling, for example, could be performed by another
topology)
§  Externally-stored transient crawl data
§  Components for storing shared crawl data (such as a
robots.txt cache) in a key-value store (Redis, Memcached, etc.)
http://ontopic.io 26
Use Cases & Users use cases
§  Processing streams of URLs
§  http://www.weborama.com
§  Continuous URL monitoring
§  http://www.shopstyle.com
§  http://www.ontopic.io
§  One-off non-recursive crawling
§  http://www.stolencamerafinder.com
§  Recursive crawling
§  http://www.shopstyle.com
§  More in development & stealth mode
http://ontopic.io 27
Demonstration demonstration
(live demo of Ontopic’s topology)
http://ontopic.io 28
Q&A q&a
Any questions?
http://ontopic.io 29
Resources q&a
§  Project page
§  https://github.com/DigitalPebble/storm-crawler
§  Project documentation
§  https://github.com/DigitalPebble/storm-crawler/wiki
§  Previous presentations
§  http://www.slideshare.net/digitalpebble/j-nioche-
bristoljavameetup20150310
§  http://www.slideshare.net/digitalpebble/storm-crawler-
ontopic20141113?related=1
§  Other resources
§  http://infolab.stanford.edu/~olston/publications/
crawling_survey.pdf
Thank you!

More Related Content

What's hot

Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and MessagingXin Wang
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Airat Khisamov
 

What's hot (20)

Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
 
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
Central LogFile Storage. ELK stack Elasticsearch, Logstash and Kibana.
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Logstash
LogstashLogstash
Logstash
 

Similar to Storm crawler apachecon_na_2015

Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Yuval Itzchakov
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Frameworksixtyone
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)Igor Talevski
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptMichael Nelson
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Cloudsinovex GmbH
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
REST APIs for the Internet of Things
REST APIs for the Internet of ThingsREST APIs for the Internet of Things
REST APIs for the Internet of ThingsMichael Koster
 
REST APIs for an Internet of Things
REST APIs for an Internet of ThingsREST APIs for an Internet of Things
REST APIs for an Internet of ThingsMichael Koster
 
Meteor Day Athens (2014-11-07)
Meteor Day Athens (2014-11-07)Meteor Day Athens (2014-11-07)
Meteor Day Athens (2014-11-07)svub
 

Similar to Storm crawler apachecon_na_2015 (20)

Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Frontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling FrameworkFrontera-Open Source Large Scale Web Crawling Framework
Frontera-Open Source Large Scale Web Crawling Framework
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
ElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der CloudsElasticSearch - Suche im Zeitalter der Clouds
ElasticSearch - Suche im Zeitalter der Clouds
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
REST APIs for the Internet of Things
REST APIs for the Internet of ThingsREST APIs for the Internet of Things
REST APIs for the Internet of Things
 
REST APIs for an Internet of Things
REST APIs for an Internet of ThingsREST APIs for an Internet of Things
REST APIs for an Internet of Things
 
Meteor Day Athens (2014-11-07)
Meteor Day Athens (2014-11-07)Meteor Day Athens (2014-11-07)
Meteor Day Athens (2014-11-07)
 

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 

Storm crawler apachecon_na_2015

  • 1. http://ontopic.io 1 Storm Crawler A real-time distributed web crawling and monitoring framework Jake Dodd, co-founder http://ontopic.io jake@ontopic.io ApacheCon North America 2015
  • 2. http://ontopic.io §  Overview §  Continuous vs. Batch §  Storm-Crawler Components §  Integration §  Use Cases §  Demonstration §  Q&A 2 Agenda
  • 3. http://ontopic.io 3 Storm-Crawler overview §  Software Development Kit (SDK) for building web crawlers on Apache Storm §  https://github.com/DigitalPebble/storm-crawler §  Apache License v2 §  Project Director: Julien Nioche (DigitalPebble Ltd) §  + 3 committers
  • 4. http://ontopic.io 4 Facts overview §  Powered by the Apache Storm framework §  Real-time, distributed, continuous crawling §  Discovery to indexing with low latency §  Java API §  Available as a Maven dependency
  • 5. http://ontopic.io 5 The Old Way continuous vs. batch §  Batch-oriented crawling §  Generate a batch of URLs §  batch fetch à batch parse à batch index à rinse & repeat §  Benefits §  Well-suited when data locality is paramount §  Challenges §  Inefficient use of resources—parsing when you could be fetching, hard to allocate and scale resources for individual tasks §  High latency—at least several minutes, often hours, sometimes days between discovery and indexing
  • 6. http://ontopic.io 6 Continuous Crawl continuous vs. batch §  Treat crawling as a streaming problem §  Feed the machine with a stream of URLs, receive a stream of results ASAP §  URL à fetch à parse à (other stuff) à index §  Benefits §  Low latency—discovery to indexing in mere moments §  Efficient use of resources—always be fetching §  Able to allocate resources to tasks on-the-fly (e.g. scale fetchers while holding parsers constant) §  Easily support stateful features (sessions and more) §  Challenges §  URL queuing and scheduling
  • 7. http://ontopic.io 7 The Static Web continuous vs. batch §  The Old Model: the web as a collection of linked static documents §  Still a useful model…just ask Google, Yahoo, Bing, and friends §  But the web has evolved—dynamism is the rule, not the exception
  • 8. http://ontopic.io 8 The Web Stream continuous vs. batch §  Dynamic resources produce a stream of links to new documents §  Applies to web pages, feeds, and social media static static dynamic new new new
  • 9. http://ontopic.io 9 Can we do both? continuous vs. batch §  From a crawler’s perspective, there’s not much difference between new and existing (but newly- discovered) pages §  Creating a web index from scratch can be modeled as a streaming problem §  Seed URLs à stream of discovered outlinks à rinse & repeat §  Discovering and indexing new content is a streaming problem §  Batch vs. continuous: both methods work, but continuous offers faster data availability §  Often important for new content
  • 10. http://ontopic.io 10 Conclusions continuous vs. batch §  A modern web crawler should: §  Use resources efficiently §  Leverage the elasticity of modern cloud infrastructures §  Be responsive—fetch and index new documents with low latency §  Elegantly handle streams of new content §  The dynamic web requires a dynamic crawler
  • 11. http://ontopic.io 11 Storm-Crawler: What is it? storm-crawler components §  A Software Development Kit (SDK) for building and configuring continuous web crawlers §  Storm components (spouts & bolts) that handle primary web crawling operations §  Fetching, parsing, and indexing §  Some of the code has been borrowed (with much gratitude) from Apache Nutch §  High level of maturity §  Organized into two sub-projects §  Core (sc-core): components and utilities needed by all crawler apps §  External (sc-external): components that depend on external technologies (Elasticsearch and more)
  • 12. http://ontopic.io 12 What is it not? storm-crawler components §  Storm-Crawler is not a full-featured, ready-to-use web crawler application §  We’re in the process of building that separately—will use the Storm-Crawler SDK §  No explicit link & content management (such as linkdb and crawldb with Nutch) §  But quickly adding components to support recursive crawls §  No PageRank
  • 13. http://ontopic.io 13 Basic Topology storm-crawler components spouts bolts Spout URL Partitioner Fetcher 1 Fetcher 2 Fetcher (n) Parser 1 Parser (n) Indexer Storm topologies consist of spouts and bolts
  • 14. http://ontopic.io 14 Spouts storm-crawler components §  File spout §  In sc-core §  Reads URLs from a file §  Elasticsearch spout §  In sc-external §  Reads URLs from an Elasticsearch index §  Functioning, but we’re working on improvements §  Other options (Redis, Kafka, etc.) §  Will discuss later in presentation
  • 15. http://ontopic.io 15 Bolts storm-crawler components §  The SDK includes several bolts that handle: §  URL partitioning §  Fetching §  Parsing §  Filtering §  Indexing §  We’ll briefly discuss each of these
  • 16. http://ontopic.io 16 Bolts: URL Partitioner storm-crawler components §  Partitions incoming URLs by host, domain, or IP address §  Strategy is configurable in the topology configuration file §  Creates a partition field in the tuple §  Storm’s grouping feature can then be used to distribute tuples according to requirements §  localOrShuffle() to randomly distribute URLs to fetchers §  or fieldsGrouping() to ensure all URLs with the same {host, domain, IP} go to the same fetcher
  • 17. http://ontopic.io 17 Bolts: Fetchers storm-crawler components §  Two fetcher bolts provided in sc-core §  Both respect robots.txt §  FetcherBolt §  Multithreaded (configurable number of threads) §  Use with fieldsGrouping() on the partition key and a configurable crawl delay to ensure your crawler is polite §  SimpleFetcherBolt §  No internal queues §  Concurrency configured using parallelism hint and # of tasks §  Politeness must be handled outside of the topology §  Easier to reason about; requires additional work to enforce politeness
  • 18. http://ontopic.io 18 Bolts: Parsers storm-crawler components §  Parser Bolt §  Utilizes Apache Tika for parsing §  Collects, filters, normalizes, and emits outlinks §  Collects page metadata (HTTP headers, etc) §  Parses the page’s content to a text representation §  Sitemap Parser Bolt §  Uses the Crawler-Commons sitemap parser §  Collects, filters, normalizes, and emits outlinks §  Requires a priori knowledge that a page is a sitemap
  • 19. http://ontopic.io 19 Bolts: Indexing storm-crawler components §  Printer Bolt (in sc-core) §  Prints output to stdout—useful for debugging §  Elasticsearch Indexer Bolt (in sc-external) §  Indexes parsed page content and metadata into Elasticsearch §  Elasticsearch Status Bolt (in sc-external) §  URLs and their status (discovered, fetched, error) are emitted to a special status stream in the storm topology §  This bolt indexes the URL, metadata, and its status into a ‘status’ Elasticsearch index
  • 20. http://ontopic.io 20 Other components storm-crawler components §  URL Filters & Normalizers §  Configurable with a JSON file §  Regex filter & normalizer borrowed from Nutch §  HostURLFilter enables you to ignore outlinks from outside domains or hosts §  Parse Filters §  Useful for scraping and extracting info from pages §  XPath-based parse filter, more to come §  Filters & Normalizers are easily pluggable
  • 21. http://ontopic.io 21 Integrating Storm-Crawler integration §  Because Storm-Crawler is an SDK, it needs to be integrated with other technologies to build a full- featured web crawler §  At the very least, a database §  For URLs, metadata, and maybe content §  Some search engines can double as your core data store (beware…research ‘Jepsen tests’ for caveats) §  Probably a search engine §  Solr, Elasticsearch, etc. §  sc-external provides basic integration with Elasticsearch §  Maybe some distributed system technologies for crawl control §  Redis, Kafka, ZooKeeper, etc.
  • 22. http://ontopic.io 22 Storm-Crawler at Ontopic integration §  The storm-crawler SDK is our workhorse for web monitoring §  Integrated with Apache Kafka, Redis, and several other technologies §  Running on an EC2 cluster managed by Hortonworks HDP 2.2
  • 23. http://ontopic.io 23 Architecture integration •  Seed List •  Domain locks •  Outlink List •  Logstash events Redis URL Manager (Ruby app) manages •  One topology •  Seed stream and outlink stream storm-crawler •  One topic, two partitions kafka Publishes seeds and outlinks to Kafka Kafka Spout with two executors (one for each topic partition) Elasticsearch indexinglogstash
  • 24. http://ontopic.io 24 R&D Cluster (AWS) integration Redis manages storm-crawler kafka Publishes seeds and outlinks to Kafka Kafka Spout with two executors (one for each topic partition) Elasticsearch indexinglogstash 1 x r3.large1 x m1.small instance (Redis and Ruby app) Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group) 1 x c3.large
  • 25. http://ontopic.io 25 Integration Examples integration §  Formal crawl metadata specification & serialization with Avro §  Kafka publishing bolt §  Component to publish crawl data to Kafka (complex URL status handling, for example, could be performed by another topology) §  Externally-stored transient crawl data §  Components for storing shared crawl data (such as a robots.txt cache) in a key-value store (Redis, Memcached, etc.)
  • 26. http://ontopic.io 26 Use Cases & Users use cases §  Processing streams of URLs §  http://www.weborama.com §  Continuous URL monitoring §  http://www.shopstyle.com §  http://www.ontopic.io §  One-off non-recursive crawling §  http://www.stolencamerafinder.com §  Recursive crawling §  http://www.shopstyle.com §  More in development & stealth mode
  • 29. http://ontopic.io 29 Resources q&a §  Project page §  https://github.com/DigitalPebble/storm-crawler §  Project documentation §  https://github.com/DigitalPebble/storm-crawler/wiki §  Previous presentations §  http://www.slideshare.net/digitalpebble/j-nioche- bristoljavameetup20150310 §  http://www.slideshare.net/digitalpebble/storm-crawler- ontopic20141113?related=1 §  Other resources §  http://infolab.stanford.edu/~olston/publications/ crawling_survey.pdf Thank you!