StormCrawler in the wild

ONTOPIC
Storm-crawler in the wild
Jake K. Dodd
jake@ontopic.io
http://www.ontopic.io

Who We Are
ONTOPIC
• Ontopic is an early-stage FinTech startup
located in Los Angeles, CA
• We’re building an engine that empowers
qualitative financial research by taming
information overload

Our Requirements
• Need to discover news as soon as it
appears on the web
• Involves monitoring several hundred
thousand content sources
• This is more adequately described as web
monitoring than web crawling

What We Tried
• + Perhaps the gold-standard for open source web crawling
• + Capable of handling millions of pages per day
• - We decided that we were trying to force Nutch to do
something for which it wasn’t designed—specifically, real-time
monitoring
• + Open source python web scraping framework
• + Incredibly simple to get started
• + Processing pipelines are dead-simple to develop
• - No built-in distributed mode. Building an in-house
distributed and continuous-crawl framework for Scrapy
seemed like a fragile solution
• - Designed, and primarily used, as a web scraper (again, not
precisely the same as web monitoring)

Storm Crawler at Ontopic
• The storm-crawler project is our workhorse
for web monitoring
• Integrated with Apache Kafka, Redis, and
several other technologies
• Running on a cluster managed by
Hortonworks HDP 2.2

High-Level Architecture
Redis
• Seed List
• Domain locks
• Outlink List
• Logstash events
manages
URL
Manager
(Ruby app)
storm-crawler
• One topology
• Seed stream and
outlink stream
kafka
• One topic, two
partitions
Publishes seeds and
outlinks to Kafka
Elasticsearch
indexing
Kafka Spout with two executors
(one for each topic partition)
logstash

R&D Environment (AWS)
Redis
• Seed List
• Domain locks
• Outlink List
• Logstash events
manages
URL
Manager
(Ruby app)
Nimbus: 1 x r3.large
Supervisors: 3 x c3.large
(in a placement group)
storm-crawler
• One topology
• Seed stream and
outlink stream
kafka
• One topic, two
partitions
Publishes seeds and
outlinks to Kafka
Elasticsearch
indexing
Kafka Spout with two executors
(one for each topic partition)
logstash
1 x m1.small instance
(Redis and Ruby app)
1 x r3.large
1 x c3.large

Eye Candy (Ambari)
Ambari Dashboard
Cluster
utilization
only ~7%

Eye Candy (Storm)
Storm UI
~800,000 URLS per day

Eye Candy (Kibana)
Kibana Dashboard
Metrics from storm-crawler sent to Logstash
enable easy real-time monitoring
~800,000 URLS per day

Conclusion
• Storm-crawler has enabled us to build a reliable, distributed, web-scale
news monitoring solution
• Screen grabs are from our R&D environment, in which we’ve been
able to monitor ~2,000 sources with a revisit time of one minute, at
10% utilization on a small cluster
• We have zero scalability concerns with storm-crawler—upping the
number of tasks and nodes has demonstrated the ability to fetch
100,000s of pages per minute
• Ontopic is committed to open-sourcing our work on top of storm-crawler
and being a core contributor to the project
• We’re working to generalize our integration points with Redis,
Kafka, and Logstash, and providing tutorials so that storm-crawler
users can easily leverage these technologies (or their equivalents)
on projects using storm-crawler

StormCrawler in the wild

More Related Content

What's hot

Viewers also liked

Similar to StormCrawler in the wild

Recently uploaded

StormCrawler in the wild