ONTOPIC 
Storm-crawler in the wild 
Jake K. Dodd 
jake@ontopic.io 
http://www.ontopic.io
Who We Are 
ONTOPIC 
• Ontopic is an early-stage FinTech startup 
located in Los Angeles, CA 
• We’re building an engine that empowers 
qualitative financial research by taming 
information overload
Our Requirements 
• Need to discover news as soon as it 
appears on the web 
• Involves monitoring several hundred 
thousand content sources 
• This is more adequately described as web 
monitoring than web crawling
What We Tried 
• + Perhaps the gold-standard for open source web crawling 
• + Capable of handling millions of pages per day 
• - We decided that we were trying to force Nutch to do 
something for which it wasn’t designed—specifically, real-time 
monitoring 
• + Open source python web scraping framework 
• + Incredibly simple to get started 
• + Processing pipelines are dead-simple to develop 
• - No built-in distributed mode. Building an in-house 
distributed and continuous-crawl framework for Scrapy 
seemed like a fragile solution 
• - Designed, and primarily used, as a web scraper (again, not 
precisely the same as web monitoring)
Storm Crawler at Ontopic 
• The storm-crawler project is our workhorse 
for web monitoring 
• Integrated with Apache Kafka, Redis, and 
several other technologies 
• Running on a cluster managed by 
Hortonworks HDP 2.2
High-Level Architecture 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash
R&D Environment (AWS) 
Redis 
• Seed List 
• Domain locks 
• Outlink List 
• Logstash events 
manages 
URL 
Manager 
(Ruby app) 
Nimbus: 1 x r3.large 
Supervisors: 3 x c3.large 
(in a placement group) 
storm-crawler 
• One topology 
• Seed stream and 
outlink stream 
kafka 
• One topic, two 
partitions 
Publishes seeds and 
outlinks to Kafka 
Elasticsearch 
indexing 
Kafka Spout with two executors 
(one for each topic partition) 
logstash 
1 x m1.small instance 
(Redis and Ruby app) 
1 x r3.large 
1 x c3.large
Eye Candy (Ambari) 
Ambari Dashboard 
Cluster 
utilization 
only ~7%
Eye Candy (Storm) 
Storm UI 
~800,000 URLS per day
Eye Candy (Kibana) 
Kibana Dashboard 
Metrics from storm-crawler sent to Logstash 
enable easy real-time monitoring 
~800,000 URLS per day
Conclusion 
• Storm-crawler has enabled us to build a reliable, distributed, web-scale 
news monitoring solution 
• Screen grabs are from our R&D environment, in which we’ve been 
able to monitor ~2,000 sources with a revisit time of one minute, at 
10% utilization on a small cluster 
• We have zero scalability concerns with storm-crawler—upping the 
number of tasks and nodes has demonstrated the ability to fetch 
100,000s of pages per minute 
• Ontopic is committed to open-sourcing our work on top of storm-crawler 
and being a core contributor to the project 
• We’re working to generalize our integration points with Redis, 
Kafka, and Logstash, and providing tutorials so that storm-crawler 
users can easily leverage these technologies (or their equivalents) 
on projects using storm-crawler

StormCrawler in the wild

  • 1.
    ONTOPIC Storm-crawler inthe wild Jake K. Dodd jake@ontopic.io http://www.ontopic.io
  • 2.
    Who We Are ONTOPIC • Ontopic is an early-stage FinTech startup located in Los Angeles, CA • We’re building an engine that empowers qualitative financial research by taming information overload
  • 3.
    Our Requirements •Need to discover news as soon as it appears on the web • Involves monitoring several hundred thousand content sources • This is more adequately described as web monitoring than web crawling
  • 4.
    What We Tried • + Perhaps the gold-standard for open source web crawling • + Capable of handling millions of pages per day • - We decided that we were trying to force Nutch to do something for which it wasn’t designed—specifically, real-time monitoring • + Open source python web scraping framework • + Incredibly simple to get started • + Processing pipelines are dead-simple to develop • - No built-in distributed mode. Building an in-house distributed and continuous-crawl framework for Scrapy seemed like a fragile solution • - Designed, and primarily used, as a web scraper (again, not precisely the same as web monitoring)
  • 5.
    Storm Crawler atOntopic • The storm-crawler project is our workhorse for web monitoring • Integrated with Apache Kafka, Redis, and several other technologies • Running on a cluster managed by Hortonworks HDP 2.2
  • 6.
    High-Level Architecture Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash
  • 7.
    R&D Environment (AWS) Redis • Seed List • Domain locks • Outlink List • Logstash events manages URL Manager (Ruby app) Nimbus: 1 x r3.large Supervisors: 3 x c3.large (in a placement group) storm-crawler • One topology • Seed stream and outlink stream kafka • One topic, two partitions Publishes seeds and outlinks to Kafka Elasticsearch indexing Kafka Spout with two executors (one for each topic partition) logstash 1 x m1.small instance (Redis and Ruby app) 1 x r3.large 1 x c3.large
  • 8.
    Eye Candy (Ambari) Ambari Dashboard Cluster utilization only ~7%
  • 9.
    Eye Candy (Storm) Storm UI ~800,000 URLS per day
  • 10.
    Eye Candy (Kibana) Kibana Dashboard Metrics from storm-crawler sent to Logstash enable easy real-time monitoring ~800,000 URLS per day
  • 11.
    Conclusion • Storm-crawlerhas enabled us to build a reliable, distributed, web-scale news monitoring solution • Screen grabs are from our R&D environment, in which we’ve been able to monitor ~2,000 sources with a revisit time of one minute, at 10% utilization on a small cluster • We have zero scalability concerns with storm-crawler—upping the number of tasks and nodes has demonstrated the ability to fetch 100,000s of pages per minute • Ontopic is committed to open-sourcing our work on top of storm-crawler and being a core contributor to the project • We’re working to generalize our integration points with Redis, Kafka, and Logstash, and providing tutorials so that storm-crawler users can easily leverage these technologies (or their equivalents) on projects using storm-crawler