This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
1. Ron Bodkin
Founder & President, Think Big
September 2014
Real-time Big Data Analytics with Storm
2. 2
Agenda
Ÿ Storm refresher
Ÿ When is Storm appropriate?
Ÿ API Options
Ÿ Use Cases
Copyright 2013-2014 Think Big, a Teradata Company
3. 3
Storm Overview
Ÿ DAG Processing of never ending streams of data
- Open Source: https://github.com/nathanmarz/storm/wiki
- Used at Twitter plus dozens of other companies
- Reliable - At Least Once semantics
- Think MapReduce for data streams
- Java / Clojure based
- Bolts in Java and ‘Shell Bolts’
- Not a queue, but usually reads from a queue.
Ÿ Related:
- Spark Streaming, Samza, S4, CEP
Ÿ Compromises
- Static topologies & cluster sizing – Avoid messy dynamic rebalancing
- Reliability: relies on reliable or durable data source, Nimbus SPOF
- State not built in
Ÿ Community Support, Hortonworks commercial support
Copyright 2013-2014 Think Big, a Teradata Company
5. 5
What is Real-Time?
Ÿ Low latency
- Query response
- Data refresh
- End-to-end response
Ÿ … nanoseconds, milliseconds, seconds, or minutes depending on
your problem
Copyright 2013-2014 Think Big, a Teradata Company
6. 6
Why Real-time?
Ÿ Better end-user experience
- Ex: View an ad, see the counter move.
Ÿ Operational intelligence
- Low latency analysis
- Operational intelligence
Ÿ Event response
- Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw)
- Path to additional real-time capabilities
- Scalable analysis
- Example: Trend analysis to recommend ‘hot’ articles.
Copyright 2013-2014 Think Big, a Teradata Company
7. 7
Why Storm?
Ÿ Scalable way to manage queue readers and logic pipeline
Ÿ Much better than roll your own
Ÿ Reliable (Message guarantees, fault tolerant)
Ÿ Fast (over 1MM messages/node/s in some tests)
Ÿ Multi-node scaling
Ÿ It works
Ÿ Open source community
Ÿ It runs on YARN
Ÿ See also: https://github.com/nathanmarz/storm/wiki/Rationale
Copyright 2013-2014 Think Big, a Teradata Company
8. 8
Programming Options
Ÿ Java
- Raw API (tuples and raw computation)
- Trident – Java DSL (SQL-like primitives) micro-batch, exactly once
Ÿ Python
Ÿ Scala
Ÿ Clojure
Ÿ Ruby
Ÿ Trident-ML
Ÿ Storm-Esper
Copyright 2013-2014 Think Big, a Teradata Company
12. 12
Analytics Architecture
Storm
Time Series
Simple Bot BucketBolt
Annotator
Web
Server
DFS
Adapter
Impression
Spout
Time Series
Buckets
(Batch)
Time Series
Buckets
(Realtime)
Impression
Prediction
Predictive
Model
Parameters
Impressions
Impressions
Impressions
Hadoop
Impression
Bucketization
Predictive
Model
Training
NoSQL
Bolt
Time
Copyright 2013-2014 Think Big, a Teradata Company
13. 13
Analyze
Massive Historical
Data Set
Analyze
Recent
Past
Realtime
Prediction
Solution Approach
Historical Data Set = S3
Analyze = Hadoop + Pig + R
Recent Past = Storm + NoSQL
Analyze = R + Web Service
Copyright 2013-2014 Think Big, a Teradata Company
14. Ad Hoc Classic BI Access
Queries
14
Simple Real-Time Recommendation Updates
Web Servers
Read &
Update
• Requests read single user
profile
• May update that profile record
• Classic key/value store
problem
• Can split transient state into
read/write NoSQL with batch
view NoSQL
Batch
Log Feed
Batch
Model
Feed
Hadoop Cluster
Batch Export
MPP Database
NoSQL
Database
Copyright 2013-2014 Think Big, a Teradata Company
16. Ad Hoc Classic BI Access
Queries
16
Real-Time Recommendation Updates w/ Storm
Web Servers
• More complex
recommendation logic
• Processing benefits from
distributed logic of
streaming
• Parallelize updates
• Parallelize analysis
Batch
Log Feed
Update
Batch
Model
Feed
Hadoop Cluster
Batch Export
Storm
MPP Database
NoSQL
Database
Read &
write
Copyright 2013-2014 Think Big, a Teradata Company
17. 17
Use of Storm in Recommendations
Ÿ More common: asynchronous updates of model
- Time-sensitive
- E.g., updating many time series, single event impacts many items
Ÿ More advanced: score many items
- Multi-dimensional models (item, user, context) so costly to precompute
- Distribute computation at read-time: still blend history with recent
Copyright 2013-2014 Think Big, a Teradata Company
18. 18
Use of Storm in “targeted user 360” use case
Copyright 2013-2014 Think Big, a Teradata Company
19. 19
Storm for Energy Grid Management
Ÿ Kafka - Data Ingestion
- Meter / sensor data from grid points and major end users
- GIS, CRM, Lightning strike and Weather data
- Energy production data from solar & wind (transient sources)
- Forecast data for weather, energy production, energy markets
Ÿ Storm - Streams processing
- Parse, filter, distribute data to
• Search (Solr or Elasticsearch)
• Archive
• Model training (Hadoop)
• Data serving (NoSQL)
- Topologies implement distributed predictive models
• Computationally intensive
• Predict outages
• Predict grid stability problems
• Predict market prices
- Publish alerts & warnings
Ÿ DRPC to query complex grid for Dashboards and investigation
Copyright 2013-2014 Think Big, a Teradata Company
20. 20
Finance: Trade Compliance Monitoring
Order Mgmt
System
• Storm reduces market data to
usable subset
• Integrates high, medium and
low frequency data
• Real-time Storm filter triggers
deeper search via NoSQL
Batch
Log
Feed
Batch
Model
Feed
Hadoop Cluster
Batch
Export
Classic BI
Access
Storm
Ad Hoc
Queries
MPP Database
NoSQL
Database
Market Data
Feed
Copyright 2013-2014 Think Big, a Teradata Company
21. 21
Use of Storm in Trade Compliance Monitoring
Ÿ Market data ingestion and processing
- Up to millions of messages per second
- Requires filtering, averaging, aggregation
Ÿ Trade information
- Orders and executions – up to tens of thousands of messages/second
- Filtering and aggregation
Ÿ Reference and static data also incorporated
Ÿ Market and trade information have to be joined by security and time
Ÿ Models used to detect anomalous trading behavior over time
Ÿ Simple filtering may be used to detect restricted behavior
Ÿ Alerts, reports must be immediate to reduce economic impact
Ÿ Recent comparative analysis for financial data:
- Within 50% of costly commercial software performance
- 25x better price/performance
Copyright 2013-2014 Think Big, a Teradata Company
22. 23
Lessons
Ÿ Easy to develop, hard to debug – start locally (timeouts)
Ÿ Storm infinite loop of failures
- Use Memcached to count # tuple failures
Ÿ At Least Once processing
- Hadoop based read-repair job
Ÿ Performance Counters not getting flushed
- Tick Tuples
Ÿ Always ACK
Ÿ Batching to S3/HDFS
- Run a compaction & event de-duplication job in Hadoop
- Or use exactly once semantics e.g., with Trident
Ÿ Queues to connect topology parts with different time frames
Copyright 2013-2014 Think Big, a Teradata Company
23. 24
Conclusions
Ÿ There’s many kinds of real-time problems
Ÿ Use of Hadoop and/or NoSQL can solve
- Low latency queries
- Event response with localized intelligence
- Operational intelligence
Ÿ Storm is valuable for
- Ingesting data within seconds
- Complex real-time distributed logic
- Operational intelligence
- Use cases in many industries and functions
Ÿ We’re Hiring!
Copyright 2013-2014 Think Big, a Teradata Company