Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics

Ron Bodkin
Founder & President, Think Big
September 2014
Real-time Big Data Analytics with Storm

2
Agenda
Ÿ Storm refresher
Ÿ When is Storm appropriate?
Ÿ API Options
Ÿ Use Cases
Copyright 2013-2014 Think Big, a Teradata Company

3
Storm Overview
Ÿ DAG Processing of never ending streams of data
- Open Source: https://github.com/nathanmarz/storm/wiki
- Used at Twitter plus dozens of other companies
- Reliable - At Least Once semantics
- Think MapReduce for data streams
- Java / Clojure based
- Bolts in Java and ‘Shell Bolts’
- Not a queue, but usually reads from a queue.
Ÿ Related:
- Spark Streaming, Samza, S4, CEP
Ÿ Compromises
- Static topologies & cluster sizing – Avoid messy dynamic rebalancing
- Reliability: relies on reliable or durable data source, Nimbus SPOF
- State not built in
Ÿ Community Support, Hortonworks commercial support

4
Storm Concepts
Ÿ Cluster
- Nimbus
- Zookeeper
- Supervisor, Workers
Ÿ Topology
Ÿ Streams
Ÿ Tuple
Ÿ Spout
Ÿ Bolt
Ÿ Stream Groupings
- Shuffle, Field
Ÿ Trident
Ÿ DRPC

5
What is Real-Time?
Ÿ Low latency
- Query response
- Data refresh
- End-to-end response
Ÿ … nanoseconds, milliseconds, seconds, or minutes depending on
your problem

6
Why Real-time?
Ÿ Better end-user experience
- Ex: View an ad, see the counter move.
Ÿ Operational intelligence
- Low latency analysis
- Operational intelligence
Ÿ Event response
- Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw)
- Path to additional real-time capabilities
- Scalable analysis
- Example: Trend analysis to recommend ‘hot’ articles.

7
Why Storm?
Ÿ Scalable way to manage queue readers and logic pipeline
Ÿ Much better than roll your own
Ÿ Reliable (Message guarantees, fault tolerant)
Ÿ Fast (over 1MM messages/node/s in some tests)
Ÿ Multi-node scaling
Ÿ It works
Ÿ Open source community
Ÿ It runs on YARN
Ÿ See also: https://github.com/nathanmarz/storm/wiki/Rationale

8
Programming Options
Ÿ Java
- Raw API (tuples and raw computation)
- Trident – Java DSL (SQL-like primitives) micro-batch, exactly once
Ÿ Python
Ÿ Scala
Ÿ Clojure
Ÿ Ruby
Ÿ Trident-ML
Ÿ Storm-Esper

9
Use Cases
Ÿ Scale analysis pipeline
Ÿ Lively stats
Ÿ Recommendations
Ÿ Better predictions
Ÿ Realtime analytics
Ÿ Online machine learning
Ÿ Distributed RPC

10
Overall Architecture
Monitoring & Alerting (Metrics, Alarms, Notifications)
Memcache (Tuple fail tracking)
NoSQL
Queue
Archive Logs
Hadoop
Ad Serving
LB
Edge
Edge
Impression
S3
S3
S3
DFS
Management
Server
LB
Edge
Edge
Relational
Store
Ad Management
Ad Selling
Storm
- Queue Management
- Simple Bot Filtering
- Real-time Bucketization
- Performance Counters
- Event Logging
View Ad
Cleansing
Model Training
Recommendations
Events
Model Parameters
getPrediction
Performance Counters
Impression Buckets

11
Integration Patterns

12
Analytics Architecture
Storm
Time Series
Simple Bot BucketBolt
Annotator
Web
Server
DFS
Adapter
Impression
Spout
Time Series
Buckets
(Batch)
Time Series
Buckets
(Realtime)
Impression
Prediction
Predictive
Model
Parameters
Impressions
Impressions
Impressions
Hadoop
Impression
Bucketization
Predictive
Model
Training
NoSQL
Bolt
Time

13
Analyze
Massive Historical
Data Set
Analyze
Recent
Past
Realtime
Prediction
Solution Approach
Historical Data Set = S3
Analyze = Hadoop + Pig + R
Recent Past = Storm + NoSQL
Analyze = R + Web Service

Ad Hoc Classic BI Access
Queries
14
Simple Real-Time Recommendation Updates
Web Servers
Read &
Update
• Requests read single user
profile
• May update that profile record
• Classic key/value store
problem
• Can split transient state into
read/write NoSQL with batch
view NoSQL
Batch
Log Feed
Batch
Model
Feed
Hadoop Cluster
Batch Export
MPP Database
NoSQL
Database

15
Storm Topology (Greatly Simplified)
S3
S3
S3
S3
DynamoDB
Performance
Counters<T>
S3
Adapter<T>
Event
Spout
DynamoDB
Adapter<T>
Command
Spout
SQS
BucketBolt
SimpleBot
Annotator
DynamoDB
Adapter<T>
SQS

Ad Hoc Classic BI Access
Queries
16
Real-Time Recommendation Updates w/ Storm
Web Servers
• More complex
recommendation logic
• Processing benefits from
distributed logic of
streaming
• Parallelize updates
• Parallelize analysis
Batch
Log Feed
Update
Batch
Model
Feed
Hadoop Cluster
Batch Export
Storm
MPP Database
NoSQL
Database
Read &
write

17
Use of Storm in Recommendations
Ÿ More common: asynchronous updates of model
- Time-sensitive
- E.g., updating many time series, single event impacts many items
Ÿ More advanced: score many items
- Multi-dimensional models (item, user, context) so costly to precompute
- Distribute computation at read-time: still blend history with recent

18
Use of Storm in “targeted user 360” use case

19
Storm for Energy Grid Management
Ÿ Kafka - Data Ingestion
- Meter / sensor data from grid points and major end users
- GIS, CRM, Lightning strike and Weather data
- Energy production data from solar & wind (transient sources)
- Forecast data for weather, energy production, energy markets
Ÿ Storm - Streams processing
- Parse, filter, distribute data to
• Search (Solr or Elasticsearch)
• Archive
• Model training (Hadoop)
• Data serving (NoSQL)
- Topologies implement distributed predictive models
• Computationally intensive
• Predict outages
• Predict grid stability problems
• Predict market prices
- Publish alerts & warnings
Ÿ DRPC to query complex grid for Dashboards and investigation

20
Finance: Trade Compliance Monitoring
Order Mgmt
System
• Storm reduces market data to
usable subset
• Integrates high, medium and
low frequency data
• Real-time Storm filter triggers
deeper search via NoSQL
Batch
Log
Feed
Batch
Model
Feed
Hadoop Cluster
Batch
Export
Classic BI
Access
Storm
Ad Hoc
Queries
MPP Database
NoSQL
Database
Market Data
Feed

21
Use of Storm in Trade Compliance Monitoring
Ÿ Market data ingestion and processing
- Up to millions of messages per second
- Requires filtering, averaging, aggregation
Ÿ Trade information
- Orders and executions – up to tens of thousands of messages/second
- Filtering and aggregation
Ÿ Reference and static data also incorporated
Ÿ Market and trade information have to be joined by security and time
Ÿ Models used to detect anomalous trading behavior over time
Ÿ Simple filtering may be used to detect restricted behavior
Ÿ Alerts, reports must be immediate to reduce economic impact
Ÿ Recent comparative analysis for financial data:
- Within 50% of costly commercial software performance
- 25x better price/performance

23
Lessons
Ÿ Easy to develop, hard to debug – start locally (timeouts)
Ÿ Storm infinite loop of failures
- Use Memcached to count # tuple failures
Ÿ At Least Once processing
- Hadoop based read-repair job
Ÿ Performance Counters not getting flushed
- Tick Tuples
Ÿ Always ACK
Ÿ Batching to S3/HDFS
- Run a compaction & event de-duplication job in Hadoop
- Or use exactly once semantics e.g., with Trident
Ÿ Queues to connect topology parts with different time frames

24
Conclusions
Ÿ There’s many kinds of real-time problems
Ÿ Use of Hadoop and/or NoSQL can solve
- Low latency queries
- Event response with localized intelligence
Ÿ Storm is valuable for
- Ingesting data within seconds
- Complex real-time distributed logic
- Use cases in many industries and functions
Ÿ We’re Hiring!

Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics

Similar to Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics