STORM
Buckle up Dorothy !!!
Distributed real-time computation
ABOUT
By Nathan Marz
Backtype => Twitter => Apache
Real-time analytics
WHAT IS IT GOOD FOR?
Online machine learning
Continuous computation
Distributed RPC
ETL (Extract, Tran...
No data loss
Fault-tolerantScalable
PROMISES
Robust
VIEW FROM ABOVE
StorageTopology
Stream
Source
Storm Cluster
Pull
(Kafka,*
MQ, …)
Read/Write
PRIMITIVES
Field 1 /
Value 1
Field 2 /
Value 2
Field 3 /
Value 3
Field 4 /
Value 4
Field 5 /
Value 5
Tuple
Tuple Tuple Tup...
Topology
Bolt
PRIMITIVES
Spout
Bolt
Spout
Bolt
Bolt
ABSTRACTION
PRIMITIVES
Tuples
Filters
Transformation
Incremental
Distributed
Scalable
Functions
Joins
Chaining streams
Sma...
CLUSTER
Nimbus Zookeeper Cluster
Worker Node
Executor
Supervisor
Executor
Executor
Worker Node
Executor
Supervisor
Executo...
NIMBUS / NODES
CLUSTER
Small
No state
Communication
State
RobustKill / Restart easy
ZOOKEEPER
No data loss
Fault-tolerantScalable
AS PROMISED?
Robust
GUARANTEES
Message transforms into a tuple tree
Storm tracks tuple tree
Fully processed when tree exhausted
FAILURES
Task died – failed tuples replayed
Acker task died – related tuples
timeout and are replayed
Spout task died – so...
WHAT DO I HAVE TO DO?
Inform about new links in tree
Inform when finished with a tuple
Every tuple must be acked or failed
TRIDENT
ANYTHING SIMPLER?
High level abstraction
Stateful persistence primitives
Exactly-once semantics
AS PROMISED?
YES
USER DASHBOARD
PROBLEM
Bad performance
Uses core storage
Pre-compute
Customize
Fast
IDEA
Isolate
Quarterly agg.
ARCHITECTURE
Core
Events
Queue
Kafka
4 Partitions
2 Replicas
Storm
4 Workers
MS SQL
4 Staging
Dashboard
Push
Pull Write
Re...
KAFKA
9
8
7
6
5
4
3
2
1
New
Client
Topic Stacked
Flushed
Client offset
Replicated
Old
Partitioned
Fast
TRANSFORMATION
ORIGINAL
{
id: df45er87c78df,
sender: “Info”,
destination: “39345123456”,
parts: 2,
price: 100,
client: “De...
CODE
TridentState tridentState = topology
.newStream("CoreEvents", buildKafkaSpout())
.parallelismHint(4)
.each(
new Field...
PERFORMANCE
1.500
PEAKREGULAR
KAFKA 60.000
4.500 160.000
STORAGE 2.000 10.000
DASHBOARD 1 1
TUNING STORAGE
1st Issue - Storage
Random access – 1.500 w/s limit
Staged approach – 30.000 w/s limit
No locks – isolated
...
STAGED WRITES
Worker 1
Main
Table
Merge
Worker 2
Stage
Table 1
Stage
Table 2
MergeWrite
Write
TUNING TOPOLOGY
2nd Issue - Serialization
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Raw/s Expanded/s W...
SERIALIZATION
0
200
400
600
800
1,000
1,200
S [s] S [byte] S [% CPU] D [s] D [% CPU]
CSV (Plain)
CSV (Deflate)
CSV (GZip)
...
MEASURE
AXIS
Max spout pending
SQL workers
Kafka fetch speed
DB write speed
Kafka / DB ratio
Capacity
DB batch size
Kafka ...
MONITOR
STORM UI TOPOLOGY
METRICS
GRAPHITE
GOTCHAS
Version 0.9.1
Partially in flux
Kafka integration
Message & topology versioning
Performance tuning
Lambda Architecture
NEXT?
Master
Dataset
Real-time Views
Serving LayerBatch Layer
Speed Layer
New
Data
Query
Query
Batch V...
http://storm.incubator.apache.org
RESOURCES
http://lambda-architecture.net
http://kafka.apache.org
http://www.gimp.org
PRESENTATION TOOLS
http://www.pictaculous.com
http://www.colourlovers.com
http://www.easycalculation.c...
QUESTIONS?
Upcoming SlideShare
Loading in...5
×

Storm overview & integration

814

Published on

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
814
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Storm overview & integration

  1. 1. STORM Buckle up Dorothy !!!
  2. 2. Distributed real-time computation ABOUT By Nathan Marz Backtype => Twitter => Apache
  3. 3. Real-time analytics WHAT IS IT GOOD FOR? Online machine learning Continuous computation Distributed RPC ETL (Extract, Transform, Load) …
  4. 4. No data loss Fault-tolerantScalable PROMISES Robust
  5. 5. VIEW FROM ABOVE StorageTopology Stream Source Storm Cluster Pull (Kafka,* MQ, …) Read/Write
  6. 6. PRIMITIVES Field 1 / Value 1 Field 2 / Value 2 Field 3 / Value 3 Field 4 / Value 4 Field 5 / Value 5 Tuple Tuple Tuple Tuple Tuple Stream
  7. 7. Topology Bolt PRIMITIVES Spout Bolt Spout Bolt Bolt
  8. 8. ABSTRACTION PRIMITIVES Tuples Filters Transformation Incremental Distributed Scalable Functions Joins Chaining streams Small components EFFECTS Spouts Bolts
  9. 9. CLUSTER Nimbus Zookeeper Cluster Worker Node Executor Supervisor Executor Executor Worker Node Executor Supervisor Executor Executor Worker Node Executor Supervisor Executor Executor
  10. 10. NIMBUS / NODES CLUSTER Small No state Communication State RobustKill / Restart easy ZOOKEEPER
  11. 11. No data loss Fault-tolerantScalable AS PROMISED? Robust
  12. 12. GUARANTEES Message transforms into a tuple tree Storm tracks tuple tree Fully processed when tree exhausted
  13. 13. FAILURES Task died – failed tuples replayed Acker task died – related tuples timeout and are replayed Spout task died – source replays, e.g. pending messages are placed back on the queue
  14. 14. WHAT DO I HAVE TO DO? Inform about new links in tree Inform when finished with a tuple Every tuple must be acked or failed
  15. 15. TRIDENT ANYTHING SIMPLER? High level abstraction Stateful persistence primitives Exactly-once semantics
  16. 16. AS PROMISED? YES
  17. 17. USER DASHBOARD PROBLEM Bad performance Uses core storage Pre-compute Customize Fast IDEA Isolate Quarterly agg.
  18. 18. ARCHITECTURE Core Events Queue Kafka 4 Partitions 2 Replicas Storm 4 Workers MS SQL 4 Staging Dashboard Push Pull Write Read State in source
  19. 19. KAFKA 9 8 7 6 5 4 3 2 1 New Client Topic Stacked Flushed Client offset Replicated Old Partitioned Fast
  20. 20. TRANSFORMATION ORIGINAL { id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, … } { client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200 } COMPUTED
  21. 21. CODE TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);
  22. 22. PERFORMANCE 1.500 PEAKREGULAR KAFKA 60.000 4.500 160.000 STORAGE 2.000 10.000 DASHBOARD 1 1
  23. 23. TUNING STORAGE 1st Issue - Storage Random access – 1.500 w/s limit Staged approach – 30.000 w/s limit No locks – isolated Scalable – each worker it’s stage Main table indexing nicely Doesn’t affect reading
  24. 24. STAGED WRITES Worker 1 Main Table Merge Worker 2 Stage Table 1 Stage Table 2 MergeWrite Write
  25. 25. TUNING TOPOLOGY 2nd Issue - Serialization 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 Raw/s Expanded/s Writes/s 200 KB 1 MB 4 MB 8 MB 16 MB 24 MB Plateauing
  26. 26. SERIALIZATION 0 200 400 600 800 1,000 1,200 S [s] S [byte] S [% CPU] D [s] D [% CPU] CSV (Plain) CSV (Deflate) CSV (GZip) Jackson (Plain) Jackson (GZip) Jackson Smile Java Object Kryo
  27. 27. MEASURE AXIS Max spout pending SQL workers Kafka fetch speed DB write speed Kafka / DB ratio Capacity DB batch size Kafka fetch size Latency METRICS Serialization …
  28. 28. MONITOR STORM UI TOPOLOGY
  29. 29. METRICS GRAPHITE
  30. 30. GOTCHAS Version 0.9.1 Partially in flux Kafka integration Message & topology versioning Performance tuning
  31. 31. Lambda Architecture NEXT? Master Dataset Real-time Views Serving LayerBatch Layer Speed Layer New Data Query Query Batch Views
  32. 32. http://storm.incubator.apache.org RESOURCES http://lambda-architecture.net http://kafka.apache.org
  33. 33. http://www.gimp.org PRESENTATION TOOLS http://www.pictaculous.com http://www.colourlovers.com http://www.easycalculation.com http://paletton.com
  34. 34. QUESTIONS?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×