Storm overview & integration
Upcoming SlideShare
Loading in...5
×
 

Storm overview & integration

on

  • 262 views

 

Statistics

Views

Total Views
262
Views on SlideShare
247
Embed Views
15

Actions

Likes
1
Downloads
4
Comments
0

3 Embeds 15

https://www.linkedin.com 12
http://www.slideee.com 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Storm overview & integration Storm overview & integration Presentation Transcript

  • STORM Buckle up Dorothy !!!
  • Distributed real-time computation ABOUT By Nathan Marz Backtype => Twitter => Apache
  • Real-time analytics WHAT IS IT GOOD FOR? Online machine learning Continuous computation Distributed RPC ETL (Extract, Transform, Load) …
  • No data loss Fault-tolerantScalable PROMISES Robust
  • VIEW FROM ABOVE StorageTopology Stream Source Storm Cluster Pull (Kafka,* MQ, …) Read/Write
  • PRIMITIVES Field 1 / Value 1 Field 2 / Value 2 Field 3 / Value 3 Field 4 / Value 4 Field 5 / Value 5 Tuple Tuple Tuple Tuple Tuple Stream
  • Topology Bolt PRIMITIVES Spout Bolt Spout Bolt Bolt
  • ABSTRACTION PRIMITIVES Tuples Filters Transformation Incremental Distributed Scalable Functions Joins Chaining streams Small components EFFECTS Spouts Bolts
  • CLUSTER Nimbus Zookeeper Cluster Worker Node Executor Supervisor Executor Executor Worker Node Executor Supervisor Executor Executor Worker Node Executor Supervisor Executor Executor
  • NIMBUS / NODES CLUSTER Small No state Communication State RobustKill / Restart easy ZOOKEEPER
  • No data loss Fault-tolerantScalable AS PROMISED? Robust
  • GUARANTEES Message transforms into a tuple tree Storm tracks tuple tree Fully processed when tree exhausted
  • FAILURES Task died – failed tuples replayed Acker task died – related tuples timeout and are replayed Spout task died – source replays, e.g. pending messages are placed back on the queue
  • WHAT DO I HAVE TO DO? Inform about new links in tree Inform when finished with a tuple Every tuple must be acked or failed
  • TRIDENT ANYTHING SIMPLER? High level abstraction Stateful persistence primitives Exactly-once semantics
  • AS PROMISED? YES
  • USER DASHBOARD PROBLEM Bad performance Uses core storage Pre-compute Customize Fast IDEA Isolate Quarterly agg.
  • ARCHITECTURE Core Events Queue Kafka 4 Partitions 2 Replicas Storm 4 Workers MS SQL 4 Staging Dashboard Push Pull Write Read State in source
  • KAFKA 9 8 7 6 5 4 3 2 1 New Client Topic Stacked Flushed Client offset Replicated Old Partitioned Fast
  • TRANSFORMATION ORIGINAL { id: df45er87c78df, sender: “Info”, destination: “39345123456”, parts: 2, price: 100, client: “Demo”, time: “2014-06-02 14:47:58”, country: “IT”, network: “Wind”, type: “SMS”, … } { client: “Demo”, type: “SMS”, country: “IT”, network: “Wind”, bucket: “2014-06-02 14:45:00”, traffic: 2, expenses: 200 } COMPUTED
  • CODE TridentState tridentState = topology .newStream("CoreEvents", buildKafkaSpout()) .parallelismHint(4) .each( new Fields("bytes"), new CoreEventMessageParser(), new Fields("time", "client", "network", "country", "type", "parts", "price")) .each( new Fields("time"), new QuarterTimeBucket(), new Fields("bucket")) .project(new Fields("bucket", "client", "network", "country", "type", "traffic", "expenses“)) .groupBy(new Fields("bucket", "client", "network", "country", "type")) .persistentAggregate(getStateFactory(), new Fields("traffic", "expenses"), new Sum(), new Fields("trafficExpenses")) .parallelismHint(8);
  • PERFORMANCE 1.500 PEAKREGULAR KAFKA 60.000 4.500 160.000 STORAGE 2.000 10.000 DASHBOARD 1 1
  • TUNING STORAGE 1st Issue - Storage Random access – 1.500 w/s limit Staged approach – 30.000 w/s limit No locks – isolated Scalable – each worker it’s stage Main table indexing nicely Doesn’t affect reading
  • STAGED WRITES Worker 1 Main Table Merge Worker 2 Stage Table 1 Stage Table 2 MergeWrite Write
  • TUNING TOPOLOGY 2nd Issue - Serialization 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 Raw/s Expanded/s Writes/s 200 KB 1 MB 4 MB 8 MB 16 MB 24 MB Plateauing
  • SERIALIZATION 0 200 400 600 800 1,000 1,200 S [s] S [byte] S [% CPU] D [s] D [% CPU] CSV (Plain) CSV (Deflate) CSV (GZip) Jackson (Plain) Jackson (GZip) Jackson Smile Java Object Kryo
  • MEASURE AXIS Max spout pending SQL workers Kafka fetch speed DB write speed Kafka / DB ratio Capacity DB batch size Kafka fetch size Latency METRICS Serialization …
  • MONITOR STORM UI TOPOLOGY
  • METRICS GRAPHITE
  • GOTCHAS Version 0.9.1 Partially in flux Kafka integration Message & topology versioning Performance tuning
  • Lambda Architecture NEXT? Master Dataset Real-time Views Serving LayerBatch Layer Speed Layer New Data Query Query Batch Views
  • http://storm.incubator.apache.org RESOURCES http://lambda-architecture.net http://kafka.apache.org
  • http://www.gimp.org PRESENTATION TOOLS http://www.pictaculous.com http://www.colourlovers.com http://www.easycalculation.com http://paletton.com
  • QUESTIONS?