Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark & Storm: When & Where?
www.mammothdata.com | @mammothdataco
The Leader in Big Data Consulting
● BI/Data Strategy
○ Development of a business inte...
www.mammothdata.com | @mammothdataco
● Lead Consultant on all things DevOps and Spark
● @carsondial on Twitter
Me!
www.mammothdata.com | @mammothdataco
● Quick overview of Spark Streaming
● Reasons why Spark Streaming can be tricky in pr...
www.mammothdata.com | @mammothdataco
This IS WEB SCALE!
www.mammothdata.com | @mammothdataco
● I kid, Rails!
● (mostly)
Beyond Web Scale
www.mammothdata.com | @mammothdataco
● Spark & Storm - millions of requests / second on commodity
hardware
● Different pro...
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
● Based around the Resilient Distribu...
www.mammothdata.com | @mammothdataco
Spark Streaming — Overview
www.mammothdata.com | @mammothdataco
Spark Streaming — In Production?
● Yes!
● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
www.mammothdata.com | @mammothdataco
● Streaming by running batches very quickly!
● Batch length: can be as low as 0.5s / ...
www.mammothdata.com | @mammothdataco
● Using same implementation (mostly) for batch and stream
processing (Lambda Architec...
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● What happens if you req...
www.mammothdata.com | @mammothdataco
Spark Streaming — I’m so sorry.
www.mammothdata.com | @mammothdataco
● What happens if you can’t process Y records in X seconds?
● Data builds up in execu...
www.mammothdata.com | @mammothdataco
● “Hey, we forgot to tell you Ops people that we have a major new
client adding stuff...
www.mammothdata.com | @mammothdataco
Spark Streaming — It Will Be Okay
www.mammothdata.com | @mammothdataco
● As a former Ops person:
● WE WILL REMEMBER.
Spark Streaming — Bad Things!
www.mammothdata.com | @mammothdataco
● Do you need low-latency?
● If so, a 10-minute nap is advisable!
● Everybody else, l...
www.mammothdata.com | @mammothdataco
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
Spark Streaming — Down In The Hole
www.mammothdata.com | @mammothdataco
● Easiest method — alter the batch window until it’s all fine!
● Tiny batches provide...
www.mammothdata.com | @mammothdataco
● Use Kafka.
● Data source with the most love (e.g. exactly-once semantics
without Wr...
www.mammothdata.com | @mammothdataco
● Use Scala.
● CPython = slower in execution
● PyPy is much faster…but…
● New feature...
www.mammothdata.com | @mammothdataco
● (or Java if you really must)
Spark Streaming — Tuning
www.mammothdata.com | @mammothdataco
● Spark Streaming = data receivers + Spark
● spark.cores.max = x * number of receiver...
www.mammothdata.com | @mammothdataco
● Are you using a foreachRDD loop?
rdd.foreachRDD{ rdd =>
rdd.cache()
…
rdd.unpersist...
www.mammothdata.com | @mammothdataco
● If routing to multiple stores / iterating over an RDD multiple
times using cache() ...
www.mammothdata.com | @mammothdataco
● Hurrah for Spark 1.5!
● spark.streaming.backpressure.enabled = true
● Spark dynamic...
www.mammothdata.com | @mammothdataco
● I really need that low-latency response!
Storm
www.mammothdata.com | @mammothdataco
● Directed Acyclic Graph Data Processing Engine
Storm
www.mammothdata.com | @mammothdataco
Spark
“Very Good, Sir”
www.mammothdata.com | @mammothdataco
Storm
“Here you go!”
www.mammothdata.com | @mammothdataco
● Stream of tuples
● Bolts
● Spouts
● Topologies
Storm Concepts
www.mammothdata.com | @mammothdataco
● Unbounded stream of tuples
● Tuples are defined via schema (usual base types plus c...
www.mammothdata.com | @mammothdataco
● Sources of tuples in a topology
● Read from external sources (e.g. Kafka) and emitt...
www.mammothdata.com | @mammothdataco
● Where your processing happens
● Roll your own aggregations / filtering / windowing
...
www.mammothdata.com | @mammothdataco
● The DAG of the spouts and bolts
● Built programmatically in code and submitted to t...
www.mammothdata.com | @mammothdataco
● Each bolt or spout runs 'tasks' across the cluster
● How parallelism works in Storm...
www.mammothdata.com | @mammothdataco
● Where the topology runs
● 1 worker = 1 JVM
● Tasks run as threads on a worker
● Sto...
www.mammothdata.com | @mammothdataco
● True Streaming
● Tuples processed as they enter topology - low latency
● Scales far...
www.mammothdata.com | @mammothdataco
● Battle-tested at Twitter & Yahoo!
● Yahoo! has 300-node clusters and working to sup...
www.mammothdata.com | @mammothdataco
● Very DIY (bring your own aggregations, ML, etc)
● Your DAG construction may not be ...
www.mammothdata.com | @mammothdataco
Spark or Storm?
www.mammothdata.com | @mammothdataco
● SLA on latency?
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though simply because it’s possible doesn’t mean you’ll get it!)
Spark or...
www.mammothdata.com | @mammothdataco
● Insane data needs (e.g. ~100m records/second?)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● Storm!
● (though, again, it’s not a magic bullet!)
Spark or Storm?
www.mammothdata.com | @mammothdataco
● For almost anything else? Spark.
● High-level vs. Low-level
● Each new version of S...
www.mammothdata.com | @mammothdataco
● Other frameworks that show promise:
○ Flink
○ Apex
○ Samza
○ Heron (Twitter’s not-p...
www.mammothdata.com | @mammothdataco
Questions?
Upcoming SlideShare
Loading in …5
×

All Things Open 2015 - Spark & Storm: When & Where?

2,444 views

Published on

What happens when you need to process a billion events an hour and Apache Spark just can’t handle it? This talk investigates when you might decide to switch from one Apache project to another, from Spark to the tried-and-tested world of Storm.

Published in: Technology

All Things Open 2015 - Spark & Storm: When & Where?

  1. 1. Spark & Storm: When & Where?
  2. 2. www.mammothdata.com | @mammothdataco The Leader in Big Data Consulting ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. ● Installation ○ Installation of Hadoop or relevant technology. ● Data Consolidation ○ Load data from diverse sources into a single scalable repository. ● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. ● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. Mammoth Data, based in downtown Durham (right above Toast)
  3. 3. www.mammothdata.com | @mammothdataco ● Lead Consultant on all things DevOps and Spark ● @carsondial on Twitter Me!
  4. 4. www.mammothdata.com | @mammothdataco ● Quick overview of Spark Streaming ● Reasons why Spark Streaming can be tricky in practice ● Performance and tuning tips we’ve learnt over the past two years ● …and when to pack it all in and use Storm instead What This Talk Is About
  5. 5. www.mammothdata.com | @mammothdataco This IS WEB SCALE!
  6. 6. www.mammothdata.com | @mammothdataco ● I kid, Rails! ● (mostly) Beyond Web Scale
  7. 7. www.mammothdata.com | @mammothdataco ● Spark & Storm - millions of requests / second on commodity hardware ● Different problems at different scales! Beyond Web Scale
  8. 8. www.mammothdata.com | @mammothdataco ● Directed Acyclic Graph Data Processing Engine ● Based around the Resilient Distributed Dataset (RDD) primitive Spark
  9. 9. www.mammothdata.com | @mammothdataco Spark Streaming — Overview
  10. 10. www.mammothdata.com | @mammothdataco Spark Streaming — In Production? ● Yes! ● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
  11. 11. www.mammothdata.com | @mammothdataco ● Streaming by running batches very quickly! ● Batch length: can be as low as 0.5s / batch ● Every X seconds, get Y records (DStream/RDDs) Spark Streaming — Overview
  12. 12. www.mammothdata.com | @mammothdataco ● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!) ● Access to rest of Spark - Dataframes, MLLib, GraphX, etc. Spark Streaming — Good Things
  13. 13. www.mammothdata.com | @mammothdataco ● What happens if you can’t process Y records in X seconds? ● What happens if you require sub-second latency? Spark Streaming — Bad Things!
  14. 14. www.mammothdata.com | @mammothdataco Spark Streaming — I’m so sorry.
  15. 15. www.mammothdata.com | @mammothdataco ● What happens if you can’t process Y records in X seconds? ● Data builds up in executors ● Executors run out of memory… Spark Streaming — Bad Things!
  16. 16. www.mammothdata.com | @mammothdataco ● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?” Spark Streaming — Bad Things!
  17. 17. www.mammothdata.com | @mammothdataco Spark Streaming — It Will Be Okay
  18. 18. www.mammothdata.com | @mammothdataco ● As a former Ops person: ● WE WILL REMEMBER. Spark Streaming — Bad Things!
  19. 19. www.mammothdata.com | @mammothdataco ● Do you need low-latency? ● If so, a 10-minute nap is advisable! ● Everybody else, let’s dive in… Spark Streaming — Tuning
  20. 20. www.mammothdata.com | @mammothdataco Spark Streaming — Tuning
  21. 21. www.mammothdata.com | @mammothdataco Spark Streaming — Down In The Hole
  22. 22. www.mammothdata.com | @mammothdataco Spark Streaming — Down In The Hole
  23. 23. www.mammothdata.com | @mammothdataco ● Easiest method — alter the batch window until it’s all fine! ● Tiny batches provide tight execution times! Spark Streaming — Down In The Hole
  24. 24. www.mammothdata.com | @mammothdataco ● Use Kafka. ● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+) ● (other sources get the features…eventually) Spark Streaming — Tuning
  25. 25. www.mammothdata.com | @mammothdataco ● Use Scala. ● CPython = slower in execution ● PyPy is much faster…but… ● New features always come to Scala first. Spark Streaming — Tuning
  26. 26. www.mammothdata.com | @mammothdataco ● (or Java if you really must) Spark Streaming — Tuning
  27. 27. www.mammothdata.com | @mammothdataco ● Spark Streaming = data receivers + Spark ● spark.cores.max = x * number of receivers ● For Great Data Locality and Parallelism! Spark Streaming — Cores
  28. 28. www.mammothdata.com | @mammothdataco ● Are you using a foreachRDD loop? rdd.foreachRDD{ rdd => rdd.cache() … rdd.unpersist() } Spark Streaming — Caching
  29. 29. www.mammothdata.com | @mammothdataco ● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win ● It really shouldn’t work so well… Spark Streaming — Caching
  30. 30. www.mammothdata.com | @mammothdataco ● Hurrah for Spark 1.5! ● spark.streaming.backpressure.enabled = true ● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors) ● Works for all data sources (for once!) Spark Streaming — Backpressure
  31. 31. www.mammothdata.com | @mammothdataco ● I really need that low-latency response! Storm
  32. 32. www.mammothdata.com | @mammothdataco ● Directed Acyclic Graph Data Processing Engine Storm
  33. 33. www.mammothdata.com | @mammothdataco Spark “Very Good, Sir”
  34. 34. www.mammothdata.com | @mammothdataco Storm “Here you go!”
  35. 35. www.mammothdata.com | @mammothdataco ● Stream of tuples ● Bolts ● Spouts ● Topologies Storm Concepts
  36. 36. www.mammothdata.com | @mammothdataco ● Unbounded stream of tuples ● Tuples are defined via schema (usual base types plus custom serializers) Storm — Streams
  37. 37. www.mammothdata.com | @mammothdataco ● Sources of tuples in a topology ● Read from external sources (e.g. Kafka) and emitting them ● Can emit multiple streams from a spout! Storm — Spouts
  38. 38. www.mammothdata.com | @mammothdataco ● Where your processing happens ● Roll your own aggregations / filtering / windowing ● Bolts can feed into other bolts ● Potentially easier to test than Spark Streaming ● Many Bolt connectors for external sources (e.g. Cassandra, Redis, Hive, etc) Storm — Bolts
  39. 39. www.mammothdata.com | @mammothdataco ● The DAG of the spouts and bolts ● Built programmatically in code and submitted to the Storm cluster ● Flux - Do It In YAML (and then complain about whitespace) Storm — Topologies
  40. 40. www.mammothdata.com | @mammothdataco ● Each bolt or spout runs 'tasks' across the cluster ● How parallelism works in Storm ● Set in topology submission Storm — Tasks
  41. 41. www.mammothdata.com | @mammothdataco ● Where the topology runs ● 1 worker = 1 JVM ● Tasks run as threads on a worker ● Storm distributes tasks evenly across cluster Storm — Workers
  42. 42. www.mammothdata.com | @mammothdataco ● True Streaming ● Tuples processed as they enter topology - low latency ● Scales far beyond Spark Streaming (currently) Storm — Good Things
  43. 43. www.mammothdata.com | @mammothdataco ● Battle-tested at Twitter & Yahoo! ● Yahoo! has 300-node clusters and working to support 1000+ nodes ● Single node clocked at over 1.5m tuples / second at Twitter Storm — Good Things
  44. 44. www.mammothdata.com | @mammothdataco ● Very DIY (bring your own aggregations, ML, etc) ● Your DAG construction may not be optimal ● Operationally more complex (and Storm WebUI is more primitive) ● Where’s Me REPL? Storm — Bad Things
  45. 45. www.mammothdata.com | @mammothdataco Spark or Storm?
  46. 46. www.mammothdata.com | @mammothdataco ● SLA on latency? Spark or Storm?
  47. 47. www.mammothdata.com | @mammothdataco ● Storm! ● (though simply because it’s possible doesn’t mean you’ll get it!) Spark or Storm?
  48. 48. www.mammothdata.com | @mammothdataco ● Insane data needs (e.g. ~100m records/second?) Spark or Storm?
  49. 49. www.mammothdata.com | @mammothdataco ● Storm! ● (though, again, it’s not a magic bullet!) Spark or Storm?
  50. 50. www.mammothdata.com | @mammothdataco ● For almost anything else? Spark. ● High-level vs. Low-level ● Each new version of Spark delivers improvements! Spark or Storm?
  51. 51. www.mammothdata.com | @mammothdataco ● Other frameworks that show promise: ○ Flink ○ Apex ○ Samza ○ Heron (Twitter’s not-public Storm replacement) Other Listing Magazines Are Available
  52. 52. www.mammothdata.com | @mammothdataco Questions?

×