Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

All Things Open - Spark & Storm - Where & When?

4,537 views

Published on

A quick tour through Spark Streaming and some of the debugging and tuning tips we've discovered over the past few years. Plus how to use Storm to get that low latency when you need it.

Published in: Software, Technology
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

All Things Open - Spark & Storm - Where & When?

  1. 1. Spark & Storm: When & Where?
  2. 2. www.mammothdata.com | @mammothdataco The Leader in Big Data Consulting ● BI/Data Strategy ○ Development of a business intelligence/ data architecture strategy. ● Installation ○ Installation of Hadoop or relevant technology. ● Data Consolidation ○ Load data from diverse sources into a single scalable repository. ● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions. ● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to necessary employees who will analyze the data. Mammoth Data, based in downtown Durham (right above Toast)
  3. 3. www.mammothdata.com | @mammothdataco ● Lead Consultant on all things DevOps and Spark ● @carsondial on Twitter Me!
  4. 4. www.mammothdata.com | @mammothdataco ● Quick overview of Spark Streaming ● Reasons why Spark Streaming can be tricky in practice ● Performance and tuning tips we’ve learnt over the past two years ● …and when to pack it all in and use Storm instead What This Talk Is About
  5. 5. www.mammothdata.com | @mammothdataco This IS WEB SCALE!
  6. 6. www.mammothdata.com | @mammothdataco ● I kid, Rails! ● (mostly) Beyond Web Scale
  7. 7. www.mammothdata.com | @mammothdataco ● Spark & Storm - millions of requests / second on commodity hardware ● Different problems at different scales! Beyond Web Scale
  8. 8. www.mammothdata.com | @mammothdataco ● Directed Acyclic Graph Data Processing Engine ● Based around the Resilient Distributed Dataset (RDD) primitive Spark
  9. 9. www.mammothdata.com | @mammothdataco Spark Streaming — Overview
  10. 10. www.mammothdata.com | @mammothdataco Spark Streaming — In Production? ● Yes! ● (Alibaba, AutoTrader, Cisco, Netflix, etc.)
  11. 11. www.mammothdata.com | @mammothdataco ● Streaming by running batches very quickly! ● Batch length: can be as low as 0.5s / batch ● Every X seconds, get Y records (DStream/RDDs) Spark Streaming — Overview
  12. 12. www.mammothdata.com | @mammothdataco ● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!) ● Access to rest of Spark - Dataframes, MLLib, GraphX, etc. Spark Streaming — Good Things
  13. 13. www.mammothdata.com | @mammothdataco ● What happens if you can’t process Y records in X seconds? ● What happens if you require sub-second latency? Spark Streaming — Bad Things!
  14. 14. www.mammothdata.com | @mammothdataco Spark Streaming — I’m so sorry.
  15. 15. www.mammothdata.com | @mammothdataco ● What happens if you can’t process Y records in X seconds? ● Data builds up in executors ● Executors run out of memory… Spark Streaming — Bad Things!
  16. 16. www.mammothdata.com | @mammothdataco ● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?” Spark Streaming — Bad Things!
  17. 17. www.mammothdata.com | @mammothdataco Spark Streaming — It Will Be Okay
  18. 18. www.mammothdata.com | @mammothdataco ● As a former Ops person: ● WE WILL REMEMBER. Spark Streaming — Bad Things!
  19. 19. www.mammothdata.com | @mammothdataco ● Do you need low-latency? ● If so, a 10-minute nap is advisable! ● Everybody else, let’s dive in… Spark Streaming — Tuning
  20. 20. www.mammothdata.com | @mammothdataco Spark Streaming — Tuning
  21. 21. www.mammothdata.com | @mammothdataco Spark Streaming — Down In The Hole
  22. 22. www.mammothdata.com | @mammothdataco Spark Streaming — Down In The Hole
  23. 23. www.mammothdata.com | @mammothdataco ● Easiest method — alter the batch window until it’s all fine! ● Tiny batches provide tight execution times! Spark Streaming — Down In The Hole
  24. 24. www.mammothdata.com | @mammothdataco ● Use Kafka. ● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+) ● (other sources get the features…eventually) Spark Streaming — Tuning
  25. 25. www.mammothdata.com | @mammothdataco ● Use Scala. ● CPython = slower in execution ● PyPy is much faster…but… ● New features always come to Scala first. Spark Streaming — Tuning
  26. 26. www.mammothdata.com | @mammothdataco ● (or Java if you really must) Spark Streaming — Tuning
  27. 27. www.mammothdata.com | @mammothdataco ● Spark Streaming = data receivers + Spark ● spark.cores.max = x * number of receivers ● For Great Data Locality and Parallelism! Spark Streaming — Cores
  28. 28. www.mammothdata.com | @mammothdataco ● Are you using a foreachRDD loop? rdd.foreachRDD{ rdd => rdd.cache() … rdd.unpersist() } Spark Streaming — Caching
  29. 29. www.mammothdata.com | @mammothdataco ● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win ● It really shouldn’t work so well… Spark Streaming — Caching
  30. 30. www.mammothdata.com | @mammothdataco ● Hurrah for Spark 1.5! ● spark.streaming.backpressure.enabled = true ● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors) ● Works for all data sources (for once!) Spark Streaming — Backpressure
  31. 31. www.mammothdata.com | @mammothdataco ● I really need that low-latency response! Storm
  32. 32. www.mammothdata.com | @mammothdataco ● Directed Acyclic Graph Data Processing Engine Storm
  33. 33. www.mammothdata.com | @mammothdataco Spark “Very Good, Sir”
  34. 34. www.mammothdata.com | @mammothdataco Storm “Here you go!”
  35. 35. www.mammothdata.com | @mammothdataco ● Stream of tuples ● Bolts ● Spouts ● Topologies Storm Concepts
  36. 36. www.mammothdata.com | @mammothdataco ● Unbounded stream of tuples ● Tuples are defined via schema (usual base types plus custom serializers) Storm — Streams
  37. 37. www.mammothdata.com | @mammothdataco ● Sources of tuples in a topology ● Read from external sources (e.g. Kafka) and emitting them ● Can emit multiple streams from a spout! Storm — Spouts
  38. 38. www.mammothdata.com | @mammothdataco ● Where your processing happens ● Roll your own aggregations / filtering / windowing ● Bolts can feed into other bolts ● Potentially easier to test than Spark Streaming ● Many Bolt connectors for external sources (e.g. Cassandra, Redis, Hive, etc) Storm — Bolts
  39. 39. www.mammothdata.com | @mammothdataco ● The DAG of the spouts and bolts ● Built programmatically in code and submitted to the Storm cluster ● Flux - Do It In YAML (and then complain about whitespace) Storm — Topologies
  40. 40. www.mammothdata.com | @mammothdataco ● Each bolt or spout runs 'tasks' across the cluster ● How parallelism works in Storm ● Set in topology submission Storm — Tasks
  41. 41. www.mammothdata.com | @mammothdataco ● Where the topology runs ● 1 worker = 1 JVM ● Tasks run as threads on a worker ● Storm distributes tasks evenly across cluster Storm — Workers
  42. 42. www.mammothdata.com | @mammothdataco ● True Streaming ● Tuples processed as they enter topology - low latency ● Scales far beyond Spark Streaming (currently) Storm — Good Things
  43. 43. www.mammothdata.com | @mammothdataco ● Battle-tested at Twitter & Yahoo! ● Yahoo! has 300-node clusters and working to support 1000+ nodes ● Single node clocked at over 1.5m tuples / second at Twitter Storm — Good Things
  44. 44. www.mammothdata.com | @mammothdataco ● Very DIY (bring your own aggregations, ML, etc) ● Your DAG construction may not be optimal ● Operationally more complex (and Storm WebUI is more primitive) ● Where’s Me REPL? Storm — Bad Things
  45. 45. www.mammothdata.com | @mammothdataco Spark or Storm?
  46. 46. www.mammothdata.com | @mammothdataco ● SLA on latency? Spark or Storm?
  47. 47. www.mammothdata.com | @mammothdataco ● Storm! ● (though simply because it’s possible doesn’t mean you’ll get it!) Spark or Storm?
  48. 48. www.mammothdata.com | @mammothdataco ● Insane data needs (e.g. ~100m records/second?) Spark or Storm?
  49. 49. www.mammothdata.com | @mammothdataco ● Storm! ● (though, again, it’s not a magic bullet!) Spark or Storm?
  50. 50. www.mammothdata.com | @mammothdataco ● For almost anything else? Spark. ● High-level vs. Low-level ● Each new version of Spark delivers improvements! Spark or Storm?
  51. 51. www.mammothdata.com | @mammothdataco ● Other frameworks that show promise: ○ Flink ○ Apex ○ Samza ○ Heron (Twitter’s not-public Storm replacement) Other Listing Magazines Are Available
  52. 52. www.mammothdata.com | @mammothdataco Questions?

×