Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Streaming into context

321 views

Published on

Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.

Published in: Data & Analytics
  • Be the first to comment

Spark Streaming into context

  1. 1. Spark Streaming into context David Martinez Rego 20th of October 2016
  2. 2. About me • Phd in ML 2013: predictive maintenance of windmills • Lived in London since then • Postdoc @ UCL • Teaching and Mentoring @ UCL internships inside financial institutions • Consulting on Data analytics • Early Startup
  3. 3. Plethora of options?
  4. 4. Wishlist • Easy to compose complex pipelines • Easy scaling out • Interoperable with a large ecosystem • Low latency and high throughput • Monitoring
  5. 5. Plethora of options?
  6. 6. Flume • Its mechanism of scaling to different machines is managed in an ad hoc way
  7. 7. Flume • Its mechanism of scaling to different machines is managed in an ad hoc way
  8. 8. Flume • Its mechanism of scaling to different machines is managed in an ad hoc way • Nice to solve simple custom data gathering from the exterior and throw it in the perimeter for further processing.
  9. 9. Plethora of options?
  10. 10. Plethora of options?
  11. 11. Plethora of options?
  12. 12. Plethora of options?
  13. 13. Lessons learnt • Each project has added some good ideas when they were more needed • Eventually, all platforms have absorbed the best ideas from peers • It seems that we have a winner, for now?
  14. 14. Time view Pipelining Composition one at a time spouts and bolts RDD one at a time spouts and bolts
  15. 15. Storm basic model Spout Spout Bolt Bolt Bolt Bolt Topology s.g. s.g. s.g. s.g.
  16. 16. Guarantees and fault tolerance ACK ANCH
  17. 17. Anchoring ACK Guarantees and fault tolerance
  18. 18. Spout
  19. 19. Bolt
  20. 20. Topology
  21. 21. Storm basic model Spout Spout Bolt Bolt Bolt Bolt Topology s.g. s.g. s.g. s.g.
  22. 22. Lambda architecture
  23. 23. Time view Pipelining Composition one at a time spouts and bolts RDD one at a time system, stream, stream task
  24. 24. Samza
  25. 25. Samza
  26. 26. Samza
  27. 27. Samza
  28. 28. Kappa architecture
  29. 29. Time view Pipelining Composition one at a time source, spouts, bolts and ack RDD one at a time system, stream, stream task
  30. 30. RDD
  31. 31. Microbatch
  32. 32. Init + connect to source pipeline computation + state mgmt.
  33. 33. Time view Pipelining Composition one at a time source, spouts, bolts and ack RDD one at a time system, stream, stream task
  34. 34. Much better, but still… • Introduce problems 1. Still no full equivalence between batch and streaming 2. out of order management and early reporting have to be coded 3. custom windows code needs to be mixed with business logic 4. Micro-batches impose a lower limit on latency
  35. 35. Spark: batch and streaming
  36. 36. Spark: batch and streaming
  37. 37. Lambda architecture?
  38. 38. Out of order Latency is unpredictable
  39. 39. Our aim
  40. 40. Final Spark (1)
  41. 41. Final Spark (2)
  42. 42. Batch vs. Streaming Data Streaming
  43. 43. Data Batch Batch vs. Streaming
  44. 44. Data Batch Batch vs. Streaming A batch pipeline IS a streaming pipeline applied to a finite stream!
  45. 45. Event time + Processing time Processing time Event time Business logic +
  46. 46. Event time + Processing time Processing time Event time Business logic +
  47. 47. Plethora of options?
  48. 48. Beam/Dataflow
  49. 49. Beam/Dataflow
  50. 50. Beam/Dataflow
  51. 51. Apache Beam Streaming API Execution engine
  52. 52. Apache Beam Streaming API ! ! ! Execution engine http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
  53. 53. Apache Beam Kostas Tzoumas, Data artisans Tyler Akidau, Beam PMC
  54. 54. Other considerations Maturity ? - Ecosystem - - Community - Ops - -
  55. 55. Other considerations • Flow of the experiment: • Read an event from Kafka. • Deserialize the JSON string. • Filter out irrelevant events • Take a projection of the relevant fields • Join each event with its associated campaign (from Redis). • Take a windowed count of events per campaign and store each window in Redis along with a last updated timestamp (with late events).
  56. 56. Resources • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 • https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison • http://data-artisans.com/why-apache-beam/#more-710 • http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream- processing-with-apache-flink/ • http://beam.incubator.apache.org/beam/capability/2016/03/17/capability- matrix.html • https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  57. 57. Spark Streaming into context Thanks for listening!

×