Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Stream Processing - Spark Summit East 2017

618 views

Published on

The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.

A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.

Published in: Software
  • Be the first to comment

Distributed Stream Processing - Spark Summit East 2017

  1. 1. MANCHESTER LONDON NEW YORK
  2. 2. Petr Zapletal @petr_zapletal @spark_summit @cakesolutions Distributed Real-Time Stream Processing: Why and How
  3. 3. Agenda ● Motivation ● Stream Processing ● Available Frameworks ● Systems Comparison ● Recommendations
  4. 4. The Data Deluge ● New Sources and New Use Cases ● 8 Zettabytes (1 ZB = 1 trillion GB) created in 2015 ● Stream Processing to the Rescue
  5. 5. ● Continuous processing, aggregation and analysis of unbounded data Distributed Stream Processing
  6. 6. Points of Interest ➜ Runtime and Programming Model
  7. 7. Points of Interest ➜ Runtime and Programming Model ➜ Primitives
  8. 8. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management
  9. 9. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees
  10. 10. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery
  11. 11. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability
  12. 12. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability ➜ Maturity and Adoption Level
  13. 13. Points of Interest ➜ Runtime and Programming Model ➜ Primitives ➜ State Management ➜ Message Delivery Guarantees ➜ Fault Tolerance & Low Overhead Recovery ➜ Latency, Throughput & Scalability ➜ Maturity and Adoption Level ➜ Ease of Development and Operability
  14. 14. ● Most important trait of stream processing system ● Defines expressiveness, possible operations and its limitations ● Therefore defines systems capabilities and its use cases Runtime and Programming Model
  15. 15. Native Streaming records Sink Operator Source Operator Processing Operator records processed one at a time Processing Operator
  16. 16. records processed in short batches Processing Operator Receiver records Processing Operator Micro-batches Sink Operator Micro-batching
  17. 17. Native Streaming ● Records are processed as they arrive Pros ⟹ Expressiveness ⟹ Low-latency ⟹ Stateful operations Cons ⟹ Fault-tolerance is expensive ⟹ Load-balancing
  18. 18. Micro-batching Cons ⟹ Lower latency, depends on batch interval ⟹ Limited expressivity ⟹ Harder stateful operations ● Splits incoming stream into small batches Pros ⟹ High-throughput ⟹ Easier fault tolerance ⟹ Simpler load-balancing
  19. 19. Programming Model Compositional ⟹ Provides basic building blocks as operators or sources ⟹ Custom component definition ⟹ Manual Topology definition & optimization ⟹ Advanced functionality often missing Declarative ⟹ High-level API ⟹ Operators as higher order functions ⟹ Abstract data types ⟹ Advance operations like state management or windowing supported out of the box ⟹ Advanced optimizers
  20. 20. Apache Streaming Landscape TRIDENT
  21. 21. Storm Stor ● Pioneer in large scale stream processing
  22. 22. ● Higher level micro-batching system build atop Storm Trident
  23. 23. ● Unified batch and stream processing over a batch runtime Spark Streaming input data stream Spark Streaming Spark Engine batches of input data batches of processed data
  24. 24. Samza ● Builds heavily on Kafka’s log based philosophy Task 1 Task 2 Task 3
  25. 25. Kafka Streams ● Simple streaming library on top of Apache Kafka
  26. 26. Apex ● Processes massive amounts of real-time events natively in Hadoop Operator Operator OperatorOperator Operator Operator Output Stream Stream Stream Stream Stream Tuple
  27. 27. Flink Stream Data Batch Data Kafka, RabbitMQ, ... HDFS, JDBC, ... ● Native streaming & High level API
  28. 28. Counting Words Spark Summit 2017 Apache Apache Spark Storm Apache Trident Flink Streaming Samza Scala 2017 Streaming (Apache, 3) (Streaming, 2) (2017, 2) (Spark, 2) (Storm, 1) (Trident, 1) (Flink, 1) (Samza, 1) (Scala, 1) (Summit, 1)
  29. 29. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  30. 30. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  31. 31. Storm TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new Split(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); ... Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); }
  32. 32. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  33. 33. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  34. 34. Trident public static StormTopology buildTopology(LocalDRPC drpc) { FixedBatchSpout spout = ... TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")); ... }
  35. 35. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  36. 36. Spark Streaming val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination()
  37. 37. val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination() Spark Streaming
  38. 38. val conf = new SparkConf().setAppName("wordcount") val ssc = new StreamingContext(conf, Seconds(1)) val text = ... val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.print() ssc.start() ssc.awaitTermination() Spark Streaming
  39. 39. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  40. 40. Samza class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) }
  41. 41. class WordCountTask extends StreamTask { override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) { val text = envelope.getMessage.asInstanceOf[String] val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) } collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts)) } Samza
  42. 42. Kafka Streams final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input"); KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream(); wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
  43. 43. Kafka Streams final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input"); KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream(); wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");
  44. 44. Apex val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } }
  45. 45. Apex val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } }
  46. 46. val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in) dag.addStream[String]("words", parser.out, counter.data) class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]() override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w) } } Apex
  47. 47. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  48. 48. Flink val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1) counts.print() env.execute("wordcount")
  49. 49. Summary Native Micro-batching Native Native Compositional Declarative Compositional Declarative At-least-once Exactly-once At-least-once Exactly-once Record ACKs RDD based Checkpointing Log-based Checkpointing Not build-in Dedicated Operators Stateful Operators Stateful Operators Very Low Medium Low Low Low Medium High High High High Medium Medium Micro-batching Exactly-once* Dedicated DStream Medium High Streaming Model API Guarantees Fault Tolerance State Management Latency Throughput Maturity TRIDENT Hybrid Compositional* Exactly-once Checkpointing Stateful Operators Very Low High Medium Native Declarative* At-least-once Low Log-based Stateful Operators Low High
  50. 50. General Guidelines As always, It depends
  51. 51. General Guidelines ➜ Evaluate particular application needs
  52. 52. General Guidelines ➜ Evaluate particular application needs ➜ Programming model
  53. 53. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees
  54. 54. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees ➜ Almost all non-trivial jobs have state
  55. 55. General Guidelines ➜ Evaluate particular application needs ➜ Programming model ➜ Available delivery guarantees ➜ Almost all non-trivial jobs have state ➜ Fast recovery is critical
  56. 56. Recommendations [Storm & Trident] ● Fits for small and fast tasks ● Very low (tens of milliseconds) latency ● State & Fault tolerance degrades performance significantly ● Potential update to Heron ○ Keeps the API, according to Twitter better in every single way ○ Open-sourced recently
  57. 57. Recommendations [Spark Streaming] ● Spark Ecosystem ● Data Exploration ● Latency is not critical ● Micro-batching limitations
  58. 58. Recommendations [Samza] ● Kafka is a cornerstone of your architecture ● Application requires large states ● Don’t need exactly once
  59. 59. Recommendations [Kafka Streams] ● Similar functionality like Samza with nicer APIs ● Operated by Kafka itself ● Great learning curve ● At-least once delivery ● May not support more advanced functionality
  60. 60. Recommendations [Apex] ● Prefer compositional approach ● Hadoop ● Great performance ● Dynamic DAG changes
  61. 61. Recommendations [Flink] ● Conceptually great, fits very most use cases ● Take advantage of batch processing capabilities ● Need a functionality which is hard to implement in micro-batch ● Enough courage to use emerging project
  62. 62. Dataflow and Apache Beam Dataflow Model & SDKs Apache Flink Apache Spark Direct Pipeline Google Cloud Dataflow Stream Processing Batch Processing Multiple Modes One Pipeline Many Runtimes Local or cloud Local Cloud
  63. 63. Questions MANCHESTER LONDON NEW YORK
  64. 64. MANCHESTER LONDON NEW YORK @petr_zapletal @cakesolutions 347 708 1518 enquiries@cakesolutions.net We are hiring http://www.cakesolutions.net/careers
  65. 65. References ● http://storm.apache.org/ ● http://spark.apache.org/streaming/ ● http://samza.apache.org/ ● https://apex.apache.org/ ● https://flink.apache.org/ ● http://beam.incubator.apache.org/ ● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 ● http://data-artisans.com/blog/ ● http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple ● https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at ● http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem ● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● http://stackoverflow.com/questions/29111549/where-do-apache-samza-and-apache-storm-differ-in-their-use-cases ● https://www.dropbox.com/s/1s8pnjwgkkvik4v/GearPump%20Edge%20Topologies%20and%20Deployment.pptx?dl=0 ● ...
  66. 66. List of Figures ● https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html ● http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014 ● https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png ● http://data-artisans.com/wp-content/uploads/2015/08/microbatching.png ● http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job ● http://data-artisans.com/wp-content/uploads/2015/08/streambarrier.png ● https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing ● https://raw.githubusercontent.com/tweise/apex-samples/kafka-count-jdbc/exactly-once/docs/images/image00.png ● https://4.bp.blogspot.com/-RlLeDymI_mU/Vp-1cb3AxNI/AAAAAAAACSQ/5TphliHJA4w/s1600/dataflow%2BASF.pn g
  67. 67. Backup Slides MANCHESTER LONDON NEW YORK
  68. 68. Processing Architecture Evolution Batch Pipeline Serving DBHDFS Query
  69. 69. Processing Architecture Evolution Lambda Architecture Batch Layer Serving Layer Stream layer Query Allyourdata Oozie Query
  70. 70. Processing Architecture Evolution Standalone Stream Processing Stream Processing
  71. 71. Processing Architecture Evolution Kappa Architecture Query Stream Processing
  72. 72. ETL Operations ● Transformations, joining or filtering of incoming data Streaming Applications
  73. 73. Windowing ● Trends in bounded interval, like tweets or sales Streaming Applications
  74. 74. Streaming Applications Machine Learning ● Clustering, Trend fitting, Classification
  75. 75. Streaming Applications Pattern Recognition ● Fraud detection, Signal triggering, Anomaly detection
  76. 76. Fault tolerance in streaming systems is inherently harder that in batch Fault Tolerance
  77. 77. Managing State f: (input, state) => (output, state’)
  78. 78. Performance Hard to design not biased test, lots of variables
  79. 79. Performance ➜ Latency vs. Throughput
  80. 80. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management
  81. 81. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management ➜ Tuning
  82. 82. Performance ➜ Latency vs. Throughput ➜ Costs of Delivery Guarantees, Fault-tolerance & State Management ➜ Tuning ➜ Network operations, Data locality & Serialization
  83. 83. Project Maturity When picking up the framework, you should always consider its maturity
  84. 84. For a long time de-facto industrial standard Project Maturity [Storm & Trident]
  85. 85. Project Maturity [Spark Streaming] The most trending Scala repository these days and one of the engines behind Scala’s popularity
  86. 86. Project Maturity [Samza] Used by LinkedIn and also by tens of other companies
  87. 87. Project Maturity [Kafka Streams] ???
  88. 88. Project Maturity [Apex] Graduated very recently, adopted by a couple of corporate clients already
  89. 89. Project Maturity [Flink] Still an emerging project, but we can see its first production deployments

×