Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large-Scale Stream Processing in the Hadoop Ecosystem


Published on

Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream processing or analysis requires specialized tools and techniques which have become publicly available in the last couple of years.

This talk will give a deep, technical overview of the top-level Apache stream processing landscape. We compare several frameworks including Spark, Storm, Samza and Flink. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Large-Scale Stream Processing in the Hadoop Ecosystem

  1. 1. Large-Scale Stream Processing in the Hadoop Ecosystem Gyula Fóra Márton Balassi
  2. 2. This talk § Stream processing by example § Open source stream processors § Runtime architecture and programming model § Counting words… § Fault tolerance and stateful processing § Closing 2Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  3. 3. Stream processing by example 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 3
  4. 4. Streaming applications ETL style operations • Filter incoming data, Log analysis • High throughput, connectors, at-least-once processing Window aggregations • Trending tweets, User sessions, Stream joins • Window abstractions 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 4 Inpu t Inpu t Inpu tInput Process/Enrich
  5. 5. Streaming applications Machine learning • Fitting trends to the evolving stream, Stream clustering • Model state, cyclic flows Pattern recognition • Fraud detection, Triggering signals based on activity • Exactly-once processing 5Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  6. 6. Open source stream processors 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 6
  7. 7. Apache Streaming landscape 72015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  8. 8. Apache Storm § Started in 2010, development driven by BackType, then Twitter § Pioneer in large scale stream processing § Distributed dataflow abstraction (spouts & bolts) 82015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  9. 9. Apache Flink § Started in 2008 as a research project (Stratosphere) at European universities § Unique combination of low latency streaming and high throughput batch analysis § Flexible operator states and windowing 9 Batch  data Kafka,  RabbitMQ,   ... HDFS,  JDBC,   ... Stream  Data 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  10. 10. Apache Spark § Started in 2009 at UC Berkley, Apache since 2013 § Very strong community, wide adoption § Unified batch and stream processing over a batch runtime § Good integration with batch programs 102015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  11. 11. Apache Samza § Developed at LinkedIn, open sourced in 2013 § Builds heavily on Kafka’s log based philosophy § Pluggable messaging system and execution backend 112015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  12. 12. System comparison 12 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  13. 13. Runtime and programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 13
  14. 14. Native Streaming 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 14
  15. 15. Distributed dataflow runtime § Storm, Samza and Flink § General properties • Long standing operators • Pipelined execution • Usually possible to create cyclic flows 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 15 Pros • Full expressivity • Low-latency execution • Stateful operators Cons • Fault-tolerance is hard • Throughput may suffer • Load balancing is an issue
  16. 16. Distributed dataflow runtime § Storm • Dynamic typing + Kryo • Dynamic topology rebalancing § Samza • Almost every component pluggable • Full task isolation, no backpressure (buffering handled by the messaging layer) § Flink • Strongly typed streams + custom serializers • Flow control mechanism • Memory management 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 16
  17. 17. Micro-batching 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 17
  18. 18. Micro-batch runtime § Implemented by Apache Spark § General properties • Computation broken down to time intervals • Load aware scheduling • Easy interaction with batch 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 18 Pros • Easy to reason about • High-throughput • FT comes for “free” • Dynamic load balancing Cons • Latency depends on batch size • Limited expressivity • Stateless by nature
  19. 19. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 19 Declarative § Expose a high-level API § Operators are higher order functions on abstract data stream types § Advanced behavior such as windowing is supported § Query optimization Compositional § Offer basic building blocks for composing custom operators and topologies § Advanced behavior such as windowing is often missing § Topology needs to be hand- optimized
  20. 20. Programming model 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 20 DStream, DataStream § Transformations abstract operator details § Suitable for engineers and data analysts Spout, Consumer, Bolt, Task, Topology § Direct access to the execution graph / topology • Suitable for engineers
  21. 21. Counting words… 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 21
  22. 22. WordCount 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 22 storm  budapest  flink apache  storm  spark streaming  samza storm flink  apache  flink bigdata  storm flink  streaming (storm,  4) (budapest,  1) (flink,  4) (apache,  2) (spark,  1) (streaming,  2) (samza,  1) (bigdata,  1)
  23. 23. Storm Assembling the topology 232015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new SentenceSpout(), 5); builder.setBolt("split", new Splitter(), 8).shuffleGrouping("spout"); builder.setBolt("count", new Counter(), 12) .fieldsGrouping("split", new Fields("word")); public class Counter extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1; counts.put(word, count); collector.emit(new Values(word, count)); } } Rolling word count bolt
  24. 24. Samza 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 24 public class WordCountTask implements StreamTask { private KeyValueStore<String, Integer> store; public void process( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { String word = envelope.getMessage(); Integer count = store.get(word); if(count == null){count = 0;} store.put(word, count + 1); collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", ”wc"), Tuple2.of(word, count))); } } Rolling word count task
  25. 25. Flink val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() Rolling word count Window word count 252015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  26. 26. Spark Window word count Rolling word count (kind of) 262015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  27. 27. Fault tolerance and stateful processing 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 27
  28. 28. Fault tolerance intro § Fault-tolerance in streaming systems is inherently harder than in batch • Can’t just restart computation • State is a problem • Fast recovery is crucial • Streaming topologies run 24/7 for a long period § Fault-tolerance is a complex issue • No single point of failure is allowed • Guaranteeing input processing • Consistent operator state • Fast recovery • At-least-once vs Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 28
  29. 29. Storm record acknowledgements § Track the lineage of tuples as they are processed (anchors and acks) § Special “acker” bolts track each lineage DAG (efficient xor based algorithm) § Replay the root of failed (or timed out) tuples 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 29
  30. 30. Samza offset tracking § Exploits the properties of a durable, offset based messaging layer § Each task maintains its current offset, which moves forward as it processes elements § The offset is checkpointed and restored on failure (some messages might be repeated) 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 30
  31. 31. Flink checkpointing § Based on consistent global snapshots § Algorithm designed for stateful dataflows (minimal runtime overhead) § Exactly-once semantics 31Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  32. 32. Spark RDD recomputation § Immutable data model with repeatable computation § Failed RDDs are recomputed using their lineage § Checkpoint RDDs to reduce lineage length § Parallel recovery of failed RDDs § Exactly-once semantics 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 32
  33. 33. State in streaming programs § Almost all non-trivial streaming programs are stateful § Stateful operators (in essence): 𝒇:   𝒊𝒏, 𝒔𝒕𝒂𝒕𝒆 ⟶ 𝒐𝒖𝒕, 𝒔𝒕𝒂𝒕𝒆. § State hangs around and can be read and modified as the stream evolves § Goal: Get as close as possible while maintaining scalability and fault-tolerance 33Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  34. 34. § States available only in Trident API § Dedicated operators for state updates and queries § State access methods • stateQuery(…) • partitionPersist(…) • persistentAggregate(…) § It’s very difficult to implement transactional states Exactly-­‐‑once  guarantee 34Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  35. 35. § Stateless runtime by design • No continuous operators • UDFs are assumed to be stateless § State can be generated as a separate stream of RDDs: updateStateByKey(…) 𝒇:   𝑺𝒆𝒒[𝒊𝒏 𝒌], 𝒔𝒕𝒂𝒕𝒆 𝒌 ⟶ 𝒔𝒕𝒂𝒕𝒆. 𝒌 § 𝒇 is scoped to a specific key § Exactly-once semantics 35Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  36. 36. § Stateful dataflow operators (Any task can hold state) § State changes are stored as a log by Kafka § Custom storage engines can be plugged in to the log § 𝒇 is scoped to a specific task § At-least-once processing semantics 36Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  37. 37. § Stateful dataflow operators (conceptually similar to Samza) § Two state access patterns • Local (Task) state • Partitioned (Key) state § Proper API integration • Java: OperatorState interface • Scala: mapWithState, flatMapWithState… § Exactly-once semantics by checkpointing 37Apache:  Big  Data  Europe2015-­‐‑09-­‐‑28
  38. 38. Performance § Throughput/Latency • A cost of a network hop is 25+ msecs • 1 million records/sec/core is nice § Size of Network Buffers/Batching § Buffer Timeout § Cost of Fault Tolerance § Operator chaining/Stages § Serialization/Types 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 38
  39. 39. Closing 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 39
  40. 40. Comparison revisited 40 Streaming model Native Micro-batching Native Native API Compositional Declarative Compositional Declarative Fault tolerance Record ACKs RDD-based Log-based Checkpoints Guarantee At-least-once Exactly-once At-least-once Exactly-once State Only in Trident State as DStream Stateful operators Stateful operators Windowing Not built-in Time based Not built-in Policy based Latency Very-Low Medium Low Low Throughput Medium High High High 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe
  41. 41. Summary § Streaming applications and stream processors are very diverse § 2 main runtime designs • Dataflow based (Storm, Samza, Flink) • Micro-batch based (Spark) § The best framework varies based on application specific needs § But high-level APIs are nice J 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 41
  42. 42. Thank you!
  43. 43. List of Figures (in order of usage) § abcd.svg/326px-CPT-FSM-abcd.svg.png § § § §, page 2. §, page 69-71. § ng.svg § § § png 2015-­‐‑09-­‐‑28 Apache:  Big  Data  Europe 43