Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Have your cake and eat it too

2,281 views

Published on

Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.

Published in: Data & Analytics
  • Be the first to comment

Have your cake and eat it too

  1. 1. Have Your Cake and Eat It Too Architectures for Batch and Stream Processing Speaker name // Speaker title
  2. 2. 2 Stuff We’ll Talk About • Why do we need both streams and batches • Why is it a problem? • Stream-Only Patterns (i.e. Kappa Architecture) • Lambda-Architecture Technologies – SummingBird – Apache Spark – Apache Flink – Bring-your-own-framework
  3. 3. 3©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap About Me
  4. 4. 4 Why Streaming and Batch ©2014 Cloudera, Inc. All rights reserved.
  5. 5. 5 Batch Processing • Store data somewhere • Read large chunks of data • Do something with data • Sometimes store results
  6. 6. 6 Batch Examples • Analytics • ETL / ELT • Training machine learning models • Recommendations Click to enter confidentiality information
  7. 7. 7 Stream Processing • Listen to incoming events • Do something with each event • Maybe store events / results Click to enter confidentiality information
  8. 8. 8 Stream Processing Examples • Anomaly detection, alerts • Monitoring, SLAs • Operational intelligence • Analytics, dashboards • ETL Click to enter confidentiality information
  9. 9. 9 Streaming & Batch Click to enter confidentiality information Alerts Monitoring, SLAs Operational Intelligence Risk Analysis Anomaly detection Analytics ETL
  10. 10. 10 Four Categories • Streams Only • Batch Only • Can be done in both • Must be done in both Click to enter confidentiality information ETL Some Analytics
  11. 11. 11 ETL Most Stream Processing projects I see involve few simple transformations. • Currency conversion • JSON to Avro • Field extraction • Joining a stream to a static data set • Aggregate on window • Identifying change in trend • Document indexing Click to enter confidentiality information
  12. 12. 12 Batch || Streaming • Efficient: – Lower CPU utilization – Better network and disk throughput – Fewer locks and waits • Easier administration • Easier integration with RDBMS • Existing expertise • Existing tools • Real-time information Click to enter confidentiality information
  13. 13. 13 The Problem ©2014 Cloudera, Inc. All rights reserved.
  14. 14. 14 We Like • Efficiency • Scalability • Fault Tolerance • Recovery from errors • Experimenting with different approaches • Debuggers • Cookies Click to enter confidentiality information
  15. 15. 15 But… We don’t like Maintaining two applications That do the same thing Click to enter confidentiality information
  16. 16. 16 Do we really need to maintain same app twice? Yes, because: • We are not sure about requirements • We sometimes need to re-process with very high efficiency Not really: • Different apps for batch and streaming • Can re-process with streams • Can error-correct with streams • Can maintain one code-base for batches and streams Click to enter confidentiality information
  17. 17. 17 Stream-Only Patterns (Kappa Architecture) Click to enter confidentiality information
  18. 18. 18 DWH Example Click to enter confidentiality information OLTP DB Sensors, Logs DWH Fact Table (Partitioned) Real Time Fact Tables Dimensio n Dimensio n Dimensio n Views Aggregat es App 1: Stream processing App 2: Occasional load
  19. 19. 19 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Real-Time Table Replacement Partition Partitioned Fact Table
  20. 20. 20 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Real-Time Table Replacement Partition Partitioned Fact Table
  21. 21. 21 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v2 Real-Time Table
  22. 22. 22 Lambda- Architecture Technologies Click to enter confidentiality information
  23. 23. 23 WordCount in Scala source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()
  24. 24. 24 SummingBird
  25. 25. 25 MapReduce was great because… Very simple abstraction: - Map - Shuffle - Reduce - Type-safe And it has simpler abstractions on top.
  26. 26. 26 SummingBird • Multi-stage MapReduce • Run on Hadoop, Spark, Storm • Very easy to combine batch and streaming results Click to enter confidentiality information
  27. 27. 27 API • Platform – Storm, Scalding, Spark… • Producer.source(Platform) <- get data • Producer – collection of events • Transformations – map, filter, merge, leftJoin (lookup) • Output – write(sink), sumByKey(store) • Store – contains aggregate for each key, and reduce operation Click to enter confidentiality information
  28. 28. 28 Associative Reduce Click to enter confidentiality information
  29. 29. 29 WordCount SummingBird def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) val stormTopology = Storm.remote(“stormName”).plan(wordCount) val hadoopJob = Scalding(“scaldingName”).plan(wordCount) Click to enter confidentiality information
  30. 30. 30 SparkStreaming
  31. 31. 31 First, there was the RDD • Spark is its own execution engine • With high-level API • RDDs are sharded collections • Can be mapped, reduced, grouped, filtered, etc
  32. 32. 32 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  33. 33. 33 DStream DStream DStreamSpark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  34. 34. 34 Compared to SummingBird Differences: • Micro-batches • Completely new execution model • Real joins • Reduce is not limited to Monads • SparkStreaming has Richer API • Summingbird can aggregate batch and stream to one dataset • SparkStreaming runs in debugger Similarities: • Almost same code will run in batch and streams • Use of Scala • Use of functional programing concepts Click to enter confidentiality information
  35. 35. 35 Spark Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  36. 36. 36 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. ssc.start()
  37. 37. 37 Apache Flink
  38. 38. 38 Execution Model You don’t want to know.
  39. 39. 39 Flink vs SparkStreaming Differences: • Flink is event-by-event streaming, events go through pipeline. • SparkStreaming has good integration with Hbase as state store • “checkpoint barriers” • Optimization based on strong typing • Flink is newer than SparkStreaming, there is less production experience Similarities: • Very similar APIs • Built-in stream-specific operators (windows) • Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)
  40. 40. 40 WordCount Batch val env = ExecutionEnvironment.getExecutionEnvironment val text = getTextDataSet(env) val counts = text.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.print() env.execute(“Wordcount Example”)
  41. 41. 41 WordCount Streaming val env = ExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream(host, port) val counts = text.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.print() env.execute(“Wordcount Example”)
  42. 42. 42 Bring Your Own Framework
  43. 43. 43 If the requirements are simple…
  44. 44. 44 How difficult it is to parallelize transformations? Simple transformations Are simple
  45. 45. 45 Just add Kafka Kafka is a reliable data source You can read Batches Microbatches Streams Also allows for re-partitioning Click to enter confidentiality information
  46. 46. 46 Cluster management • Managing cluster resources used to be difficult • Now: – YARN – Mesos – Docker – Kubernetes Click to enter confidentiality information
  47. 47. 47 So your app should… • Allocate resources and track tasks with YARN / Mesos • Read from Kafka (however often you want) • Do simple transformations • Write to Kafka / Hbase • How difficult can it possibly be? Click to enter confidentiality information
  48. 48. 48 Parting Thoughts Click to enter confidentiality information
  49. 49. 49 Good engineering lessons • DRY – do you really need same code twice? • Error correction is critical • Reliability guarantees are critical • Debuggers are really nice • Latency / Throughput trade-offs • Use existing expertise • Stream processing is about patterns
  50. 50. Thank you

×