Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dive into Spark Streaming

786 views

Published on

Spark Streaming API walk-through and insights of the dynamics of how it works. Presented at the Spark Belgium Meetup. (Presentation included live demo on backpressure)

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Dive into Spark Streaming

  1. 1. @maasg www. .com by Gerard Maas Data Processing Team Lead gerard.maas@virdata.com @maasg Dive into
  2. 2. @maasg www. .com Virdata: A ‘born on the cloud’ IoT Managed Services & Analytics Platform
  3. 3. @maasg www. .com A ‘born on the cloud’ IoT Managed Services & Analytics Platform 2012 2013 2014 affiliate certified
  4. 4. @maasg www. .com Kafka PUB / SUB MQTT / WebSockets RAW Storage >>> Storage` ` ` >>> <<< >>> Query >>> >>> Virdata’s Stack
  5. 5. @maasg www. .com PUB / SUB MQTT / WebSockets RAW Storage Storage ` ` ` Query Notebook Server How Spark is Driving a New Loosely-coupled Stand-alone Service Virdata’s Full StackVirdata’s Spark as a Service
  6. 6. @maasg www. .com 100TB 5MB
  7. 7. @maasg www. .com 100TB 5MB/second
  8. 8. @maasg www. .com Agenda What is Spark Streaming? Programming Model Demo 1 Execution Model Demo 2 Resources Q/A
  9. 9. @maasg www. .com Apache Spark
  10. 10. @maasg www. .com Spark Streaming Scalable, fault-tolerant stream processing system Kafka Flume Kinesis Twitter Sockets HDFS/S3 Databases HDFS Server APPLICATIONS Custom Streams
  11. 11. @maasg www. .com Spark: RDD Operations transformatio n INPUT DATA HDFS TEXT/ Sequence File RDD SparkContext RDD OUTPUT Data HDFS TEXT/ Sequence File Cassandra Lazy evaluation action
  12. 12. @maasg www. .com Spark Streaming: DStreams transformatio n Receiver RDD RDD Lazy evaluation action Stream
  13. 13. @maasg www. .com Spark Streaming
  14. 14. @maasg www. .com DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 Spark Streaming
  15. 15. @maasg www. .com DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformations Spark Streaming
  16. 16. @maasg www. .com DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] t0 t1 t2 t3 ti ti+1 RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Actions Transformations Spark Streaming
  17. 17. @maasg www. .com Transformations map, flatmap, filter count, reduce, countByValue, reduceByKey n union, join cogroup
  18. 18. @maasg www. .com Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }
  19. 19. @maasg www. .com Transformations updateStateByKey … …
  20. 20. @maasg www. .com Transformations trackStateByKey … … SPARK 1.6
  21. 21. @maasg www. .com Actions print ------------------------------------------- Time: 1459875469000 ms ------------------------------------------- data1 data2 saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles xxx yyy zzz foreachRDD *
  22. 22. @maasg www. .com Actions - foreachRDD dstream.foreachRDD{rdd => } Dataframes Spark SQL MLLib GraphX Databases ...
  23. 23. @maasg www. .com Actions - foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) val conn = DB.connect(server) asRecords.foreachPartition{partition => partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) }
  24. 24. @maasg www. .com Actions - foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) val conn = DB.connect(server) asRecords.foreachPartition{partition => partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes on the Driver Executes on the Workers
  25. 25. @maasg www. .com Actions - foreachRDD Usage dstream.foreachRDD{rdd => rdd.cache() val alternatives = restServer.get(“/v1/alternatives”).toSet alternatives.foreach{alternative => val byAlternative = rdd.filter(element => element.kind == alternative) val asRecords = byAlternative.map(element => asRecord(element)) asRecords.foreachPartition{partition => val conn = DB.connect(server) partition.foreach(element => conn.insert(element) } } rdd.unpersist(true) } Executes on the Driver Executes on the Workers
  26. 26. @maasg www. .com Windows - Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14
  27. 27. @maasg www. .com Windows - Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 4,5,6,7,8,9
  28. 28. @maasg www. .com Windows - Sliding 1 t dstream.window(windowLength = 6, slideInterval = 3) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 4,5,6,7,8,9 7,8,9,10,11,12
  29. 29. @maasg www. .com Windows - Non-Overlapping 1 t dstream.window(windowLength = 6, slideInterval = 6) t 2 3 4 5 6 7 8 9 1,2,3,4,5,6 10 11 12 13 14 7,8,9,10,11,12
  30. 30. @maasg www. .com Windows - Operations window, countByWindow, reduceByWindow, reduceByKeyAndWindow, countByValueAndWindow
  31. 31. @maasg www. .com Windows - Inverse Function Optimization reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks]) 1 2 3 4 5 6 7 8 9 1,2,3,4,5,6 + + -
  32. 32. @maasg www. .com Windows- Inverse Function Optimization reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval,[numTasks]) 1 2 3 4 5 6 7 8 9 1,2,3,4,5,6 4,5,6,7,8,9 - + - + 7 8 9 1 2 3
  33. 33. @maasg www. .com Top 3 Starter Issues Serialization Where is my closure executing Do I have enough cores?
  34. 34. @maasg www. .com Demo 1 Anatomy of an Spark Streaming Application
  35. 35. @maasg www. .com Tweet few keywords about your interests and experience. Use hashtag “#sparkbe” Scala Spark Streaming DistributedSystems #sparkbe
  36. 36. Ready to dive in?
  37. 37. @maasg www. .com Deployment Options M Local Standalone Cluster WW W Using a Cluster Manager W spark.master=local[*] spark.master=spark://host:port spark.master=mesos://host:port M M D W D W W
  38. 38. @maasg www. .com Deployment Options M Local Standalone Cluster Using a Cluster Manager spark.master=local[*] spark.master=mesos://host:port M M DD Rec Exec Rec Exec Exec Exec Rec Exec Rec Exec ExecExec spark.master=spark://host:port
  39. 39. @maasg www. .com Streaming Spark t0 t1 t2 #0 Consumer Consumer Consumer Scheduling
  40. 40. @maasg www. .com Streaming Spark t0 t1 t2 #1 Consumer Consumer Consumer #0 Process Time < Batch Interval Scheduling
  41. 41. @maasg www. .com Streaming Spark t0 t1 t2 #2 Consumer Consumer Consumer #0 #1 #3 Scheduling Delay Scheduling
  42. 42. @maasg www. .com From Streams to μbatches Consumer #0 #1 batchInterval blockInterval Spark Streaming Spark #partitions = receivers x batchInterval / blockInterval
  43. 43. @maasg www. .com #0 RDD Partitions Spark Spark Executors Spark Streaming From Streams to μbatches
  44. 44. @maasg www. .com #0 RDD Spark Spark Executors Spark Streaming From Streams to μbatches
  45. 45. @maasg www. .com #0 RDD Spark Spark Executors Spark Streaming From Streams to μbatches
  46. 46. @maasg www. .com #0 RDD Spark Spark Executors Spark Streaming From Streams to μbatches
  47. 47. @maasg www. .com From Streams to μbatches
  48. 48. @maasg www. .com From Streams to μbatches
  49. 49. @maasg www. .com Consumer #0 #1 batchInterval blockInterval Spark Streaming Spark #partitions = receivers x batchInterval / blockInterval From Streams to μbatches
  50. 50. @maasg www. .com Consumer #0 #1 batchInterval blockInterval Spark Streaming Spark spark.streaming.blockInterval = batchInterval x receivers / (partitionFactor x sparkCores) From Streams to μbatches
  51. 51. @maasg www. .com The Importance of Caching dstream.foreachRDD { rdd => rdd.cache() // cache the RDD before iterating! keys.foreach{ key => rdd.filter(elem=> key(elem) == key).saveAsFooBar(...) } rdd.unpersist() }
  52. 52. @maasg www. .com The Receiver model spark.streaming.receiver.maxRate Fault tolerance ? WAL
  53. 53. @maasg www. .com The Receiver Model src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  54. 54. @maasg www. .com Direct Kafka Stream compute(offsets) Driver
  55. 55. @maasg www. .com Kafka:The Receiver-less model Simplified Parallelism Efficiency Exactly-once semantics Less degrees of freedom val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume] ) spark.streaming.kafka.maxRatePerPartition
  56. 56. @maasg www. .com Kafka:The Receiver-less model src: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  57. 57. @maasg www. .com Delivery Semantics • Spark Streaming Receiver-based (<v1.2 ) Roughly at least once • Spark Streaming Recv w/ WAL At least once + Zero Data Loss • Spark Streaming Direct At least once + Zero Data Loss • Spark Streaming Direct + Offset management Exactly Once + Idempotent Writes | Transactions
  58. 58. @maasg www. .com Spark Streaming (v1.5) made Reactive proportional-integral-derivative controller (PID controller) Backpressure support
  59. 59. @maasg www. .com Demo 2 Spark Streaming Performance
  60. 60. @maasg www. .com Resources Spark Streaming Official Programming Guide: http://spark.apache.org/docs/latest/streaming-programming-guide.html Backpressure in Spark Streaming: http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i The Virdata’s Spark Streaming tuning guide: http://www.virdata.com/tuning-spark/ Spark Summit Presentations: https://spark-summit.org/ Diving into Spark Streaming Execution Model: https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html Kafka direct approach: https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
  61. 61. @maasg www. .com Questions?
  62. 62. @maasg www. .com Thanks! Gerard Maas @maasg www. .com - we’re hiring -

×