Spark Streaming


Published on

by Tathagata Das

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Spark Streaming

  1. 1. Spark Streaming What’s new and what’s coming next Tathagata “TD” Das AMP Camp 6 @tathadas
  2. 2. Who am I? Project Management Committee (PMC) member of Spark Started Spark Streaming in AMPLab Current technical lead of Spark Streaming Software engineer at Databricks 2
  3. 3. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboar ds Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault- tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, Spark SQL, DataFrames 3
  4. 4. What can you use it for? Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral 4
  5. 5. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results data streams receivers batches results 5
  6. 6. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   entry point of streaming functionality create DStream from Kafka data 6 batch @ t +1 batch @ t batch @ t +2lines DStream Discretized Stream (DStream) basic abstraction of Spark Streaming series of RDDs representing a stream of data RDDs
  7. 7. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   split lines into words 7 RDD @ t+1RDD @ t RDD @ t+2lines DStream RDD @ t+1RDD @ t RDD @ t+2words DStream flatMap flatMap flatMap
  8. 8. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   val  wordCounts  =  =>  (x,  1))                                              .reduceByKey(_  +  _)   wordCounts.print()   context.start()   print some counts on screen count the words start receiving and transforming the data 8
  9. 9. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   val  wordCounts  =  =>  (x,  1))                                              .reduceByKey(_  +  _)   wordCounts.foreachRDD(rdd  =>  /*  do  something  */  )   context.start()   push data out to storage systems 9
  10. 10. Many transformations Window operations  =>  (x,  1)).reduceByKeyAndWindow(_  +  _,  Minutes(1)) Arbitrary stateful processing    def  stateUpdateFunc(newData,  lastState)  =>  updatedState      val  stateStream  =  keyValueDStream.updateStateByKey(stateUpdateFunc)   10
  11. 11. Integrates with Spark Ecosystem 11 Spark Core Spark Streaming Spark SQL DataFrame s MLlib GraphX
  12. 12. Combine batch and streaming processing Join data streams with static data sets //  Create  data  set  from  Hadoop  file   val  dataset  =  sparkContext.hadoopFile(“file”)             //  Join  each  batch  in  stream  with  the  dataset   kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)                              .filter(  ...  )   }   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 12
  13. 13. Combine machine learning with streaming Learn models offline, apply them online //  Learn  model  offline   val  model  =  KMeans.train(dataset,  ...)     //  Apply  model  online  on  stream  {  event  =>            model.predict(event.feature)     }     Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 13
  14. 14. Combine SQL with streaming Interactively query streaming data with SQL and DataFrames //  Register  each  batch  in  stream  as  table   kafkaStream.foreachRDD  {  batchRDD  =>        batchRDD.toDF.registerTempTable("events")   }     //  Interactively  query  table   sqlContext.sql("select  *  from  events")   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 14
  15. 15. Spark Streaming Adoption 15
  16. 16. Spark Survey by Databricks Survey over 1417 individuals from 842 organizations 56% increase in Spark Streaming users since 2014 Fastest rising component in Spark 16
  17. 17. Feedback from community We have learnt a lot from our rapidly growing user base Most of the development in the last few releases have driven by community demands 17
  18. 18. What’s new?
  19. 19. Ease of use Infrastructure Libraries
  20. 20. Streaming MLlib algos [Spark 1.1-1.3] val  model  =  new  StreamingKMeans()        .setK(10).setDecayFactor(1.0).setRandomCenters(4,  0.0)     model.trainOn(trainingDStream)  //  Train  on  one  DStream     //  Predict  on  another  DStream   model.predictOnValues(  {  lp  =>  (lp.label,  lp.features)  }  )   20 Continuous learning and prediction on streaming data StreamingLinearRegression, StreamingKMeans, StreamingLogisticRegression
  21. 21. Python API for Spark Streaming Added Python API [Spark 1.2] Added Python API for various data sources [Spark 1.3 - 1.6] Added Python API for Streaming ML algos [Spark 1.5] 21 lines  =  KafKaUtils.createStream(    streamingContext,  kafkaTopics,  kafkaParams)       counts  =  lines.flatMap(lambda  line:  line.split("  "))    
  22. 22. Improved state management [1.6] Earlier stateful stream processing done with updateStateByKey           22 def  stateUpdateFunc(newData,  lastState)  =>  updatedState     val  stateDStream  =  keyValueDStream.updateStateByKey(stateUpdateFunc)   update state per-key state key-value data updated states
  23. 23. Improved state management [1.6] Need to keep much larger state Processing times of batches increase with the amount state, limits performance Need to expire keys that have received no data for a while 23 Feedback from community about updateStateByKey  
  24. 24. Improved state management [1.6] New API with timeouts: trackStateByKey   24 def  updateFunc(values,  state)  =>  emittedData      //  call  state.update(newState)  to  update  state     keyValueDStream.trackStateByKey(    StateSpec.function(updateFunc).timeout(Minutes(10))   Can provide order of magnitude higher performance than updateStateByKey SPARK-2629
  25. 25. Ease of use Infrastructure Libraries
  26. 26. New Visualizations [Spark 1.4-1.6] 26 Stats over last 1000 batches For stability Scheduling delay should be approx 0 Processing Time approx < batch interval
  27. 27. New Visualizations [Spark 1.4-1.6] 27 Details of individual batches Kafka offsets processed in each batch, Can help in debugging bad data List of Spark jobs in each batch
  28. 28. New Visualizations [Spark 1.4-1.6] 28 Full DAG of RDDs and stages generated by Spark Streaming
  29. 29. Ease of use Infrastructure Libraries
  30. 30. Zero data loss System stability
  31. 31. Zero data loss Many streaming applications need guarantees that no data is lost despite any kind of failures in the system At least once guarantee – every record processed at least once Exactly once guarantee – every record processed exactly once Spark Streaming applications should get these guarantees no matter what fails in Spark
  32. 32. Non-replayable Sources Sources that do not support replay from any position (e.g. Flume, etc.) Solved using Write Ahead Log (WAL) [Spark 1.3] Zero data loss: Two cases Replayable Sources Sources that allow data to replayed from any pos (e.g. Kafka, Kinesis, etc.) Solved with more reliable Kafka, and Kinesis integrations [Spark 1.3-1.5]
  33. 33. Write Ahead Log (WAL) [Spark 1.3] All received data synchronously written to HDFS and replayed when necessary after failure WAL can be enabled by setting Spark configuration spark.streaming.receiver.writeAheadLog.enable to true   Can give end-to-end at least once guarantee for sources that can support acks, but do not support replays 33
  34. 34. Reliable Kinesis [Spark 1.5] Saves record sequence numbers instead of data to WAL Replay from Kinesis using sequence numbers Higher throughput than using WAL Can give at least once guarantee 34
  35. 35. Reliable Kafka [Spark 1.3, graduated in 1.5] New API: Direct Kafka stream Does not use receivers, does not use ZK to save offsets Offset management (saving, replaying) by Spark Streaming Can provide up to 10x higher throughput than earlier receiver Can give exactly once guarantee Can run Spark batch jobs directly on Kafka # RDD partitions = # Kafka partitions, easy to reason about 35
  36. 36. System stability Streaming applications may have to deal with variations in data rates and processing rates For stability, any streaming application must receive data only as fast as it can process Static rate limits on receivers [Spark 1.1] But hard to figure out the right rate 36
  37. 37. Backpressure [Spark 1.5] System automatically and dynamically adapts rate limits to ensure stability under any processing conditions If sinks slow down, then the system automatically pushes back on the source to slow down receiving 37 receivers Sources Sinks
  38. 38. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits Well known PID controller theory (used in industrial control systems) is used calculate appropriate rate limits Contributed by Typesafe 38
  39. 39. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits 39 Scheduling delay kept in check by the rate limits Dynamic rate limit prevents receivers from receiving too fast
  40. 40. Backpressure [Spark 1.5] Experimental, so disabled by default in Spark 1.5 Enabled by setting Spark configuration spark.streaming.backpressure.enabled to true   Will be enabled by default in future releases 40
  41. 41. What’s next?
  42. 42. API and Libraries Streaming DataFrames! Logical-to-physical plan optimizations Tungsten-based binary optimizations Support for event-time based windowing Support for out-of-order data 42
  43. 43. Ease of use Better streaming UI Make it easy to understand bottlenecks in the system Make it easy to understand processing trends Programmatic monitoring More information exposed through StreamingListener 43
  44. 44. Infrastructure Add native support for Dynamic Allocation for Streaming Dynamically scale the cluster resources based on processing load Will work in collaboration with backpressure to scale up/down while maintaining stability Higher throughput and lower latency Specifically, improved performance for stateful ops (e.g. trackStateByKey) 44
  45. 45. Additional resources Research Paper (SOSP 2013) Programming Guide Recipes for running Spark Streaming in production Blog posts on Spark Streaming 45
  46. 46. Fastest growing component in the Spark ecosystem Significant improvements in fault-tolerance, stability, visualizations and Python API More community requested features to come @tathadas