Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata NYC 2015: What's new in Spark Streaming


Published on

As the adoption of Spark Streaming in the industry is increasing, so is the community’s demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In particular, some of these features are:

- New Kafka integration for exactly-once guarantees
- Improved Kinesis integration for stronger guarantees
- Addition of more sources to the Python API

Significantly improved UI for greater monitoring and debuggability.
In this talk, I am going to discuss these improvements as well as the plethora of features we plan to add in the near future.

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Strata NYC 2015: What's new in Spark Streaming

  1. 1. What’s new in Spark Streaming Tathagata “TD” Das Strata NY 2015 @tathadas
  2. 2. Who am I? Project Management Committee (PMC) member of Spark Started Spark Streaming in AMPLab, UC Berkeley Current technical lead of Spark Streaming Software engineer at Databricks 2
  3. 3. Founded by creators of Spark and remains largest contributor Offers a hosted service •  Spark on EC2 •  Notebooks •  Plot visualizations •  Cluster management •  Scheduled jobs What is Databricks? 3
  4. 4. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrates with MLlib, SQL, DataFrames, GraphX 4
  5. 5. What can you use it for? Real-time fraud detection in transactions React to anomalies in sensors in real-time Cat videos in tweets as soon as they go viral 5
  6. 6. Spark Streaming Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results data streams receivers batches results 6
  7. 7. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   entry point of streaming functionality create DStream from Kafka data 7
  8. 8. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   split lines into words 8
  9. 9. Word Count with Kafka val  context  =  new  StreamingContext(conf,  Seconds(1))   val  lines  =  KafkaUtils.createStream(context,  ...)   val  words  =  lines.flatMap(_.split("  "))   val  wordCounts  =  =>  (x,  1))                                              .reduceByKey(_  +  _)   wordCounts.print()   context.start()   print some counts on screen count the words start receiving and transforming the data 9
  10. 10. Integrates with Spark Ecosystem 10 Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX
  11. 11. Combine batch and streaming processing Join data streams with static data sets //  Create  data  set  from  Hadoop  file   val  dataset  =  sparkContext.hadoopFile(“file”)             //  Join  each  batch  in  stream  with  the  dataset   kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)                              .filter(  ...  )   }   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 11
  12. 12. Combine machine learning with streaming Learn models offline, apply them online //  Learn  model  offline   val  model  =  KMeans.train(dataset,  ...)     //  Apply  model  online  on  stream  {  event  =>            model.predict(event.feature)     }     Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 12
  13. 13. Combine SQL with streaming Interactively query streaming data with SQL and DataFrames //  Register  each  batch  in  stream  as  table   kafkaStream.foreachRDD  {  batchRDD  =>        batchRDD.toDF.registerTempTable("events")   }     //  Interactively  query  table   sqlContext.sql("select  *  from  events")   Spark Core Spark Streaming Spark SQL DataFrames MLlib GraphX 13
  14. 14. Spark Streaming Adoption 14
  15. 15. Spark Survey by Databricks Survey over 1417 individuals from 842 organizations  56% increase in Spark Streaming users since 2014 Fastest rising component in Spark 15
  16. 16. Feedback from community We have learnt a lot from our rapidly growing user base Most of the development in the last few releases have driven by community demands 16
  17. 17. What have we added recently? 17  
  18. 18. Ease of use Infrastructure Libraries
  19. 19. Streaming MLlib algorithms val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)     //  Train  on  one  DStream   model.trainOn(trainingDStream)     //  Predict  on  another  DStream   model.predictOnValues(  {  lp  =>            (lp.label,  lp.features)        }   ).print()     19 Continuous learning and prediction on streaming data StreamingLinearRegression [Spark 1.1] StreamingKMeans [Spark 1.2] StreamingLogisticRegression [Spark 1.3]
  20. 20. Python API Improvements Added Python API for Streaming ML algos [Spark 1.5] Added Python API for various data sources Kafka [Spark 1.3 - 1.5] Flume, Kinesis, MQTT [Spark 1.5] 20 lines  =  KinesisUtils.createStream(streamingContext,            appName,  streamName,  endpointUrl,  regionName,          InitialPositionInStream.LATEST,  2)       counts  =  lines.flatMap(lambda  line:  line.split("  "))    
  21. 21. Ease of use Infrastructure Libraries
  22. 22. New Visualizations [Spark 1.4-15] 22 Stats over last 1000 batches For stability Scheduling delay should be approx 0 Processing Time approx < batch interval
  23. 23. New Visualizations [Spark 1.4-15] 23 Details of individual batches Kafka offsets processed in each batch, Can help in debugging bad data List of Spark jobs in each batch
  24. 24. New Visualizations [Spark 1.4-15] 24 Full DAG of RDDs and stages generated by Spark Streaming
  25. 25. New Visualizations [Spark 1.4-15] Memory usage of received data Can be used to understand memory consumption across executors
  26. 26. Ease of use Infrastructure Libraries
  27. 27. Zero data loss System stability
  28. 28. Non-replayable Sources Sources that do not support replay from any position (e.g. Flume, etc.) Spark Streaming’s saves received data to a Write Ahead Log (WAL) and replays data from the WAL on failure Zero data loss: Two cases Replayable Sources Sources that allow data to replayed from any pos (e.g. Kafka, Kinesis, etc.) Spark Streaming saves only the record identifiers and replays the data back directly from source
  29. 29. Cluster Write Ahead Log (WAL) [Spark 1.3] Save received data in a WAL in a fault-tolerant file system 29 Driver Executor Data stream Driver runs receivers Driver runs user code ReceiverDriver runs tasks to process received data Receiver buffers data in memory and writes to WAL WAL in HDFS
  30. 30. Executor Receiver Cluster Write Ahead Log (WAL) [Spark 1.3] Replay unprocessed data from WAL if driver fails and restarts 30 Restarted Executor Tasks read data from the WAL WAL in HDFS Failed Driver Restarted Driver Failed tasks rerun on restarted executors
  31. 31. Write Ahead Log (WAL) [Spark 1.3] WAL can be enabled by setting Spark configuration spark.streaming.receiver.writeAheadLog.enable to true   Should use reliable receiver, that ensures data written to WAL for acknowledging sources Reliable receiver + WAL gives at least once guarantee 31
  32. 32. Kinesis [Spark 1.5] Save the Kinesis sequence numbers instead of raw data Using KCL Sequence number ranges sent to driver Sequence number ranges saved to HDFS 32 Driver Executor
  33. 33. Kinesis [Spark 1.5] Recover unprocessed data directly from Kinesis using recovered sequence numbers Using AWS SDK 33 Restarted Driver Restarted ExecutorTasks rerun with recovered ranges Ranges recovered from HDFS
  34. 34. Kinesis [Spark 1.5] After any failure, records are either recovered from saved sequence numbers or replayed via KCL No need to replicate received data in Spark Streaming Provides end-to-end at least once guarantee 34
  35. 35. Kafka [1.3, graduated in 1.5] A priori decide the offset ranges to consume in the next batch 35 Every batch interval, latest offset info fetched for each Kafka partition Offset ranges for next batch decided and saved to HDFS Driver
  36. 36. Kafka [1.3, graduated in 1.5] A priori decide the offset ranges to consume in the next batch 36 Executor Executor Executor Broker Broker Broker Tasks run to read each range in parallel Driver Every batch interval, latest offset info fetched for each Kafka partition
  37. 37. Direct Kafka API [Spark 1.5] Does not use receivers, no need for Spark Streaming to replicate Can provide up to 10x higher throughput than earlier receiver approach Can provide exactly once semantics Output operation to external storage should be idempotent or transactional Can run Spark batch jobs directly on Kafka # RDD partitions = # Kafka partitions, easy to reason about 37
  38. 38. System stability Streaming applications may have to deal with variations in data rates and processing rates For stability, any streaming application must receive data only as fast as it can process Since 1.1, Spark Streaming allowed setting static limits ingestion rates on receivers to guard against spikes 38
  39. 39. Backpressure [Spark 1.5] System automatically and dynamically adapts rate limits to ensure stability under any processing conditions If sinks slow down, then the system automatically pushes back on the source to slow down receiving 39 receivers Sources Sinks
  40. 40. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits Well known PID controller theory (used in industrial control systems) is used calculate appropriate rate limits Contributed by Typesafe 40
  41. 41. Backpressure [Spark 1.5] System uses batch processing times and scheduling delays used to set rate limits 41 Dynamic rate limit prevents receivers from receiving too fast Scheduling delay kept in check by the rate limits
  42. 42. Backpressure [Spark 1.5] Experimental, so disabled by default in Spark 1.5 Enabled by setting Spark configuration spark.streaming.backpressure.enabled to true   Will be enabled by default in future releases 42
  43. 43. What’s next?
  44. 44. API and Libraries Support for operations on event time and out of order data Most demanded feature from the community Tighter integration between Streaming and SQL + DataFrames Helps leverage Project Tungsten 44
  45. 45. Infrastructure Add native support for Dynamic Allocation for Streaming Dynamically scale the cluster resources based on processing load Will work in collaboration with backpressure to scale up/down while maintaining stability Note: As of 1.5, existing Dynamic Allocation not optimized for streaming But users can build their own scaling logic using developer API sparkContext.requestExecutors(),  sparkContext.killExecutors()   45
  46. 46. Infrastructure Higher throughput and lower latency by leveraging Project Tungsten Specifically, improved performance of stateful ops 46
  47. 47. Fastest growing component in the Spark ecosystem Significant improvements in fault-tolerance, stability, visualizations and Python API More community requested features to come @tathadas