Successfully reported this slideshow.
Your SlideShare is downloading. ×

Spark streaming state of the union

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 56 Ad

Spark streaming state of the union

Download to read offline

In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:

Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.

In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:

Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Spark streaming state of the union (20)

More from Databricks (20)

Advertisement

Recently uploaded (20)

Spark streaming state of the union

  1. 1. Spark Streaming The State of the Union and the Road Beyond Tathagata “TD” Das @tathadas March 18, 2015
  2. 2. Who am I? Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
  3. 3. What is Spark Streaming?
  4. 4. Spark Streaming Scalable, fault-tolerant stream processing system File systems Databases Dashboards Flume Kinesis HDFS/S3 Kafka Twitter High-level API joins, windows, … often 5x less code Fault-tolerant Exactly-once semantics, even for stateful ops Integration Integrate with MLlib, SQL, DataFrames, GraphX
  5. 5. How does it work? Receivers receive data streams and chop them up into batches Spark processes the batches and pushes out the results 5 data streams receivers batches results
  6. 6. Streaming Word Count with Kafka val  kafka  =  KafkaUtils.create(ssc,  kafkaParams,  …)   val  words  =  kafka.map(_._2).flatMap(_.split("  "))   val  wordCounts  =  words.map(x  =>  (x,  1))                .reduceByKey(_  +  _)   wordCounts.print()   ssc.start()   6 print some counts on screen count the words split lines into words create DStream with lines from Kafka start processing the stream
  7. 7. Languages Can natively use Can use any other language by using RDD.pipe() 7
  8. 8. Integrates with Spark Ecosystem 8 Spark Core Spark Streaming Spark SQL MLlib GraphX
  9. 9. Combine batch and streaming processing Join data streams with static data sets //  Create  data  set  from  Hadoop  file   val  dataset  =  sparkContext.hadoopFile(“file”)             //  Join  each  batch  in  stream  with  the  dataset   kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)filter(...)   }   9 Spark Core Spark Streaming Spark SQL MLlib GraphX
  10. 10. Combine machine learning with streaming Learn models offline, apply them online //  Learn  model  offline   val  model  =  KMeans.train(dataset,  ...)     //  Apply  model  online  on  stream   kafkaStream.map  {  event  =>            model.predict(event.feature)     }     10 Spark Core Spark Streaming Spark SQL MLlib GraphX
  11. 11. Combine SQL with streaming Interactively query streaming data with SQL //  Register  each  batch  in  stream  as  table   kafkaStream.map  {  batchRDD  =>              batchRDD.registerTempTable("latestEvents")   }     //  Interactively  query  table   sqlContext.sql("select  *  from  latestEvents")   11 Spark Core Spark Streaming Spark SQL MLlib GraphX
  12. 12. A Brief History 12 Late 2011 – research idea AMPLab, UC Berkeley We need to make Spark faster Okay...umm, how??!?!
  13. 13. A Brief History 13 Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Late 2011 – idea AMPLab, UC Berkeley
  14. 14. A Brief History 14 Late 2011 – idea AMPLab, UC Berkeley Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms Q3 2012 Spark core improvements open sourced in Spark 0.6 Feb 2013 – Alpha release 7.7k lines, merged in 7 days Released with Spark 0.7 Jan 2014 – Stable release Graduation with Spark 0.9
  15. 15. Current state of Spark Streaming
  16. 16. Adoption 16 Roadmap Development
  17. 17. 17 What have we added in the last year?
  18. 18. Python API Core functionality in Spark 1.2, with sockets and files as sources Kafka support in Spark 1.3 Other sources coming in future 18 kafka  =  KafkaUtils.createStream(ssc,  params,  …)   lines  =  kafka.map(lambda  x:  x[1])   counts  =  lines.flatMap(lambda  line:  line.split("  "))                                      .map(lambda  word:  (word,  1))                                        .reduceByKey(lambda  a,  b:  a+b)   counts.pprint()  
  19. 19. Streaming MLlib algorithms val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)     //  Apply  model  to  DStreams   model.trainOn(trainingDStream)   model.predictOnValues(      testDStream.map  {  lp  =>            (lp.label,  lp.features)        }   ).print()     19 Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 StreamingLogisticRegression in Spark 1.3 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
  20. 20. Kafka `Direct` Stream API Earlier Receiver-based approach for Kafka Requires replicated journals (write ahead logs) to ensure zero data loss under driver failures 20 http://spark.apache.org/docs/latest/streaming-kafka-integration.html Kafka Receiver high-level consumer
  21. 21. Kafka `Direct` Stream API Earlier Receiver-based approach for Kafka New direct approach for Kafka in Spark 1.3 21 http://spark.apache.org/docs/latest/streaming-kafka-integration.html Kafka Receiver high-level consumer simple consumer API to read Kafka topics
  22. 22. Kafka `Direct` Stream API New direct approach for Kafka in 1.3 – treat Kafka like a file system No receivers!!! Directly query Kafka for latest topic offsets, and read data like reading files Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets More efficient, fault-tolerant, exactly-once receiving of Kafka data 22 http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  23. 23. Other Library Additions Amazon Kinesis integration [Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] 23
  24. 24. System Infrastructure Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2] 24
  25. 25. Contributors to Streaming 25 0 10 20 30 40 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
  26. 26. Contributors - Full Picture 26 0 30 60 90 120 Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2 Streaming Core + Streaming (w/o SQL, MLlib,…) All contributions to core Spark directly improve Spark Streaming
  27. 27. Spark Packages More contributions from the community in spark-packages Alternate Kafka receiver Apache Camel receiver Cassandra examples http://spark-packages.org/ 27
  28. 28. Who is using Spark Streaming?
  29. 29. Spark Summit 2014 Survey 29 40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it Not using 21% Evaluating 39% Prototyping 31% Production 9%
  30. 30. 30
  31. 31. 31 80+ known deployments
  32. 32. Intel China builds big data solutions for large enterprises Multiple streaming applications for top businesses Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company
  33. 33. Complicated stream processing SQL queries on streams Join streams with large historical datasets > 1TB/day passing through Spark Streaming YARN Spark Streaming Kafka RocketMQ HBase
  34. 34. One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming
  35. 35. Spark Standalone Spark StreamingKafka Cassandra Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing Apache Blur More information: http://dbricks.co/1BnFZZ8
  36. 36. Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads Mesos+Marathon Spark Streaming Kinesis MySQL Redis RabbitMQ SQS
  37. 37. Wants to learn trending movies and shows in real time Currently in the middle of replacing one of their internal stream processing architecture with Spark Streaming Tested resiliency of Spark Streaming with Chaos Monkey More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
  38. 38. Driver failures handled with Spark Standalone cluster’s supervise mode Worker, executor and receiver failures automatically handled Spark Streaming can handle all kinds of failures More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
  39. 39. Neuroscience @ Freeman Lab, Janelia Farm Spark Streaming and MLlib to analyze neural activities Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons! http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  40. 40. Neuroscience @ Freeman Lab, Janelia Farm Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
  41. 41. Why are they adopting Spark Streaming? Easy, high-level API Unified API across batch and streaming Integration with Spark SQL and MLlib Ease of operations 41
  42. 42. What’s coming next?
  43. 43. Libraries Operational Ease Performance
  44. 44. Roadmap Libraries Streaming machine learning algorithms A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms Streaming + DataFrames, Streaming + SQL 44
  45. 45. Roadmap Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments 45
  46. 46. Roadmap Performance Higher throughput, especially of stateful operations Lower latencies Easy deployment of streaming apps in Databricks Cloud! 46
  47. 47. You can help! Roadmaps are heavily driven by community feedback We have listened to community demands over the last year Write Ahead Logs for zero data loss New Kafka direct API Let us know what do you want to see in Spark Streaming Spark user mailing list, tweet it to me @tathadas 47
  48. 48. Industry adoption increasing rapidly Community contributing very actively More libraries, operational ease and performance in the roadmap 48 @tathadas
  49. 49. 49 Backup slides
  50. 50. Typesafe survey of Spark users 2136 developers, data scientists, and other tech professionals http://java.dzone.com/articles/apache-spark-survey-typesafe-0
  51. 51. Typesafe survey of Spark users 65% of Spark users are interested in Spark Streaming
  52. 52. Typesafe survey of Spark users 2/3 of Spark users want to process event streams
  53. 53. 53 More usecases
  54. 54. •  Big data solution provider for enterprises •  Multiple applications for different businesses -  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank •  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on Spark Streaming, Kafka, Cassandra, MongoDB
  55. 55. •  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic •  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations •  Kafka à SS http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
  56. 56. •  Runs claims processing applications for healthcare providers http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims •  Predictive models can look for claims that are likely to be held up for approval •  Spark Streaming allows model scoring in seconds instead of hours

×