Successfully reported this slideshow.
Your SlideShare is downloading. ×

SSR: Structured Streaming on R for Machine Learning with Felix Cheung

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 42 Ad

SSR: Structured Streaming on R for Machine Learning with Felix Cheung

Download to read offline

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.

Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.

Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to SSR: Structured Streaming on R for Machine Learning with Felix Cheung (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

SSR: Structured Streaming on R for Machine Learning with Felix Cheung

  1. 1. Felix Cheung Principal Engineer & Spark Committer SSR: Structured Streaming for R and Machine Learning
  2. 2. Disclaimer: Apache Spark community contributions
  3. 3. Agenda • Structured Streaming • ML Pipeline • R - putting it all together • Considerations
  4. 4. Why Streaming? • Faster insight at scale • ETL • Trends • Latest data to static data • Continuous Learning
  5. 5. Spark Streaming 1. Receiver 2. Direct DStream 3. Structured Streaming
  6. 6. Structured Streaming https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  7. 7. Structured Streaming • "Streaming Logical Plan" – Extending Dataset/DataFrame to include incremental execution of unbounded input – aka Rinse & Repeat
  8. 8. Same • Transformations: map filter aggregate window join* (*some limitations)
  9. 9. Better • Trigger • Consistency • Fault Tolerance • Event time – late data, watermark
  10. 10. Execution Plan https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
  11. 11. SS in a Circuit DataFrame DataFrame Trigger
  12. 12. Source File Kafka Socket MQTT
  13. 13. Sink File (new formats in 2.1+) Console Memory (aka Temp View) Foreach Kafka (new in 2.2)
  14. 14. Output Mode Append (default) Complete Update (new in 2.1.1)
  15. 15. Streaming & ML Don't Mix*
  16. 16. ML Pipeline Model
  17. 17. Remember the SS Flow? DataFrame DataFrame
  18. 18. ML Pipeline fit() • Essentially an Action • Results in a Model • Sink start() also an Action • Structured Streaming circuit must be completed with Sink start()
  19. 19. R to the Rescue
  20. 20. R • Statistical computing and graphics • 10.7k+ packages on CRAN
  21. 21. Why Streaming in R • Single integrated job for everything 1. Ingest 2. ETL 3. Machine Learning • Use your favorite packages - freedom to choose • rkafka – last published 2015
  22. 22. SparkR • DataFrame API like R data.frame, dplyr – Full Spark optimizations • SQL, Session, Catalog • “Spark Packages” • ML • R-native UDF • SS
  23. 23. Native R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group
  24. 24. Parallel Processing By Partition
  25. 25. https://spark-summit.org/east-2017/events/scalable-data-science-with-sparkr/
  26. 26. Native R UDF = DF Transform DataFrame DataFrame DataFrame DataFrame
  27. 27. SS in R 1. DataStreamReader/Writer 2. StreamingQuery 3. Extending DataFrame (isStreaming)
  28. 28. About Demo • Create a job to discover trending news topics – Structured Streaming – Machine Learning with native R package in UDF
  29. 29. Demo! https://goo.gl/0v6YxF
  30. 30. Demo • SS – read text stream from Kafka • R-UDF – a partition with lines of text – RTextTools – text vector into DTM – scrubbing – LDA – terms • SQL – group by words, count • SS – write to console
  31. 31. Read DataFrame vs Stream read.df(datapath, source = "parquet") read.stream("kafka", kafka.bootstrap.servers = servers, subscribe = topic)
  32. 32. Streaming WordCount in 1 line library(magrittr) kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092" topic <- "test1" read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>% selectExpr("explode(split(value as string, ' ')) as word") %>% group_by("word") %>% count() %>% write.stream("console", outputMode = "complete")
  33. 33. Challenges
  34. 34. Streaming and ML • Streaming – small batch • ML – sometimes large data to build model => pre-trained model => online machine learning • Adopting to data schema, pattern changes • Updating model (when?)
  35. 35. Practical Implementation - LSI – online training - Online LDA - kNN - k-means with predict on new data
  36. 36. SS Considerations • Schema of DataFrame from Kafka: key (object), value (object), topic, partition, offset, timestamp, timestampType • OutputMode requirements
  37. 37. ML with R-UDF • Native code UDF can break the job - eg. ML packages could be sensitive to empty row - more data checks In Real Life • Debugging can be challenging – run separately first • UDF must return that matches schema • Model as state to distribute to each UDF instance
  38. 38. Future – SSR • Configurable trigger • Watermark for late data
  39. 39. Thank You. https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7 blog: http://bit.ly/1E2z6OI

×