Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Briefing on the Modern ML Stack with R

We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP

  • Be the first to comment

  • Be the first to like this

Briefing on the Modern ML Stack with R

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Javier Luraschi, RStudio Briefing on the Modern ML Stack with R #UnifiedDataAnalytics #SparkAISummit
  3. 3. Intro to R “R is a programming language and free software environment for statistical computing and graphics." 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Modern R library(tidyverse) flights %>% group_by(month, day) %>% summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>% filter(count > 1000) 4#UnifiedDataAnalytics #SparkAISummit The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
  5. 5. About RStudio 5#UnifiedDataAnalytics #SparkAISummit
  6. 6. RStudio Multiverse Team 6#UnifiedDataAnalytics #SparkAISummit Authors of R packages to support Apache Spark, TensorFlow and MLflow. Contributors to tidyverse and Apache Arrow.
  7. 7. The Modern ML Stack with R 7#UnifiedDataAnalytics #SparkAISummit 2016 2017 2018 20192015
  8. 8. 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Spark with R - Motivation 9#UnifiedDataAnalytics #SparkAISummit library(sparklyr) sc <- spark_connect(master = “local|yarn|mesos|spark|livy”) flights <- copy_to(sc, flights) library(tidyverse) flights %>% group_by(month, day) %>% summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>% filter(count > 1000)
  10. 10. Spark with R - Timeline 10#UnifiedDataAnalytics #SparkAISummit Oct 2019 Sep 2016 sparklyr 0.4 R interface for Apache Spark. sparklyr 0.6 Distributed R and external sources. Jul 2017 Jan 2017 sparklyr 0.5 Livy and dplyr improvements. Jan 2018 sparklyr 0.7 Spark Pipelines and Machine Learning. May 2018 sparklyr 0.8 Production pipelines and graphs. sparklyr 0.9 Streams and Kubernetes. Oct 2018 Mar 2019 sparklyr 1.0 Arrow, XGBoost, Broom and TFRecords
  11. 11. Spark - What’s new? 11#UnifiedDataAnalytics #SparkAISummit library(sparklyr) library(arrow)
  12. 12. Spark - What’s new? - XGBoost 12#UnifiedDataAnalytics #SparkAISummit library(sparkxgb) iris <- copy_to(sc, iris) xgb_model <- xgboost_classifier(iris, Species ~ ., num_class =3, num_round = 50, max_depth = 4) xgb_model %>% ml_predict(iris) %>% select(Species, predicted_label, starts_with("probability_")) %>% glimpse()
  13. 13. Spark - What’s new? - Broom 13#UnifiedDataAnalytics #SparkAISummit
  14. 14. Spark - New? - TF Records 14#UnifiedDataAnalytics #SparkAISummit
  15. 15. Spark - What’s next? - Genomics 15#UnifiedDataAnalytics #SparkAISummit library(sparklyr) library(variantspark) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, "inst/extdata/hipster.vcf.bz2") hipster_labels <- vs_read_csv(vsc, "inst/extdata/hipster_labels.txt") labels <- vs_read_labels(vsc, "inst/extdata/hipster_labels.txt") vs_importance_analysis(vsc, hipster_vcf, labels, n_trees = 100) github.com/r-spark/variantspark by Samuel Macêdo
  16. 16. Spark - What’s next? - Genomics 16#UnifiedDataAnalytics #SparkAISummit library(sparkhail) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) hail_dataframe(mt) github.com/r-spark/sparkhail by Samuel Macêdo
  17. 17. Spark - What’s next? - Genomics 17#UnifiedDataAnalytics #SparkAISummit github.com/lawremi/hailr by Michael Lawrence
  18. 18. Spark - What’s next? - NLP github.com/r-spark/sparknlp by Kevin Kuo 18#UnifiedDataAnalytics #SparkAISummit
  19. 19. Spark - What’s next? - GitHub sparklyr moving to github.com/r-spark and more... 19#UnifiedDataAnalytics #SparkAISummit
  20. 20. 20#UnifiedDataAnalytics #SparkAISummit
  21. 21. TensorFlow with R - Timeline 21#UnifiedDataAnalytics #SparkAISummit Mar 2017 tensorflow 0.7 Initial Release Dec 2017 Jul 2017 keras 2.0.5 Initial Release Jan 2018 tfestimators 1.4.2 Initial Release Jun 2018 cloudml 0.5 Initial Release Aug 2018 tensorflow Eager Execution Oct 2018 tfprobability Initial Release tfdatasets 1.5 Initial Release
  22. 22. TensorFlow - New? - tfdatasets Feature specs 22#UnifiedDataAnalytics #SparkAISummit ft_spec <- training %>% select(-id) %>% feature_spec(target ~ .) %>% step_numeric_column(ends_with("bin")) %>% step_numeric_column(-ends_with("bin"), -ends_with("cat"), normalizer_fn = scaler_standard()) %>% step_categorical_column_with_vocabulary_list(ends_with("cat")) %>% step_embedding_column(ends_with("cat"), dimension = function(vocab_size) as.integer(sqrt(vocab_size) + 1)) %>% fit()
  23. 23. TensorFlow - New? - tfprobability Combine probabilistic models and deep learning on modern hardware 23#UnifiedDataAnalytics #SparkAISummit # create a binomial distribution with n = 7 and p = 0.3 d <- tfd_binomial(total_count = 7, probs = 0.3) # compute mean d %>% tfd_mean() # compute variance d %>% tfd_variance() # compute probability d %>% tfd_prob(2.3) github.com/rstudio/tfprobability
  24. 24. TensorFlow - What’s next? TF 2.0 24#UnifiedDataAnalytics #SparkAISummit
  25. 25. TensorFlow - Next? - Distributed 25#UnifiedDataAnalytics #SparkAISummit
  26. 26. 26#UnifiedDataAnalytics #SparkAISummit
  27. 27. MLflow - Timeline 27#UnifiedDataAnalytics #SparkAISummit Available in CRAN since v0.7.0
  28. 28. MLflow - Timeline 28#UnifiedDataAnalytics #SparkAISummit Docs site at a par with Python!
  29. 29. MLflow - What’s next? ● renv (packrat successor) ● Cloud Deployment Targets ● Keras Autolog 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. DEMO: Modern ML Stack with R 30#UnifiedDataAnalytics #SparkAISummit
  31. 31. Resources • Mastering Spark with R (book) • github.com/r-spark • spark.rstudio.com • github.com/r-tensorflow • tensorflow.rstudio.com • youtube.com/c/multiverses 31#UnifiedDataAnalytics #SparkAISummit
  32. 32. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×