Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

sparklyr - Jeff Allen

542 views

Published on

Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Technology
  • Be the first to comment

sparklyr - Jeff Allen

  1. 1. Spark + R = sparklyr Jeff Allen, RStudio @trestleJeff
  2. 2. THE R LANGUAGE • 5th most popular programming language in the world1 • Data analysis, statistical modeling, visualization • Great interface to your data • Historically limited to in- memory data 1 http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages
  3. 3. WHAT IS SPARK? • Open-source Apache computing engine • Bigger-than-memory data, low-latency distributed computing • Can integrate with the Hadoop ecosystem • Built-in machine learning
  4. 4. sparklyr Introducing… “SPARRK-lee ARR” http://spark.rstudio.com
  5. 5. • New open-source R package from RStudio • Complete dplyr back-end for Spark • Integrated with the RStudio IDE • Extensible foundation for Spark + R sparklyr
  6. 6. KICKTHETIRES library(sparklyr) spark_install() sc <- spark_connect(“local") my_tbl <- copy_to(sc, iris) Driver Program Cluster Manager Worker Spark Context Task Task Executor Cache
  7. 7. library(dplyr) # use standard verbs to filter and aggregate 
 select(
 filter(my_tbl, Petal_Width < 0.3),
 Petal_Length, Petal_Width
 ) # use magrittr pipes for a cleaner syntax
 my_tbl %>%
 filter(Petal_Width < 0.3) %>%
 select(Petal_Length, Petal_Width) USE DPLYRTO WRITE SPARK SQL
  8. 8. Demo
  9. 9. RUN IN PRODUCTION Spark Cluster Master Node Driver Program Spark Context Worker Node Task Task Executor Cache Cluster Manager Worker Node Task Task Executor Cache Use RStudio Server on the Spark cluster master node > spark_connect(“spark://spark.company.org:7077”) > my_tbl <- tbl(sc, “tblname”)
  10. 10. SPARKLYR FUNCTIONALITY • Full dplyr back-end for Spark DataFrames • R wrappers for all MLlib functions • Easily leverage from R Markdown, Shiny, etc. • IDE integration • Windows, too!
  11. 11. SPARKLYR & H2O • rsparkling: R interface to Sparkling Water Spark package • A sparklyr extension • http://spark.rstudio.com/h2o.html
  12. 12. RELATIONSHIPTO SPARKR • Working together to establish a common extension API • Some differences in approach: • CRAN distribution • dplyr compatibility
  13. 13. SPARKR DPLYR
  14. 14. APPLICATIONS http://spark.rstudio.com/examples.html
  15. 15. QUESTIONS? http://spark.rstudio.com @trestleJeff https://github.com/trestletech/user2016-sparklyr

×