Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ANALYZE DATA USING RSTUDIO'S SPARKLYR
R AND SPARK
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/spa...
Apache Spark
• Huge investments in big data and Hadoop
• Data scientists wanting to analyze data at scale
• Rapid progress...
Best of both worlds
If you are investing in Spark,
then there is nothing
stopping you from using it
with the full power of...
Benefits of Spark for the R user
Apache Spark…
• Can integrate with Hadoop
• Supports familiar SQL syntax
• Has built-in m...
New! Open-source
R package from RStudio
• Integrated with the RStudio IDE
• Sparklyr is a dplyr back-end for Spark
• Exten...
Create your own R
packages with
interfaces to Spark
•Interfaces to custom
machine learning pipelines
•Interfaces to 3rd pa...
R for data science toolchain
“You’ll learn how to get your data into R
[with Spark], get it into the most useful
structure...
Import
Create a connection
sc <- spark_connect()
Import data from file/S3/HDFS/R
spark_read_csv(sc,“table”,“hdfs://<path>”)...
Wrangle
dplyr
my_tbl %>%
filter(Petal_Width < 0.3) %>%
select(Petal_Length, Petal_Width)
Spark SQL
select Petal_Length, Pe...
Visualize
ggplot2
collect(mpg_tbl) %>%
ggplot() +
aes(displ, hwy, color = class) +
geom_point()
Use ggplot2 to
visualize d...
Model
Models
K-means
Linear regression
Logistic regression
Survival regression
Generalized linear regression
Decision tree...
Communicate
R MarkdownNotebooks
Make decisions
Take actions
See results
Weave together text
and code to produce
high quali...
Demo
Analyzing 1 billion records with Spark and R
http://colorado.rstudio.com:3939/content/262/
https://github.com/rstudio...
rsparkling extension
Spark is extensible…
sparklyr is extensible
https://github.com/h2oai/rsparkling/blob/master/R/h2o_con...
Benefits Limitations
No data movement required.
Native ML algorithms.
Fast growing ecosystem.
Comparatively fewer algorith...
What’s new with sparklyr?
spark.rstudio.com
https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
Upcoming SlideShare
Loading in …5
×

R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens

6,295 views

Published on

Sparklyr is an R package that lets you analyze data in Spark while using familiar tools in R. Sparklyr supports a complete backend for dplyr, a popular tool for working with data frame objects both in memory and out of memory. You can use dplyr to translate R code into Spark SQL. Sparklyr also supports MLlib so you can run classifiers, regressions, clustering, decision trees, and many more machine learning algorithms on your distributed data in Spark. With sparklyr you can analyze large amounts of data that would not traditionally fit into R memory. Then you can collect results from Spark into R for further visualization and documentation.

Sparklyr is also extensible. You can create R packages that depend on sparklyr to call the full Spark API. One example of an extension is H2O’s rsparkling, an R package that works with H2O’s machine learning algorithm. With sparklyr and rsparkling you have access to all the tools in H2O for analysis with R and Spark.

In this presentation I will demonstrate how to analyze data in Spark by using sparklyr and rsparkling.

Published in: Data & Analytics

R and Spark: How to Analyze Data Using RStudio's Sparklyr and H2O's Rsparkling Packages: Spark Summit East talk by Nathan Stephens

  1. 1. ANALYZE DATA USING RSTUDIO'S SPARKLYR R AND SPARK https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
  2. 2. Apache Spark • Huge investments in big data and Hadoop • Data scientists wanting to analyze data at scale • Rapid progress and adoption in Spark libraries R and RStudio • Wide range of tools and packages • Powerful ways to share insights • Interactive notebooks • Great visualizations What we hear from our customers
  3. 3. Best of both worlds If you are investing in Spark, then there is nothing stopping you from using it with the full power of R Using R with Spark
  4. 4. Benefits of Spark for the R user Apache Spark… • Can integrate with Hadoop • Supports familiar SQL syntax • Has built-in machine learning • Is designed for performance • Great for interactive data analysis R users can take advantage of all these investments
  5. 5. New! Open-source R package from RStudio • Integrated with the RStudio IDE • Sparklyr is a dplyr back-end for Spark • Extensible foundation for Spark applications and R sparklyr http://spark.rstudio.com/
  6. 6. Create your own R packages with interfaces to Spark •Interfaces to custom machine learning pipelines •Interfaces to 3rd party Spark packages •Many other R interfaces sparklyr extensions Example Count the number of lines in a file Extension library(sparklyr) count_lines <- function(sc, file) { spark_context(sc) %>% invoke("textFile", file, 1L) %>% invoke("count") } Call sc <- spark_connect(master = "local") count_lines(sc, "hdfs://path/data.csv")
  7. 7. R for data science toolchain “You’ll learn how to get your data into R [with Spark], get it into the most useful structure, transform it, visualize it and model it [with Spark].” 
  8. 8. Import Create a connection sc <- spark_connect() Import data from file/S3/HDFS/R spark_read_csv(sc,“table”,“hdfs://<path>”) sdf_copy_to(sc, table,“table”) nyct2010_tbl <- tbl(sc,“table") Write data spark_write_parquet(table,“hdfs://<path>”) Sparklyr Connect to Spark. Read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the local filesystem.
  9. 9. Wrangle dplyr my_tbl %>% filter(Petal_Width < 0.3) %>% select(Petal_Length, Petal_Width) Spark SQL select Petal_Length, Petal_Width from mytable where Petal_Width < 0.3 Use dplyr to write Spark SQL A fast, consistent tool for working with data frame like objects, 
 both in memory and out of memory.
  10. 10. Visualize ggplot2 collect(mpg_tbl) %>% ggplot() + aes(displ, hwy, color = class) + geom_point() Use ggplot2 to visualize data collected from Spark A plotting system for R that makes it easy to produce complex multi- layered graphics.
  11. 11. Model Models K-means Linear regression Logistic regression Survival regression Generalized linear regression Decision trees Random forests Gradient boosted trees Principal component analysis Naive Bayes Multilayer perceptron Latent Dirichlet allocation One vs rest Industry Specific Chemometrics ClinicalTrials Econometrics Environmetrics Finance Genetics Pharmacokinetics Phylogenetics Psychometrics Social Sciences Models GLMNet Bayesian regression Multinomial regression Random Forest Gradient boosted machine Decision trees Multi-Layer Perceptron Auto-encoder Restricted Boltzmann K-Means LSH SVD ALS ARIMA Forecasting Collaborative filtering Solvers and optimization General Topics Machine Learning Bayesian Cluster Design of experiments ExtremeValue Meta Analsis Multivariate NLP Robust methods Spatial Survival Time Series Graphical models No data movement required. Native ML algorithms. Fast growing ecosystem. Over 10,000 packages. Time tested, industry specific models. Integrated with other R packages Scalable, high performance models. Wide variety of algorithms. Useful diagnostics. MLlib
  12. 12. Communicate R MarkdownNotebooks Make decisions Take actions See results Weave together text and code to produce high quality documents, apps, and plots. Share
  13. 13. Demo Analyzing 1 billion records with Spark and R http://colorado.rstudio.com:3939/content/262/ https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast
  14. 14. rsparkling extension Spark is extensible… sparklyr is extensible https://github.com/h2oai/rsparkling/blob/master/R/h2o_context.R#L53 Spark R H2O rsparkling sparklyr h2o sparkling water
  15. 15. Benefits Limitations No data movement required. Native ML algorithms. Fast growing ecosystem. Comparatively fewer algorithms and fewer diagnostics. Scalable, high performance models. Wide variety of algorithms. Useful diagnostics. Data conversion requires 3-4X memory. Added complexity around introducing and learning another tool. Access to CRAN packages, visualization, reporting tools, and time tested algorithms. Data collection is expensive and collection size is limited (< 10 GB). Where should I model my data? Others… MLlib
  16. 16. What’s new with sparklyr?
  17. 17. spark.rstudio.com https://github.com/rstudio/sparkDemos/tree/master/prod/presentations/sparkSummitEast

×