Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ali Zaidi
Data Scientist @Microsoft
akzaidi
Extending R’s API with
Microsoft R Server and
Spark
@alikzaidi
Incredible R Speakers and Talks
• Felix Cheung -- SSR: Structured Streaming on R for Machine
Learning
– Tuesday, 11:00 AM ...
Language Popularity
IEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage Growth
Rexer Data Mi...
What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most unive...
R as an Interface
[Y]ou should understand R as a user interface,
or a language that’s capable of providing very
good user ...
Distributions of R
R APIs for Spark
PR#78
• SparkR
• ASF, Apache-
licensed
• Ships with Apache-
Spark since 1.4x
• SparkSQL and
SparkML suppo...
• On a workstation, that means:
– All available cores will be used for math operations and parallel
processes
– Hard drive...
Code written on a workstation will run on a cluster by
tweaking a single function call:
# Use your local computer:
rxSetCo...
Parallel External Memory Algorithms (PEMAs)
1. A chunk/subset of data is extracted from the main dataset
2. An intermediat...
How	Does	Remote	Compute	Context	Work?
Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block...
Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster ...
Sharing a Spark Session
• In Microsoft R Server 9.1 (MRS), you can create a single Spark Session
from MRS that connects wi...
Sparklyr’s Role in the Tidyverse
Spark Abstractions for R
RDD;
DataFrame;
Transformers;
a function that takes a
RDD/DataFrame in and pushes
another RDD/Dat...
The Power of Interfaces and
Abstractions
• Interfaces and abstractions make it easier to write reusable code
• For dplyr, ...
Scaling Analytics from Single
Machines to Clusters
Suppose we had a local, single machine running an dplyr method:
Changing the Source
• Not sparklyr
functions,
regular dplyr
methods for
the right tbl
Can You Explain That?
• If we try the explain method, we can see the code-gen by dplyr
Two-Tables Joins and Lazy
Operations
Distributed Computing with MRS
• MRS has two packages for distributed machine learning and
predictive modeling:
• RevoScal...
From sparklyr to MRS
Spark
DataFrame,
tbl_spark
Parquet,
spark_write
_parquet
Hive
sdf_register
MRS
rxDataStep(
outFile =
...
Training ML Models with MRS
• Same intuitive syntax we are used to from all CRAN-R modeling functions
Training Ensembles
• MRS provides a convenient function to parallelize training across your
cluster and combine the models...
Distributed Hyperparameter
Optimization
• Can conduct cross-validation and hyperparameter tuning using foreach or
rxExec
Pretrained Algorithms for Transfer
Learning
• MicrosoftML includes pre-trained ImageNet models that you can use to
directl...
Streaming Extensions
• sparklyr::invoke
• sparkstreaming | structlystreams
• sparkstreaming and structlystreams interface ...
Structured Streams
• structlystreams:
GraphFrames Extensions
• GraphFrames
• kevinykuo/sparklygraphs: R interface for GraphFrames
Thanks!
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi
Upcoming SlideShare
Loading in …5
×

of

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 1 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 2 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 3 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 4 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 5 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 6 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 7 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 8 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 9 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 10 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 11 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 12 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 13 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 14 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 15 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 16 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 17 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 18 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 19 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 20 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 21 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 22 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 23 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 24 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 25 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 26 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 27 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 28 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 29 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 30 Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi Slide 31
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

Download to read offline

There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem.

In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

  1. 1. Ali Zaidi Data Scientist @Microsoft akzaidi Extending R’s API with Microsoft R Server and Spark @alikzaidi
  2. 2. Incredible R Speakers and Talks • Felix Cheung -- SSR: Structured Streaming on R for Machine Learning – Tuesday, 11:00 AM – 11:30 AM • Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases – Wednesday, 2:00 PM - 3:10 PM • Hossein Falaki -- Apache SparkR Under the Hood: How to Debug your SparkR Applications – Wednesday, 4:20 PM – 4:50 PM • Navdeep Gill -- From R Script to Production Using rsparkling – Wednesday, 5:00 PM – 5:30 PM
  3. 3. Language Popularity IEEE Spectrum Top Programming Languages R’s popularity is growing rapidly R Usage Growth Rexer Data Miner Survey, 2007-2013 Rexer Data Miner Survey
  4. 4. What is • A statistics programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 10,000+ free algorithms in CRAN • Scalable to big data • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  5. 5. R as an Interface [Y]ou should understand R as a user interface, or a language that’s capable of providing very good user interfaces – JJ Allaire, RStudio
  6. 6. Distributions of R
  7. 7. R APIs for Spark PR#78 • SparkR • ASF, Apache- licensed • Ships with Apache- Spark since 1.4x • SparkSQL and SparkML support through RPC • UDF support through gapply, dapply, spark.lapply
  8. 8. • On a workstation, that means: – All available cores will be used for math operations and parallel processes – Hard drive capacity sets limit for data size, not RAM • On a cluster: – Parallel utilization of all available nodes – Distributed file systems like HDFS greatly expand possible data sizes MRS in Different Contexts
  9. 9. Code written on a workstation will run on a cluster by tweaking a single function call: # Use your local computer: rxSetComputeContext( RxLocalParallel() ) # Switch to your cluster: rxSetComputeContext( RxSpark(...) ) MRS in Different Contexts
  10. 10. Parallel External Memory Algorithms (PEMAs) 1. A chunk/subset of data is extracted from the main dataset 2. An intermediate result is calculated from that chunk of data 3. The intermediate results are combined into a final dataset How MRS Works
  11. 11. How Does Remote Compute Context Work? Algorithm Master Predictive Algorithm Big Data Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines where to process. • E.g. remote context like Hadoop Map Reduce • Microsoft R functions prefixed with rx • Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved. Microsoft R Client Microsoft R Server Console R IDE or command- line REMOTE CONTEXT
  12. 12. Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Combination rxDataStep rxExec, rxExecBy PEMA-R API Custom Algorithms Parallelized, Remote Execution Algorithms Data Step Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Sampling Subsample (observations & variables) Random Sampling Stratified Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Quantiles (approx.) Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchy,, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression, SDCA Classification & Regression Trees, Random Forest, Neural Networks, Convolutional Neural Networks Ensemble Algorithms
  13. 13. Sharing a Spark Session • In Microsoft R Server 9.1 (MRS), you can create a single Spark Session from MRS that connects with sparklyr to that session • Allows you to share data across MRS and sparklyr, and use functions from either package in the same pipeline • can also share the session with any other sparklyr extension packages, like rsparkling • MRS has readers for Hive, using the RxHiveData source reader, so anything you persist to the Hive metastore from SparkSQL/sparklyr can be consumed by MRS
  14. 14. Sparklyr’s Role in the Tidyverse
  15. 15. Spark Abstractions for R RDD; DataFrame; Transformers; a function that takes a RDD/DataFrame in and pushes another RDD/DataFrame out Actions; a function that takes a RDD/DataFrame in, and puts anything else out Distributed immutable list; Distributed data.frame: tbl_spark, tbl_sql Lazy Computations; dplyr, sql, sdf_transforms, ml_transformer Eager Queries; dplyr::collect, head
  16. 16. The Power of Interfaces and Abstractions • Interfaces and abstractions make it easier to write reusable code • For dplyr, Spark SQL method dispatch occurs at the tbl_sql class • Laziness occurs at the tbl_spark class • We can develop scripts with dplyr that manipulate other tbl objects: tbl_df, tbl_mysql, etc. • When we're ready to run our code in Spark, simply change the data source
  17. 17. Scaling Analytics from Single Machines to Clusters Suppose we had a local, single machine running an dplyr method:
  18. 18. Changing the Source • Not sparklyr functions, regular dplyr methods for the right tbl
  19. 19. Can You Explain That? • If we try the explain method, we can see the code-gen by dplyr
  20. 20. Two-Tables Joins and Lazy Operations
  21. 21. Distributed Computing with MRS • MRS has two packages for distributed machine learning and predictive modeling: • RevoScaleR • Primarily written in C++ • Spark extensions written in Scala • MicrosoftML • Primarily written in C# • CLR bridge to MRS process – (BxlServer) • Both packages are executed as parallel external memory algorithms, and can consume a variety a data sources • Hive tables • Parquet • Text • ODBC, and many more
  22. 22. From sparklyr to MRS Spark DataFrame, tbl_spark Parquet, spark_write _parquet Hive sdf_register MRS rxDataStep( outFile = xdfd)
  23. 23. Training ML Models with MRS • Same intuitive syntax we are used to from all CRAN-R modeling functions
  24. 24. Training Ensembles • MRS provides a convenient function to parallelize training across your cluster and combine the models into an ensemble • Creates an ensemble by sampling over the training and features, and aggregating over the predictions
  25. 25. Distributed Hyperparameter Optimization • Can conduct cross-validation and hyperparameter tuning using foreach or rxExec
  26. 26. Pretrained Algorithms for Transfer Learning • MicrosoftML includes pre-trained ImageNet models that you can use to directly featurize your images, and fine-tune it for your data
  27. 27. Streaming Extensions • sparklyr::invoke • sparkstreaming | structlystreams • sparkstreaming and structlystreams interface to Spark Streaming (RDD) and Structured Streams respectively • sparkstreaming:
  28. 28. Structured Streams • structlystreams:
  29. 29. GraphFrames Extensions • GraphFrames • kevinykuo/sparklygraphs: R interface for GraphFrames
  30. 30. Thanks!
  • YoshiakiAmano

    Nov. 25, 2017
  • dolcos

    Nov. 3, 2017

There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.

Views

Total views

2,210

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

70

Shares

0

Comments

0

Likes

2

×