There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem.
In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi
1. Ali Zaidi
Data Scientist @Microsoft
akzaidi
Extending R’s API with
Microsoft R Server and
Spark
@alikzaidi
2. Incredible R Speakers and Talks
• Felix Cheung -- SSR: Structured Streaming on R for Machine
Learning
– Tuesday, 11:00 AM – 11:30 AM
• Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases
– Wednesday, 2:00 PM - 3:10 PM
• Hossein Falaki -- Apache SparkR Under the Hood: How to
Debug your SparkR Applications
– Wednesday, 4:20 PM – 4:50 PM
• Navdeep Gill -- From R Script to Production Using rsparkling
– Wednesday, 5:00 PM – 5:30 PM
3. Language Popularity
IEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage Growth
Rexer Data Miner Survey, 2007-2013
Rexer Data Miner Survey
4. What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10,000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
5. R as an Interface
[Y]ou should understand R as a user interface,
or a language that’s capable of providing very
good user interfaces
– JJ Allaire, RStudio
8. R APIs for Spark
PR#78
• SparkR
• ASF, Apache-
licensed
• Ships with Apache-
Spark since 1.4x
• SparkSQL and
SparkML support
through RPC
• UDF support
through gapply,
dapply,
spark.lapply
9. • On a workstation, that means:
– All available cores will be used for math operations and parallel
processes
– Hard drive capacity sets limit for data size, not RAM
• On a cluster:
– Parallel utilization of all available nodes
– Distributed file systems like HDFS greatly expand possible data sizes
MRS in Different Contexts
10. Code written on a workstation will run on a cluster by
tweaking a single function call:
# Use your local computer:
rxSetComputeContext( RxLocalParallel() )
# Switch to your cluster:
rxSetComputeContext( RxSpark(...) )
MRS in Different Contexts
11. Parallel External Memory Algorithms (PEMAs)
1. A chunk/subset of data is extracted from the main dataset
2. An intermediate result is calculated from that chunk of data
3. The intermediate results are combined into a final dataset
How MRS Works
12. How Does Remote Compute Context Work?
Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
“Pack and Ship”
Requests to
Remote
Environments
Results
Microsoft R Server functions
• A compute context defines where to process.
• E.g. remote context like Hadoop Map Reduce
• Microsoft R functions prefixed with rx
• Current set compute context determines processing
location
Copyright Microsoft Corporation. All rights reserved.
Microsoft R Client Microsoft R Server
Console
R IDE or
command-
line REMOTE
CONTEXT
13. Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Naïve Bayes
Combination
rxDataStep
rxExec, rxExecBy
PEMA-R API Custom Algorithms
Parallelized, Remote Execution
Algorithms
Data Step
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Descriptive Statistics
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long form)
Marginal Summaries of Cross Tabulations
Statistical Tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random Sampling
Stratified Sampling
Predictive Models
Sum of Squares (cross product matrix for set variables)
Quantiles (approx.)
Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian, Poisson,
Tweedie. Standard link functions: cauchy,, identity, log, logit,
probit. User defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression, SDCA
Classification & Regression Trees, Random Forest,
Neural Networks, Convolutional Neural Networks
Ensemble Algorithms
14. Sharing a Spark Session
• In Microsoft R Server 9.1 (MRS), you can create a single Spark Session
from MRS that connects with sparklyr to that session
• Allows you to share data across MRS and sparklyr, and use functions
from either package in the same pipeline
• can also share the session with any other sparklyr extension
packages, like rsparkling
• MRS has readers for Hive, using the RxHiveData source reader, so
anything you persist to the Hive metastore from SparkSQL/sparklyr can
be consumed by MRS
16. Spark Abstractions for R
RDD;
DataFrame;
Transformers;
a function that takes a
RDD/DataFrame in and pushes
another RDD/DataFrame out
Actions;
a function that takes a
RDD/DataFrame in, and puts
anything else out
Distributed immutable list;
Distributed data.frame:
tbl_spark, tbl_sql
Lazy Computations;
dplyr, sql,
sdf_transforms,
ml_transformer
Eager Queries;
dplyr::collect, head
17. The Power of Interfaces and
Abstractions
• Interfaces and abstractions make it easier to write reusable code
• For dplyr, Spark SQL method dispatch occurs at the tbl_sql class
• Laziness occurs at the tbl_spark class
• We can develop scripts with dplyr that manipulate
other tbl objects: tbl_df, tbl_mysql, etc.
• When we're ready to run our code in Spark, simply change the data
source
18. Scaling Analytics from Single
Machines to Clusters
Suppose we had a local, single machine running an dplyr method:
19. Changing the Source
• Not sparklyr
functions,
regular dplyr
methods for
the right tbl
20. Can You Explain That?
• If we try the explain method, we can see the code-gen by dplyr
22. Distributed Computing with MRS
• MRS has two packages for distributed machine learning and
predictive modeling:
• RevoScaleR
• Primarily written in C++
• Spark extensions written in Scala
• MicrosoftML
• Primarily written in C#
• CLR bridge to MRS process – (BxlServer)
• Both packages are executed as parallel external memory
algorithms, and can consume a variety a data sources
• Hive tables
• Parquet
• Text
• ODBC, and many more
23. From sparklyr to MRS
Spark
DataFrame,
tbl_spark
Parquet,
spark_write
_parquet
Hive
sdf_register
MRS
rxDataStep(
outFile =
xdfd)
24. Training ML Models with MRS
• Same intuitive syntax we are used to from all CRAN-R modeling functions
25. Training Ensembles
• MRS provides a convenient function to parallelize training across your
cluster and combine the models into an ensemble
• Creates an ensemble by sampling over the training and features, and
aggregating over the predictions
27. Pretrained Algorithms for Transfer
Learning
• MicrosoftML includes pre-trained ImageNet models that you can use to
directly featurize your images, and fine-tune it for your data
28. Streaming Extensions
• sparklyr::invoke
• sparkstreaming | structlystreams
• sparkstreaming and structlystreams interface to Spark
Streaming (RDD) and Structured Streams respectively
• sparkstreaming: