Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi


Published on

There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem.

In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.

Published in: Data & Analytics
  • Get Paid For Your Opinions! Earn $5-$10 cash on your first survey. ★★★
    Are you sure you want to  Yes  No
    Your message goes here

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

  1. 1. Ali Zaidi Data Scientist @Microsoft akzaidi Extending R’s API with Microsoft R Server and Spark @alikzaidi
  2. 2. Incredible R Speakers and Talks • Felix Cheung -- SSR: Structured Streaming on R for Machine Learning – Tuesday, 11:00 AM – 11:30 AM • Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases – Wednesday, 2:00 PM - 3:10 PM • Hossein Falaki -- Apache SparkR Under the Hood: How to Debug your SparkR Applications – Wednesday, 4:20 PM – 4:50 PM • Navdeep Gill -- From R Script to Production Using rsparkling – Wednesday, 5:00 PM – 5:30 PM
  3. 3. Language Popularity IEEE Spectrum Top Programming Languages R’s popularity is growing rapidly R Usage Growth Rexer Data Miner Survey, 2007-2013 Rexer Data Miner Survey
  4. 4. What is • A statistics programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 10,000+ free algorithms in CRAN • Scalable to big data • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  5. 5. R as an Interface [Y]ou should understand R as a user interface, or a language that’s capable of providing very good user interfaces – JJ Allaire, RStudio
  6. 6. Distributions of R
  7. 7. R APIs for Spark PR#78 • SparkR • ASF, Apache- licensed • Ships with Apache- Spark since 1.4x • SparkSQL and SparkML support through RPC • UDF support through gapply, dapply, spark.lapply
  8. 8. • On a workstation, that means: – All available cores will be used for math operations and parallel processes – Hard drive capacity sets limit for data size, not RAM • On a cluster: – Parallel utilization of all available nodes – Distributed file systems like HDFS greatly expand possible data sizes MRS in Different Contexts
  9. 9. Code written on a workstation will run on a cluster by tweaking a single function call: # Use your local computer: rxSetComputeContext( RxLocalParallel() ) # Switch to your cluster: rxSetComputeContext( RxSpark(...) ) MRS in Different Contexts
  10. 10. Parallel External Memory Algorithms (PEMAs) 1. A chunk/subset of data is extracted from the main dataset 2. An intermediate result is calculated from that chunk of data 3. The intermediate results are combined into a final dataset How MRS Works
  11. 11. How Does Remote Compute Context Work? Algorithm Master Predictive Algorithm Big Data Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines where to process. • E.g. remote context like Hadoop Map Reduce • Microsoft R functions prefixed with rx • Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved. Microsoft R Client Microsoft R Server Console R IDE or command- line REMOTE CONTEXT
  12. 12. Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Combination rxDataStep rxExec, rxExecBy PEMA-R API Custom Algorithms Parallelized, Remote Execution Algorithms Data Step Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Sampling Subsample (observations & variables) Random Sampling Stratified Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Quantiles (approx.) Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchy,, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression, SDCA Classification & Regression Trees, Random Forest, Neural Networks, Convolutional Neural Networks Ensemble Algorithms
  13. 13. Sharing a Spark Session • In Microsoft R Server 9.1 (MRS), you can create a single Spark Session from MRS that connects with sparklyr to that session • Allows you to share data across MRS and sparklyr, and use functions from either package in the same pipeline • can also share the session with any other sparklyr extension packages, like rsparkling • MRS has readers for Hive, using the RxHiveData source reader, so anything you persist to the Hive metastore from SparkSQL/sparklyr can be consumed by MRS
  14. 14. Sparklyr’s Role in the Tidyverse
  15. 15. Spark Abstractions for R RDD; DataFrame; Transformers; a function that takes a RDD/DataFrame in and pushes another RDD/DataFrame out Actions; a function that takes a RDD/DataFrame in, and puts anything else out Distributed immutable list; Distributed data.frame: tbl_spark, tbl_sql Lazy Computations; dplyr, sql, sdf_transforms, ml_transformer Eager Queries; dplyr::collect, head
  16. 16. The Power of Interfaces and Abstractions • Interfaces and abstractions make it easier to write reusable code • For dplyr, Spark SQL method dispatch occurs at the tbl_sql class • Laziness occurs at the tbl_spark class • We can develop scripts with dplyr that manipulate other tbl objects: tbl_df, tbl_mysql, etc. • When we're ready to run our code in Spark, simply change the data source
  17. 17. Scaling Analytics from Single Machines to Clusters Suppose we had a local, single machine running an dplyr method:
  18. 18. Changing the Source • Not sparklyr functions, regular dplyr methods for the right tbl
  19. 19. Can You Explain That? • If we try the explain method, we can see the code-gen by dplyr
  20. 20. Two-Tables Joins and Lazy Operations
  21. 21. Distributed Computing with MRS • MRS has two packages for distributed machine learning and predictive modeling: • RevoScaleR • Primarily written in C++ • Spark extensions written in Scala • MicrosoftML • Primarily written in C# • CLR bridge to MRS process – (BxlServer) • Both packages are executed as parallel external memory algorithms, and can consume a variety a data sources • Hive tables • Parquet • Text • ODBC, and many more
  22. 22. From sparklyr to MRS Spark DataFrame, tbl_spark Parquet, spark_write _parquet Hive sdf_register MRS rxDataStep( outFile = xdfd)
  23. 23. Training ML Models with MRS • Same intuitive syntax we are used to from all CRAN-R modeling functions
  24. 24. Training Ensembles • MRS provides a convenient function to parallelize training across your cluster and combine the models into an ensemble • Creates an ensemble by sampling over the training and features, and aggregating over the predictions
  25. 25. Distributed Hyperparameter Optimization • Can conduct cross-validation and hyperparameter tuning using foreach or rxExec
  26. 26. Pretrained Algorithms for Transfer Learning • MicrosoftML includes pre-trained ImageNet models that you can use to directly featurize your images, and fine-tune it for your data
  27. 27. Streaming Extensions • sparklyr::invoke • sparkstreaming | structlystreams • sparkstreaming and structlystreams interface to Spark Streaming (RDD) and Structured Streams respectively • sparkstreaming:
  28. 28. Structured Streams • structlystreams:
  29. 29. GraphFrames Extensions • GraphFrames • kevinykuo/sparklygraphs: R interface for GraphFrames
  30. 30. Thanks!