SlideShare a Scribd company logo
1 of 31
Download to read offline
Ali Zaidi
Data Scientist @Microsoft
akzaidi
Extending R’s API with
Microsoft R Server and
Spark
@alikzaidi
Incredible R Speakers and Talks
• Felix Cheung -- SSR: Structured Streaming on R for Machine
Learning
– Tuesday, 11:00 AM – 11:30 AM
• Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases
– Wednesday, 2:00 PM - 3:10 PM
• Hossein Falaki -- Apache SparkR Under the Hood: How to
Debug your SparkR Applications
– Wednesday, 4:20 PM – 4:50 PM
• Navdeep Gill -- From R Script to Production Using rsparkling
– Wednesday, 5:00 PM – 5:30 PM
Language Popularity
IEEE Spectrum Top Programming Languages
R’s popularity is growing rapidly
R Usage Growth
Rexer Data Miner Survey, 2007-2013
Rexer Data Miner Survey
What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10,000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
R as an Interface
[Y]ou should understand R as a user interface,
or a language that’s capable of providing very
good user interfaces
– JJ Allaire, RStudio
Distributions of R
R APIs for Spark
PR#78
• SparkR
• ASF, Apache-
licensed
• Ships with Apache-
Spark since 1.4x
• SparkSQL and
SparkML support
through RPC
• UDF support
through gapply,
dapply,
spark.lapply
• On a workstation, that means:
– All available cores will be used for math operations and parallel
processes
– Hard drive capacity sets limit for data size, not RAM
• On a cluster:
– Parallel utilization of all available nodes
– Distributed file systems like HDFS greatly expand possible data sizes
MRS in Different Contexts
Code written on a workstation will run on a cluster by
tweaking a single function call:
# Use your local computer:
rxSetComputeContext( RxLocalParallel() )
# Switch to your cluster:
rxSetComputeContext( RxSpark(...) )
MRS in Different Contexts
Parallel External Memory Algorithms (PEMAs)
1. A chunk/subset of data is extracted from the main dataset
2. An intermediate result is calculated from that chunk of data
3. The intermediate results are combined into a final dataset
How MRS Works
How	Does	Remote	Compute	Context	Work?
Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
“Pack and Ship”
Requests to
Remote
Environments
Results
Microsoft R Server functions
• A compute context defines where to process.
• E.g. remote context like Hadoop Map Reduce
• Microsoft R functions prefixed with rx
• Current set compute context determines processing
location
Copyright Microsoft Corporation. All rights reserved.
Microsoft R Client Microsoft R Server
Console
R IDE or
command-
line REMOTE
CONTEXT
Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Naïve Bayes
Combination
rxDataStep
rxExec, rxExecBy
PEMA-R API Custom Algorithms
Parallelized, Remote Execution
Algorithms
Data Step
Data import – Delimited, Fixed, SAS, SPSS,
OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Descriptive Statistics
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long form)
Marginal Summaries of Cross Tabulations
Statistical Tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random Sampling
Stratified Sampling
Predictive Models
Sum of Squares (cross product matrix for set variables)
Quantiles (approx.)
Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian, Poisson,
Tweedie. Standard link functions: cauchy,, identity, log, logit,
probit. User defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression, SDCA
Classification & Regression Trees, Random Forest,
Neural Networks, Convolutional Neural Networks
Ensemble Algorithms
Sharing a Spark Session
• In Microsoft R Server 9.1 (MRS), you can create a single Spark Session
from MRS that connects with sparklyr to that session
• Allows you to share data across MRS and sparklyr, and use functions
from either package in the same pipeline
• can also share the session with any other sparklyr extension
packages, like rsparkling
• MRS has readers for Hive, using the RxHiveData source reader, so
anything you persist to the Hive metastore from SparkSQL/sparklyr can
be consumed by MRS
Sparklyr’s Role in the Tidyverse
Spark Abstractions for R
RDD;
DataFrame;
Transformers;
a function that takes a
RDD/DataFrame in and pushes
another RDD/DataFrame out
Actions;
a function that takes a
RDD/DataFrame in, and puts
anything else out
Distributed immutable list;
Distributed data.frame:
tbl_spark, tbl_sql
Lazy Computations;
dplyr, sql,
sdf_transforms,
ml_transformer
Eager Queries;
dplyr::collect, head
The Power of Interfaces and
Abstractions
• Interfaces and abstractions make it easier to write reusable code
• For dplyr, Spark SQL method dispatch occurs at the tbl_sql class
• Laziness occurs at the tbl_spark class
• We can develop scripts with dplyr that manipulate
other tbl objects: tbl_df, tbl_mysql, etc.
• When we're ready to run our code in Spark, simply change the data
source
Scaling Analytics from Single
Machines to Clusters
Suppose we had a local, single machine running an dplyr method:
Changing the Source
• Not sparklyr
functions,
regular dplyr
methods for
the right tbl
Can You Explain That?
• If we try the explain method, we can see the code-gen by dplyr
Two-Tables Joins and Lazy
Operations
Distributed Computing with MRS
• MRS has two packages for distributed machine learning and
predictive modeling:
• RevoScaleR
• Primarily written in C++
• Spark extensions written in Scala
• MicrosoftML
• Primarily written in C#
• CLR bridge to MRS process – (BxlServer)
• Both packages are executed as parallel external memory
algorithms, and can consume a variety a data sources
• Hive tables
• Parquet
• Text
• ODBC, and many more
From sparklyr to MRS
Spark
DataFrame,
tbl_spark
Parquet,
spark_write
_parquet
Hive
sdf_register
MRS
rxDataStep(
outFile =
xdfd)
Training ML Models with MRS
• Same intuitive syntax we are used to from all CRAN-R modeling functions
Training Ensembles
• MRS provides a convenient function to parallelize training across your
cluster and combine the models into an ensemble
• Creates an ensemble by sampling over the training and features, and
aggregating over the predictions
Distributed Hyperparameter
Optimization
• Can conduct cross-validation and hyperparameter tuning using foreach or
rxExec
Pretrained Algorithms for Transfer
Learning
• MicrosoftML includes pre-trained ImageNet models that you can use to
directly featurize your images, and fine-tune it for your data
Streaming Extensions
• sparklyr::invoke
• sparkstreaming | structlystreams
• sparkstreaming and structlystreams interface to Spark
Streaming (RDD) and Structured Streams respectively
• sparkstreaming:
Structured Streams
• structlystreams:
GraphFrames Extensions
• GraphFrames
• kevinykuo/sparklygraphs: R interface for GraphFrames
Thanks!

More Related Content

What's hot

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark Summit
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 

What's hot (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 

Similar to Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 

Similar to Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Zaidi

  • 1. Ali Zaidi Data Scientist @Microsoft akzaidi Extending R’s API with Microsoft R Server and Spark @alikzaidi
  • 2. Incredible R Speakers and Talks • Felix Cheung -- SSR: Structured Streaming on R for Machine Learning – Tuesday, 11:00 AM – 11:30 AM • Javier Luraschi -- Sparklyr: Recap, Updates and Use Cases – Wednesday, 2:00 PM - 3:10 PM • Hossein Falaki -- Apache SparkR Under the Hood: How to Debug your SparkR Applications – Wednesday, 4:20 PM – 4:50 PM • Navdeep Gill -- From R Script to Production Using rsparkling – Wednesday, 5:00 PM – 5:30 PM
  • 3. Language Popularity IEEE Spectrum Top Programming Languages R’s popularity is growing rapidly R Usage Growth Rexer Data Miner Survey, 2007-2013 Rexer Data Miner Survey
  • 4. What is • A statistics programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 10,000+ free algorithms in CRAN • Scalable to big data • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  • 5. R as an Interface [Y]ou should understand R as a user interface, or a language that’s capable of providing very good user interfaces – JJ Allaire, RStudio
  • 6.
  • 8. R APIs for Spark PR#78 • SparkR • ASF, Apache- licensed • Ships with Apache- Spark since 1.4x • SparkSQL and SparkML support through RPC • UDF support through gapply, dapply, spark.lapply
  • 9. • On a workstation, that means: – All available cores will be used for math operations and parallel processes – Hard drive capacity sets limit for data size, not RAM • On a cluster: – Parallel utilization of all available nodes – Distributed file systems like HDFS greatly expand possible data sizes MRS in Different Contexts
  • 10. Code written on a workstation will run on a cluster by tweaking a single function call: # Use your local computer: rxSetComputeContext( RxLocalParallel() ) # Switch to your cluster: rxSetComputeContext( RxSpark(...) ) MRS in Different Contexts
  • 11. Parallel External Memory Algorithms (PEMAs) 1. A chunk/subset of data is extracted from the main dataset 2. An intermediate result is calculated from that chunk of data 3. The intermediate results are combined into a final dataset How MRS Works
  • 12. How Does Remote Compute Context Work? Algorithm Master Predictive Algorithm Big Data Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results “Pack and Ship” Requests to Remote Environments Results Microsoft R Server functions • A compute context defines where to process. • E.g. remote context like Hadoop Map Reduce • Microsoft R functions prefixed with rx • Current set compute context determines processing location Copyright Microsoft Corporation. All rights reserved. Microsoft R Client Microsoft R Server Console R IDE or command- line REMOTE CONTEXT
  • 13. Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Combination rxDataStep rxExec, rxExecBy PEMA-R API Custom Algorithms Parallelized, Remote Execution Algorithms Data Step Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Sampling Subsample (observations & variables) Random Sampling Stratified Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Quantiles (approx.) Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchy,, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression, SDCA Classification & Regression Trees, Random Forest, Neural Networks, Convolutional Neural Networks Ensemble Algorithms
  • 14. Sharing a Spark Session • In Microsoft R Server 9.1 (MRS), you can create a single Spark Session from MRS that connects with sparklyr to that session • Allows you to share data across MRS and sparklyr, and use functions from either package in the same pipeline • can also share the session with any other sparklyr extension packages, like rsparkling • MRS has readers for Hive, using the RxHiveData source reader, so anything you persist to the Hive metastore from SparkSQL/sparklyr can be consumed by MRS
  • 15. Sparklyr’s Role in the Tidyverse
  • 16. Spark Abstractions for R RDD; DataFrame; Transformers; a function that takes a RDD/DataFrame in and pushes another RDD/DataFrame out Actions; a function that takes a RDD/DataFrame in, and puts anything else out Distributed immutable list; Distributed data.frame: tbl_spark, tbl_sql Lazy Computations; dplyr, sql, sdf_transforms, ml_transformer Eager Queries; dplyr::collect, head
  • 17. The Power of Interfaces and Abstractions • Interfaces and abstractions make it easier to write reusable code • For dplyr, Spark SQL method dispatch occurs at the tbl_sql class • Laziness occurs at the tbl_spark class • We can develop scripts with dplyr that manipulate other tbl objects: tbl_df, tbl_mysql, etc. • When we're ready to run our code in Spark, simply change the data source
  • 18. Scaling Analytics from Single Machines to Clusters Suppose we had a local, single machine running an dplyr method:
  • 19. Changing the Source • Not sparklyr functions, regular dplyr methods for the right tbl
  • 20. Can You Explain That? • If we try the explain method, we can see the code-gen by dplyr
  • 21. Two-Tables Joins and Lazy Operations
  • 22. Distributed Computing with MRS • MRS has two packages for distributed machine learning and predictive modeling: • RevoScaleR • Primarily written in C++ • Spark extensions written in Scala • MicrosoftML • Primarily written in C# • CLR bridge to MRS process – (BxlServer) • Both packages are executed as parallel external memory algorithms, and can consume a variety a data sources • Hive tables • Parquet • Text • ODBC, and many more
  • 23. From sparklyr to MRS Spark DataFrame, tbl_spark Parquet, spark_write _parquet Hive sdf_register MRS rxDataStep( outFile = xdfd)
  • 24. Training ML Models with MRS • Same intuitive syntax we are used to from all CRAN-R modeling functions
  • 25. Training Ensembles • MRS provides a convenient function to parallelize training across your cluster and combine the models into an ensemble • Creates an ensemble by sampling over the training and features, and aggregating over the predictions
  • 26. Distributed Hyperparameter Optimization • Can conduct cross-validation and hyperparameter tuning using foreach or rxExec
  • 27. Pretrained Algorithms for Transfer Learning • MicrosoftML includes pre-trained ImageNet models that you can use to directly featurize your images, and fine-tune it for your data
  • 28. Streaming Extensions • sparklyr::invoke • sparkstreaming | structlystreams • sparkstreaming and structlystreams interface to Spark Streaming (RDD) and Structured Streams respectively • sparkstreaming:
  • 30. GraphFrames Extensions • GraphFrames • kevinykuo/sparklygraphs: R interface for GraphFrames