Microsoft R - Data Science at Scale

What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration

Challenges posed by open source R
?
?
Lack of
Commercial
Support
Inadequate
Modeling
Performance
Complex
Deployment
Processes
Limited
Data
Scale

R from Microsoft brings
Peace of
mind
Efficiency Speed and
scalability
Flexibility
and agility

Linux, Windows, Hadoop & Teradata
R Server Technology

Convergence with Flexibility
Scalable Algorithms
R: Write Once Deploy Anywhere
Templates & Samples
Microsoft R Server Family
R & Python to AML Interop.
Cortana Intelligence

SQL Server
R Services
Linux
Hadoop Teradata
Windows
CommercialCommunity
R ServerR Open

Installed Packages
Base
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base
Recommended
- boot
- class
- cluster
- codetools
- foreign
- kernSmooth
- lattice
- MASS
- Matrix
- mgcv
- nlme
- nnet
- rpart
- spatial
- survival
Microsoft
(Developed /
Maintained)
- checkpoint
- deployRserve
- doParallel
- foreach
- jsonlite
- iterators
- microsoftR
- RevoIOQ
- RevoMods
- RevoUtils
- RODBC
- RevoUtilsMath
- azureml
- rmr2
- rhdfs
- rhbase
- plyrmr
Open-Source #1
Additional
CRAN R
- curl
- jsonlite
- png
- R6
- RODBC
Microsoft R Open #2
(Intel MKL)
Microsoft R Server #4
Microsoft R Client (free) #3
Microsoft
(Developed /
Maintained)
- RevoScaleR
- MicrosoftML
- CompatibilityAPI
- mrupdate
- RevoIOQ
- RevoTreeView
- Mrsdeploy
- Sqlrutils
- olapR
Commercially licenced & supported
Open-Source
Open-Source

Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
Results
Microsoft R Server “Client” Microsoft R Server “Server”
Console
R IDE or
command-
line REMOTE
CONTEXT

DI
R+CRAN
MicrosoftR
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Delivers High Performance Parallel Distributed
Analytics Across Individual and Clustered Systems
• Cloudera
• Hortonworks
• MapR
• Apache Spark
• IBM Platform LSF
• Microsoft HPC
Clusters
• SQL Server
• Teradata
Database
• Red Hat
• SuSE Servers
• Windows
DistributeR

### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
hdfsFS
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
Copyright Microsoft Corporation. All rights reserved.

▪ Data import – Delimited, Fixed, SAS, SPSS,
OBDC
▪ Variable creation & transformation
▪ Recode variables
▪ Factor variables
▪ Missing value handling
▪ Sort, Merge, Split
▪ Aggregate by category (means, sums)
▪ Min / Max, Mean, Median (approx.)
▪ Quantiles (approx.)
▪ Standard Deviation
▪ Variance
▪ Correlation
▪ Covariance
▪ Sum of Squares (cross product matrix for set
variables)
▪ Pairwise Cross tabs
▪ Risk Ratio & Odds Ratio
▪ Cross-Tabulation of Data (standard tables & long
form)
▪ Marginal Summaries of Cross Tabulations
▪ Chi Square Test
▪ Kendall Rank Correlation
▪ Fisher’s Exact Test
▪ Student’s t-Test
▪ Subsample (observations & variables)
▪ Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
▪ Sum of Squares (cross product matrix for set
variables)
▪ Multiple Linear Regression
▪ Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
▪ Covariance & Correlation Matrices
▪ Logistic Regression
▪ Classification & Regression Trees
▪ Predictions/scoring for models
▪ Residuals for all models
Predictive Models ▪ K-Means
▪ Decision Trees
▪ Decision Forests
▪ Gradient Boosted Decision Trees
▪ Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
▪ Stepwise Regression
▪ Simulation (e.g. Monte Carlo)
▪ Parallel Random Number Generation
Combination
▪ rxDataStep
▪ rxExec
▪ PEMA-R API Custom Algorithms

Spark SQL
structured
data
Spark
Streaming
real-time
MLlib
machine
learning
GraphX
graph
Core
SparkR
R on Spark
Yarn Mesos Standalone

Read from
HDFS
Write to
HDFS
Read from
HDFS
Write to
HDFS
Read from
HDFS

Spark Users
RDDs
Spark Driver
YARN Resource Management
Name
Node
HDFS
Spark Executor
YARN Node Manager
HDFS
Data
Node
Spark Executor
YARN Node Manager
HDFS
Data
Node
Spark Executor
YARN Node Manager
HDFS
Data
Node

R User
Workstation
R Server for Hadoop v8.0.5
RDDs HDFS
YARN Resource
Management
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
HDFS Name
Node
ScaleR
Master Task
Finalizer
Initiator
Edge Node Spark
Spark
Driver

RDDs HDFS
YARN Resource
Management
Spark
Executor HDFS
Worker
Task
HDFS Name
Node
ScaleR
Master Task
Finalizer
Initiator
Edge Node
Spark
Spark
Driver Spark
Executor HDFS
Worker
Task
Spark
Executor HDFS
Worker
Task
Remote Execution:
ssh
Web Services
DeployR
R Tools for Visual Studio
BI Tools & Applications
Jupyter Notebooks
Thin Client IDEs
https://
https://

Open Source
R Package
Microsoft R
Package

execute sp_execute_external_script
@language = N'R’
, @script = N'
x <- as.matrix(InputDataSet);
y <- array(dim1:dim2);
OutputDataSet <- as.data.frame(x %*% y);'
, @input_data_1 = N' SELECT [Col1] from MyData;’
, @params = N'@dim1 int, @dim2 int’
, @dim1 = 12, @dim2 = 15
WITH RESULT SETS (([Col1] int, [Col2] int, [Col3] int, [Col4] int));

launchpad.exe
sp_execute_external_script
sqlservr.exe
Named pipe
SQLOS
XEvent
MSSQLSERVER Service MSSQLLAUNCHPAD Service
“launcher”
Windows
“satellite” process
sqlsatellite.dll
Windows
Windows
Windows
Windows

Microsoft R - Data Science at Scale

Microsoft R - Data Science at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Microsoft R - Data Science at Scale

Similar to Microsoft R - Data Science at Scale (20)

More from Sascha Dittmann

More from Sascha Dittmann (18)

Recently uploaded

Recently uploaded (20)

Microsoft R - Data Science at Scale