SlideShare a Scribd company logo
What is
• A statistics programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 10000+ free algorithms in CRAN
• Scalable to big data
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
Challenges posed by open source R
?
?
Lack of
Commercial
Support
Inadequate
Modeling
Performance
Complex
Deployment
Processes
Limited
Data
Scale
R from Microsoft brings
Peace of
mind
Efficiency Speed and
scalability
Flexibility
and agility
Linux, Windows, Hadoop & Teradata
R Server Technology
Convergence with Flexibility
Scalable Algorithms
R: Write Once Deploy Anywhere
Templates & Samples
Microsoft R Server Family
R & Python to AML Interop.
Cortana Intelligence
SQL Server
R Services
Linux
Hadoop Teradata
Windows
CommercialCommunity
R ServerR Open
Installed Packages
Base
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base
Recommended
- boot
- class
- cluster
- codetools
- foreign
- kernSmooth
- lattice
- MASS
- Matrix
- mgcv
- nlme
- nnet
- rpart
- spatial
- survival
Microsoft
(Developed /
Maintained)
- checkpoint
- deployRserve
- doParallel
- foreach
- jsonlite
- iterators
- microsoftR
- RevoIOQ
- RevoMods
- RevoUtils
- RODBC
- RevoUtilsMath
- azureml
- rmr2
- rhdfs
- rhbase
- plyrmr
Open-Source #1
Additional
CRAN R
- curl
- jsonlite
- png
- R6
- RODBC
Microsoft R Open #2
(Intel MKL)
Microsoft R Server #4
Microsoft R Client (free) #3
Microsoft
(Developed /
Maintained)
- RevoScaleR
- MicrosoftML
- CompatibilityAPI
- mrupdate
- RevoIOQ
- RevoTreeView
- Mrsdeploy
- Sqlrutils
- olapR
Commercially licenced & supported
Open-Source
Open-Source
Algorithm
Master
Predictive
Algorithm
Big
Data
Analyze
Blocks In
Parallel
Load Block
At A Time
Distribute Work,
Compile Results
Results
Microsoft R Server “Client” Microsoft R Server “Server”
Console
R IDE or
command-
line REMOTE
CONTEXT
DI
R+CRAN
MicrosoftR
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Delivers High Performance Parallel Distributed
Analytics Across Individual and Clustered Systems
• Cloudera
• Hortonworks
• MapR
• Apache Spark
• IBM Platform LSF
• Microsoft HPC
Clusters
• SQL Server
• Teradata
Database
• Red Hat
• SuSE Servers
• Windows
DistributeR
### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCC <- RxHadoopMR()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
hdfsFS
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
localFS <- RxNativeFileSystem()
AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”,
fileSystem = localFS)
Local Parallel processing – Linux or Windows In – Hadoop
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Compute
context R script
– sets where the
model will run
Functional
model R script –
does not need
to change to run
in Hadoop
Copyright Microsoft Corporation. All rights reserved.
▪ Data import – Delimited, Fixed, SAS, SPSS,
OBDC
▪ Variable creation & transformation
▪ Recode variables
▪ Factor variables
▪ Missing value handling
▪ Sort, Merge, Split
▪ Aggregate by category (means, sums)
▪ Min / Max, Mean, Median (approx.)
▪ Quantiles (approx.)
▪ Standard Deviation
▪ Variance
▪ Correlation
▪ Covariance
▪ Sum of Squares (cross product matrix for set
variables)
▪ Pairwise Cross tabs
▪ Risk Ratio & Odds Ratio
▪ Cross-Tabulation of Data (standard tables & long
form)
▪ Marginal Summaries of Cross Tabulations
▪ Chi Square Test
▪ Kendall Rank Correlation
▪ Fisher’s Exact Test
▪ Student’s t-Test
▪ Subsample (observations & variables)
▪ Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
▪ Sum of Squares (cross product matrix for set
variables)
▪ Multiple Linear Regression
▪ Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
▪ Covariance & Correlation Matrices
▪ Logistic Regression
▪ Classification & Regression Trees
▪ Predictions/scoring for models
▪ Residuals for all models
Predictive Models ▪ K-Means
▪ Decision Trees
▪ Decision Forests
▪ Gradient Boosted Decision Trees
▪ Naïve Bayes
Cluster Analysis
Classification
Simulation
Variable Selection
▪ Stepwise Regression
▪ Simulation (e.g. Monte Carlo)
▪ Parallel Random Number Generation
Combination
▪ rxDataStep
▪ rxExec
▪ PEMA-R API Custom Algorithms
Spark SQL
structured
data
Spark
Streaming
real-time
MLlib
machine
learning
GraphX
graph
Core
SparkR
R on Spark
Yarn Mesos Standalone
Read from
HDFS
Write to
HDFS
Read from
HDFS
Write to
HDFS
Read from
HDFS
Spark Users
RDDs
Spark Driver
YARN Resource Management
Name
Node
HDFS
Spark Executor
YARN Node Manager
HDFS
Data
Node
Spark Executor
YARN Node Manager
HDFS
Data
Node
Spark Executor
YARN Node Manager
HDFS
Data
Node
R User
Workstation
R Server for Hadoop v8.0.5
RDDs HDFS
YARN Resource
Management
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
Spark Executor
YARN Node Manager
HDFS
Data
Node
Worker
Task
HDFS Name
Node
ScaleR
Master Task
Finalizer
Initiator
Edge Node Spark
Spark
Driver
RDDs HDFS
YARN Resource
Management
Spark
Executor HDFS
Worker
Task
HDFS Name
Node
ScaleR
Master Task
Finalizer
Initiator
Edge Node
Spark
Spark
Driver Spark
Executor HDFS
Worker
Task
Spark
Executor HDFS
Worker
Task
Remote Execution:
ssh
Web Services
DeployR
R Tools for Visual Studio
BI Tools & Applications
Jupyter Notebooks
Thin Client IDEs
https://
https://
Familiar
Scalable
Secure
Open Source
R Package
Microsoft R
Package
execute sp_execute_external_script
@language = N'R’
, @script = N'
x <- as.matrix(InputDataSet);
y <- array(dim1:dim2);
OutputDataSet <- as.data.frame(x %*% y);'
, @input_data_1 = N' SELECT [Col1] from MyData;’
, @params = N'@dim1 int, @dim2 int’
, @dim1 = 12, @dim2 = 15
WITH RESULT SETS (([Col1] int, [Col2] int, [Col3] int, [Col4] int));
launchpad.exe
sp_execute_external_script
sqlservr.exe
Named pipe
SQLOS
XEvent
MSSQLSERVER Service MSSQLLAUNCHPAD Service
“launcher”
Windows
“satellite” process
sqlsatellite.dll
Windows
“satellite” process
Windows
“satellite” process
Windows
“satellite” process
Windows
“satellite” process
Microsoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Microsoft R - Data Science at Scale

More Related Content

What's hot

Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Composable Data Processing with Apache Spark
Composable Data Processing with Apache SparkComposable Data Processing with Apache Spark
Composable Data Processing with Apache Spark
Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
Eric Sun
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Databricks
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
Schubert Zhang
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Don Demcsak
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
Databricks
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Databricks
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 

What's hot (20)

Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Composable Data Processing with Apache Spark
Composable Data Processing with Apache SparkComposable Data Processing with Apache Spark
Composable Data Processing with Apache Spark
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 

Similar to Microsoft R - Data Science at Scale

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
Sascha Dittmann
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Revolution Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Ml2
Ml2Ml2
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
BAINIDA
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
Khalid Salama
 
eRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability ReRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability R
Łukasz Grala
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
Máté Szalay-Bekő
 

Similar to Microsoft R - Data Science at Scale (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
eRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability ReRum2016 -RevoScaleR - Performance and Scalability R
eRum2016 -RevoScaleR - Performance and Scalability R
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Evolution of Spark APIs
Evolution of Spark APIsEvolution of Spark APIs
Evolution of Spark APIs
 

More from Sascha Dittmann

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
Sascha Dittmann
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Sascha Dittmann
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
Sascha Dittmann
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
Sascha Dittmann
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
Sascha Dittmann
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
Sascha Dittmann
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Sascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Sascha Dittmann
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Sascha Dittmann
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile ServicesSascha Dittmann
 
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwicklerdotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
Sascha Dittmann
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
Sascha Dittmann
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
Sascha Dittmann
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudSascha Dittmann
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
Sascha Dittmann
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureSascha Dittmann
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Sascha Dittmann
 

More from Sascha Dittmann (18)

C# + SQL = Big Data
C# + SQL = Big DataC# + SQL = Big Data
C# + SQL = Big Data
 
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft AzureHochskalierbare, relationale Datenbanken in Microsoft Azure
Hochskalierbare, relationale Datenbanken in Microsoft Azure
 
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSONSQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
 
dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric dotnet Cologne 2015 - Azure Service Fabric
dotnet Cologne 2015 - Azure Service Fabric
 
SQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der PraxisSQL Saturday #313 Rheinland - MapReduce in der Praxis
SQL Saturday #313 Rheinland - MapReduce in der Praxis
 
Hadoop 2.0 - The Next Level
Hadoop 2.0 - The Next LevelHadoop 2.0 - The Next Level
Hadoop 2.0 - The Next Level
 
Microsoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsightMicrosoft HDInsight Podcast #001 - Was ist HDInsight
Microsoft HDInsight Podcast #001 - Was ist HDInsight
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Servicesdotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Windows Azure Mobile Services
 
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwicklerdotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
 
Developer Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing WorkshopDeveloper Open Space 2012 - Cloud Computing Workshop
Developer Open Space 2012 - Cloud Computing Workshop
 
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
 
CloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die CloudCloudOps Summit 2012 - 3 Wege in die Cloud
CloudOps Summit 2012 - 3 Wege in die Cloud
 
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv....NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
 
Big Data & NoSQL
Big Data & NoSQLBig Data & NoSQL
Big Data & NoSQL
 
NoSQL mit RavenDB und Azure
NoSQL mit RavenDB und AzureNoSQL mit RavenDB und Azure
NoSQL mit RavenDB und Azure
 
Windows Azure für Entwickler V1
Windows Azure für Entwickler V1Windows Azure für Entwickler V1
Windows Azure für Entwickler V1
 

Recently uploaded

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 

Recently uploaded (20)

Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 

Microsoft R - Data Science at Scale

  • 1.
  • 2. What is • A statistics programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 10000+ free algorithms in CRAN • Scalable to big data • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  • 3. Challenges posed by open source R ? ? Lack of Commercial Support Inadequate Modeling Performance Complex Deployment Processes Limited Data Scale
  • 4. R from Microsoft brings Peace of mind Efficiency Speed and scalability Flexibility and agility
  • 5. Linux, Windows, Hadoop & Teradata R Server Technology
  • 6. Convergence with Flexibility Scalable Algorithms R: Write Once Deploy Anywhere Templates & Samples Microsoft R Server Family R & Python to AML Interop. Cortana Intelligence
  • 7. SQL Server R Services Linux Hadoop Teradata Windows CommercialCommunity R ServerR Open
  • 8. Installed Packages Base - stats - graphics - grDevices - utils - datasets - methods - base Recommended - boot - class - cluster - codetools - foreign - kernSmooth - lattice - MASS - Matrix - mgcv - nlme - nnet - rpart - spatial - survival Microsoft (Developed / Maintained) - checkpoint - deployRserve - doParallel - foreach - jsonlite - iterators - microsoftR - RevoIOQ - RevoMods - RevoUtils - RODBC - RevoUtilsMath - azureml - rmr2 - rhdfs - rhbase - plyrmr Open-Source #1 Additional CRAN R - curl - jsonlite - png - R6 - RODBC Microsoft R Open #2 (Intel MKL) Microsoft R Server #4 Microsoft R Client (free) #3 Microsoft (Developed / Maintained) - RevoScaleR - MicrosoftML - CompatibilityAPI - mrupdate - RevoIOQ - RevoTreeView - Mrsdeploy - Sqlrutils - olapR Commercially licenced & supported Open-Source Open-Source
  • 9.
  • 10. Algorithm Master Predictive Algorithm Big Data Analyze Blocks In Parallel Load Block At A Time Distribute Work, Compile Results Results Microsoft R Server “Client” Microsoft R Server “Server” Console R IDE or command- line REMOTE CONTEXT
  • 11. DI R+CRAN MicrosoftR DistributedR DeployR DevelopR ScaleR ConnectR Delivers High Performance Parallel Distributed Analytics Across Individual and Clustered Systems • Cloudera • Hortonworks • MapR • Apache Spark • IBM Platform LSF • Microsoft HPC Clusters • SQL Server • Teradata Database • Red Hat • SuSE Servers • Windows DistributeR
  • 12. ### SETUP HADOOP ENVIRONMENT VARIABLES ### myHadoopCC <- RxHadoopMR() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(myHadoopCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() hdfsFS ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### localFS <- RxNativeFileSystem() AirlineDataSet <- RxXdfData(“AirlineDemoSmall.xdf”, fileSystem = localFS) Local Parallel processing – Linux or Windows In – Hadoop ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Compute context R script – sets where the model will run Functional model R script – does not need to change to run in Hadoop Copyright Microsoft Corporation. All rights reserved.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. ▪ Data import – Delimited, Fixed, SAS, SPSS, OBDC ▪ Variable creation & transformation ▪ Recode variables ▪ Factor variables ▪ Missing value handling ▪ Sort, Merge, Split ▪ Aggregate by category (means, sums) ▪ Min / Max, Mean, Median (approx.) ▪ Quantiles (approx.) ▪ Standard Deviation ▪ Variance ▪ Correlation ▪ Covariance ▪ Sum of Squares (cross product matrix for set variables) ▪ Pairwise Cross tabs ▪ Risk Ratio & Odds Ratio ▪ Cross-Tabulation of Data (standard tables & long form) ▪ Marginal Summaries of Cross Tabulations ▪ Chi Square Test ▪ Kendall Rank Correlation ▪ Fisher’s Exact Test ▪ Student’s t-Test ▪ Subsample (observations & variables) ▪ Random Sampling Data Step Statistical Tests Sampling Descriptive Statistics ▪ Sum of Squares (cross product matrix for set variables) ▪ Multiple Linear Regression ▪ Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. ▪ Covariance & Correlation Matrices ▪ Logistic Regression ▪ Classification & Regression Trees ▪ Predictions/scoring for models ▪ Residuals for all models Predictive Models ▪ K-Means ▪ Decision Trees ▪ Decision Forests ▪ Gradient Boosted Decision Trees ▪ Naïve Bayes Cluster Analysis Classification Simulation Variable Selection ▪ Stepwise Regression ▪ Simulation (e.g. Monte Carlo) ▪ Parallel Random Number Generation Combination ▪ rxDataStep ▪ rxExec ▪ PEMA-R API Custom Algorithms
  • 18.
  • 19.
  • 21.
  • 22. Read from HDFS Write to HDFS Read from HDFS Write to HDFS Read from HDFS
  • 23. Spark Users RDDs Spark Driver YARN Resource Management Name Node HDFS Spark Executor YARN Node Manager HDFS Data Node Spark Executor YARN Node Manager HDFS Data Node Spark Executor YARN Node Manager HDFS Data Node
  • 24. R User Workstation R Server for Hadoop v8.0.5 RDDs HDFS YARN Resource Management Spark Executor YARN Node Manager HDFS Data Node Worker Task Spark Executor YARN Node Manager HDFS Data Node Worker Task Spark Executor YARN Node Manager HDFS Data Node Worker Task HDFS Name Node ScaleR Master Task Finalizer Initiator Edge Node Spark Spark Driver
  • 25. RDDs HDFS YARN Resource Management Spark Executor HDFS Worker Task HDFS Name Node ScaleR Master Task Finalizer Initiator Edge Node Spark Spark Driver Spark Executor HDFS Worker Task Spark Executor HDFS Worker Task Remote Execution: ssh Web Services DeployR R Tools for Visual Studio BI Tools & Applications Jupyter Notebooks Thin Client IDEs https:// https://
  • 26.
  • 27.
  • 29.
  • 31. execute sp_execute_external_script @language = N'R’ , @script = N' x <- as.matrix(InputDataSet); y <- array(dim1:dim2); OutputDataSet <- as.data.frame(x %*% y);' , @input_data_1 = N' SELECT [Col1] from MyData;’ , @params = N'@dim1 int, @dim2 int’ , @dim1 = 12, @dim2 = 15 WITH RESULT SETS (([Col1] int, [Col2] int, [Col3] int, [Col4] int));
  • 32. launchpad.exe sp_execute_external_script sqlservr.exe Named pipe SQLOS XEvent MSSQLSERVER Service MSSQLLAUNCHPAD Service “launcher” Windows “satellite” process sqlsatellite.dll Windows “satellite” process Windows “satellite” process Windows “satellite” process Windows “satellite” process