TDWI - Accelerate
October 16, 2:30 – 3:15 PM EDT
Hyatt Regency, Bellevue
• Introduction to R
• Benefits and challenges
• R in Apache Spark: Distributed computing
• R in Databases: In-DB intelligence
Slideshare.net
• 3+M users
• Taught in most universities
• Thriving user groups worldwide
• 5th in 2016 IEEE Spectrum rank
• ~40% pro analysts prefer R (highest amongst R, SAS, python)
• 10,000+ contributed packages
• Many common use cases across industry
• Rich application & platform integration
What is
• The most popular statistical & ML programming language
• A data visualization tool
• Open source
Language
Platform
Community
Ecosystem
3
R Adoption is on a tear
76% of analytic
professionals use R
36% select R as
their primary tool
R Usage Growth
Rexer Data Miner Survey 2007-2015
2016 IEEE Spectrum rank
o In-Memory operation
o Lack of implicit parallelism
o Expensive data movement & duplication
6
7
Scaling R on Spark clusters
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
SparkR: R API included with Apache Spark
8
9
Data processing and modeling with SparkR
MLlib: Apache Spark's scalable machine learning library
sparklyr: R interface for Apache Spark
Source: http://spark.rstudio.com/
• Easy installation from CRAN
• Loads data into SparkDataFrame from:
local R data frames, Hive tables, CSV,
JSON, and Parquet files.
• Connect to both local instances of
Spark and remote Spark clusters
10
dplyr and ML in sparklyr
• Includes 3 family of ML functions for machine learning pipeline
• ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
• K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
• ft_*: Feature transformers for manipulating individual features.
• sdf_*: Functions for manipulating SparkDataFrames.
• Provides a complete dplyr backend for data manipulation and
analysis
%>%
11
h2o: prediction engine in R
http://www.h2o.ai/product/
• Open source ML platform
• Optimized for “in memory” distributed, parallel ML
• Data manipulation and modeling on H2OFrame:
R functions + h2o pre-fixed functions.
• Transformations: h2o.group_by(), h2o.impute()
• Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
• Algorithms: h2o.glm(), h2o.naiveBayes(),
h2o.deeplearning(), h2o.kmeans(), ...
• rsparkling package: h2o on Spark
• Provides bindings to h2o’s machine learning
algorithms: extension package for sparklyr
• Simple data conversion: SparkDataFrame ->
H2OFrame
12
https://github.com/h2oai/rsparkling
ML Server 9.x: Scale-out R
• 100% compatible with open source R
• Virtually any code/package that works today with R will work in ML Server.
• Ability to parallelize any R function
• Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed rx pre-fixed functions in
RevoScaleR package.
• Transformations: rxDataStep()
• Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
• Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
• Parallelism: rxSetComputeContext()
13
Free Developer’s version available
14
https://aka.ms/freemrs
ScaleR library: parallel and portable for Big Data
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server.
ScaleR algorithms work inside
multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
15
Write once - deploy anywhere (WODA)
ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc.
Models can be trained in one and deployed in another
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
mySparkCC <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(mySparkCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
### Linear model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = linuxFS)
Local Parallel processing - Linux or Windows In – Spark
Compute
context R script
- sets where the
model will run
Functional model
R script – does
not need to
change to run in
Spark
16
Spark clusters in Azure HDInsight
• Provisions Azure compute
resources with Spark 2.1
installed and configured.
• Supports multiple versions
(e.g. Spark 1.6).
• Stores data in Azure Blob
storage (WASB), Azure Data
Lake Store or Local HDFS.
17
ML Server Spark cluster architecture
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
R R R R R
R R R R R
ML Server
Data in Distributed Storage
R process on Edge Node
18
Model deployment using ML Server
operationalization services (mrsdeploy)
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
 In-cloud or on-prem
 Adding nodes to scale
 High availability & load balancing
 Remote execution server
Microsoft ML Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
19
Prepare/Explore:
OperationalizeModel
Prepare/
Explore
Typical advanced analytics lifecycle
20
21
22
23
scoringFn <- function(newdata){
library(RevoScaleR)
data <- rxImport(newdata)
rxPredict(model, data)
}
ML Server on Hadoop/HDInsight scales to hundreds of
nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support
CRAN R1 Single machines Community
SparkR Single + Distributed
computing
X Community
sparklyr Single + Distributed
computing
X Community
h2o Single + Distributed
computing
X X Community
RevoScaleR Single + Distributed
computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed
25
tinyurl.com/Strata2017R
https://aka.ms/kdd2017r
26
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/StrataSanJose2017
https://learnanalytics.microsoft.com/
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/KDD2017MRS
27
https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server
28
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview
29
For Oracle In DB analytics, see: https://www.oracle.com/database/advanced-
analytics/index.html
In-database machine learning
Develop Train Deploy Consume
Develop, explore and
experiment in your favorite
IDE
Train models with
sp_execute_external_
script and save the
models in database
Deploy your ML scripts
with sp_execute_external_
script and predict using the
models
Make your app/reports
intelligent by consuming
predictions
31
Eliminate data movement
Operationalize ML scripts and models
Enterprise grade performance and scale
SQL Transformations
Relational data
Analytics library
32
Free Developer’s versions available
33
https://aka.ms/sqlserverdeveloper
R services in-database: Data exploration and
predictive modeling (Data Scientist)
34
35
36
EXEC TrainTipPredictionModel
37
38
39
40
41
42
https://docs.microsoft.com/en-us/sql/advanced-analytics/getting-started-with-
machine-learning-services
https://blogs.msdn.microsoft.com/microsoft_press/2016/10/19/fre
e-ebook-data-science-with-microsoft-sql-server-2016/
43
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

  • 1.
    TDWI - Accelerate October16, 2:30 – 3:15 PM EDT Hyatt Regency, Bellevue
  • 2.
    • Introduction toR • Benefits and challenges • R in Apache Spark: Distributed computing • R in Databases: In-DB intelligence Slideshare.net
  • 3.
    • 3+M users •Taught in most universities • Thriving user groups worldwide • 5th in 2016 IEEE Spectrum rank • ~40% pro analysts prefer R (highest amongst R, SAS, python) • 10,000+ contributed packages • Many common use cases across industry • Rich application & platform integration What is • The most popular statistical & ML programming language • A data visualization tool • Open source Language Platform Community Ecosystem 3
  • 4.
    R Adoption ison a tear 76% of analytic professionals use R 36% select R as their primary tool R Usage Growth Rexer Data Miner Survey 2007-2015 2016 IEEE Spectrum rank
  • 5.
    o In-Memory operation oLack of implicit parallelism o Expensive data movement & duplication
  • 6.
  • 7.
    7 Scaling R onSpark clusters • What is Spark? • An unified, open source, parallel, data processing framework for Big Data Analytics
  • 8.
    SparkR: R APIincluded with Apache Spark 8
  • 9.
    9 Data processing andmodeling with SparkR MLlib: Apache Spark's scalable machine learning library
  • 10.
    sparklyr: R interfacefor Apache Spark Source: http://spark.rstudio.com/ • Easy installation from CRAN • Loads data into SparkDataFrame from: local R data frames, Hive tables, CSV, JSON, and Parquet files. • Connect to both local instances of Spark and remote Spark clusters 10
  • 11.
    dplyr and MLin sparklyr • Includes 3 family of ML functions for machine learning pipeline • ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package. • K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA • ft_*: Feature transformers for manipulating individual features. • sdf_*: Functions for manipulating SparkDataFrames. • Provides a complete dplyr backend for data manipulation and analysis %>% 11
  • 12.
    h2o: prediction enginein R http://www.h2o.ai/product/ • Open source ML platform • Optimized for “in memory” distributed, parallel ML • Data manipulation and modeling on H2OFrame: R functions + h2o pre-fixed functions. • Transformations: h2o.group_by(), h2o.impute() • Statistics: h2o.summary(), h2o.quantile(), h2o.mean() • Algorithms: h2o.glm(), h2o.naiveBayes(), h2o.deeplearning(), h2o.kmeans(), ... • rsparkling package: h2o on Spark • Provides bindings to h2o’s machine learning algorithms: extension package for sparklyr • Simple data conversion: SparkDataFrame -> H2OFrame 12 https://github.com/h2oai/rsparkling
  • 13.
    ML Server 9.x:Scale-out R • 100% compatible with open source R • Virtually any code/package that works today with R will work in ML Server. • Ability to parallelize any R function • Ideal for parameter sweeps, simulation, scoring. • Wide range of scalable and distributed rx pre-fixed functions in RevoScaleR package. • Transformations: rxDataStep() • Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()… • Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()… • Parallelism: rxSetComputeContext() 13
  • 14.
    Free Developer’s versionavailable 14 https://aka.ms/freemrs
  • 15.
    ScaleR library: paralleland portable for Big Data Stream data into blocks from sources: Hive tables, CSV, Parquet, XDF, ODBC and SQL Server. ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing. 15
  • 16.
    Write once -deploy anywhere (WODA) ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc. Models can be trained in one and deployed in another ### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ### mySparkCC <- RxSpark() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(mySparkCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1) ### Linear model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem( ) AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = linuxFS) Local Parallel processing - Linux or Windows In – Spark Compute context R script - sets where the model will run Functional model R script – does not need to change to run in Spark 16
  • 17.
    Spark clusters inAzure HDInsight • Provisions Azure compute resources with Spark 2.1 installed and configured. • Supports multiple versions (e.g. Spark 1.6). • Stores data in Azure Blob storage (WASB), Azure Data Lake Store or Local HDFS. 17
  • 18.
    ML Server Sparkcluster architecture Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes R R R R R R R R R R ML Server Data in Distributed Storage R process on Edge Node 18
  • 19.
    Model deployment usingML Server operationalization services (mrsdeploy) Data Scientist Developer Easy Integration Easy Deployment Easy Setup  In-cloud or on-prem  Adding nodes to scale  High availability & load balancing  Remote execution server Microsoft ML Server configured for operationalizing R analytics Microsoft R Client (mrsdeploy package) Easy Consumption publishServiceMicrosoft R Client (mrsdeploy package) Data Scientist 19
  • 20.
  • 21.
  • 22.
  • 23.
    23 scoringFn <- function(newdata){ library(RevoScaleR) data<- rxImport(newdata) rxPredict(model, data) }
  • 24.
    ML Server onHadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime Billions of rows Logistic Regression on NYC Taxi Dataset 2.2 TB
  • 25.
    Base and scalableapproaches comparison Approach Scalability Spark Hadoop SQL Server Teradata Support CRAN R1 Single machines Community SparkR Single + Distributed computing X Community sparklyr Single + Distributed computing X Community h2o Single + Distributed computing X X Community RevoScaleR Single + Distributed computing X X X X Enterprise 1. CRAN R indicates no additional R packages installed 25 tinyurl.com/Strata2017R https://aka.ms/kdd2017r
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    For Oracle InDB analytics, see: https://www.oracle.com/database/advanced- analytics/index.html
  • 31.
    In-database machine learning DevelopTrain Deploy Consume Develop, explore and experiment in your favorite IDE Train models with sp_execute_external_ script and save the models in database Deploy your ML scripts with sp_execute_external_ script and predict using the models Make your app/reports intelligent by consuming predictions 31
  • 32.
    Eliminate data movement OperationalizeML scripts and models Enterprise grade performance and scale SQL Transformations Relational data Analytics library 32
  • 33.
    Free Developer’s versionsavailable 33 https://aka.ms/sqlserverdeveloper
  • 34.
    R services in-database:Data exploration and predictive modeling (Data Scientist) 34
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.