TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

TDWI - Accelerate
October 16, 2:30 – 3:15 PM EDT
Hyatt Regency, Bellevue

• Introduction to R
• Benefits and challenges
• R in Apache Spark: Distributed computing
• R in Databases: In-DB intelligence
Slideshare.net

• 3+M users
• Taught in most universities
• Thriving user groups worldwide
• 5th in 2016 IEEE Spectrum rank
• ~40% pro analysts prefer R (highest amongst R, SAS, python)
• 10,000+ contributed packages
• Many common use cases across industry
• Rich application & platform integration
What is
• The most popular statistical & ML programming language
• A data visualization tool
• Open source
Language
Platform
Community
Ecosystem
3

R Adoption is on a tear
76% of analytic
professionals use R
36% select R as
their primary tool
R Usage Growth
Rexer Data Miner Survey 2007-2015
2016 IEEE Spectrum rank

o In-Memory operation
o Lack of implicit parallelism
o Expensive data movement & duplication

7
Scaling R on Spark clusters
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics

SparkR: R API included with Apache Spark
8

9
Data processing and modeling with SparkR
MLlib: Apache Spark's scalable machine learning library

sparklyr: R interface for Apache Spark
Source: http://spark.rstudio.com/
• Easy installation from CRAN
• Loads data into SparkDataFrame from:
local R data frames, Hive tables, CSV,
JSON, and Parquet files.
• Connect to both local instances of
Spark and remote Spark clusters
10

dplyr and ML in sparklyr
• Includes 3 family of ML functions for machine learning pipeline
• ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
• K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
• ft_*: Feature transformers for manipulating individual features.
• sdf_*: Functions for manipulating SparkDataFrames.
• Provides a complete dplyr backend for data manipulation and
analysis
%>%
11

h2o: prediction engine in R
http://www.h2o.ai/product/
• Open source ML platform
• Optimized for “in memory” distributed, parallel ML
• Data manipulation and modeling on H2OFrame:
R functions + h2o pre-fixed functions.
• Transformations: h2o.group_by(), h2o.impute()
• Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
• Algorithms: h2o.glm(), h2o.naiveBayes(),
h2o.deeplearning(), h2o.kmeans(), ...
• rsparkling package: h2o on Spark
• Provides bindings to h2o’s machine learning
algorithms: extension package for sparklyr
• Simple data conversion: SparkDataFrame ->
H2OFrame
12
https://github.com/h2oai/rsparkling

ML Server 9.x: Scale-out R
• 100% compatible with open source R
• Virtually any code/package that works today with R will work in ML Server.
• Ability to parallelize any R function
• Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed rx pre-fixed functions in
RevoScaleR package.
• Transformations: rxDataStep()
• Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
• Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
• Parallelism: rxSetComputeContext()
13

Free Developer’s version available
14
https://aka.ms/freemrs

ScaleR library: parallel and portable for Big Data
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server.
ScaleR algorithms work inside
multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
15

Write once - deploy anywhere (WODA)
ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc.
Models can be trained in one and deployed in another
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
mySparkCC <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(mySparkCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
### Linear model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = linuxFS)
Local Parallel processing - Linux or Windows In – Spark
Compute
context R script
- sets where the
model will run
Functional model
R script – does
not need to
change to run in
Spark
16

Spark clusters in Azure HDInsight
• Provisions Azure compute
resources with Spark 2.1
installed and configured.
• Supports multiple versions
(e.g. Spark 1.6).
• Stores data in Azure Blob
storage (WASB), Azure Data
Lake Store or Local HDFS.
17

ML Server Spark cluster architecture
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
R R R R R
R R R R R
ML Server
Data in Distributed Storage
R process on Edge Node
18

Model deployment using ML Server
operationalization services (mrsdeploy)
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
 In-cloud or on-prem
 Adding nodes to scale
 High availability & load balancing
 Remote execution server
Microsoft ML Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
19

Prepare/Explore:
OperationalizeModel
Prepare/
Explore
Typical advanced analytics lifecycle
20

23
scoringFn <- function(newdata){
library(RevoScaleR)
data <- rxImport(newdata)
rxPredict(model, data)
}

ML Server on Hadoop/HDInsight scales to hundreds of
nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB

Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support
CRAN R1 Single machines Community
SparkR Single + Distributed
computing
X Community
sparklyr Single + Distributed
computing
X Community
h2o Single + Distributed
computing
X X Community
RevoScaleR Single + Distributed
computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed
25
tinyurl.com/Strata2017R
https://aka.ms/kdd2017r

https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/StrataSanJose2017
https://learnanalytics.microsoft.com/
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/KDD2017MRS
27

https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server
28

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview
29

For Oracle In DB analytics, see: https://www.oracle.com/database/advanced-
analytics/index.html

In-database machine learning
Develop Train Deploy Consume
Develop, explore and
experiment in your favorite
IDE
Train models with
sp_execute_external_
script and save the
models in database
Deploy your ML scripts
with sp_execute_external_
script and predict using the
models
Make your app/reports
intelligent by consuming
predictions
31

Eliminate data movement
Operationalize ML scripts and models
Enterprise grade performance and scale
SQL Transformations
Relational data
Analytics library
32

Free Developer’s versions available
33
https://aka.ms/sqlserverdeveloper

R services in-database: Data exploration and
predictive modeling (Data Scientist)
34

EXEC TrainTipPredictionModel
37

https://docs.microsoft.com/en-us/sql/advanced-analytics/getting-started-with-
machine-learning-services
https://blogs.msdn.microsoft.com/microsoft_press/2016/10/19/fre
e-ebook-data-science-with-microsoft-sql-server-2016/
43

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

More Related Content

What's hot

Similar to TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta

Recently uploaded

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R, Debraj GuhaThakurta