SlideShare a Scribd company logo
1 of 37
| © Copyright 2015 Hitachi Consulting1
Microsoft R
ScaleR Overview with a Quick Tutorial
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
Outline
 Experimental Data Science vs Operational Machine Learning
 Microsoft R Server
 Overview on ScaleR
 How to: Setup Environment
 How to: Get Data
 How to: Process & Transform
 How to: Summarize, Analyse, and Visualize
 How to: Learn & Predict
 How to: Deploy and Consume (msrdeploy)
 Overview on MicrosoftML package functionality
| © Copyright 2015 Hitachi Consulting3
Experimental Data Science vs Operational
Machine Learning
| © Copyright 2015 Hitachi Consulting4
Exploratory Data
Analysis
Data Science Activities
Experimentation vs. Operationalization
Collect Data
Blend
Visualize
Prepare
ML Experiment
Algorithm Selection
Parameter Tuning
Training & Testing
Model
Learning
Dataset
Report of Visuals &
Findings
Decision!
Data Analysis &
Experimentation
 Interactive
 Easy to perform
 Rich Visualizations
| © Copyright 2015 Hitachi Consulting5
Online Apps
Automated ML Pipeline
Data Science Activities
Experimentation vs. Operationalization
Model
Data Ingestion Data Processing Model Training Scoring
Deploy
Web APIs
Predict
Train
Export
Batch
Real-time
Operational ML Pipelines
 Pipelined (ETL Integration)
 Scalable
 Apps Integration
| © Copyright 2015 Hitachi Consulting6
Microsoft R Server
| © Copyright 2015 Hitachi Consulting7
Microsoft R Server
R in Microsoft World
Microsoft R Open (MRO)
 Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft
 More efficient and multi-threaded computation
 Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions
 Compatible with all R-related software
| © Copyright 2015 Hitachi Consulting8
Microsoft R Server
Comparison
CRAN MRO MRS
Data size In-memory In-memory In-memory & disk
Efficiency Single threaded Multi-threaded Multi-threaded, parallel
processing 1:N servers
Support Community Community Community + Commercial
Functionality 7500+ innovative analytic
packages
7500+ innovative analytic
packages
7500+ innovative packages +
commercial parallel high-speed
functions
Licence Open Source Open Source Commercial license.
| © Copyright 2015 Hitachi Consulting9
Microsoft R Server
Components & Compute Contexts
Microsoft R Server
CRAN&MSROpen
ScaleR
DistributedR
ConnectR
MicrosoftML-Package
Operationalization
(msrdeploy)
RStudio | RTVS
MS R Client
Scale & Deploy
DifferentComputeContexts
 Installed on Windows or Linux
 ScaleR - Optimized for parallel execution on
Big Data, to eliminate memory limitations.
 ConnectR – Provides access to local file
systems, hdfs, hive, sqlserver, Teradata, etc.
 DistributeR - Adaptable parallel execution
framework to enable running on different
(distributed) compute contexts.
 Operationalization (msrdeploy) – Deploy
the model as a Web API.
https://msdn.microsoft.com/en-us/microsoft-r/microsoft-r-getting-started
Import
Data
1- Reference to a Data
Source
 RxTextData()
 RxSqlServerData()
 RxOdbcData()
 RxTeradata()
2- Import Data to XDF
 rxImport()
 RxSasData()
 RxSpssData()
 RxHiveData()
 RxParquetData()
3- Reference XDF
 RxXdfData()
Setup
1- Get Information
 Revo.home()
 Revo.version
 rxGetComputeContex()
 rxGetFileSystem()
 rxOptions()
2- Set Properties
 rxSetComputeContex()
 RxLocalSeq
 RxLocalParallel
 RxInSqlServer
 rxSetFileSystem()
 RxNativeFileSystem
 RxHdfsFileSystem
 rxSetOption()
 RxInTeradata
 RxHadoopMR
 RxSpark
Process
&
Transfor
m
rxDataStep()
 inData (ref to data source)
 outFile (xdf)
 overwrite (the outFile if exists)
 varToKeep (column selection)
 rowSelection (filter)
 transformObjects (need in your process)
 transformPackages (need in your process)
 transformFunc (function with your processing logic)
rxMerge()
 inData1
 inData2
 outFile
 matchVars
 matchType
Others
 rxSplit()
 rxSort()
 rxFactors()
Summariz
e
 rxSummary()
 rxQuantile()
 rxCrossTabs()
 rxCube()
(formula,data)
 rxMarginals()
 as.xtabs()
(crossTabs)
Learn &
Predict
Classification
 rxDTrees()
 rxBTrees()
 rxDForest()
 rxNaiveBayes()
 rxLogit()
(formula, data)
Analyze
 rxCovCor()
 rxCor()
 rxSSCP()
(formula, data)
Predict
 rxPredict(model, data)
 rxRoc()
 rxHistogram()
 rxLinePlot()
 rxRocCurve()
Regression
 rxLinMod()
 rxGlm()
 rxDTrees()
 rxBTrees()
(formula, data)
Clustering
 rxKMeans()
(formula, data)
Analyse
Visualiz
e
Microsoft R
ScaleR Summary Map
Deploy
4 View Data
Information
 rxGetInfo()
 rxChisquaredTest()
 rxFisherTest()
 rxKendallCor()
 rxRiskRatio()
 rxOddsRatio()
(xtab)
msrdeploy
 remoteLogin
 listServices()
 getService()
 publishService()
 api$conumse()
| © Copyright 2015 Hitachi Consulting11
Microsoft R – ScaleR
Get Information
Revo.version – query the version of the current ScaleR
Revo.home() – get the path of the currently used R.
Make sure it is Microsoft R (Client or Server),
not Open-Source R
rxGetComputeContext() – get the current compute context.
You can set the current compute context to many different
options, as shown next.
rxGetFileSystem() – get the default file system used.
You can change the currently used file system from “native” to a
“hdfs”, as shown next.
rxOptions() – list all the ScaleR configurations, and their
current values. You can get the value of a specific option
using rxGetOption(“optionName”)
| © Copyright 2015 Hitachi Consulting12
Microsoft R – ScaleR
Set Information
rxSetComputContext(computeContext) – the following
are the various options, each is an computeContext
object (each need different parameters to construct):
 RxLocalSeq()
 RxLocalParallel()
 RxInSqlServer()
rxSetFileSystem(fileSystem) – the filesystem object can
one of the two following options:
 RxNativeFileSystem()
 RxHdfsFileSystem()
rxSetOption(option = value) – used to set an option.
Note that, these are the global default values, you can overwrite
these values in each operation. The default values (that you set
here) are used if nothing is specified in the operations
 RxInTeradata()
 RxHadoopMR()
 RxSpark()
| © Copyright 2015 Hitachi Consulting13
Microsoft R – ScaleR
Get Data
1. Reference a Data Source – The following are the functions to use to reference
various data sources
 RxTextData()
 RxOdbcData()
 RxSqlServerData()
 RxTeraData()
2. Import the data to an eXternal Data Frame (xdf) - Not that, you can query the data in
the data source, but you need to import it to xdf to be able to process it in your computeContext.
rxImport( inData = dataSource, outFile = xdfFile.xdf )
 overwrite = Boolean flag to replace an existing xdf file or not
 append = use “rows” to append to the same .xdf file
3. Read the imported xdf data
RxXdfData( file = xdfFile.xdf )
 createCompositeSet = set to TRUE if you point to a directory that contains multiple .xdf files to treat them
as one dataset.
 RxSasData()
 RxSpssData()
 RxHiveData()
 RxParquetData()
| © Copyright 2015 Hitachi Consulting14
Microsoft R – ScaleR
Reference a Data Source
file_path = file.path(data_directory,”iris.csv”)
txtDataSource = rxTextData(file = file_path)
OR
connection_string = “Driver=SQL Server; Server=.; Database=dbdemo; Trusted_Connection = True;”
sql_query = “SELECT * FROM iris;”
sqlDataSource = rxSqlServerData(connectionString = connection_string, sqlQuery = sql_query)
Note, this is only reference to the data source,
which will not make anything with the data
until you query it, e.g. head(dataSource)
| © Copyright 2015 Hitachi Consulting15
Microsoft R – ScaleR
Import to xdf
xdf_file_path = file_path = file.path(data_directory,”iris.xdf”)
iris_xdata = rxImport( inData = dataSource, outFile = xdf_file_path
overwrite = TRUE, append = “none” )
 inData = any “Rx” Data Source, or it can be a file path
 outFile = file to store the .xdf dataset
 overwrite = Boolean flag to replace an existing xdf file or not
 append = use “rows” to append to the same .xdf file
This will create iris.xdf file in your fileSystem, and return iris_xdata reference to
work with the dataset.
You can read the .xdf file later:
iris_xdata = RxXdfData( file = xdf_file_path)
class(iris_xdata)
| © Copyright 2015 Hitachi Consulting16
Microsoft R – ScaleR
Describing xdf
rxGetInfo( data = iris_xdata, getVarInfo = TRUE, numRows = 2)
rxSummary(formula = ~., data = xdata)
| © Copyright 2015 Hitachi Consulting17
Microsoft R – ScaleR
Read a subset of xdf to a data frame
iris_subset = rxReadXdf(data = iris.xdata, startRow = 10, numRows = 5)
 iris_subset = in-memory data frame
 data = Rx Data Source
 numRows = number of rows to retrieve
Sometimes it is useful to get a (small) subset of the xdf to a data frame
to test a processing function on it before we apply it on the big data (xdf)
| © Copyright 2015 Hitachi Consulting18
Microsoft R – ScaleR
Process & Transform
Remember that you compute context can be a distributed
processing cluster: Hpc, spark, Hadoop, etc.
In such case, each node of the compute cluster processes a
subset of your xdf, as it is shredded also on a HDFS
You data processing operation needs to consider that, i.e., all
the needed objects and packages are available for the local
node to process this data portion
rxDataSetp() function is used to process and transform an xdf
dataset, and can be used to perform the following
 Filter rows
 Select columns
 Add computed columns
 Convert column types (e.g. discetize to factors)
 Update existing columns (handling missing values, scale &
normalize, etc.)
rxDataStep(…)
 inData = xdf to process
 outFile = can be the same as the input xdf.
If omitted, the function return a data frame
 overwrite = set to TRUE if inData = outFile
 rowSelection = (col1 > 50) & …
 varToKeep = character vector of columns to select
 transformFunc = a function that has the processing logic
 transformObjects = list of objects used in the function
 transformPackages = list of packages used in the function
| © Copyright 2015 Hitachi Consulting19
Microsoft R – ScaleR
Process & Transform
Extract means and stdvs (will be used to normalize some columns)
rxsummary = rxSummary(~.,iris_xdata)
str(rxsummary$sDataFrame)
means = rxsummary$sDataFrame$Mean
stdvs = rxsummary$sDataFrame$StdDev
Extract quantiles for Sepal.Length (will be used to discretize it)
cut_points = rxQuantile(varName = "Sepal.Length", data = iris_xdata)
cut_points
| © Copyright 2015 Hitachi Consulting20
Microsoft R – ScaleR
Process & Transform
Create data processing function
process_data = function(data_frame){
# discretize
data_frame$Sepal.Length_Disc = cut(data_frame$Sepal.Length, breaks = cut_points)
# normalize
data_frame$Petal.Length_norm = (data_frame$Petal.Length - means[3])/stdvs[3]
data_frame$Petal.Width_norm = (data_frame$Petal.Width - means[4])/stdvs[4]
return(data_frame)
}
Note the following:
 The function expects a data frame, which will be a subset of the xdf dataset running on a compute node
 cut_points, means, and stdvs are variables that will be available to the scope of this function when passed
via the rxDataStep() function
| © Copyright 2015 Hitachi Consulting21
Microsoft R – ScaleR
Process & Transform
Execute the process_data function on the iris_xdata
rxDataStep(inData = iris_xdata, outFile = iris_xdata, overwrite = TRUE,
rowSelection = !is.na(Species),
transformFunc = process_data,
transformObjects = list(
"cut_points" = cut_points,
"means" = means,
"stdvs" = stdvs
)
)
| © Copyright 2015 Hitachi Consulting22
Microsoft R – ScaleR
Summarize & Analyse
Understand variable dependencies & correlations
formula = ~ Species+Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
rxCovCor(formula, data = iris_xdata, type = "Cor")
 “Cor” = correlation
 “Cov” = covariane
 “SSCP” = sum squred / cross product
| © Copyright 2015 Hitachi Consulting23
Microsoft R – ScaleR
Summarize & Analyse
Summarize data (generate sums, means, and counts)
using cross tabs
formula = Sepal.Width ~ Sepal.Length_Disc:Species
ctabs = rxCrossTabs(formula, data = iris_xdata, means = TRUE)
ctabs$sums
ctabs$means
ctabs$counts
| © Copyright 2015 Hitachi Consulting24
Microsoft R – ScaleR
Summarize & Analyse
Summarize cross tab results
summary(ctabs, output = "means")
Get Margins
rxMarginals(ctabs, output = “sums”)
Perform Statistical Dependency test
| © Copyright 2015 Hitachi Consulting25
Microsoft R – ScaleR
Summarize & Analyse
Summarize using xCube (to produce a long-format table)
formula = Petal.Width ~ F(Petal.Length)
rxCube(formula, data = iris_xdata)
 F(variable) converts the variable into a factor,
on the fly, using the distinct rounded values
of this variable
| © Copyright 2015 Hitachi Consulting26
Microsoft R – ScaleR
Visualize
rxHistogram(~Sepal.Length|Species, data = iris_xdata)
| © Copyright 2015 Hitachi Consulting27
Microsoft R – ScaleR
Learn & Predict
Classification Algorithms
 rxDTrees() – Decision Trees for
classification and regression.
Can be converted to rpart tree models
 rxBTrees() – Gradient Boosted Trees
 rxDForest() – Random Forests
 rxNaiveBayes()
 rxLogit() – Logistic Regression Models
Regression Algorithms
 rxLinMod() – Linear
Regression Models
 rxGlm() Generalized Linear
Models
 rxDTrees()
 rxBTrees()
Clustering Algoritm
 rxKMeans()
All the algorithms accept the following parameters
 Formula: response ~ input1+input2:input3
 Data: learning set
 Other parameters depending on the algorithms
| © Copyright 2015 Hitachi Consulting28
Microsoft R – ScaleR
Learn & Predict – Decision Trees Example
rxDTrees() used to train classification (target variable is categorical)
& regression (target variable is numeric) trees.
The output is similar to rpart tree model. The key parameters are:
 formula: response ~ input1+input2:input3
 data: traing set
 xVal: number of cross validation folds for pruning
 maxDepth: maximum number of tree levels (to control complexity)
 minBucket: minimum number of examples must be in a leaf node
(to control complexity)
formula = Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width
models.dtree = rxDTree(formula, data = iris_xdata)
models.dtree
| © Copyright 2015 Hitachi Consulting29
Microsoft R – ScaleR
Learn & Predict – Decision Trees Example
# get predictions, in form of probabilities
predictions = rxPredict(models.dtree, data = iris_xdata,
type = c("prob"))
# select only columns of actual and predicted (as data frame)
predictions = rxDataStep(predictions,
varsToKeep =c("Species",
"setosa_Pred",
"versicolor_Pred","virginica_Pred"),
transforms = list( setosa_actual = as.numeric(Species=='setosa'),
versicolor_actual = as.numeric(Species=='versicolor'),
virginica_actual = as.numeric(Species=='virginica')
)
)
# display the prediction results
rxGetInfo(predictions, getVarInfo = TRUE, numRows = 5)
# plot Roc Curve (with respect to versicolor predictions)
rxRocCurve(actualVarName = "versicolor_actual",
predVarNames = c("versicolor_Pred"),
data = predictions)
| © Copyright 2015 Hitachi Consulting30
Microsoft R – ScaleR
Learn & Predict – Decision Trees Example
# compute accuracy
predictions = rxPredict(models.dtree, data = iris_xdata,
type = c("class"))
predictions = rxReadXdf( predictions ,
varsToKeep = c("Species","Species_Pred"))
accuracy = sum(as.numeric(predictions$Species ==
predictions$Species_Pred)/nrow(predictions))
print(accuracy)
#use Revo Tree View to show tree
tree = RevoTreeView::createTreeView(models.dtree)
plot(tree)
#convert to rpart tree model
rpart_tree= as.rpart(models.dtree)
class(rpart_tree)
#export to pmml format
library(pmml)
pmml(rpart_tree)
| © Copyright 2015 Hitachi Consulting31
Microsoft R – ScaleR
Parallel Processing on Partitioned Data
In some cases, instead of building one “Big” model using all your “Big” data,
you build “many” models using “small” subsets of the data
For example, building many time-series models, one for each product line, for
demand forecasting, or several regression models, one for each geographic area,
for fraud detection
This is also called mixture of local models
In this case, your data is partitioned into (smaller) subsets, by a certain criteria, and
then local models are built, one for each data subset
Such a process can be performed in parallel, using rxExecBy() function, which takes
the following parameters:
 inData = xdf dataset to be partitioned
 keys = character vector of the name of the dataset columns by which the data will
be partitioned. These columns should be of type factor
 func = the function that will be applied on each data partition
(i.e., learning a local model)
 rxExecBy() returns a list containing the constructed model of each partition
Dataset
Partition
Subset 1 Subset 2 Subset 3
Local
Model 1
Local
Model 2
Local
Model 3
Learn
Learn
Learn
}
Parallel Learning
| © Copyright 2015 Hitachi Consulting32
Microsoft R – ScaleR
Parallel Processing on Partitioned Data
For example, using the iris dataset, lets build a regression model that estimates Sepal.Length based on the
Sepal.Width, for each Species type.
In other words, we will partition the iris dataset into 3 subsets, one for each Species type (setosa, versicolor
virginica), and build a local model for each partition, in parallel
xdf = RxTextData(file = file.path(data_directory,"iris.csv"))
buildLocalModels = function(keys, data){
local_xdf = rxImport(inData = data)
local_model = rxLinMod(formula = Sepal.Length ~ Sepal.Width, data = data)
return(local_model)
}
local_models = rxExecBy(inData = xdf, keys = c("Species"),
func = buildLocalModels)
local_models[[1]]$result
local_models[[2]]$result
local_models[[3]]$result
| © Copyright 2015 Hitachi Consulting33
Microsoft R – msrdeploy
Deploy & Consume
In order to deploy an R model as a web API, you need to configure an MS R
Server for operationalization, by running the R-Server-Admin-Util, as described in
this link: https://msdn.microsoft.com/en-us/microsoft-r/operationalize/about
| © Copyright 2015 Hitachi Consulting34
Microsoft R – msrdeploy
Deploy & Consume
library(mrsdeploy)
# generate data
x = 1:100
y = 2*x + rnorm(n=length(x), mean = 0, sd = 5)
#buid a linear model
reg_model = lm(y~x)
# create a prediction function: takes input, and uses the lm to estimate the output
estimate_output = function(input){
newdata = as.data.frame(x = input)
names(newdata) = c("x")
estimates = predict(reg_model, newdata = newdata, type = "response")
return(estimates)
}
# connect to R Server to deploy into
remoteLogin("http://localhost:12800", username = "admin", password = <password>)
serviceName <- paste("estimate_output_", round(as.numeric(Sys.time()), 0))
# publish the prediction function
api = publishService( serviceName, code = estimate_output,
model = reg_model, # model to be used in the function
inputs = list(input = "numeric"),
outputs = list(output = "numeric"),
v = "v1.0.0")
# query the published API
api
# list the deployed APIs
mrsdeploy::listServices()
# consume the API
result = api$estimate_output(120)
result$output("output")
| © Copyright 2015 Hitachi Consulting35
Microsoft R – MicrosoftML
MicrosoftML Overview
Machine Learning Algorithms
 rxFastLinear() – binary classification & Regression
 rxOneClassSvm() – anomaly detection (unsupervised)
 rxFastTrees() – classification & regression
 rxFastForest() – classification & regression
 rxNeuralNetworks() – classification & regression
 rxLogisticRegression() - regression
rxEnsemble() – combine a number of models of various kinds
Text Processing
 featurizeText() – TF, IDF, TF-IDF
 getSentiment() – using pretrained model
Image Processing
 featurizeImage() – using a pretrained model
 loadImgae()
 resizeImage()
 extractPixels() - extracts the pixel values from an image
Other Processing
 selectFeatures() – using minCount or mutualInfo
 categorical() – converts a categorical variable to indicator columns
 categoricalHash() converts a categorical variable to indicator
columns using hashing (used with variable with many values)
https://msdn.microsoft.com/en-us/microsoft-r/microsoftml-get-started
| © Copyright 2015 Hitachi Consulting36
My Background
Applying Computational Intelligence in Data Mining
 Honorary Research Fellow, School of Computing , University of Kent.
 Ph.D. Computer Science, University of Kent, Canterbury, UK.
 28+ published journal and conference papers in the fields of AI and ML
https://www.researchgate.net/profile/Khalid_Salama https://www.linkedin.com/in/khalid-salama-24403144/
https://github.com/khalid-m-salama/sqlbits-2017
| © Copyright 2015 Hitachi Consulting37
Thanks!

More Related Content

What's hot

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsDataWorks Summit
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Eric Rice
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the adminTillmann Eitelberg
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 

What's hot (20)

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3Cloud Innovation Day - Commonwealth of PA v11.3
Cloud Innovation Day - Commonwealth of PA v11.3
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 

Similar to Microsoft R - ScaleR Overview

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝OSS On Azure
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with ScalaChetan Khatri
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET Journal
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 

Similar to Microsoft R - ScaleR Overview (20)

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝
Azure HDlnsight에서 R 및 Spark를 이용하여 확장 가능한 머신러닝
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
IRJET- Generate Distributed Metadata using Blockchain Technology within HDFS ...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 

More from Khalid Salama

Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryKhalid Salama
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureKhalid Salama
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft AzureKhalid Salama
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 

More from Khalid Salama (8)

Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 

Microsoft R - ScaleR Overview

  • 1. | © Copyright 2015 Hitachi Consulting1 Microsoft R ScaleR Overview with a Quick Tutorial Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 Outline  Experimental Data Science vs Operational Machine Learning  Microsoft R Server  Overview on ScaleR  How to: Setup Environment  How to: Get Data  How to: Process & Transform  How to: Summarize, Analyse, and Visualize  How to: Learn & Predict  How to: Deploy and Consume (msrdeploy)  Overview on MicrosoftML package functionality
  • 3. | © Copyright 2015 Hitachi Consulting3 Experimental Data Science vs Operational Machine Learning
  • 4. | © Copyright 2015 Hitachi Consulting4 Exploratory Data Analysis Data Science Activities Experimentation vs. Operationalization Collect Data Blend Visualize Prepare ML Experiment Algorithm Selection Parameter Tuning Training & Testing Model Learning Dataset Report of Visuals & Findings Decision! Data Analysis & Experimentation  Interactive  Easy to perform  Rich Visualizations
  • 5. | © Copyright 2015 Hitachi Consulting5 Online Apps Automated ML Pipeline Data Science Activities Experimentation vs. Operationalization Model Data Ingestion Data Processing Model Training Scoring Deploy Web APIs Predict Train Export Batch Real-time Operational ML Pipelines  Pipelined (ETL Integration)  Scalable  Apps Integration
  • 6. | © Copyright 2015 Hitachi Consulting6 Microsoft R Server
  • 7. | © Copyright 2015 Hitachi Consulting7 Microsoft R Server R in Microsoft World Microsoft R Open (MRO)  Based on latest Open Source R (3.2.2.) - Built, tested, and distributed by Microsoft  More efficient and multi-threaded computation  Enhanced by Intel Math Kernel Library (MKL) to speed up linear algebra functions  Compatible with all R-related software
  • 8. | © Copyright 2015 Hitachi Consulting8 Microsoft R Server Comparison CRAN MRO MRS Data size In-memory In-memory In-memory & disk Efficiency Single threaded Multi-threaded Multi-threaded, parallel processing 1:N servers Support Community Community Community + Commercial Functionality 7500+ innovative analytic packages 7500+ innovative analytic packages 7500+ innovative packages + commercial parallel high-speed functions Licence Open Source Open Source Commercial license.
  • 9. | © Copyright 2015 Hitachi Consulting9 Microsoft R Server Components & Compute Contexts Microsoft R Server CRAN&MSROpen ScaleR DistributedR ConnectR MicrosoftML-Package Operationalization (msrdeploy) RStudio | RTVS MS R Client Scale & Deploy DifferentComputeContexts  Installed on Windows or Linux  ScaleR - Optimized for parallel execution on Big Data, to eliminate memory limitations.  ConnectR – Provides access to local file systems, hdfs, hive, sqlserver, Teradata, etc.  DistributeR - Adaptable parallel execution framework to enable running on different (distributed) compute contexts.  Operationalization (msrdeploy) – Deploy the model as a Web API. https://msdn.microsoft.com/en-us/microsoft-r/microsoft-r-getting-started
  • 10. Import Data 1- Reference to a Data Source  RxTextData()  RxSqlServerData()  RxOdbcData()  RxTeradata() 2- Import Data to XDF  rxImport()  RxSasData()  RxSpssData()  RxHiveData()  RxParquetData() 3- Reference XDF  RxXdfData() Setup 1- Get Information  Revo.home()  Revo.version  rxGetComputeContex()  rxGetFileSystem()  rxOptions() 2- Set Properties  rxSetComputeContex()  RxLocalSeq  RxLocalParallel  RxInSqlServer  rxSetFileSystem()  RxNativeFileSystem  RxHdfsFileSystem  rxSetOption()  RxInTeradata  RxHadoopMR  RxSpark Process & Transfor m rxDataStep()  inData (ref to data source)  outFile (xdf)  overwrite (the outFile if exists)  varToKeep (column selection)  rowSelection (filter)  transformObjects (need in your process)  transformPackages (need in your process)  transformFunc (function with your processing logic) rxMerge()  inData1  inData2  outFile  matchVars  matchType Others  rxSplit()  rxSort()  rxFactors() Summariz e  rxSummary()  rxQuantile()  rxCrossTabs()  rxCube() (formula,data)  rxMarginals()  as.xtabs() (crossTabs) Learn & Predict Classification  rxDTrees()  rxBTrees()  rxDForest()  rxNaiveBayes()  rxLogit() (formula, data) Analyze  rxCovCor()  rxCor()  rxSSCP() (formula, data) Predict  rxPredict(model, data)  rxRoc()  rxHistogram()  rxLinePlot()  rxRocCurve() Regression  rxLinMod()  rxGlm()  rxDTrees()  rxBTrees() (formula, data) Clustering  rxKMeans() (formula, data) Analyse Visualiz e Microsoft R ScaleR Summary Map Deploy 4 View Data Information  rxGetInfo()  rxChisquaredTest()  rxFisherTest()  rxKendallCor()  rxRiskRatio()  rxOddsRatio() (xtab) msrdeploy  remoteLogin  listServices()  getService()  publishService()  api$conumse()
  • 11. | © Copyright 2015 Hitachi Consulting11 Microsoft R – ScaleR Get Information Revo.version – query the version of the current ScaleR Revo.home() – get the path of the currently used R. Make sure it is Microsoft R (Client or Server), not Open-Source R rxGetComputeContext() – get the current compute context. You can set the current compute context to many different options, as shown next. rxGetFileSystem() – get the default file system used. You can change the currently used file system from “native” to a “hdfs”, as shown next. rxOptions() – list all the ScaleR configurations, and their current values. You can get the value of a specific option using rxGetOption(“optionName”)
  • 12. | © Copyright 2015 Hitachi Consulting12 Microsoft R – ScaleR Set Information rxSetComputContext(computeContext) – the following are the various options, each is an computeContext object (each need different parameters to construct):  RxLocalSeq()  RxLocalParallel()  RxInSqlServer() rxSetFileSystem(fileSystem) – the filesystem object can one of the two following options:  RxNativeFileSystem()  RxHdfsFileSystem() rxSetOption(option = value) – used to set an option. Note that, these are the global default values, you can overwrite these values in each operation. The default values (that you set here) are used if nothing is specified in the operations  RxInTeradata()  RxHadoopMR()  RxSpark()
  • 13. | © Copyright 2015 Hitachi Consulting13 Microsoft R – ScaleR Get Data 1. Reference a Data Source – The following are the functions to use to reference various data sources  RxTextData()  RxOdbcData()  RxSqlServerData()  RxTeraData() 2. Import the data to an eXternal Data Frame (xdf) - Not that, you can query the data in the data source, but you need to import it to xdf to be able to process it in your computeContext. rxImport( inData = dataSource, outFile = xdfFile.xdf )  overwrite = Boolean flag to replace an existing xdf file or not  append = use “rows” to append to the same .xdf file 3. Read the imported xdf data RxXdfData( file = xdfFile.xdf )  createCompositeSet = set to TRUE if you point to a directory that contains multiple .xdf files to treat them as one dataset.  RxSasData()  RxSpssData()  RxHiveData()  RxParquetData()
  • 14. | © Copyright 2015 Hitachi Consulting14 Microsoft R – ScaleR Reference a Data Source file_path = file.path(data_directory,”iris.csv”) txtDataSource = rxTextData(file = file_path) OR connection_string = “Driver=SQL Server; Server=.; Database=dbdemo; Trusted_Connection = True;” sql_query = “SELECT * FROM iris;” sqlDataSource = rxSqlServerData(connectionString = connection_string, sqlQuery = sql_query) Note, this is only reference to the data source, which will not make anything with the data until you query it, e.g. head(dataSource)
  • 15. | © Copyright 2015 Hitachi Consulting15 Microsoft R – ScaleR Import to xdf xdf_file_path = file_path = file.path(data_directory,”iris.xdf”) iris_xdata = rxImport( inData = dataSource, outFile = xdf_file_path overwrite = TRUE, append = “none” )  inData = any “Rx” Data Source, or it can be a file path  outFile = file to store the .xdf dataset  overwrite = Boolean flag to replace an existing xdf file or not  append = use “rows” to append to the same .xdf file This will create iris.xdf file in your fileSystem, and return iris_xdata reference to work with the dataset. You can read the .xdf file later: iris_xdata = RxXdfData( file = xdf_file_path) class(iris_xdata)
  • 16. | © Copyright 2015 Hitachi Consulting16 Microsoft R – ScaleR Describing xdf rxGetInfo( data = iris_xdata, getVarInfo = TRUE, numRows = 2) rxSummary(formula = ~., data = xdata)
  • 17. | © Copyright 2015 Hitachi Consulting17 Microsoft R – ScaleR Read a subset of xdf to a data frame iris_subset = rxReadXdf(data = iris.xdata, startRow = 10, numRows = 5)  iris_subset = in-memory data frame  data = Rx Data Source  numRows = number of rows to retrieve Sometimes it is useful to get a (small) subset of the xdf to a data frame to test a processing function on it before we apply it on the big data (xdf)
  • 18. | © Copyright 2015 Hitachi Consulting18 Microsoft R – ScaleR Process & Transform Remember that you compute context can be a distributed processing cluster: Hpc, spark, Hadoop, etc. In such case, each node of the compute cluster processes a subset of your xdf, as it is shredded also on a HDFS You data processing operation needs to consider that, i.e., all the needed objects and packages are available for the local node to process this data portion rxDataSetp() function is used to process and transform an xdf dataset, and can be used to perform the following  Filter rows  Select columns  Add computed columns  Convert column types (e.g. discetize to factors)  Update existing columns (handling missing values, scale & normalize, etc.) rxDataStep(…)  inData = xdf to process  outFile = can be the same as the input xdf. If omitted, the function return a data frame  overwrite = set to TRUE if inData = outFile  rowSelection = (col1 > 50) & …  varToKeep = character vector of columns to select  transformFunc = a function that has the processing logic  transformObjects = list of objects used in the function  transformPackages = list of packages used in the function
  • 19. | © Copyright 2015 Hitachi Consulting19 Microsoft R – ScaleR Process & Transform Extract means and stdvs (will be used to normalize some columns) rxsummary = rxSummary(~.,iris_xdata) str(rxsummary$sDataFrame) means = rxsummary$sDataFrame$Mean stdvs = rxsummary$sDataFrame$StdDev Extract quantiles for Sepal.Length (will be used to discretize it) cut_points = rxQuantile(varName = "Sepal.Length", data = iris_xdata) cut_points
  • 20. | © Copyright 2015 Hitachi Consulting20 Microsoft R – ScaleR Process & Transform Create data processing function process_data = function(data_frame){ # discretize data_frame$Sepal.Length_Disc = cut(data_frame$Sepal.Length, breaks = cut_points) # normalize data_frame$Petal.Length_norm = (data_frame$Petal.Length - means[3])/stdvs[3] data_frame$Petal.Width_norm = (data_frame$Petal.Width - means[4])/stdvs[4] return(data_frame) } Note the following:  The function expects a data frame, which will be a subset of the xdf dataset running on a compute node  cut_points, means, and stdvs are variables that will be available to the scope of this function when passed via the rxDataStep() function
  • 21. | © Copyright 2015 Hitachi Consulting21 Microsoft R – ScaleR Process & Transform Execute the process_data function on the iris_xdata rxDataStep(inData = iris_xdata, outFile = iris_xdata, overwrite = TRUE, rowSelection = !is.na(Species), transformFunc = process_data, transformObjects = list( "cut_points" = cut_points, "means" = means, "stdvs" = stdvs ) )
  • 22. | © Copyright 2015 Hitachi Consulting22 Microsoft R – ScaleR Summarize & Analyse Understand variable dependencies & correlations formula = ~ Species+Sepal.Length + Sepal.Width + Petal.Length + Petal.Width rxCovCor(formula, data = iris_xdata, type = "Cor")  “Cor” = correlation  “Cov” = covariane  “SSCP” = sum squred / cross product
  • 23. | © Copyright 2015 Hitachi Consulting23 Microsoft R – ScaleR Summarize & Analyse Summarize data (generate sums, means, and counts) using cross tabs formula = Sepal.Width ~ Sepal.Length_Disc:Species ctabs = rxCrossTabs(formula, data = iris_xdata, means = TRUE) ctabs$sums ctabs$means ctabs$counts
  • 24. | © Copyright 2015 Hitachi Consulting24 Microsoft R – ScaleR Summarize & Analyse Summarize cross tab results summary(ctabs, output = "means") Get Margins rxMarginals(ctabs, output = “sums”) Perform Statistical Dependency test
  • 25. | © Copyright 2015 Hitachi Consulting25 Microsoft R – ScaleR Summarize & Analyse Summarize using xCube (to produce a long-format table) formula = Petal.Width ~ F(Petal.Length) rxCube(formula, data = iris_xdata)  F(variable) converts the variable into a factor, on the fly, using the distinct rounded values of this variable
  • 26. | © Copyright 2015 Hitachi Consulting26 Microsoft R – ScaleR Visualize rxHistogram(~Sepal.Length|Species, data = iris_xdata)
  • 27. | © Copyright 2015 Hitachi Consulting27 Microsoft R – ScaleR Learn & Predict Classification Algorithms  rxDTrees() – Decision Trees for classification and regression. Can be converted to rpart tree models  rxBTrees() – Gradient Boosted Trees  rxDForest() – Random Forests  rxNaiveBayes()  rxLogit() – Logistic Regression Models Regression Algorithms  rxLinMod() – Linear Regression Models  rxGlm() Generalized Linear Models  rxDTrees()  rxBTrees() Clustering Algoritm  rxKMeans() All the algorithms accept the following parameters  Formula: response ~ input1+input2:input3  Data: learning set  Other parameters depending on the algorithms
  • 28. | © Copyright 2015 Hitachi Consulting28 Microsoft R – ScaleR Learn & Predict – Decision Trees Example rxDTrees() used to train classification (target variable is categorical) & regression (target variable is numeric) trees. The output is similar to rpart tree model. The key parameters are:  formula: response ~ input1+input2:input3  data: traing set  xVal: number of cross validation folds for pruning  maxDepth: maximum number of tree levels (to control complexity)  minBucket: minimum number of examples must be in a leaf node (to control complexity) formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width models.dtree = rxDTree(formula, data = iris_xdata) models.dtree
  • 29. | © Copyright 2015 Hitachi Consulting29 Microsoft R – ScaleR Learn & Predict – Decision Trees Example # get predictions, in form of probabilities predictions = rxPredict(models.dtree, data = iris_xdata, type = c("prob")) # select only columns of actual and predicted (as data frame) predictions = rxDataStep(predictions, varsToKeep =c("Species", "setosa_Pred", "versicolor_Pred","virginica_Pred"), transforms = list( setosa_actual = as.numeric(Species=='setosa'), versicolor_actual = as.numeric(Species=='versicolor'), virginica_actual = as.numeric(Species=='virginica') ) ) # display the prediction results rxGetInfo(predictions, getVarInfo = TRUE, numRows = 5) # plot Roc Curve (with respect to versicolor predictions) rxRocCurve(actualVarName = "versicolor_actual", predVarNames = c("versicolor_Pred"), data = predictions)
  • 30. | © Copyright 2015 Hitachi Consulting30 Microsoft R – ScaleR Learn & Predict – Decision Trees Example # compute accuracy predictions = rxPredict(models.dtree, data = iris_xdata, type = c("class")) predictions = rxReadXdf( predictions , varsToKeep = c("Species","Species_Pred")) accuracy = sum(as.numeric(predictions$Species == predictions$Species_Pred)/nrow(predictions)) print(accuracy) #use Revo Tree View to show tree tree = RevoTreeView::createTreeView(models.dtree) plot(tree) #convert to rpart tree model rpart_tree= as.rpart(models.dtree) class(rpart_tree) #export to pmml format library(pmml) pmml(rpart_tree)
  • 31. | © Copyright 2015 Hitachi Consulting31 Microsoft R – ScaleR Parallel Processing on Partitioned Data In some cases, instead of building one “Big” model using all your “Big” data, you build “many” models using “small” subsets of the data For example, building many time-series models, one for each product line, for demand forecasting, or several regression models, one for each geographic area, for fraud detection This is also called mixture of local models In this case, your data is partitioned into (smaller) subsets, by a certain criteria, and then local models are built, one for each data subset Such a process can be performed in parallel, using rxExecBy() function, which takes the following parameters:  inData = xdf dataset to be partitioned  keys = character vector of the name of the dataset columns by which the data will be partitioned. These columns should be of type factor  func = the function that will be applied on each data partition (i.e., learning a local model)  rxExecBy() returns a list containing the constructed model of each partition Dataset Partition Subset 1 Subset 2 Subset 3 Local Model 1 Local Model 2 Local Model 3 Learn Learn Learn } Parallel Learning
  • 32. | © Copyright 2015 Hitachi Consulting32 Microsoft R – ScaleR Parallel Processing on Partitioned Data For example, using the iris dataset, lets build a regression model that estimates Sepal.Length based on the Sepal.Width, for each Species type. In other words, we will partition the iris dataset into 3 subsets, one for each Species type (setosa, versicolor virginica), and build a local model for each partition, in parallel xdf = RxTextData(file = file.path(data_directory,"iris.csv")) buildLocalModels = function(keys, data){ local_xdf = rxImport(inData = data) local_model = rxLinMod(formula = Sepal.Length ~ Sepal.Width, data = data) return(local_model) } local_models = rxExecBy(inData = xdf, keys = c("Species"), func = buildLocalModels) local_models[[1]]$result local_models[[2]]$result local_models[[3]]$result
  • 33. | © Copyright 2015 Hitachi Consulting33 Microsoft R – msrdeploy Deploy & Consume In order to deploy an R model as a web API, you need to configure an MS R Server for operationalization, by running the R-Server-Admin-Util, as described in this link: https://msdn.microsoft.com/en-us/microsoft-r/operationalize/about
  • 34. | © Copyright 2015 Hitachi Consulting34 Microsoft R – msrdeploy Deploy & Consume library(mrsdeploy) # generate data x = 1:100 y = 2*x + rnorm(n=length(x), mean = 0, sd = 5) #buid a linear model reg_model = lm(y~x) # create a prediction function: takes input, and uses the lm to estimate the output estimate_output = function(input){ newdata = as.data.frame(x = input) names(newdata) = c("x") estimates = predict(reg_model, newdata = newdata, type = "response") return(estimates) } # connect to R Server to deploy into remoteLogin("http://localhost:12800", username = "admin", password = <password>) serviceName <- paste("estimate_output_", round(as.numeric(Sys.time()), 0)) # publish the prediction function api = publishService( serviceName, code = estimate_output, model = reg_model, # model to be used in the function inputs = list(input = "numeric"), outputs = list(output = "numeric"), v = "v1.0.0") # query the published API api # list the deployed APIs mrsdeploy::listServices() # consume the API result = api$estimate_output(120) result$output("output")
  • 35. | © Copyright 2015 Hitachi Consulting35 Microsoft R – MicrosoftML MicrosoftML Overview Machine Learning Algorithms  rxFastLinear() – binary classification & Regression  rxOneClassSvm() – anomaly detection (unsupervised)  rxFastTrees() – classification & regression  rxFastForest() – classification & regression  rxNeuralNetworks() – classification & regression  rxLogisticRegression() - regression rxEnsemble() – combine a number of models of various kinds Text Processing  featurizeText() – TF, IDF, TF-IDF  getSentiment() – using pretrained model Image Processing  featurizeImage() – using a pretrained model  loadImgae()  resizeImage()  extractPixels() - extracts the pixel values from an image Other Processing  selectFeatures() – using minCount or mutualInfo  categorical() – converts a categorical variable to indicator columns  categoricalHash() converts a categorical variable to indicator columns using hashing (used with variable with many values) https://msdn.microsoft.com/en-us/microsoft-r/microsoftml-get-started
  • 36. | © Copyright 2015 Hitachi Consulting36 My Background Applying Computational Intelligence in Data Mining  Honorary Research Fellow, School of Computing , University of Kent.  Ph.D. Computer Science, University of Kent, Canterbury, UK.  28+ published journal and conference papers in the fields of AI and ML https://www.researchgate.net/profile/Khalid_Salama https://www.linkedin.com/in/khalid-salama-24403144/ https://github.com/khalid-m-salama/sqlbits-2017
  • 37. | © Copyright 2015 Hitachi Consulting37 Thanks!

Editor's Notes

  1. Hello everyone and welcome to the last day of Sqlbits… My name is Khalid Salama. I work at Hitachi Consulting, in this Business Insights & Analytics practice, focusing on designing and delivering Data & Analytics Solutions I n this session, I would like to explore with you the various Microsoft technologies that can help to operationalize your Machine Learning pipelines and enable scalable data science. Well, it’s more of an engineering session than a data science one to be fair, however, I think it is an important topic to discuss because, data science is perceived as experimental, isolated activity… While in many contemporary applications, specially with the rise of digital transformation and IoT, your data science products need to be incorporated with your operational systems, and you ML pipelines need to be an integral part of your ETL process. So, we will try to touch on various the Microsoft options to perform both experimental data science and operational ML.
  2. So without over due, we have a lot of ground to cover… I’ll start with a very quick intro to data science, I assume everybody here has “a” background on data science Then, I give some insights on the difference between exploratory data science and Operational ML After that, we are going to delve into the MS technologies for Advanced Analytics and show several demos…. And finally, I will conclude with some general remarks.
  3. Now let’s take a look onto the activities of any a science process, to try to discriminate between experimental data science and operational machine learning
  4. It starts with an exploratory data analysis phase… After being presented with an analytics problem, you start with collecting the relevant data and importing it to your environment… Then you blend this data by performing some generic data engineering tasks, such as merging, joining, aggerating, and so on…. After that, you apply some machine learning-specific data preparation tasks, also know as features engineering, including features construction, extraction, selection, and feature tuning, like scaling, handling missing values & outliers, and so on. The output of this phase is a learning dataset, that will be used in your ML experimentation phase. In this phase, you perform iterative steps training & testing to select the algorithm & parameters that produce the model that best captures the hidden patterns in your data… The final output of this whole experimentation phase is a report of findings, along with comprehensive visuals. That can be in the form of a markdown file, using jupyter notebooks, that tills the end-to-end data analysis story and support reproducibility. These results may lead to a specific decision or recommendation. In some scenarios, these results are the ultimate output of the data science activity
  5. However, in many other scenarios, where you need repeated and real-time intelligence, such as targeted advertising and recommender systems, you need to productionize the models produced from the previous data science process, and integrate them with your operational systems to perform online predictions and recommendations In which case, the whole ML pipeline, including data ingestion, processing, model training and/or scoring, needs to be a repeatable, automated process The process should produce a model that exposes Web API to be integrated with your operational apps and consumed real-time
  6. Microsoft R Server Probably the most important analytics product for Microsoft at the moment…. If you are an R developer, you will probably know that open-source R has scalability limitations, because it is single-threaded and in-memory only… You needed to use commercial R libraries to make your program multi-threaded, process your data partly in-memory and partly on-desk, so that you can handle data sizes bigger than your workstation’s memory, and run your R app on a cluster for distributed computing and scaling your data processing…
  7. Well, Microsoft has acquired a company that builds such libraries, called Revolution Analytics, and included their open-source libraries in MRO, and their commercial ones in MRS Besides, MSR Open has enhanced Math Kernel Library, for more efficient mathematical computations and it is compatible with all R-related software
  8. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  9. Let’s take a closer look to the main components of MS R Server ScaleR – The core libraries in MS R, optimized for parallel execution and uses external data frames to overcome the memory limitation ConnecR – provides access to various data sources including distributed file systems and relational databases DistributeR – allows you R application to run in different execution context, including distributed one So you can write you application once, and with a few lines of code, you can configure your application to run on different execution context in order to scale it MS R Server Operationalization - allows you to deploy your R models, on a configured R Server, as Web APIs (similar to what we have seen in Azure ML) using msrdeploy libraries Let’s have a look on a sample MS R code
  10. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  11. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  12. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  13. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  14. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  15. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  16. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  17. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  18. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  19. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  20. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  21. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  22. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  23. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  24. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  25. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  26. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  27. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  28. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  29. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  30. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  31. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  32. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  33. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing
  34. So in this comparison, you can see that Microsoft R Server processes the data in-memory as well as on-desk (using external data frames or xds, which we will see shortly), It is multi-threaded, and supports distributed computing to scale for big data processing