SlideShare a Scribd company logo
Spark Technology Center
Alok Singh
IBM Spark Technology Center
1
Spark Technology Center
About Presenter Alok Singh
Principal Engineer at the IBM Spark
Technology Center. He is leading the R4ML
team at IBM. He has built/architected
analytical frameworks and implemented
various machine learning algorithms.
Interests: Big Data, Scalable Machine
Learning Software, and Algorithms.
Contact:
email- singh_alok@hotmail.com
Github -https://github.com/aloknsingh
2
Spark Technology Center
3
IBM
Spark Technology
Center
Founded in 2015.
Location:
Physical: 505 Howard St., San Francisco CA
Web: http://spark.tc Twitter: @apachespark_tc
Mission:
Contribute intellectual and technical capital to the Apache Spark
community.
Make the core technology enterprise- and cloud-ready.
Build data science skills to drive intelligence into business
applications — http://bigdatauniversity.com
Key statistics:
About 50 developers, co-located with 25 IBM designers.
Major contributions to Apache Spark http://jiras.spark.tc
Apache SystemML is now an Apache Incubator project.
Founding member of UC Berkeley AMPLab and RISE Lab
Member of R Consortium and Scala Center
Spark Technology Center
Spark Technology Center
4
Introducing R4ML
Spark Technology Center
5
R4ML
R4ML is an R-language
front end to Apache
SystemML that
integrates seamlessly
with Apache Spark’s
SparkR APIs.
Spark Technology Center
SparkR is a
Thin Wrapper
Over Apache
Spark
Brings in the powerful statistics from R to Spark
6
Source: SparkR talk by Shivaram Venkatraman in Spark Summit 15
Spark Technology Center
Linear Algebra
is the Language of
Machine Learning.
Linear algebra is
powerful,
precise,
and high-level.
Express complex transformations
over large arrays of data…
…using a small number of operations.
…in a clear and unambiguous way
7
Spark Technology Center
8
Linear Algebra at
Scale via
Distributed
Computing
• In R Linear Algebra and various matrix packages
are very useful to create the new algorithms
• Wouldn’t it be nice that all the linear algebra scale to
millions and billions of rows ?
• Wouldn’t it be nicer if user can write his code
without worrying about the internal distributed
framework.
Spark Technology Center
Apache
SystemML
(incubating)
Goal: Take the linear algebra formulation of an
algorithm and make it scale automatically
IBM Research technology, now an open-source
Apache top level project.
To learn more: http://systemml.apache.org9
Top-level project!
Spark Technology Center
LinearAlgebra-BasedProgrammingLanguages
L2 regularized linear regression by the conjugate gradient method in DML:
X = read.csv(…); # n x m feature matrix
y = read.csv(…); # n x 1 feature vector
maxi = 50; lambda = 0.001; ...
r = -(t(X) %*% y)
norm_r2 = sum(r * r); p = -r # initial gradient
w = matrix(0, ncol(X), 1); i = 0
while(i < maxi & norm_r2 > norm_r2_trgt) {
q=((t(X)%*%(X%*%p))+lambda*p); # compute conjugate gradient
alpha = norm_r2 / sum(p * q) # compute step size
# update model and residuals
w=w+alpha*p; r=r+alpha*q
old_norm_r2 = norm_r2
norm_r2 = sum(r^2); i = i + 1
p = -r + norm_r2/old_norm_r2 * p
}
write.csv(w, …);
Variables
are
matrices
Expressions
use linear
algebra
operations
10
Spark Technology Center
11
Spark Technology Center
Spark Technology Center
13
Spark Technology Center
IBM
R4ML ML flow
Read input
data
Pre-process
Run built-in
ML algorithm
Run a custom
ML algorithm
Scoring
Spark Technology Center
Reading Input Example
15
Spark Technology Center
IBM
Pre-processing
• Typical machine learning use cases require many pre-processing steps
• R4ML provides and enhances these commonly used pre-processing options:
• NA removal
• Binning
• Normalizing
• Recoding
• One-hot encoding
• R4ML also supports all the standard SparkR pre-processors
Spark Technology Center
IBM
Pre-processing Example
Spark Technology Center
IBM
R4ML’s Built-in ML Algorithms
• Regression
• Linear Model (LM)
• Generalized Linear Model (GLM)
• Step Linear Model (Step.LM)
• Classification
• Multi logistic Regression(MLOGIT)
• Support Vector Machine (SVM)
• Factorization
• Principal Component Analysis (PCA)
• Alternating Least Square (ALS)
• Survival analysis
• Kaplan Meier (KM) Survival model
• Cox Proportional Hazard model
Spark Technology Center
IBM
Training & Scoring GLM Example
Spark Technology Center
IBM
Custom ML algorithms
• Allows user to create the custom machine learning algorithms for POC .
• Works via the Apache SystemML DML connector.
• User need to attach the input and output variables from R/R4ML to the DML
variables.
• User can create the library using the R package management.
Spark Technology Center
IBM
R4ML Custom Algorithms Skeleton
Spark Technology Center
22
Spark Technology Center
IBM
Airline Data Analysis using R4ML
• Do exploratory analysis and manual feature selection for the real dataset
• Pre-process the inputs and do proper na removal , recoding etc
• PCA examples:
• SVM example: (predict airline delay using the SVM)
• See the results
• Talks about wider table support in R4ML and systemML
• Illustrate the cross validation and model tuning
Spark Technology Center
IBM
Bridging the gap for the Wide columns datasets
• Many Dataset when done one hot encoding , creates many columns
• Many image processing applications requires many columns
• So one of the distinguishing feature of the systemML and R4ML is the support for the
wider columns.
24
Spark Technology Center
25
Spark Technology Center
IBM
Future work
• More ML algorithm support.
• We will have many real world use cases, examples, and PDF demos.
• Many more new work items are in the pipeline. Please see the latest changes at
https://github.com/SparkTC/r4ml
• Other useful links
• SparkR: https://spark.apache.org/docs/latest/sparkr.html
• Apache SystemML: http://apache.github.io/systemml/algorithms-reference.html
Spark Technology Center
IBM
Summary
• Distributed ML is complex problem and Apache SystemML makes it easier and we have
created a R frontend which work with underlying SparkR and whole R eco system also.
• R4ML is an open-source scalable machine learning package for implementing end to end ML
flows.
• It bridges important gaps in SparkR and uses the best of Apache SystemML and SparkR to
make data scientists and analysts get the most out of big data analytics.
• R4ML advantages:
• More ML algorithms
• Ability to write custom algorithms
• Wide table data set (images, etc.)
• Commonly used pre-processors
Spark Technology Center
Thanks to the R4ML Team!
28
• https://github.com/sparkTC/r4ml

More Related Content

What's hot

Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Formulatedby
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
Shivaji Dutta
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Databricks
 
Practical Aggregate Programming in Scala
Practical Aggregate Programming in ScalaPractical Aggregate Programming in Scala
Practical Aggregate Programming in Scala
Roberto Casadei
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
Databricks
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
Michal Bachman
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
datamantra
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework Intro
Michal Bachman
 
Machine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research HelpMachine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research Help
Matlab Simulation
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
Makoto Yui
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
Databricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 
Imperative and-functional-programming
Imperative and-functional-programmingImperative and-functional-programming
Imperative and-functional-programming
rameses francia
 

What's hot (20)

Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 
Practical Aggregate Programming in Scala
Practical Aggregate Programming in ScalaPractical Aggregate Programming in Scala
Practical Aggregate Programming in Scala
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Advanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware FrameworkAdvanced Neo4j Use Cases with the GraphAware Framework
Advanced Neo4j Use Cases with the GraphAware Framework
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework Intro
 
Machine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research HelpMachine Learning Projects Using MATLAB Research Help
Machine Learning Projects Using MATLAB Research Help
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Imperative and-functional-programming
Imperative and-functional-programmingImperative and-functional-programming
Imperative and-functional-programming
 

Similar to R4ML: An R Based Scalable Machine Learning Framework

Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
Mateusz Dymczyk
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
Stepan Pushkarev
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 

Similar to R4ML: An R Based Scalable Machine Learning Framework (20)

Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Spark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scaleSpark + H20 = Machine Learning at scale
Spark + H20 = Machine Learning at scale
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 

R4ML: An R Based Scalable Machine Learning Framework

  • 1. Spark Technology Center Alok Singh IBM Spark Technology Center 1
  • 2. Spark Technology Center About Presenter Alok Singh Principal Engineer at the IBM Spark Technology Center. He is leading the R4ML team at IBM. He has built/architected analytical frameworks and implemented various machine learning algorithms. Interests: Big Data, Scalable Machine Learning Software, and Algorithms. Contact: email- singh_alok@hotmail.com Github -https://github.com/aloknsingh 2
  • 3. Spark Technology Center 3 IBM Spark Technology Center Founded in 2015. Location: Physical: 505 Howard St., San Francisco CA Web: http://spark.tc Twitter: @apachespark_tc Mission: Contribute intellectual and technical capital to the Apache Spark community. Make the core technology enterprise- and cloud-ready. Build data science skills to drive intelligence into business applications — http://bigdatauniversity.com Key statistics: About 50 developers, co-located with 25 IBM designers. Major contributions to Apache Spark http://jiras.spark.tc Apache SystemML is now an Apache Incubator project. Founding member of UC Berkeley AMPLab and RISE Lab Member of R Consortium and Scala Center Spark Technology Center
  • 5. Spark Technology Center 5 R4ML R4ML is an R-language front end to Apache SystemML that integrates seamlessly with Apache Spark’s SparkR APIs.
  • 6. Spark Technology Center SparkR is a Thin Wrapper Over Apache Spark Brings in the powerful statistics from R to Spark 6 Source: SparkR talk by Shivaram Venkatraman in Spark Summit 15
  • 7. Spark Technology Center Linear Algebra is the Language of Machine Learning. Linear algebra is powerful, precise, and high-level. Express complex transformations over large arrays of data… …using a small number of operations. …in a clear and unambiguous way 7
  • 8. Spark Technology Center 8 Linear Algebra at Scale via Distributed Computing • In R Linear Algebra and various matrix packages are very useful to create the new algorithms • Wouldn’t it be nice that all the linear algebra scale to millions and billions of rows ? • Wouldn’t it be nicer if user can write his code without worrying about the internal distributed framework.
  • 9. Spark Technology Center Apache SystemML (incubating) Goal: Take the linear algebra formulation of an algorithm and make it scale automatically IBM Research technology, now an open-source Apache top level project. To learn more: http://systemml.apache.org9 Top-level project!
  • 10. Spark Technology Center LinearAlgebra-BasedProgrammingLanguages L2 regularized linear regression by the conjugate gradient method in DML: X = read.csv(…); # n x m feature matrix y = read.csv(…); # n x 1 feature vector maxi = 50; lambda = 0.001; ... r = -(t(X) %*% y) norm_r2 = sum(r * r); p = -r # initial gradient w = matrix(0, ncol(X), 1); i = 0 while(i < maxi & norm_r2 > norm_r2_trgt) { q=((t(X)%*%(X%*%p))+lambda*p); # compute conjugate gradient alpha = norm_r2 / sum(p * q) # compute step size # update model and residuals w=w+alpha*p; r=r+alpha*q old_norm_r2 = norm_r2 norm_r2 = sum(r^2); i = i + 1 p = -r + norm_r2/old_norm_r2 * p } write.csv(w, …); Variables are matrices Expressions use linear algebra operations 10
  • 14. Spark Technology Center IBM R4ML ML flow Read input data Pre-process Run built-in ML algorithm Run a custom ML algorithm Scoring
  • 16. Spark Technology Center IBM Pre-processing • Typical machine learning use cases require many pre-processing steps • R4ML provides and enhances these commonly used pre-processing options: • NA removal • Binning • Normalizing • Recoding • One-hot encoding • R4ML also supports all the standard SparkR pre-processors
  • 18. Spark Technology Center IBM R4ML’s Built-in ML Algorithms • Regression • Linear Model (LM) • Generalized Linear Model (GLM) • Step Linear Model (Step.LM) • Classification • Multi logistic Regression(MLOGIT) • Support Vector Machine (SVM) • Factorization • Principal Component Analysis (PCA) • Alternating Least Square (ALS) • Survival analysis • Kaplan Meier (KM) Survival model • Cox Proportional Hazard model
  • 19. Spark Technology Center IBM Training & Scoring GLM Example
  • 20. Spark Technology Center IBM Custom ML algorithms • Allows user to create the custom machine learning algorithms for POC . • Works via the Apache SystemML DML connector. • User need to attach the input and output variables from R/R4ML to the DML variables. • User can create the library using the R package management.
  • 21. Spark Technology Center IBM R4ML Custom Algorithms Skeleton
  • 23. Spark Technology Center IBM Airline Data Analysis using R4ML • Do exploratory analysis and manual feature selection for the real dataset • Pre-process the inputs and do proper na removal , recoding etc • PCA examples: • SVM example: (predict airline delay using the SVM) • See the results • Talks about wider table support in R4ML and systemML • Illustrate the cross validation and model tuning
  • 24. Spark Technology Center IBM Bridging the gap for the Wide columns datasets • Many Dataset when done one hot encoding , creates many columns • Many image processing applications requires many columns • So one of the distinguishing feature of the systemML and R4ML is the support for the wider columns. 24
  • 26. Spark Technology Center IBM Future work • More ML algorithm support. • We will have many real world use cases, examples, and PDF demos. • Many more new work items are in the pipeline. Please see the latest changes at https://github.com/SparkTC/r4ml • Other useful links • SparkR: https://spark.apache.org/docs/latest/sparkr.html • Apache SystemML: http://apache.github.io/systemml/algorithms-reference.html
  • 27. Spark Technology Center IBM Summary • Distributed ML is complex problem and Apache SystemML makes it easier and we have created a R frontend which work with underlying SparkR and whole R eco system also. • R4ML is an open-source scalable machine learning package for implementing end to end ML flows. • It bridges important gaps in SparkR and uses the best of Apache SystemML and SparkR to make data scientists and analysts get the most out of big data analytics. • R4ML advantages: • More ML algorithms • Ability to write custom algorithms • Wide table data set (images, etc.) • Commonly used pre-processors
  • 28. Spark Technology Center Thanks to the R4ML Team! 28 • https://github.com/sparkTC/r4ml

Editor's Notes

  1. R4ML is build on the top of SystemML and SparkR and the basic core engine is the Spark. Before we go deeper into what is R4ML . Lets talk a little bit about the two open source Apache project SystemML and SparkR
  2. SparkR brings best of Spark and R together
  3. Let me introduce why systemML is needed. Lets talk about linear algebra and why it is so important for the machine learning.
  4. -SystemML makes writing code in the linear algebra more seamless and easy. -Currently SystemML is an Apache top level project project. -Automatic Optimization based on data and cluster characteristics to ensure both efficiency and scalability -DML lang: declarative machine learning language Latest state of the art, Spark In memory distributed MR like framework Various API support like Scala, Java, Python and R System-ML Library for Customized machine learning on Large Scale data It has a compiler which will analyze, optimize, and create the spark based MR plan to be run on the cluster. Focus on the simple, fortran and R like syntax R R has been the de-facto tools and standard for analyzing the stats and ML
  5. How easy is to write the complex formulations
  6. Left hand block: represent the R4ML Top upper block represent SparkR - have multiple workers - can support Cran packages Top bottom represent SystemML - The main engine is Spark Distributed file system is used for storing data
  7. Lets dive deeper into R4ML and see how it can help big data scientists in doing their job
  8. A typical machine learning flow and how R4ML can fit seamlessly into it Lets walk it through Lets call out one special feature of custom machine learning algo provided by systemML base engine.
  9. Support for csv and other standard formats. APIs r4ml.read.csv same as read.csv.
  10. -Explain The input data usually requires some cleanup and pre-processing, this is to increase the predictive power of each of the features Shorter Version NA removeable and replace with mean or constant Binning allows user to bucketized a col Normalizing is centering and scaling Recoding for converting string to integer for the ordinal values One hot encoding when order is not imp i.e states of america Longer Version R4ML supports all the major pre-processing task - NA removal: this options let one remove the missing data or user can substitute with the constant or with the mean (if it is numeric) - One of the typical use case of binning is say we have the height of the people in feet and inches and if we only care about three top level category i.e short , medium and tall, user can use this option - Most of the algo’s prediction power or the result become better if it is normalized i.e substract the mean and then divide it by stddev. - Since most of the machine learning algorithm is the matrix based linear algebra, - Many times the categorical columns have inherit order in it like height or shirt_size, it options provide user the ability to encode those string columns into the ordinal categorical numeric value - But many times categorical columns have no inherit orders like people’s race or states they lives in, in that case, we would like to do one hot encoding for those.
  11. This examples illustrates the typical pre-processing steps. This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Note that we are doing both recoding and one hot encoding for the name of species
  12. Shorter Version --------------------- Go over the buckets Explain differentiator (wide col support) Mention Step.LM , KM, Cox Longer Version --------------------- GLM === Similar to LM for the solving the linear system to calcualte y_est However the output is related to the y_est via the link function and distribution Allows one to model a variety of use cases Supports various family and the link function unlike the base package we don’t suffer from the number of feature limitations PCA === Rotate the data so that axis align to the maximum variance User can then chose the rotated data eigen-vectors along the axis which results in dim reduction R4ML APIs are similar to prcomp pca from R MLOGIT ====== Uses logit function to separate the dataset into C different classes Same as Poisson GLM (log linear model) SVM === Maximum margin classifier. User can use regularization using slack variables. Multi classes uses, one against rest strategy. ALS === Approximate factorization of a low rank matrix using alternating least square method Used in the recommendation system Survival models =============== Use for the examining a particular event (usually death) to occur and it’s hazard Support KM Proportional hazard model cox
  13. Here we illustrates training and scoring using the built-in GLM algorithms. Note that in the typically ML flow, we divide the input data set into training and test set and build the model on train dataset and verify it on the test dataset We in the end look at the our prediction and compare it with the real measurement. We also print out the typical statistics which measures the perf of the model like R-squared, dispersion.
  14. -Airline dataset The Airline data set consists of flight arrival and departure details for all commercial flights from 1987 to 2008. The approximately 120MM records (CSV format), occupy 120GB space This dataset are real world examples and showcase the typical pb and use case How examples was created - Via the R package knitr by running the R4ML’s R code embedded with docs Notes: - you can copy paste the examples from this and it will work in your R4ML console - for convenient , we have create following markers - green: the code of interest - yellow: the outputs - purple: notes - call outs: to highlight the key message
  15. Message: it will evolve and back of big company like IBM