SlideShare a Scribd company logo
1 of 26
1
Distributed R
The Next Generation Platform for Predictive Analytics
Jorge Martinez
Vishrut Gupta
Ed Ma April 10th, 2015
2
About me
FPGAs
Barcelona
2009
Embedded
software,
GPUs
Barcelona
2011
Distributed
systems
and ML
SF
2013
@jorgemarsal
http://github.com/jorgemarsal
3
The data
explosion
4
Horizontal scaling
The shift from BI to Data Science
The shift from BI to
data science
Happens!
https://www.youtube.com/watch?v=vbb-AjiXyh0
5
Predictive analytics workflow
Build Models
Evaluate Models
Deploy
Models
(In-database
scoring)
BI Integration
1 2
3
Build and evaluate predictive
models on large datasets
using Distributed R
2
1 Ingest and prepare data by
leveraging HP Vertica
Analytics Platform (SQL DB)
3 Deploy models to Vertica and
use in-database scoring to
produce prediction results for
BI and applications.
6
Data Scientists Preferred Languages: R & SQL
Adoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
7
R is …
“The best thing about R is that it was developed by statisticians. The worst
thing about R is that… it was developed by statisticians.”
-Bo Cogwill, Google
8
R is ….
Popular
Not
scalable
Open
source No parallel
algorithms
Flexible
Extensible
Limited
pre/post
processing
9
Horizontal scaling
Functional programming and big dataScale-out
Scale-out
10
Horizontal scaling
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
Ship code to data,
Functional programming
11
Distributed R
The Next Generation Platform for Predictive Analytics
12
Distributed R
ANew Enterpriseclass predictive analytics platform
A scalable, high-performance platform for the R language
• Implemented as an R package
• Open source
Use familiar GUIs
and packages
Analyze data too
large for vanilla R
Leverage multiple
nodes for
distributed
processing
Vastly
improved
performance
13
Distributed R: architecture
Master
• Schedules tasks across the cluster.
• Sends commands/code to workers
Workers
• Do the actual work
• Own the data
• Work on independent data partitions in
parallel
DistR Master
Worker 1
Worker 2
Worker 3
Worker 4
14
• Relies on user defined partitioning
• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
15
• Express computations over partitions
• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
16
Distributed R basic demo
17
• Similar signature, accuracy as R packages
• Scalable and high performance
• E.g., regression on billions of rows in a couple of minutes
Distributed R: Built-in distributed algorithms
Algorithm Use cases
Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.
Logistic Regression (GLM)
Customer Response modeling, Healthcare analytics
(Disease analysis)
Random Forest Customer churn, Market campaign analysis
K-Means Clustering
Customer segmentation, Fraud detection, Anomaly
detection
Page Rank Identify influencers
18
Distributed R March Madness demo
19
Parallel Random Forest Example
Random Forest – building an
ensemble of deep decision trees
Need to build 100 decision trees on 4
machines
Each machine builds 25 decision trees
Can use random forest to predict
March Madness Bracket
X
7
>
5
X1
2
>
3.
4
X
3
>
3
01 10
21
March Madness Bracket
Train Model to predict individual games
Use team and opponent features to train a model
• blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy
Calculate the summary statistics of each team
Group by teams and get the mean of each team’s features
Predict the result of the game
Concatenate the summary statistics of the team and feed to model that predicts individual
games
Fill out bracket by predicting 1 game at the time
22
23
Distributed R Census demo using Shiny
http://15.126.194.41/public/index.html
24
Distributed R rocks!
• Regression on billions of rows in minutes
• Graph algorithms on 10B edges
• Load 400GB+ data from database to R in < 10 minutes
• Open source!
25
That’s cool… what can I do with it?
• Collaborate
• Github (report issues, send PRs) https://github.com/vertica/DistributedR
• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-
distributed-r/
• Buy commercial support
26
“The future has already arrived,
it’s just not evenly distributed yet”
- William Gibson
Thank you
http://github.com/vertica/distributedr

More Related Content

What's hot

Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...TigerGraph
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project PresentationShubham Shrivastava
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in RRory Winston
 
BDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBigData_Europe
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AITigerGraph
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARAlexander Falk
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph AnalyticsLinkurious
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & SummaryBigData_Europe
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 

What's hot (20)

Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project Presentation
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
 
BDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical Overview
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
 
Altova NIEM keynote
Altova NIEM keynoteAltova NIEM keynote
Altova NIEM keynote
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & Summary
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 

Viewers also liked

Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointBuilding a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointTristan Handy
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User GuideAndrey Karpov
 
OSGeo와 Open Data
OSGeo와 Open DataOSGeo와 Open Data
OSGeo와 Open Datar-kor
 
황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결r-kor
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsR Systems International
 
Optimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with ROptimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with RDomino Data Lab
 
The Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldThe Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldGE
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...William Markito Oliveira
 
Cloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCorley S.r.l.
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...In-Memory Computing Summit
 
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화r-kor
 
Trading System Design
Trading System DesignTrading System Design
Trading System DesignMarketcalls
 
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LDr-kor
 
Trading sentimental analysis
Trading sentimental analysisTrading sentimental analysis
Trading sentimental analysisMarketcalls
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirDr. Haxel Consult
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 

Viewers also liked (20)

Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointBuilding a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
 
resume
resumeresume
resume
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User Guide
 
OSGeo와 Open Data
OSGeo와 Open DataOSGeo와 Open Data
OSGeo와 Open Data
 
황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 
Optimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with ROptimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with R
 
R lecture oga
R lecture ogaR lecture oga
R lecture oga
 
The Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldThe Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the World
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
 
Cloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT Applications
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
 
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
 
Trading System Design
Trading System DesignTrading System Design
Trading System Design
 
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
 
Trading sentimental analysis
Trading sentimental analysisTrading sentimental analysis
Trading sentimental analysis
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
Language R
Language RLanguage R
Language R
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 

Similar to Distributed R: The Next Generation Platform for Predictive Analytics

End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Naveen Kumar
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15MLconf
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Precisely
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfCarlos Paredes
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-Ajay Ohri
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 

Similar to Distributed R: The Next Generation Platform for Predictive Analytics (20)

End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
 
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdfPan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
 
Dinkar mishra101206
Dinkar mishra101206Dinkar mishra101206
Dinkar mishra101206
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
 
CV
CVCV
CV
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-
 
Evaluation guide to Streaming Analytics
Evaluation guide to Streaming AnalyticsEvaluation guide to Streaming Analytics
Evaluation guide to Streaming Analytics
 
Resume kartikeya sharma
Resume kartikeya sharmaResume kartikeya sharma
Resume kartikeya sharma
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Distributed R: The Next Generation Platform for Predictive Analytics

  • 1. 1 Distributed R The Next Generation Platform for Predictive Analytics Jorge Martinez Vishrut Gupta Ed Ma April 10th, 2015
  • 4. 4 Horizontal scaling The shift from BI to Data Science The shift from BI to data science Happens! https://www.youtube.com/watch?v=vbb-AjiXyh0
  • 5. 5 Predictive analytics workflow Build Models Evaluate Models Deploy Models (In-database scoring) BI Integration 1 2 3 Build and evaluate predictive models on large datasets using Distributed R 2 1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB) 3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.
  • 6. 6 Data Scientists Preferred Languages: R & SQL Adoption of R increased across industries 1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html 2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
  • 7. 7 R is … “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” -Bo Cogwill, Google
  • 8. 8 R is …. Popular Not scalable Open source No parallel algorithms Flexible Extensible Limited pre/post processing
  • 9. 9 Horizontal scaling Functional programming and big dataScale-out Scale-out
  • 10. 10 Horizontal scaling “The future has arrived, it’s just not evenly distributed yet” - William Gibson “The future has arrived, it’s just not evenly distributed yet” - William Gibson Ship code to data, Functional programming
  • 11. 11 Distributed R The Next Generation Platform for Predictive Analytics
  • 12. 12 Distributed R ANew Enterpriseclass predictive analytics platform A scalable, high-performance platform for the R language • Implemented as an R package • Open source Use familiar GUIs and packages Analyze data too large for vanilla R Leverage multiple nodes for distributed processing Vastly improved performance
  • 13. 13 Distributed R: architecture Master • Schedules tasks across the cluster. • Sends commands/code to workers Workers • Do the actual work • Own the data • Work on independent data partitions in parallel DistR Master Worker 1 Worker 2 Worker 3 Worker 4
  • 14. 14 • Relies on user defined partitioning • Also support for distributed data-frames and lists darray Distributed R: Distributed data structures
  • 15. 15 • Express computations over partitions • Execute across the cluster foreach Distributed R: Distributed code f (x)
  • 17. 17 • Similar signature, accuracy as R packages • Scalable and high performance • E.g., regression on billions of rows in a couple of minutes Distributed R: Built-in distributed algorithms Algorithm Use cases Linear Regression (GLM) Risk Analysis, Trend Analysis, etc. Logistic Regression (GLM) Customer Response modeling, Healthcare analytics (Disease analysis) Random Forest Customer churn, Market campaign analysis K-Means Clustering Customer segmentation, Fraud detection, Anomaly detection Page Rank Identify influencers
  • 18. 18 Distributed R March Madness demo
  • 19. 19 Parallel Random Forest Example Random Forest – building an ensemble of deep decision trees Need to build 100 decision trees on 4 machines Each machine builds 25 decision trees Can use random forest to predict March Madness Bracket X 7 > 5 X1 2 > 3. 4 X 3 > 3 01 10
  • 20. 21 March Madness Bracket Train Model to predict individual games Use team and opponent features to train a model • blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy Calculate the summary statistics of each team Group by teams and get the mean of each team’s features Predict the result of the game Concatenate the summary statistics of the team and feed to model that predicts individual games Fill out bracket by predicting 1 game at the time
  • 21. 22
  • 22. 23 Distributed R Census demo using Shiny http://15.126.194.41/public/index.html
  • 23. 24 Distributed R rocks! • Regression on billions of rows in minutes • Graph algorithms on 10B edges • Load 400GB+ data from database to R in < 10 minutes • Open source!
  • 24. 25 That’s cool… what can I do with it? • Collaborate • Github (report issues, send PRs) https://github.com/vertica/DistributedR • Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/ • Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica- distributed-r/ • Buy commercial support
  • 25. 26 “The future has already arrived, it’s just not evenly distributed yet” - William Gibson