SlideShare a Scribd company logo
1 of 34
Speeding up R
with Parallel Programming in the Cloud
David M Smith
Developer Advocate, Microsoft
@revodavid
Hi. I’m David.
• Lapsed statistician
• Cloud Developer Advocate
at Microsoft
– cda.ms/7f
• Editor, Revolutions blog
– cda.ms/7g
• Twitter: @revodavid
2 @revodavid
What is R?
• Widely used data science software
• Used by millions of data scientists, statisticians and analysts
• Most powerful statistical programming language
• Flexible, extensible and comprehensive for productivity
• Creates beautiful and unique data visualizations
• As seen in New York Times, The Economist and FlowingData
• Thriving open-source community
• Leading edge of Statistics research
3 @revodavid
What aren’t you telling me about R?
• R is single-threaded
• R is an in-memory application
• R is kinda quirky
And yet, major companies use R for production data science on
large databases.
• Examples: blog.revolutionanalytics.com/applications/
4 @revodavid
Secrets to using R in production
• Don’t use R alone
– Know what it’s good for! (And what it’s not good for.)
– Use R as part of a production stack
• Use modern R workflows
– Hire great data scientists
• Use R in conjunction with parallel/distributed data & compute
architectures
5 @revodavid
Speeding up your R code
• Vectorize
6 @revodavid
• Vectorize
• Moar megahertz!
• Moar RAM!
• Moar cores!
• Moar computers!
• Moar cloud!
Embarassingly Parallel Problems
Easy to speed things up when:
• Calculating similar things many times
– Iterations in a loop, chunks of data, …
• Calculations are independent of each other
• Each calculation takes a decent amount of time
Just run multiple calculations at the same time
7 @revodavid
8
Is this Embarrassing?
Embarrassingly Parallel
Group-by Analyses
Reporting
Simulations
Resampling / Bootstrapping
Optimization / Search (somewhat)
Prediction (scoring)
Cross-Validation
Backtesting
Not Embarrassingly Parallel
SQL operations (many)
Matrix inverse
Linear regression (training)
Logistic Regression (training)
Trees (training)
Neural Networks (training)
Time Series (most things)
Train tracks
The Birthday Paradox
What is the likelihood that there are two people in this room
who share the same birthday?
9 @revodavid
Birthday Problem Simulator
10
pbirthdaysim <- function(n) {
ntests <- 100000
pop <- 1:365
anydup <- function(i)
any(duplicated(
sample(pop, n, replace=TRUE)))
sum(sapply(seq(ntests), anydup)) / ntests
}
bdayp <- sapply(1:100, pbirthdaysim)  About 5 minutes
(on this laptop)
library(foreach)
Looping with the foreach package on CRAN
– x is a list of results
– each entry calculated from RHS of %dopar%
Learn more about foreach: cda.ms/6Q
11 @revodavid
x <- foreach (n=1:100) %dopar% pbirthdaysim(n)
Parallel processing with foreach
• Change how processing is done by registering a backend
– registerDoSEQ() sequential processing (default)
– registerdoParallel() local cluster via library(parallel)
– registerAzureParallel() remote cluster in Azure
• Whatever you use, call to foreach does not change
– Also: no need to worry about data, packages etc. (mostly)
12 @revodavid
library(doParallel)
cl <- makeCluster(2) # local cluster, 2 workers
registerDoParallel(cl)
bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
foreach + doAzureParallel
• doAzureParallel: A simple R package that uses the Azure Batch
cluster service as a parallel-backend for foreach
github.com/Azure/doAzureParallel
Demo: birthday simulation
8-node cluster (compute-optimized D2v2 2-core instances)
• specify VM class in cluster.json
• specify credentials for Azure Batch and Azure Storage in credentials.json
14 @revodavid
library(doAzureParallel)
setCredentials("credentials.json")
cluster <- makeCluster("cluster.json")
registerDoAzureParallel(cluster)
bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
bdayp <- unlist(bdayp)
cluster.json (excerpt):
"name": "davidsmi8caret",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 8,
"poolSize": {
"dedicatedNodes": {
"min": 8,
"max": 8
}
45 seconds (more than 6 times faster!)
Scale
• From 1 to 10,000 VMs for a cluster
• From 1 to millions of tasks
• Your selection of hardware:
– General compute VMs (A-Series / D-
Series)
– Memory / storage optimized (G-Series)
– Compute Optimized (F-Series)
– GPU enabled (N-Series)
• Results from computing the mandelbrot set
when scaling up:
Local
machine
5 parallel
workers
10 parallel
workers
20 parallel
workers
Cross-validation with caret
• Most predictive modeling algorithms have
“tuning parameters”
• Example: Boosted Trees
– Boosting iterations
– Max Tree Depth
– Shrinkage
• Parameters affect model performance
• Try ‘em out: cross-validate
16 @revodavid
grid <-
data.frame(
nrounds = …,
max_depth = …,
gamma = …,
colsample_bytree = …,
min_child_weight = …,
subsample = …)
)
Cross-validation in parallel
• Caret’s train function will automatically
use the registered foreach backend
• Just register your cluster first:
registerDoAzureParallel(cluster)
• Handles sending objects, packages to
nodes
17 @revodavid
mod <- train(
Class ~ .,
data = dat,
method = "xgbTree",
trControl = ctrl,
tuneGrid = grid,
nthread = 1
)
caret speed-ups
• Max Kuhn
benchmarked various
hardware and OS for
local parallel
– cda.ms/6V
• Let’s see how it works
with doAzureParallel
18 @revodavid
Source: cda.ms/6V
Packages and Containers
• Docker images used to spawn nodes
– Default: rocker/tidyverse:latest
– Lots of R packages pre-installed
• But this cross-validation also needs:
– xgboost, e1071
• Easy fix: add to cluster.json
19 @revodavid
{
"name": "davidsmi8caret",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 8,
"poolSize": {
"dedicatedNodes": {
"min": 4,
"max": 4
},
"lowPriorityNodes": {
"min": 4,
"max": 4
},
"autoscaleFormula": "QUEUE"
},
"containerImage":
"rocker/tidyverse:latest",
"rPackages": {
"cran": ["xgboost","e1071"],
"github": [],
"bioconductor": []
},
"commandLine": []
}
20 @revodavid
================================================
Id: job20180126022301
chunkSize: 1
enableCloudCombine: TRUE
packages:
caret;
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
================================================
Submitting tasks (1250/1250)
Submitting merge task. . .
Job Preparation Status: Package(s) being install
Waiting for tasks to complete. . .
| Progress: 13.84% (173/1250) | Running: 59 | Qu
MY LAPTOP: 78 minutes
THIS CLUSTER: 16 minutes
(almost 5x faster)
How much does it cost?
• Pay by the minute only for VMs used in cluster
– No additional cost for the Azure Batch cluster service
• Using D2v2 Virtual Machines
– Ubuntu 16, 8Mb RAM, 2-core “compute optimized”
• 17 minutes × 8 VMs @ $0.10 / hour
– about 23 cents (not counting startup)
… but why pay full price?
21 @revodavid
Low Priority Nodes
• Low-Priority = (very) Low Costs VMs from surplus capacity
– up to 80% discount
• Clusters can mix dedicated VMs and low-priority VMs
My Local R
Session
Azure Batch
Low Priority VMs
at up to 80% discount
Dedicated VMs "poolSize": {
"dedicatedNodes": {
"min": 3,
"max": 3
},
"lowPriorityNodes": {
"min": 9,
"max": 9
},
TL;DR: Embarrassingly Parallel
• Install the foreach and doAzureParallel packages
• Get Azure Batch and Azure Storage accounts
– Need an account? http://azure.com/free
• Set up Azure keys in credentials.json
• Define your cluster size/type in cluster.json
• Use registerAzureParallel to set up your job
• Use foreach / %dopar% to loop in parallel
• Worked example with code: cda.ms/7d
24 @revodavid
DISTRIBUTED DATA PROCESSING WITH
SPARKLYR
For when it’s not embarrassingly parallel:
25 @revodavid
Azure
What is Spark?
• Distributed data processing engine
• Store and analyze massive volumes in a robust, scalable cluster
• Successor to Hadoop
• in-memory engine 100x faster than map-reduce
• Highly extensible, with machine-learning capabilities
• Supports Scala, Java, Python, R …
• Managed cloud services available
– Azure Databricks & HDInsight, AWS EMR, GCP Dataproc
• Largest open-source data project
• Apache project with 1000+ contributors
26 @revodavid
R and Spark: Sparklyr
• sparklyr: R interface to Spark
– open-source R package from RStudio
• Move data between R and Spark
• “References” to Spark Data Frames
– Familiar R operations, including dplyr syntax
– Computations offloaded to Spark cluster, and deferred until needed
• CPU/RAM/Disk consumed in cluster, not by R
• Interfaces to Spark ML algorithms
27 @revodavid
spark.rstudio.com
Provisioning clusters for Sparklyr with aztk
• aztk: Command-line interface to provision Spark-ready (and
Sparklyr-ready) clusters in Batch
– www.github.com/azure/aztk
• Provision a Spark cluster in about 5 minutes
– Choice of VM instance types
– Use provided Docker instances (or your own)
– Pay only for VM usage, by the minute
– Optionally, use low-priority nodes to save costs
• Tools to manage persistent storage
• Easily connect to RStudio Server and Spark UIs from desktop
28 @revodavid
Launch and connect
Provision a Spark cluster:
aztk spark cluster create --id davidsmispark4 --size 4
Connect to the Spark cluster and map ports:
aztk spark cluster ssh --id davidsmispark4
Launch RStudio
http://localhost:8787
29 @revodavid
dplyr with Sparklyr
30 @revodavid
Connect to the Spark cluster:
library(sparklyr)
cluster_url <- paste0("spark://", system("hostname -i", intern = TRUE), ":7077")
sc <- spark_connect(master = cluster_url)
Load in some data:
library(dplyr)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
Munge with dplyr:
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
Things to note
• All of the computation take place in the Spark cluster
– Computations are delayed until you need results
– Behind the scenes, Spark SQL statements are being written for you
• None of the data comes back to R
– Until you call collect, when it becomes a tbl
– It’s only at this point you have to worry about data size
• This is all ordinary dplyr syntax
31 @revodavid
Machine Learning with SparklyR
33 @revodavid
> m <- ml_linear_regression(delay ~ dist, data=delay_near)
* No rows dropped by 'na.omit' call
> summary(m)
Call: ml_linear_regression(delay ~ dist, data = delay_near)
Deviance Residuals::
Min 1Q Median 3Q Max
-19.9499 -5.8752 -0.7035 5.1867 40.8973
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6904319 1.0199146 0.677 0.4986
dist 0.0195910 0.0019252 10.176 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-Squared: 0.09619
Root Mean Squared Error: 8.075
>
SparklyR provides R
interfaces to Spark’s
distributed machine
learning algorithms
(MLlib)
Computations
happing in the Spark
cluster, not in R
In summary
• Embarrassingly parallel (small): foreach + local backend
• Embarrassingly parallel (big): foreach + cluster backend
– Create & use clusters in Azure with doAzureParallel
• Big, distributed data: sparklyr
– Create Spark clusters with sparklyr in Azure with azkt
34 @revodavid
Get your links here
• Code for birthday problem simulation: cda.ms/7d
• Using the foreach package: cda.ms/6Q
• Get doAzureParallel: cda.ms/7w
• Get aztk (for sparklyr): cda.ms/7x
• Sparklyr: spark.rstudio.com
• Free Azure account with $200 credit: cda.ms/7v
35 @revodavid
Thank you!

More Related Content

What's hot

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosPaco Nathan
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules RestructuredDoiT International
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Julia Angell
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learningfelixcss
 
Provisioning Datadog with Terraform
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with TerraformMatt Spurlin
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 

What's hot (20)

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Getting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache MesosGetting Started Running Apache Spark on Apache Mesos
Getting Started Running Apache Spark on Apache Mesos
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
 
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
 
Dask: Scaling Python
Dask: Scaling PythonDask: Scaling Python
Dask: Scaling Python
 
SSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine LearningSSR: Structured Streaming for R and Machine Learning
SSR: Structured Streaming for R and Machine Learning
 
Provisioning Datadog with Terraform
Provisioning Datadog with TerraformProvisioning Datadog with Terraform
Provisioning Datadog with Terraform
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 

Similar to Speed up R with parallel programming in the Cloud

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 
Migrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMigrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMicrosoft Tech Community
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbMongoDB APAC
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationSean Chittenden
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Multi Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureMulti Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureKenny Gryp
 

Similar to Speed up R with parallel programming in the Cloud (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Migrating existing open source machine learning to azure
Migrating existing open source machine learning to azureMigrating existing open source machine learning to azure
Migrating existing open source machine learning to azure
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Buildingsocialanalyticstoolwithmongodb
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Multi Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ VerisureMulti Source Replication With MySQL 5.7 @ Verisure
Multi Source Replication With MySQL 5.7 @ Verisure
 
Concurrency
ConcurrencyConcurrency
Concurrency
 

More from Revolution Analytics

Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceRevolution Analytics
 

More from Revolution Analytics (20)

R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 

Recently uploaded

WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 

Recently uploaded (20)

WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 

Speed up R with parallel programming in the Cloud

  • 1. Speeding up R with Parallel Programming in the Cloud David M Smith Developer Advocate, Microsoft @revodavid
  • 2. Hi. I’m David. • Lapsed statistician • Cloud Developer Advocate at Microsoft – cda.ms/7f • Editor, Revolutions blog – cda.ms/7g • Twitter: @revodavid 2 @revodavid
  • 3. What is R? • Widely used data science software • Used by millions of data scientists, statisticians and analysts • Most powerful statistical programming language • Flexible, extensible and comprehensive for productivity • Creates beautiful and unique data visualizations • As seen in New York Times, The Economist and FlowingData • Thriving open-source community • Leading edge of Statistics research 3 @revodavid
  • 4. What aren’t you telling me about R? • R is single-threaded • R is an in-memory application • R is kinda quirky And yet, major companies use R for production data science on large databases. • Examples: blog.revolutionanalytics.com/applications/ 4 @revodavid
  • 5. Secrets to using R in production • Don’t use R alone – Know what it’s good for! (And what it’s not good for.) – Use R as part of a production stack • Use modern R workflows – Hire great data scientists • Use R in conjunction with parallel/distributed data & compute architectures 5 @revodavid
  • 6. Speeding up your R code • Vectorize 6 @revodavid • Vectorize • Moar megahertz! • Moar RAM! • Moar cores! • Moar computers! • Moar cloud!
  • 7. Embarassingly Parallel Problems Easy to speed things up when: • Calculating similar things many times – Iterations in a loop, chunks of data, … • Calculations are independent of each other • Each calculation takes a decent amount of time Just run multiple calculations at the same time 7 @revodavid
  • 8. 8 Is this Embarrassing? Embarrassingly Parallel Group-by Analyses Reporting Simulations Resampling / Bootstrapping Optimization / Search (somewhat) Prediction (scoring) Cross-Validation Backtesting Not Embarrassingly Parallel SQL operations (many) Matrix inverse Linear regression (training) Logistic Regression (training) Trees (training) Neural Networks (training) Time Series (most things) Train tracks
  • 9. The Birthday Paradox What is the likelihood that there are two people in this room who share the same birthday? 9 @revodavid
  • 10. Birthday Problem Simulator 10 pbirthdaysim <- function(n) { ntests <- 100000 pop <- 1:365 anydup <- function(i) any(duplicated( sample(pop, n, replace=TRUE))) sum(sapply(seq(ntests), anydup)) / ntests } bdayp <- sapply(1:100, pbirthdaysim)  About 5 minutes (on this laptop)
  • 11. library(foreach) Looping with the foreach package on CRAN – x is a list of results – each entry calculated from RHS of %dopar% Learn more about foreach: cda.ms/6Q 11 @revodavid x <- foreach (n=1:100) %dopar% pbirthdaysim(n)
  • 12. Parallel processing with foreach • Change how processing is done by registering a backend – registerDoSEQ() sequential processing (default) – registerdoParallel() local cluster via library(parallel) – registerAzureParallel() remote cluster in Azure • Whatever you use, call to foreach does not change – Also: no need to worry about data, packages etc. (mostly) 12 @revodavid library(doParallel) cl <- makeCluster(2) # local cluster, 2 workers registerDoParallel(cl) bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
  • 13. foreach + doAzureParallel • doAzureParallel: A simple R package that uses the Azure Batch cluster service as a parallel-backend for foreach github.com/Azure/doAzureParallel
  • 14. Demo: birthday simulation 8-node cluster (compute-optimized D2v2 2-core instances) • specify VM class in cluster.json • specify credentials for Azure Batch and Azure Storage in credentials.json 14 @revodavid library(doAzureParallel) setCredentials("credentials.json") cluster <- makeCluster("cluster.json") registerDoAzureParallel(cluster) bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n) bdayp <- unlist(bdayp) cluster.json (excerpt): "name": "davidsmi8caret", "vmSize": "Standard_D2_v2", "maxTasksPerNode": 8, "poolSize": { "dedicatedNodes": { "min": 8, "max": 8 } 45 seconds (more than 6 times faster!)
  • 15. Scale • From 1 to 10,000 VMs for a cluster • From 1 to millions of tasks • Your selection of hardware: – General compute VMs (A-Series / D- Series) – Memory / storage optimized (G-Series) – Compute Optimized (F-Series) – GPU enabled (N-Series) • Results from computing the mandelbrot set when scaling up: Local machine 5 parallel workers 10 parallel workers 20 parallel workers
  • 16. Cross-validation with caret • Most predictive modeling algorithms have “tuning parameters” • Example: Boosted Trees – Boosting iterations – Max Tree Depth – Shrinkage • Parameters affect model performance • Try ‘em out: cross-validate 16 @revodavid grid <- data.frame( nrounds = …, max_depth = …, gamma = …, colsample_bytree = …, min_child_weight = …, subsample = …) )
  • 17. Cross-validation in parallel • Caret’s train function will automatically use the registered foreach backend • Just register your cluster first: registerDoAzureParallel(cluster) • Handles sending objects, packages to nodes 17 @revodavid mod <- train( Class ~ ., data = dat, method = "xgbTree", trControl = ctrl, tuneGrid = grid, nthread = 1 )
  • 18. caret speed-ups • Max Kuhn benchmarked various hardware and OS for local parallel – cda.ms/6V • Let’s see how it works with doAzureParallel 18 @revodavid Source: cda.ms/6V
  • 19. Packages and Containers • Docker images used to spawn nodes – Default: rocker/tidyverse:latest – Lots of R packages pre-installed • But this cross-validation also needs: – xgboost, e1071 • Easy fix: add to cluster.json 19 @revodavid { "name": "davidsmi8caret", "vmSize": "Standard_D2_v2", "maxTasksPerNode": 8, "poolSize": { "dedicatedNodes": { "min": 4, "max": 4 }, "lowPriorityNodes": { "min": 4, "max": 4 }, "autoscaleFormula": "QUEUE" }, "containerImage": "rocker/tidyverse:latest", "rPackages": { "cran": ["xgboost","e1071"], "github": [], "bioconductor": [] }, "commandLine": [] }
  • 20. 20 @revodavid ================================================ Id: job20180126022301 chunkSize: 1 enableCloudCombine: TRUE packages: caret; errorHandling: stop wait: TRUE autoDeleteJob: TRUE ================================================ Submitting tasks (1250/1250) Submitting merge task. . . Job Preparation Status: Package(s) being install Waiting for tasks to complete. . . | Progress: 13.84% (173/1250) | Running: 59 | Qu MY LAPTOP: 78 minutes THIS CLUSTER: 16 minutes (almost 5x faster)
  • 21. How much does it cost? • Pay by the minute only for VMs used in cluster – No additional cost for the Azure Batch cluster service • Using D2v2 Virtual Machines – Ubuntu 16, 8Mb RAM, 2-core “compute optimized” • 17 minutes × 8 VMs @ $0.10 / hour – about 23 cents (not counting startup) … but why pay full price? 21 @revodavid
  • 22. Low Priority Nodes • Low-Priority = (very) Low Costs VMs from surplus capacity – up to 80% discount • Clusters can mix dedicated VMs and low-priority VMs My Local R Session Azure Batch Low Priority VMs at up to 80% discount Dedicated VMs "poolSize": { "dedicatedNodes": { "min": 3, "max": 3 }, "lowPriorityNodes": { "min": 9, "max": 9 },
  • 23. TL;DR: Embarrassingly Parallel • Install the foreach and doAzureParallel packages • Get Azure Batch and Azure Storage accounts – Need an account? http://azure.com/free • Set up Azure keys in credentials.json • Define your cluster size/type in cluster.json • Use registerAzureParallel to set up your job • Use foreach / %dopar% to loop in parallel • Worked example with code: cda.ms/7d 24 @revodavid
  • 24. DISTRIBUTED DATA PROCESSING WITH SPARKLYR For when it’s not embarrassingly parallel: 25 @revodavid Azure
  • 25. What is Spark? • Distributed data processing engine • Store and analyze massive volumes in a robust, scalable cluster • Successor to Hadoop • in-memory engine 100x faster than map-reduce • Highly extensible, with machine-learning capabilities • Supports Scala, Java, Python, R … • Managed cloud services available – Azure Databricks & HDInsight, AWS EMR, GCP Dataproc • Largest open-source data project • Apache project with 1000+ contributors 26 @revodavid
  • 26. R and Spark: Sparklyr • sparklyr: R interface to Spark – open-source R package from RStudio • Move data between R and Spark • “References” to Spark Data Frames – Familiar R operations, including dplyr syntax – Computations offloaded to Spark cluster, and deferred until needed • CPU/RAM/Disk consumed in cluster, not by R • Interfaces to Spark ML algorithms 27 @revodavid spark.rstudio.com
  • 27. Provisioning clusters for Sparklyr with aztk • aztk: Command-line interface to provision Spark-ready (and Sparklyr-ready) clusters in Batch – www.github.com/azure/aztk • Provision a Spark cluster in about 5 minutes – Choice of VM instance types – Use provided Docker instances (or your own) – Pay only for VM usage, by the minute – Optionally, use low-priority nodes to save costs • Tools to manage persistent storage • Easily connect to RStudio Server and Spark UIs from desktop 28 @revodavid
  • 28. Launch and connect Provision a Spark cluster: aztk spark cluster create --id davidsmispark4 --size 4 Connect to the Spark cluster and map ports: aztk spark cluster ssh --id davidsmispark4 Launch RStudio http://localhost:8787 29 @revodavid
  • 29. dplyr with Sparklyr 30 @revodavid Connect to the Spark cluster: library(sparklyr) cluster_url <- paste0("spark://", system("hostname -i", intern = TRUE), ":7077") sc <- spark_connect(master = cluster_url) Load in some data: library(dplyr) flights_tbl <- copy_to(sc, nycflights13::flights, "flights") Munge with dplyr: delay <- flights_tbl %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>% filter(count > 20, dist < 2000, !is.na(delay)) %>% collect
  • 30. Things to note • All of the computation take place in the Spark cluster – Computations are delayed until you need results – Behind the scenes, Spark SQL statements are being written for you • None of the data comes back to R – Until you call collect, when it becomes a tbl – It’s only at this point you have to worry about data size • This is all ordinary dplyr syntax 31 @revodavid
  • 31.
  • 32. Machine Learning with SparklyR 33 @revodavid > m <- ml_linear_regression(delay ~ dist, data=delay_near) * No rows dropped by 'na.omit' call > summary(m) Call: ml_linear_regression(delay ~ dist, data = delay_near) Deviance Residuals:: Min 1Q Median 3Q Max -19.9499 -5.8752 -0.7035 5.1867 40.8973 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6904319 1.0199146 0.677 0.4986 dist 0.0195910 0.0019252 10.176 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-Squared: 0.09619 Root Mean Squared Error: 8.075 > SparklyR provides R interfaces to Spark’s distributed machine learning algorithms (MLlib) Computations happing in the Spark cluster, not in R
  • 33. In summary • Embarrassingly parallel (small): foreach + local backend • Embarrassingly parallel (big): foreach + cluster backend – Create & use clusters in Azure with doAzureParallel • Big, distributed data: sparklyr – Create Spark clusters with sparklyr in Azure with azkt 34 @revodavid
  • 34. Get your links here • Code for birthday problem simulation: cda.ms/7d • Using the foreach package: cda.ms/6Q • Get doAzureParallel: cda.ms/7w • Get aztk (for sparklyr): cda.ms/7x • Sparklyr: spark.rstudio.com • Free Azure account with $200 credit: cda.ms/7v 35 @revodavid Thank you!

Editor's Notes

  1. R is #8 in January 2018 Tiobe language rankings. #6 in IEEE Spectrum 2017 top programming languages.
  2. R is #8 in January 2018 Tiobe language rankings. #6 in IEEE Spectrum 2017 top programming languages.