SlideShare a Scribd company logo
deathstar – seamless cloud computing for R
Whit Armstrong
armstrong.whit@gmail.com
KLS Diversified Asset Management
May 11, 2012
deathstar – seamless cloud computing for R 1 / 27
What is deathstar?
deathstar is a dead simple way to run lapply across EC2 nodes or local
compute nodes
deathstar uses ZMQ (via the rzmq package) to deliver both data and code
over the wire
deathstar consists of two components: a daemon and an R package
simplicity is the key theme
deathstar – seamless cloud computing for R 2 / 27
deathstar – visual
user – user workstation
node0,node1 – local or cloud machines
n0 workers, n1 workers – deathstar workers
deathstar – seamless cloud computing for R 3 / 27
deathstar daemon – details
deathstar is a daemon that listens on two ports
the ’exec’ port listens for incoming work
the ’status’ port fulfills queries on the current workload capacity of the server
(this is what allows dynamic job allocation)
deathstar – seamless cloud computing for R 4 / 27
deathstar – queue device (about 100 lines of c++)
void deathstar(zmq::socket_t& frontend, zmq::socket_t& backend, zmq::socket_t& worker_signal, zmq::socket_t& status) {
int number_of_workers(0);
// loosely copied from lazy pirate (ZMQ guide)
while (1) {
zmq_pollitem_t items [] = {
{ worker_signal, 0, ZMQ_POLLIN, 0 },
{ status, 0, ZMQ_POLLIN, 0 },
{ backend, 0, ZMQ_POLLIN, 0 },
{ frontend, 0, ZMQ_POLLIN, 0 }
};
// frontend only if we have available workers
// otherwise poll both worker_signal and backend
int rc = zmq_poll(items, number_of_workers ? 4 : 3, -1);
// Interrupted
if(rc == -1) {
break;
}
// worker_signal -- worker came online
if(items[0].revents & ZMQ_POLLIN) {
process_worker_signal(worker_signal, number_of_workers);
}
// status -- client status request
if(items[1].revents & ZMQ_POLLIN) {
process_status_request(status, number_of_workers);
}
// backend -- send data from worker to client
if(items[2].revents & ZMQ_POLLIN) {
process_backend(frontend, backend);
}
// frontend -- send data from client to worker
if (items[3].revents & ZMQ_POLLIN) {
process_frontend(frontend, backend, number_of_workers);
}
}
}
deathstar – seamless cloud computing for R 5 / 27
deathstar – worker (25 SLOC of R)
#!/usr/bin/env Rscript
library(rzmq)
worker.id <- paste(Sys.info()["nodename"],Sys.getpid(),sep=":")
cmd.args <- commandArgs(trailingOnly=TRUE)
print(cmd.args)
work.endpoint <- cmd.args[1]
ready.endpoint <- cmd.args[2]
log.file <- cmd.args[3]
sink(log.file)
context = init.context()
ready.socket = init.socket(context,"ZMQ_PUSH")
work.socket = init.socket(context,"ZMQ_REP")
connect.socket(ready.socket,ready.endpoint)
connect.socket(work.socket,work.endpoint)
while(1) {
## send control message to indicate worker is up
send.null.msg(ready.socket)
## wait for work
msg = receive.socket(work.socket);
index <- msg$index
fun <- msg$fun
args <- msg$args
print(system.time(result <- try(do.call(fun,args),silent=TRUE)))
send.socket(work.socket,list(index=index,result=result,node=worker.id));
print(gc(verbose=TRUE))
}
deathstar – seamless cloud computing for R 6 / 27
deathstar package – exposes zmq.cluster.lapply
code is a little too long for a slide, but does threetwo basic things:
check node capacity, if capacity > 0, submit job
collect jobs
node life cycle monitoring (future feature)
So, what can it do?
deathstar – seamless cloud computing for R 7 / 27
Stochastic estimation of Pi
Borrowed from JD Long:
http://www.vcasmo.com/video/drewconway/8468
or
http://joefreeman.co.uk/blog/2009/07/
estimating-pi-with-monte-carlo-methods
estimatePi <- function(seed,draws) {
set.seed(seed)
r <- .5
x <- runif(draws, min=-r, max=r)
y <- runif(draws, min=-r, max=r)
inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0)
sum(inCircle) / length(inCircle) * 4
}
deathstar – seamless cloud computing for R 8 / 27
Basic lapply example – estimatePi
require(deathstar)
cluster <- c("krypton","xenon","node1","mongodb","research")
pi.local.runtime <- system.time(pi.local <-
zmq.cluster.lapply(cluster=cluster,1:1e4,estimatePi,draws=1e5))
print(mean(unlist(pi.local)))
## [1] 3.142
print(pi.local.runtime)
## user system elapsed
## 7.828 1.024 48.378
print(node.summary(attr(pi.local,"execution.report")))
## node jobs.completed
## 1 krypton 2206
## 2 mongodb 2191
## 3 node1 1598
## 4 research 1357
## 5 xenon 2648
deathstar – seamless cloud computing for R 9 / 27
Simple beta calculation
calc.beta <- function(ticker,mkt,n.years) {
require(KLS)
x <- get.bbg(ticker); x <- tail(x,250*n.years)
ticker <- gsub(" ",".",ticker)
x.diff <- diff(x,1) * 100
reg.data <- as.data.frame(cbind(x.diff,mkt))
colnames(reg.data) <- c("asofdate",ticker,"SPX")
formula <- paste(ticker,"SPX",sep="~")
coef(lm(formula,data=reg.data))
}
deathstar – seamless cloud computing for R 10 / 27
Simple beta calculation – setup
require(KLS)
spx <- get.bbg("SPXT Index")
spx.ret <- (spx/lag(spx, 1) - 1) * 100
tickers <- c("USGG30YR Index", "USGG10YR Index", "USGG5YR Index",
"USGG2YR Index", "USGG3YR Index", "USSW10 Curncy", "USSW5 Curncy",
"USSW2 Curncy", "TY1 Comdty", "CN1 Comdty", "RX1 Comdty", "G 1 Comdty",
"JB1 Comdty", "XM1 Comdty")
cluster <- c("krypton", "research", "mongodb")
deathstar – seamless cloud computing for R 11 / 27
Simple beta calculation – results
require(deathstar)
beta <- zmq.cluster.lapply(cluster = cluster, tickers, calc.beta,
mkt = spx.ret, n.years = 4)
beta <- do.call(rbind, beta)
rownames(beta) <- tickers
print(beta)
## (Intercept) SPX
## USGG30YR Index -0.2011 1.9487
## USGG10YR Index -0.2695 2.0328
## USGG5YR Index -0.3130 1.9442
## USGG2YR Index -0.2738 1.3025
## USGG3YR Index -0.1675 1.3840
## USSW10 Curncy -0.3034 1.8889
## USSW5 Curncy -0.3526 1.6594
## USSW2 Curncy -0.3275 0.8852
## TY1 Comdty 3.0219 -13.1236
## CN1 Comdty 3.0726 -12.8818
## RX1 Comdty 3.9561 -11.6475
## G 1 Comdty 4.4650 -10.2193
## JB1 Comdty 1.0143 -0.8301
## XM1 Comdty 0.3558 -0.4453
deathstar – seamless cloud computing for R 12 / 27
So, what have we learned?
deathstar looks pretty much like lapply, or parLapply
it’s pretty fast
not much configuration
deathstar – seamless cloud computing for R 13 / 27
into the cloud!
Basic steps to customize your EC2 instance.
The deathstar daemon can be downloaded here:
http://github.com/downloads/armstrtw/deathstar.core/deathstar_0.0.
1-1_amd64.deb
## launch EC2 instance from web console
## assume instance public dns is: ec2-23-20-55-67.compute-1.amazonaws.com
## from local workstation
scp rzmq_0.6.6.tar.gz 
ec2-23-20-55-67.compute-1.amazonaws.com:/home/ubuntu
scp ~/.s3cfg 
ec2-23-20-55-67.compute-1.amazonaws.com:/var/lib/deathstar
## from EC2 node
sudo apt-get install r-base-core r-base-dev libzmq-dev s3cmd
sudo R CMD INSTALL rzmq_0.6.6.tar.gz
sudo R CMD INSTALL AWS.tools_0.0.6.tar.gz
sudo dpkg -i deathstar_0.0.1-1_amd64.deb
## from local workstation
## check that the right ports are open (6000,6001)
## and AWS firewall is configured properly
nmap ec2-23-20-55-67.compute-1.amazonaws.com
deathstar – seamless cloud computing for R 14 / 27
a little more aggressive – estimatePi on EC2
c1.xlarge is an 8 core machine with 7 GB of RAM.
require(deathstar)
require(AWS.tools)
## start the EC2 cluster
cluster <- startCluster(ami = "ami-1c05a075", key = "kls-ec2",
instance.count = 4, instance.type = "c1.xlarge")
nodes <- get.nodes(cluster)
pi.ec2.runtime <- system.time(pi.ec2 <- zmq.cluster.lapply(cluster = nodes,
1:10000, estimatePi, draws = 1e+05))
deathstar – seamless cloud computing for R 15 / 27
estimatePi on EC2 – workload distribution
print(mean(unlist(pi.ec2)))
## [1] 3.142
print(pi.ec2.runtime)
## user system elapsed
## 6.285 1.044 122.751
print(node.summary(attr(pi.ec2, "execution.report")))
## node jobs.completed
## 1 ip-10-115-54-77 2510
## 2 ip-10-34-239-135 2526
## 3 ip-10-39-10-26 2521
## 4 ip-10-78-49-250 2443
deathstar – seamless cloud computing for R 16 / 27
mixed mode – use local + cloud resources
## the EC2 cluster is still running from the last example
cloud.nodes <- get.nodes(cluster)
nodes <- c(c("krypton", "xenon", "node1", "mongodb", "research"),
cloud.nodes)
pi.mixed.runtime <- system.time(pi.mixed <- zmq.cluster.lapply(cluster = nodes,
1:10000, estimatePi, draws = 1e+05))
## turn the cluster off
terminateCluster(cluster)
## [,1] [,2] [,3] [,4]
## [1,] "INSTANCE" "i-d35d82b5" "running" "shutting-down"
## [2,] "INSTANCE" "i-d15d82b7" "running" "shutting-down"
## [3,] "INSTANCE" "i-d75d82b1" "running" "shutting-down"
## [4,] "INSTANCE" "i-d55d82b3" "running" "shutting-down"
deathstar – seamless cloud computing for R 17 / 27
workload distribution for local + cloud
print(mean(unlist(pi.mixed)))
## [1] 3.142
print(pi.mixed.runtime)
## user system elapsed
## 5.812 0.708 36.957
print(node.summary(attr(pi.mixed, "execution.report")))
## node jobs.completed
## 1 ip-10-115-54-77 815
## 2 ip-10-34-239-135 810
## 3 ip-10-39-10-26 826
## 4 ip-10-78-49-250 827
## 5 krypton 1470
## 6 mongodb 1648
## 7 node1 1051
## 8 research 940
## 9 xenon 1613
deathstar – seamless cloud computing for R 18 / 27
distributed memory regression
Borrowing from Thomas Lumley’s presentation1
, one can calculate regression
coefficients in a distributed memory environment via these steps:
Compute XT
X and XT
y in chunks
aggregate chunks
reduce by ˆβ = (XT
X)−1
XT
y
1http://faculty.washington.edu/tlumley/tutorials/user-biglm.pdf
deathstar – seamless cloud computing for R 19 / 27
distributed memory regression – data
Airline on-time performance
http://stat-computing.org/dataexpo/2009/
The data:
The data consists of flight arrival and departure details for all commercial flights
within the USA, from October 1987 to April 2008. This is a large dataset: there
are nearly 120 million records in total, and takes up 1.6 gigabytes of space
compressed and 12 gigabytes when uncompressed. To make sure that you’re not
overwhelmed by the size of the data, we’ve provide two brief introductions to some
useful tools: linux command line tools and sqlite, a simple sql database.
deathstar – seamless cloud computing for R 20 / 27
distributed memory regression – setup
distributed.reg <- function(bucket, nodes, y.transform, x.transform) {
calc.xTx.xTy <- function(chnk.name, y.transform, x.transform) {
require(AWS.tools)
raw.data <- s3.get(chnk.name)
y <- y.transform(raw.data)
X <- x.transform(raw.data)
bad.mask <- is.na(y) | apply(is.na(X), 1, any)
y <- y[!bad.mask]
X <- X[!bad.mask, ]
list(xTx = t(X) %*% X, xTy = t(X) %*% y)
}
chunks <- s3.ls(bucket)[, "bucket"]
ans <- zmq.cluster.lapply(cluster = nodes, chunks, calc.xTx.xTy,
y.transform = y.transform, x.transform = x.transform)
exe.rpt <- attr(ans, "execution.report")
xTx <- Reduce("+", lapply(ans, "[[", "xTx"))
xTy <- Reduce("+", lapply(ans, "[[", "xTy"))
ans <- solve(xTx) %*% xTy
attr(ans, "execution.report") <- exe.rpt
ans
}
deathstar – seamless cloud computing for R 21 / 27
distributed memory regression – more setup
What are we estimating?
We supply ’get.y’ and ’get.x’ which are applied to the each chunk.
’get.y’ returns the log of the departure delay in minutes.
’get.x’ returns a matrix of the departure time in hours and the log(distance) of the
flight.
require(AWS.tools)
get.y <- function(X) {
log(ifelse(!is.na(X[, "DepDelay"]) & X[, "DepDelay"] > 0, X[, "DepDelay"],
1e-06))
}
get.x <- function(X) {
as.matrix(cbind(round(X[, "DepTime"]/100, 0) + X[, "DepTime"]%%100/60,
log(ifelse(X[, "Distance"] > 0, X[, "Distance"], 0.1))))
}
deathstar – seamless cloud computing for R 22 / 27
distributed memory regression – local execution
## research is an 8 core, 64GB server, 64/8 ~ 8GB per worker
## mongo is a 12 core, 32GB server, 32/12 ~ 2.6GB per worker
nodes <- c("research", "mongodb", "krypton")
coefs.local.runtime <- system.time(coefs.local <- distributed.reg(bucket =
"s3://airline.data",
nodes, y.transform = get.y, x.transform = get.x))
print(as.vector(coefs.local))
## [1] 0.1702 -1.4925
print(coefs.local.runtime)
## user system elapsed
## 0.10 0.02 402.44
print(node.summary(attr(coefs.local, "execution.report")))
## node jobs.completed
## 1 krypton 1
## 2 mongodb 12
## 3 research 8
deathstar – seamless cloud computing for R 23 / 27
distributed memory regression – cloud execution
Make sure you pick an instance type that provides adequate RAM per core.
m1.large is a 2 core machine with 7.5 GB of RAM 3.75GB per core (which is
more than we need for this chunk size).
require(AWS.tools)
cluster <- startCluster(ami = "ami-1c05a075", key = "kls-ec2",
instance.count = 11, instance.type = "m1.large")
nodes <- get.nodes(cluster)
coefs.ec2.runtime <- system.time(coefs.ec2 <- distributed.reg(bucket =
"s3://airline.data",
nodes, y.transform = get.y, x.transform = get.x))
## turn the cluster off
res <- terminateCluster(cluster)
print(as.vector(coefs.ec2))
## [1] 0.1702 -1.4925
print(coefs.ec2.runtime)
## user system elapsed
## 0.152 0.088 277.166
print(node.summary(attr(coefs.ec2, "execution.report")))
## node jobs.completed
## 1 ip-10-114-150-31 2
## 2 ip-10-116-235-189 1
## 3 ip-10-12-119-154 2
## 4 ip-10-12-123-135 2
## 5 ip-10-204-111-140 2
## 6 ip-10-204-150-93 2
## 7 ip-10-36-33-101 2
## 8 ip-10-40-62-117 2
## 9 ip-10-83-131-140 2
## 10 ip-10-85-151-177 2
## 11 ip-10-99-21-228 2
deathstar – seamless cloud computing for R 24 / 27
Final points
consuming cloud resources should not require arduous configuration steps (in
contrast to MPI, SLURM, Sun Grid Engine, etc.)
deathstar allows cloud access with minimal headache maximum simplicity
deathstar – seamless cloud computing for R 25 / 27
Conclusion
The next generation of R programmers needs something better than MPI... Please
help me continue to build that tool.
deathstar – seamless cloud computing for R 26 / 27
Thanks!
Many people contributed ideas and helped debug work in progress as the package
was being developed.
Kurt Hornik for putting up with my packaging.
John Laing for initial debugging.
Gyan Sinha for ideas and initial testing.
Prof Brian Ripley for just being himself.
My wife for enduring many solitary hours while I built these packages.
deathstar – seamless cloud computing for R 27 / 27

More Related Content

What's hot

Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Databricks
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
Brendan Gregg
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
Viller Hsiao
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
Riyaj Shamsudeen
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
Brendan Gregg
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
Riyaj Shamsudeen
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
Brendan Gregg
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
bmbouter
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
emBO_Conference
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
Brendan Gregg
 
Kernel development
Kernel developmentKernel development
Kernel development
Nuno Martins
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
Brendan Gregg
 
Variations on PostgreSQL Replication
 Variations on PostgreSQL Replication Variations on PostgreSQL Replication
Variations on PostgreSQL Replication
EDB
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
Kohei KaiGai
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
Andrea Righi
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
Renaldas Zioma
 

What's hot (20)

Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN A deep dive about VIP,HAIP, and SCAN
A deep dive about VIP,HAIP, and SCAN
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
 
Kernel development
Kernel developmentKernel development
Kernel development
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Variations on PostgreSQL Replication
 Variations on PostgreSQL Replication Variations on PostgreSQL Replication
Variations on PostgreSQL Replication
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
 
Tut hemant ns2
Tut hemant ns2Tut hemant ns2
Tut hemant ns2
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 

Viewers also liked

Info nutri-label-etiquet-eng
Info nutri-label-etiquet-engInfo nutri-label-etiquet-eng
Info nutri-label-etiquet-engtbehiel
 
Acupuncture stickers
Acupuncture stickersAcupuncture stickers
Acupuncture stickersYannie Chan
 
Distributed Computing Patterns in R
Distributed Computing Patterns in RDistributed Computing Patterns in R
Distributed Computing Patterns in R
armstrtw
 

Viewers also liked (8)

Acupuncture
AcupunctureAcupuncture
Acupuncture
 
Hyatt 48 Lex Photos
Hyatt 48 Lex PhotosHyatt 48 Lex Photos
Hyatt 48 Lex Photos
 
Acupuncture
AcupunctureAcupuncture
Acupuncture
 
Case story 1
Case story 1Case story 1
Case story 1
 
Acupuncture
AcupunctureAcupuncture
Acupuncture
 
Info nutri-label-etiquet-eng
Info nutri-label-etiquet-engInfo nutri-label-etiquet-eng
Info nutri-label-etiquet-eng
 
Acupuncture stickers
Acupuncture stickersAcupuncture stickers
Acupuncture stickers
 
Distributed Computing Patterns in R
Distributed Computing Patterns in RDistributed Computing Patterns in R
Distributed Computing Patterns in R
 

Similar to Deathstar

Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
Quentin Ambard
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
DataStax Academy
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
Emanuel Calvo
 
Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast
Masahiro Nagano
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Sean Chittenden
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
Aman Gupta
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
Engine Yard
 
Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"
Provectus
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
Sascha Dittmann
 
20140513_jeffyang_demo_openstack
20140513_jeffyang_demo_openstack20140513_jeffyang_demo_openstack
20140513_jeffyang_demo_openstackJeff Yang
 
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
Jeff Yang
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB
 
R-House (LSRC)
R-House (LSRC)R-House (LSRC)
R-House (LSRC)
Fernand Galiana
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
ScyllaDB
 
Code Red Security
Code Red SecurityCode Red Security
Code Red Security
Amr Ali
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB
 

Similar to Deathstar (20)

Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Introduction to cloudforecast
Introduction to cloudforecastIntroduction to cloudforecast
Introduction to cloudforecast
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterpriseA Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
 
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern AutomationIncrementalism: An Industrial Strategy For Adopting Modern Automation
Incrementalism: An Industrial Strategy For Adopting Modern Automation
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"Блохин Леонид - "Mist, как часть Hydrosphere"
Блохин Леонид - "Mist, как часть Hydrosphere"
 
SQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning ServicesSQL Server 2017 Machine Learning Services
SQL Server 2017 Machine Learning Services
 
20140513_jeffyang_demo_openstack
20140513_jeffyang_demo_openstack20140513_jeffyang_demo_openstack
20140513_jeffyang_demo_openstack
 
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
An High Available Database for OpenStack Cloud Production by Pacemaker, Coros...
 
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + KubernetesMongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local DC 2018: MongoDB Ops Manager + Kubernetes
 
R-House (LSRC)
R-House (LSRC)R-House (LSRC)
R-House (LSRC)
 
Gur1009
Gur1009Gur1009
Gur1009
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
Code Red Security
Code Red SecurityCode Red Security
Code Red Security
 
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + KubernetesMongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
MongoDB.local Austin 2018: MongoDB Ops Manager + Kubernetes
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

Deathstar

  • 1. deathstar – seamless cloud computing for R Whit Armstrong armstrong.whit@gmail.com KLS Diversified Asset Management May 11, 2012 deathstar – seamless cloud computing for R 1 / 27
  • 2. What is deathstar? deathstar is a dead simple way to run lapply across EC2 nodes or local compute nodes deathstar uses ZMQ (via the rzmq package) to deliver both data and code over the wire deathstar consists of two components: a daemon and an R package simplicity is the key theme deathstar – seamless cloud computing for R 2 / 27
  • 3. deathstar – visual user – user workstation node0,node1 – local or cloud machines n0 workers, n1 workers – deathstar workers deathstar – seamless cloud computing for R 3 / 27
  • 4. deathstar daemon – details deathstar is a daemon that listens on two ports the ’exec’ port listens for incoming work the ’status’ port fulfills queries on the current workload capacity of the server (this is what allows dynamic job allocation) deathstar – seamless cloud computing for R 4 / 27
  • 5. deathstar – queue device (about 100 lines of c++) void deathstar(zmq::socket_t& frontend, zmq::socket_t& backend, zmq::socket_t& worker_signal, zmq::socket_t& status) { int number_of_workers(0); // loosely copied from lazy pirate (ZMQ guide) while (1) { zmq_pollitem_t items [] = { { worker_signal, 0, ZMQ_POLLIN, 0 }, { status, 0, ZMQ_POLLIN, 0 }, { backend, 0, ZMQ_POLLIN, 0 }, { frontend, 0, ZMQ_POLLIN, 0 } }; // frontend only if we have available workers // otherwise poll both worker_signal and backend int rc = zmq_poll(items, number_of_workers ? 4 : 3, -1); // Interrupted if(rc == -1) { break; } // worker_signal -- worker came online if(items[0].revents & ZMQ_POLLIN) { process_worker_signal(worker_signal, number_of_workers); } // status -- client status request if(items[1].revents & ZMQ_POLLIN) { process_status_request(status, number_of_workers); } // backend -- send data from worker to client if(items[2].revents & ZMQ_POLLIN) { process_backend(frontend, backend); } // frontend -- send data from client to worker if (items[3].revents & ZMQ_POLLIN) { process_frontend(frontend, backend, number_of_workers); } } } deathstar – seamless cloud computing for R 5 / 27
  • 6. deathstar – worker (25 SLOC of R) #!/usr/bin/env Rscript library(rzmq) worker.id <- paste(Sys.info()["nodename"],Sys.getpid(),sep=":") cmd.args <- commandArgs(trailingOnly=TRUE) print(cmd.args) work.endpoint <- cmd.args[1] ready.endpoint <- cmd.args[2] log.file <- cmd.args[3] sink(log.file) context = init.context() ready.socket = init.socket(context,"ZMQ_PUSH") work.socket = init.socket(context,"ZMQ_REP") connect.socket(ready.socket,ready.endpoint) connect.socket(work.socket,work.endpoint) while(1) { ## send control message to indicate worker is up send.null.msg(ready.socket) ## wait for work msg = receive.socket(work.socket); index <- msg$index fun <- msg$fun args <- msg$args print(system.time(result <- try(do.call(fun,args),silent=TRUE))) send.socket(work.socket,list(index=index,result=result,node=worker.id)); print(gc(verbose=TRUE)) } deathstar – seamless cloud computing for R 6 / 27
  • 7. deathstar package – exposes zmq.cluster.lapply code is a little too long for a slide, but does threetwo basic things: check node capacity, if capacity > 0, submit job collect jobs node life cycle monitoring (future feature) So, what can it do? deathstar – seamless cloud computing for R 7 / 27
  • 8. Stochastic estimation of Pi Borrowed from JD Long: http://www.vcasmo.com/video/drewconway/8468 or http://joefreeman.co.uk/blog/2009/07/ estimating-pi-with-monte-carlo-methods estimatePi <- function(seed,draws) { set.seed(seed) r <- .5 x <- runif(draws, min=-r, max=r) y <- runif(draws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) sum(inCircle) / length(inCircle) * 4 } deathstar – seamless cloud computing for R 8 / 27
  • 9. Basic lapply example – estimatePi require(deathstar) cluster <- c("krypton","xenon","node1","mongodb","research") pi.local.runtime <- system.time(pi.local <- zmq.cluster.lapply(cluster=cluster,1:1e4,estimatePi,draws=1e5)) print(mean(unlist(pi.local))) ## [1] 3.142 print(pi.local.runtime) ## user system elapsed ## 7.828 1.024 48.378 print(node.summary(attr(pi.local,"execution.report"))) ## node jobs.completed ## 1 krypton 2206 ## 2 mongodb 2191 ## 3 node1 1598 ## 4 research 1357 ## 5 xenon 2648 deathstar – seamless cloud computing for R 9 / 27
  • 10. Simple beta calculation calc.beta <- function(ticker,mkt,n.years) { require(KLS) x <- get.bbg(ticker); x <- tail(x,250*n.years) ticker <- gsub(" ",".",ticker) x.diff <- diff(x,1) * 100 reg.data <- as.data.frame(cbind(x.diff,mkt)) colnames(reg.data) <- c("asofdate",ticker,"SPX") formula <- paste(ticker,"SPX",sep="~") coef(lm(formula,data=reg.data)) } deathstar – seamless cloud computing for R 10 / 27
  • 11. Simple beta calculation – setup require(KLS) spx <- get.bbg("SPXT Index") spx.ret <- (spx/lag(spx, 1) - 1) * 100 tickers <- c("USGG30YR Index", "USGG10YR Index", "USGG5YR Index", "USGG2YR Index", "USGG3YR Index", "USSW10 Curncy", "USSW5 Curncy", "USSW2 Curncy", "TY1 Comdty", "CN1 Comdty", "RX1 Comdty", "G 1 Comdty", "JB1 Comdty", "XM1 Comdty") cluster <- c("krypton", "research", "mongodb") deathstar – seamless cloud computing for R 11 / 27
  • 12. Simple beta calculation – results require(deathstar) beta <- zmq.cluster.lapply(cluster = cluster, tickers, calc.beta, mkt = spx.ret, n.years = 4) beta <- do.call(rbind, beta) rownames(beta) <- tickers print(beta) ## (Intercept) SPX ## USGG30YR Index -0.2011 1.9487 ## USGG10YR Index -0.2695 2.0328 ## USGG5YR Index -0.3130 1.9442 ## USGG2YR Index -0.2738 1.3025 ## USGG3YR Index -0.1675 1.3840 ## USSW10 Curncy -0.3034 1.8889 ## USSW5 Curncy -0.3526 1.6594 ## USSW2 Curncy -0.3275 0.8852 ## TY1 Comdty 3.0219 -13.1236 ## CN1 Comdty 3.0726 -12.8818 ## RX1 Comdty 3.9561 -11.6475 ## G 1 Comdty 4.4650 -10.2193 ## JB1 Comdty 1.0143 -0.8301 ## XM1 Comdty 0.3558 -0.4453 deathstar – seamless cloud computing for R 12 / 27
  • 13. So, what have we learned? deathstar looks pretty much like lapply, or parLapply it’s pretty fast not much configuration deathstar – seamless cloud computing for R 13 / 27
  • 14. into the cloud! Basic steps to customize your EC2 instance. The deathstar daemon can be downloaded here: http://github.com/downloads/armstrtw/deathstar.core/deathstar_0.0. 1-1_amd64.deb ## launch EC2 instance from web console ## assume instance public dns is: ec2-23-20-55-67.compute-1.amazonaws.com ## from local workstation scp rzmq_0.6.6.tar.gz ec2-23-20-55-67.compute-1.amazonaws.com:/home/ubuntu scp ~/.s3cfg ec2-23-20-55-67.compute-1.amazonaws.com:/var/lib/deathstar ## from EC2 node sudo apt-get install r-base-core r-base-dev libzmq-dev s3cmd sudo R CMD INSTALL rzmq_0.6.6.tar.gz sudo R CMD INSTALL AWS.tools_0.0.6.tar.gz sudo dpkg -i deathstar_0.0.1-1_amd64.deb ## from local workstation ## check that the right ports are open (6000,6001) ## and AWS firewall is configured properly nmap ec2-23-20-55-67.compute-1.amazonaws.com deathstar – seamless cloud computing for R 14 / 27
  • 15. a little more aggressive – estimatePi on EC2 c1.xlarge is an 8 core machine with 7 GB of RAM. require(deathstar) require(AWS.tools) ## start the EC2 cluster cluster <- startCluster(ami = "ami-1c05a075", key = "kls-ec2", instance.count = 4, instance.type = "c1.xlarge") nodes <- get.nodes(cluster) pi.ec2.runtime <- system.time(pi.ec2 <- zmq.cluster.lapply(cluster = nodes, 1:10000, estimatePi, draws = 1e+05)) deathstar – seamless cloud computing for R 15 / 27
  • 16. estimatePi on EC2 – workload distribution print(mean(unlist(pi.ec2))) ## [1] 3.142 print(pi.ec2.runtime) ## user system elapsed ## 6.285 1.044 122.751 print(node.summary(attr(pi.ec2, "execution.report"))) ## node jobs.completed ## 1 ip-10-115-54-77 2510 ## 2 ip-10-34-239-135 2526 ## 3 ip-10-39-10-26 2521 ## 4 ip-10-78-49-250 2443 deathstar – seamless cloud computing for R 16 / 27
  • 17. mixed mode – use local + cloud resources ## the EC2 cluster is still running from the last example cloud.nodes <- get.nodes(cluster) nodes <- c(c("krypton", "xenon", "node1", "mongodb", "research"), cloud.nodes) pi.mixed.runtime <- system.time(pi.mixed <- zmq.cluster.lapply(cluster = nodes, 1:10000, estimatePi, draws = 1e+05)) ## turn the cluster off terminateCluster(cluster) ## [,1] [,2] [,3] [,4] ## [1,] "INSTANCE" "i-d35d82b5" "running" "shutting-down" ## [2,] "INSTANCE" "i-d15d82b7" "running" "shutting-down" ## [3,] "INSTANCE" "i-d75d82b1" "running" "shutting-down" ## [4,] "INSTANCE" "i-d55d82b3" "running" "shutting-down" deathstar – seamless cloud computing for R 17 / 27
  • 18. workload distribution for local + cloud print(mean(unlist(pi.mixed))) ## [1] 3.142 print(pi.mixed.runtime) ## user system elapsed ## 5.812 0.708 36.957 print(node.summary(attr(pi.mixed, "execution.report"))) ## node jobs.completed ## 1 ip-10-115-54-77 815 ## 2 ip-10-34-239-135 810 ## 3 ip-10-39-10-26 826 ## 4 ip-10-78-49-250 827 ## 5 krypton 1470 ## 6 mongodb 1648 ## 7 node1 1051 ## 8 research 940 ## 9 xenon 1613 deathstar – seamless cloud computing for R 18 / 27
  • 19. distributed memory regression Borrowing from Thomas Lumley’s presentation1 , one can calculate regression coefficients in a distributed memory environment via these steps: Compute XT X and XT y in chunks aggregate chunks reduce by ˆβ = (XT X)−1 XT y 1http://faculty.washington.edu/tlumley/tutorials/user-biglm.pdf deathstar – seamless cloud computing for R 19 / 27
  • 20. distributed memory regression – data Airline on-time performance http://stat-computing.org/dataexpo/2009/ The data: The data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. To make sure that you’re not overwhelmed by the size of the data, we’ve provide two brief introductions to some useful tools: linux command line tools and sqlite, a simple sql database. deathstar – seamless cloud computing for R 20 / 27
  • 21. distributed memory regression – setup distributed.reg <- function(bucket, nodes, y.transform, x.transform) { calc.xTx.xTy <- function(chnk.name, y.transform, x.transform) { require(AWS.tools) raw.data <- s3.get(chnk.name) y <- y.transform(raw.data) X <- x.transform(raw.data) bad.mask <- is.na(y) | apply(is.na(X), 1, any) y <- y[!bad.mask] X <- X[!bad.mask, ] list(xTx = t(X) %*% X, xTy = t(X) %*% y) } chunks <- s3.ls(bucket)[, "bucket"] ans <- zmq.cluster.lapply(cluster = nodes, chunks, calc.xTx.xTy, y.transform = y.transform, x.transform = x.transform) exe.rpt <- attr(ans, "execution.report") xTx <- Reduce("+", lapply(ans, "[[", "xTx")) xTy <- Reduce("+", lapply(ans, "[[", "xTy")) ans <- solve(xTx) %*% xTy attr(ans, "execution.report") <- exe.rpt ans } deathstar – seamless cloud computing for R 21 / 27
  • 22. distributed memory regression – more setup What are we estimating? We supply ’get.y’ and ’get.x’ which are applied to the each chunk. ’get.y’ returns the log of the departure delay in minutes. ’get.x’ returns a matrix of the departure time in hours and the log(distance) of the flight. require(AWS.tools) get.y <- function(X) { log(ifelse(!is.na(X[, "DepDelay"]) & X[, "DepDelay"] > 0, X[, "DepDelay"], 1e-06)) } get.x <- function(X) { as.matrix(cbind(round(X[, "DepTime"]/100, 0) + X[, "DepTime"]%%100/60, log(ifelse(X[, "Distance"] > 0, X[, "Distance"], 0.1)))) } deathstar – seamless cloud computing for R 22 / 27
  • 23. distributed memory regression – local execution ## research is an 8 core, 64GB server, 64/8 ~ 8GB per worker ## mongo is a 12 core, 32GB server, 32/12 ~ 2.6GB per worker nodes <- c("research", "mongodb", "krypton") coefs.local.runtime <- system.time(coefs.local <- distributed.reg(bucket = "s3://airline.data", nodes, y.transform = get.y, x.transform = get.x)) print(as.vector(coefs.local)) ## [1] 0.1702 -1.4925 print(coefs.local.runtime) ## user system elapsed ## 0.10 0.02 402.44 print(node.summary(attr(coefs.local, "execution.report"))) ## node jobs.completed ## 1 krypton 1 ## 2 mongodb 12 ## 3 research 8 deathstar – seamless cloud computing for R 23 / 27
  • 24. distributed memory regression – cloud execution Make sure you pick an instance type that provides adequate RAM per core. m1.large is a 2 core machine with 7.5 GB of RAM 3.75GB per core (which is more than we need for this chunk size). require(AWS.tools) cluster <- startCluster(ami = "ami-1c05a075", key = "kls-ec2", instance.count = 11, instance.type = "m1.large") nodes <- get.nodes(cluster) coefs.ec2.runtime <- system.time(coefs.ec2 <- distributed.reg(bucket = "s3://airline.data", nodes, y.transform = get.y, x.transform = get.x)) ## turn the cluster off res <- terminateCluster(cluster) print(as.vector(coefs.ec2)) ## [1] 0.1702 -1.4925 print(coefs.ec2.runtime) ## user system elapsed ## 0.152 0.088 277.166 print(node.summary(attr(coefs.ec2, "execution.report"))) ## node jobs.completed ## 1 ip-10-114-150-31 2 ## 2 ip-10-116-235-189 1 ## 3 ip-10-12-119-154 2 ## 4 ip-10-12-123-135 2 ## 5 ip-10-204-111-140 2 ## 6 ip-10-204-150-93 2 ## 7 ip-10-36-33-101 2 ## 8 ip-10-40-62-117 2 ## 9 ip-10-83-131-140 2 ## 10 ip-10-85-151-177 2 ## 11 ip-10-99-21-228 2 deathstar – seamless cloud computing for R 24 / 27
  • 25. Final points consuming cloud resources should not require arduous configuration steps (in contrast to MPI, SLURM, Sun Grid Engine, etc.) deathstar allows cloud access with minimal headache maximum simplicity deathstar – seamless cloud computing for R 25 / 27
  • 26. Conclusion The next generation of R programmers needs something better than MPI... Please help me continue to build that tool. deathstar – seamless cloud computing for R 26 / 27
  • 27. Thanks! Many people contributed ideas and helped debug work in progress as the package was being developed. Kurt Hornik for putting up with my packaging. John Laing for initial debugging. Gyan Sinha for ideas and initial testing. Prof Brian Ripley for just being himself. My wife for enduring many solitary hours while I built these packages. deathstar – seamless cloud computing for R 27 / 27