Parallel Computing with R

Parallel Computing with R

Literature Seminar
Abhirup Mallik
malli066@umn.edu
School of Statistics
University of Minnesota

November 15, 2013

Why Parallel?

Why Parallel?

R does not take advantage of multiple cores by default
Does not support passing by reference

Why Parallel?

Why Parallel?

R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read ﬁles dynamically ... etc..

What is Parallel computing with R

What is Parallel?

’Parallel’ : Doing more than one tasks at the same time.
Use diﬀerent cores of a same CPU for diﬀerent tasks.

What is Parallel computing with R

What is Parallel?

’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.

How to go Parallel?

Using Multicore (Implicit Parallelism)
Main process forks to child process which runs in parallel in
diﬀerent cores.
1 library ( parallel )
2 mclapply (X , FUN , ...)

Or use
1
2
3
4
5
6

library ( parallel )
... setup stuff ..
for ( isplit in 1: nsplit ) {
mcparallel ( some R expression involving isplit )
}
out <- collect ()

How to go Parallel?

Warnings:
All child process compete for memory.
Closing terminal or closing any graphical window only kills
parent.
’CRTL + C’ Kills the parent, not the children.
Kill the children if they are unresponsive.

How to go Parallel?

Using SNOW (Explicit Parallelism)
Make a cluster by any one of these options
1 cl <- makeCluster ( spec , type , ...)
2 cl <- m a k e P S O C K c l u s t e r ( names , ...)
3 cl <- ma ke F or kC lu s te r ( nnodes = , ...)

Export essential objects to the cluster:
1 clusterExport ( cl , c ( var1 , fun1 , ..) )

Evaluate on cluster:
1 clusterEvalQ ( cl , expr )
2 parLapply ( cl = NULL , X , fun , ...)
3 parSapply ( cl = NULL , X , fun , ...)

Stop the cluster

Demonstration

Demonstration

Using Swiss fertility data from 1888 (R-base).
1 > str ( swiss )
2 ’ data . frame ’: 47 obs . of
3 $ Fertility
: num
4 $ Agriculture
: num
5 $ Examination
: int
6 $ Education
: int
7 $ Catholic
: num
8 $ Infant . Mortality : num

6 variables :
80.2 83.1 92.5 85.8 76.9 76.1 ...
17 45.1 39.7 36.5 43.5 35.3 ...
15 6 5 12 17 9 16 14 12 16 ...
12 9 5 7 15 7 7 8 7 13 ...
9.96 84.84 93.4 33.77 5.16 ...
22.2 22.2 20.2 20.3 20.6 26.6 ...

Demonstration

Demonstration
10 fold cross validation
1 fold <- sample ( seq (1 , 10) , size = nrow ( swiss ) ,
2
replace = TRUE )

Cross validation for ’i’th Fold
1 fold . cv <- function ( i ) {
2 train <- swiss [ fold ! = i , ]
3 test <- swiss [ fold == i , ]
4 swiss . rf <- randomForest ( sqrt ( Fertility ) ~ .
5
- Catholic + I ( Catholic < 50) , data = train )
6 predict . test <- predict ( swiss . rf , test , type = " response " )
7 actual . test <- sqrt ( test $ Fertility )
8 err <- predict . test - actual . test
9 sum ( err * err )
10 }

Demonstration

How to create a cluster?

Create a local cluster of size 4 (parallel socket)
1 cl <- m a k e P S O C K c l u s t e r (4)

Create a local cluster on diﬀerent cores of the CPU (8 cores).
1 cl <- ma ke F or kC lu s te r (8)

Demonstration

How to create a cluster in our LAB?
Create password less log in using ssh keygen (from Shell):
1 ssh - keygen -t dsa
2 cat ~ / . ssh / id _ dsa . pub >> ~ / . ssh / authorized _ keys

#check which computers are running
1 grephosts LAB
2 # Then ssh all the computers you want to connect to once ,
and it will be remembered for the session .

Now we are ready to make a cluster:
1 library ( parallel )
2 machines <- c ( " crab " , " sugar " , " strike " , " hyland " , " lovejoy "
, " driller " )

3 address <- rapply ( lapply ( machines , nsl ) , c )
4 cl <- m a k e P S O C K c l u s t e r ( address )

Demonstration

How to create a cluster in our LAB?

If you are connecting to stat.umn.edu from your own computer, to
create a password-less ssh session:
1 ssh - keygen -t dsa
2 # Then use scp to copy id _ dsa . pub to ~ / . ssh / authorized _ keys

Demonstration

Comparison
On cluster:
1
2
3
4
5
6
7
8
9
10

> system . time ({
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.008
0.000
0.838

On Multicore:
1 > system . time ({
2 +
res1 <- do . call (c , mclapply (1:10 , fold . cv , mc . cores = 8) )
3
4

})
user
0.386

system elapsed
0.162
0.120

Demonstration

Using Fork cluster:
1
2
3
4
5
6
7
8
9
10
11

> system . time ({
+
cl <- m ak eF o rk Cl us t er (8)
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.010
0.054
0.153

Without any parallelization:
1 > system . time ({
2 +
res2 <- do . call (c , lapply (1:10 , fold . cv ) )
3 +
})
4
user system elapsed
5
0.233
0.000
0.235

When to go Parallel?


When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.



Cross validation or Bootstrapping are examples where going
parallel would work.



Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.

To inﬁnity and beyond

What is beyond the wall?

Parallelization in Big data framework: RHadoop
Other and related implementations of parallelization: MPI,
NWS, etc...
Other cool libraries: foreach, snowfall, etc...
GPU !!

Where to get codes?

Where to get the codes?

All the codes in this presentation is available at :
https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R

References

Acknowledgements and References

Sincere thanks to Charles Geyer
Resourceful slides by Ryan Rosario.
Some other and more resourceful slides.
Parallel R Book

Thank You

Thank You !!

Parallel Computing with R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Parallel Computing with R

Similar to Parallel Computing with R (20)

Recently uploaded

Recently uploaded (20)

Parallel Computing with R