Parallel Computing with R

Parallel Computing with R
Literature Seminar
Abhirup Mallik
malli066@umn.edu
School of Statistics
University of Minnesota

November 15, 2013
Parallel Computing with R
Why Parallel?

Why Parallel?

R does not take advantage of multiple cores by default
Does not support passing by reference
Parallel Computing with R
Why Parallel?

Why Parallel?

R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
Parallel Computing with R
Why Parallel?

Why Parallel?

R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
Parallel Computing with R
What is Parallel computing with R

What is Parallel?

’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Parallel Computing with R
What is Parallel computing with R

What is Parallel?

’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
Parallel Computing with R
What is Parallel computing with R

What is Parallel?

’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
Parallel Computing with R
How to go Parallel?

Using Multicore (Implicit Parallelism)
Main process forks to child process which runs in parallel in
different cores.
1 library ( parallel )
2 mclapply (X , FUN , ...)

Or use
1
2
3
4
5
6

library ( parallel )
... setup stuff ..
for ( isplit in 1: nsplit ) {
mcparallel ( some R expression involving isplit )
}
out <- collect ()
Parallel Computing with R
How to go Parallel?

Warnings:
All child process compete for memory.
Closing terminal or closing any graphical window only kills
parent.
’CRTL + C’ Kills the parent, not the children.
Kill the children if they are unresponsive.
Parallel Computing with R
How to go Parallel?

Using SNOW (Explicit Parallelism)
Make a cluster by any one of these options
1 cl <- makeCluster ( spec , type , ...)
2 cl <- m a k e P S O C K c l u s t e r ( names , ...)
3 cl <- ma ke F or kC lu s te r ( nnodes = , ...)

Export essential objects to the cluster:
1 clusterExport ( cl , c ( var1 , fun1 , ..) )

Evaluate on cluster:
1 clusterEvalQ ( cl , expr )
2 parLapply ( cl = NULL , X , fun , ...)
3 parSapply ( cl = NULL , X , fun , ...)

Stop the cluster
Parallel Computing with R
Demonstration

Demonstration

Using Swiss fertility data from 1888 (R-base).
1 > str ( swiss )
2 ’ data . frame ’: 47 obs . of
3 $ Fertility
: num
4 $ Agriculture
: num
5 $ Examination
: int
6 $ Education
: int
7 $ Catholic
: num
8 $ Infant . Mortality : num

6 variables :
80.2 83.1 92.5 85.8 76.9 76.1 ...
17 45.1 39.7 36.5 43.5 35.3 ...
15 6 5 12 17 9 16 14 12 16 ...
12 9 5 7 15 7 7 8 7 13 ...
9.96 84.84 93.4 33.77 5.16 ...
22.2 22.2 20.2 20.3 20.6 26.6 ...
Parallel Computing with R
Demonstration

Demonstration
10 fold cross validation
1 fold <- sample ( seq (1 , 10) , size = nrow ( swiss ) ,
2
replace = TRUE )

Cross validation for ’i’th Fold
1 fold . cv <- function ( i ) {
2 train <- swiss [ fold ! = i , ]
3 test <- swiss [ fold == i , ]
4 swiss . rf <- randomForest ( sqrt ( Fertility ) ~ .
5
- Catholic + I ( Catholic < 50) , data = train )
6 predict . test <- predict ( swiss . rf , test , type = " response " )
7 actual . test <- sqrt ( test $ Fertility )
8 err <- predict . test - actual . test
9 sum ( err * err )
10 }
Parallel Computing with R
Demonstration

How to create a cluster?

Create a local cluster of size 4 (parallel socket)
1 cl <- m a k e P S O C K c l u s t e r (4)

Create a local cluster on different cores of the CPU (8 cores).
1 cl <- ma ke F or kC lu s te r (8)
Parallel Computing with R
Demonstration

How to create a cluster in our LAB?
Create password less log in using ssh keygen (from Shell):
1 ssh - keygen -t dsa
2 cat ~ / . ssh / id _ dsa . pub >> ~ / . ssh / authorized _ keys

#check which computers are running
1 grephosts LAB
2  # Then ssh all the computers you want to connect to once ,
and it will be remembered for the session .

Now we are ready to make a cluster:
1 library ( parallel )
2 machines <- c ( " crab " , " sugar " , " strike " , " hyland " , " lovejoy "
, " driller " )

3 address <- rapply ( lapply ( machines , nsl ) , c )
4 cl <- m a k e P S O C K c l u s t e r ( address )
Parallel Computing with R
Demonstration

How to create a cluster in our LAB?

If you are connecting to stat.umn.edu from your own computer, to
create a password-less ssh session:
1 ssh - keygen -t dsa
2  # Then use scp to copy id _ dsa . pub to ~ / . ssh / authorized _ keys
Parallel Computing with R
Demonstration

Comparison
On cluster:
1
2
3
4
5
6
7
8
9
10

> system . time ({
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.008
0.000
0.838

On Multicore:
1 > system . time ({
2 +
res1 <- do . call (c , mclapply (1:10 , fold . cv , mc . cores = 8) )
3
4

})
user
0.386

system elapsed
0.162
0.120
Parallel Computing with R
Demonstration

Using Fork cluster:
1
2
3
4
5
6
7
8
9
10
11

> system . time ({
+
cl <- m ak eF o rk Cl us t er (8)
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.010
0.054
0.153

Without any parallelization:
1 > system . time ({
2 +
res2 <- do . call (c , lapply (1:10 , fold . cv ) )
3 +
})
4
user system elapsed
5
0.233
0.000
0.235
Parallel Computing with R
When to go Parallel?

When to go Parallel?

When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Parallel Computing with R
When to go Parallel?

When to go Parallel?

When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Parallel Computing with R
When to go Parallel?

When to go Parallel?

When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
Parallel Computing with R
When to go Parallel?

When to go Parallel?

When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
Parallel Computing with R
To infinity and beyond

What is beyond the wall?

Parallelization in Big data framework: RHadoop
Other and related implementations of parallelization: MPI,
NWS, etc...
Other cool libraries: foreach, snowfall, etc...
GPU !!
Parallel Computing with R
Where to get codes?

Where to get the codes?

All the codes in this presentation is available at :
https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R
Parallel Computing with R
References

Acknowledgements and References

Sincere thanks to Charles Geyer
Resourceful slides by Ryan Rosario.
Some other and more resourceful slides.
Parallel R Book
Parallel Computing with R
Thank You

Thank You !!

Parallel Computing with R

  • 1.
    Parallel Computing withR Parallel Computing with R Literature Seminar Abhirup Mallik malli066@umn.edu School of Statistics University of Minnesota November 15, 2013
  • 2.
    Parallel Computing withR Why Parallel? Why Parallel? R does not take advantage of multiple cores by default Does not support passing by reference
  • 3.
    Parallel Computing withR Why Parallel? Why Parallel? R does not take advantage of multiple cores by default Does not support passing by reference Can not read files dynamically ... etc..
  • 4.
    Parallel Computing withR Why Parallel? Why Parallel? R does not take advantage of multiple cores by default Does not support passing by reference Can not read files dynamically ... etc..
  • 5.
    Parallel Computing withR What is Parallel computing with R What is Parallel? ’Parallel’ : Doing more than one tasks at the same time. Use different cores of a same CPU for different tasks.
  • 6.
    Parallel Computing withR What is Parallel computing with R What is Parallel? ’Parallel’ : Doing more than one tasks at the same time. Use different cores of a same CPU for different tasks. Use different computers in a cluster for different tasks.
  • 7.
    Parallel Computing withR What is Parallel computing with R What is Parallel? ’Parallel’ : Doing more than one tasks at the same time. Use different cores of a same CPU for different tasks. Use different computers in a cluster for different tasks.
  • 8.
    Parallel Computing withR How to go Parallel? Using Multicore (Implicit Parallelism) Main process forks to child process which runs in parallel in different cores. 1 library ( parallel ) 2 mclapply (X , FUN , ...) Or use 1 2 3 4 5 6 library ( parallel ) ... setup stuff .. for ( isplit in 1: nsplit ) { mcparallel ( some R expression involving isplit ) } out <- collect ()
  • 9.
    Parallel Computing withR How to go Parallel? Warnings: All child process compete for memory. Closing terminal or closing any graphical window only kills parent. ’CRTL + C’ Kills the parent, not the children. Kill the children if they are unresponsive.
  • 10.
    Parallel Computing withR How to go Parallel? Using SNOW (Explicit Parallelism) Make a cluster by any one of these options 1 cl <- makeCluster ( spec , type , ...) 2 cl <- m a k e P S O C K c l u s t e r ( names , ...) 3 cl <- ma ke F or kC lu s te r ( nnodes = , ...) Export essential objects to the cluster: 1 clusterExport ( cl , c ( var1 , fun1 , ..) ) Evaluate on cluster: 1 clusterEvalQ ( cl , expr ) 2 parLapply ( cl = NULL , X , fun , ...) 3 parSapply ( cl = NULL , X , fun , ...) Stop the cluster
  • 11.
    Parallel Computing withR Demonstration Demonstration Using Swiss fertility data from 1888 (R-base). 1 > str ( swiss ) 2 ’ data . frame ’: 47 obs . of 3 $ Fertility : num 4 $ Agriculture : num 5 $ Examination : int 6 $ Education : int 7 $ Catholic : num 8 $ Infant . Mortality : num 6 variables : 80.2 83.1 92.5 85.8 76.9 76.1 ... 17 45.1 39.7 36.5 43.5 35.3 ... 15 6 5 12 17 9 16 14 12 16 ... 12 9 5 7 15 7 7 8 7 13 ... 9.96 84.84 93.4 33.77 5.16 ... 22.2 22.2 20.2 20.3 20.6 26.6 ...
  • 12.
    Parallel Computing withR Demonstration Demonstration 10 fold cross validation 1 fold <- sample ( seq (1 , 10) , size = nrow ( swiss ) , 2 replace = TRUE ) Cross validation for ’i’th Fold 1 fold . cv <- function ( i ) { 2 train <- swiss [ fold ! = i , ] 3 test <- swiss [ fold == i , ] 4 swiss . rf <- randomForest ( sqrt ( Fertility ) ~ . 5 - Catholic + I ( Catholic < 50) , data = train ) 6 predict . test <- predict ( swiss . rf , test , type = " response " ) 7 actual . test <- sqrt ( test $ Fertility ) 8 err <- predict . test - actual . test 9 sum ( err * err ) 10 }
  • 13.
    Parallel Computing withR Demonstration How to create a cluster? Create a local cluster of size 4 (parallel socket) 1 cl <- m a k e P S O C K c l u s t e r (4) Create a local cluster on different cores of the CPU (8 cores). 1 cl <- ma ke F or kC lu s te r (8)
  • 14.
    Parallel Computing withR Demonstration How to create a cluster in our LAB? Create password less log in using ssh keygen (from Shell): 1 ssh - keygen -t dsa 2 cat ~ / . ssh / id _ dsa . pub >> ~ / . ssh / authorized _ keys #check which computers are running 1 grephosts LAB 2 # Then ssh all the computers you want to connect to once , and it will be remembered for the session . Now we are ready to make a cluster: 1 library ( parallel ) 2 machines <- c ( " crab " , " sugar " , " strike " , " hyland " , " lovejoy " , " driller " ) 3 address <- rapply ( lapply ( machines , nsl ) , c ) 4 cl <- m a k e P S O C K c l u s t e r ( address )
  • 15.
    Parallel Computing withR Demonstration How to create a cluster in our LAB? If you are connecting to stat.umn.edu from your own computer, to create a password-less ssh session: 1 ssh - keygen -t dsa 2 # Then use scp to copy id _ dsa . pub to ~ / . ssh / authorized _ keys
  • 16.
    Parallel Computing withR Demonstration Comparison On cluster: 1 2 3 4 5 6 7 8 9 10 > system . time ({ + garbage <- clusterEvalQ ( cl , data ( swiss ) ) + garbage <- clusterEvalQ ( cl , library ( randomForest ) ) + clusterExport ( cl , c ( " fold " , " fold . cv " ) ) + c l u s t e r S e t R N G S t r e a m ( cl , 123) + res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) ) + stopCluster ( cl ) + }) user system elapsed 0.008 0.000 0.838 On Multicore: 1 > system . time ({ 2 + res1 <- do . call (c , mclapply (1:10 , fold . cv , mc . cores = 8) ) 3 4 }) user 0.386 system elapsed 0.162 0.120
  • 17.
    Parallel Computing withR Demonstration Using Fork cluster: 1 2 3 4 5 6 7 8 9 10 11 > system . time ({ + cl <- m ak eF o rk Cl us t er (8) + garbage <- clusterEvalQ ( cl , data ( swiss ) ) + garbage <- clusterEvalQ ( cl , library ( randomForest ) ) + clusterExport ( cl , c ( " fold " , " fold . cv " ) ) + c l u s t e r S e t R N G S t r e a m ( cl , 123) + res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) ) + stopCluster ( cl ) + }) user system elapsed 0.010 0.054 0.153 Without any parallelization: 1 > system . time ({ 2 + res2 <- do . call (c , lapply (1:10 , fold . cv ) ) 3 + }) 4 user system elapsed 5 0.233 0.000 0.235
  • 18.
    Parallel Computing withR When to go Parallel? When to go Parallel? When gain from parallelization is much more than the cost of data transfer, network delays, etc... If the problem is Embarrassingly parallel: No dependency between the parallel tasks.
  • 19.
    Parallel Computing withR When to go Parallel? When to go Parallel? When gain from parallelization is much more than the cost of data transfer, network delays, etc... If the problem is Embarrassingly parallel: No dependency between the parallel tasks. Cross validation or Bootstrapping are examples where going parallel would work.
  • 20.
    Parallel Computing withR When to go Parallel? When to go Parallel? When gain from parallelization is much more than the cost of data transfer, network delays, etc... If the problem is Embarrassingly parallel: No dependency between the parallel tasks. Cross validation or Bootstrapping are examples where going parallel would work. Iterative numerical methods like co-ordinate descent or Newton-Rapson, going parallel may not be possible.
  • 21.
    Parallel Computing withR When to go Parallel? When to go Parallel? When gain from parallelization is much more than the cost of data transfer, network delays, etc... If the problem is Embarrassingly parallel: No dependency between the parallel tasks. Cross validation or Bootstrapping are examples where going parallel would work. Iterative numerical methods like co-ordinate descent or Newton-Rapson, going parallel may not be possible.
  • 22.
    Parallel Computing withR To infinity and beyond What is beyond the wall? Parallelization in Big data framework: RHadoop Other and related implementations of parallelization: MPI, NWS, etc... Other cool libraries: foreach, snowfall, etc... GPU !!
  • 23.
    Parallel Computing withR Where to get codes? Where to get the codes? All the codes in this presentation is available at : https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R
  • 24.
    Parallel Computing withR References Acknowledgements and References Sincere thanks to Charles Geyer Resourceful slides by Ryan Rosario. Some other and more resourceful slides. Parallel R Book
  • 25.
    Parallel Computing withR Thank You Thank You !!