R programming basics
● Syntax forms, data structure, vector, elapsed time
Serial computing
● For loop, vectorised functions, *apply() functions
Parallel computing
● The doParallel, parallel, foreach package
Compare time performance in serial and parallel computing
Tackling repetitive tasks with serial or parallel programming in R
1. Tackling repetitive tasks with serial
or parallel programming in R
Speaker: Lun-Hsien CHANG
Affiliation: Immunology in Cancer and Infection, QIMR Berghofer
Meeting: R user group meeting #14
Time: 1-2 PM, 28th July 2020
Place: Seminar room, Level 6, Central building, QIMR, Brisbane
2. It is the central dogma but….
Even your program is working
fine, you may still want to
● Try a faster R package
than the current one
● Rewrite code that is less
error-prone
● Revise code for simplicity
and efficiency
3. Outline
R programming basics
● Syntax forms, data structure, vector, elapsed time
Serial computing
● For loop, vectorised functions, *apply() functions
Parallel computing
● The doParallel, parallel, foreach package
Compare time performance in serial and parallel computing
4. Common syntax forms in a R program
# Comments preceded by a hash
# Assign value "A.1" to variable.1
variable.1 <- "A.1"
library(package.A)
# Use function.A from package.A
function.A( argument1=values
,argument2=values
,...)
# Use function.A from package.A
package.A::function.A( argument1=values
,argument2=values
,...)
6. What is a vector in R?
A vector is a one-dimensioned collection of numbers, characters or logicals
v1 <- c(1:5)
v1
# [1] 1 2 3 4 5
v2 <- c("a","b","c","d","e")
v2
# [1] "a" "b" "c" "d" "e"
v3 <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
v3
# [1] TRUE TRUE FALSE FALSE TRUE
v4 <- c(1, "a", TRUE, 4, "b")
v4
#[1] "1" "a" "TRUE" "4" "b"
7. What is elapsed time in R?
User time : defined by your operating system (OS)
System time : defined by your OS
Elapsed time : the amount of time that passes from the start of a program to its
finish
Start.time <- proc.time()
# run some R code
End.time <- proc.time() - Start.time
system.time(# run some R code)
# user system elapsed
# 0.4 0.1 132.2
9. What is serial (sequential) computing?
Runs on a single CPU core, solving one task at a time
Ideal for dependent tasks (e.g. Task 2 uses result from task 1)
Run time is a function of the number of tasks
Task 4Task 3Task 2Task 1
Time
Single-core
processor (CPU)
10. R functions that run serial computing
● for loop
● Vectorised functions
○ Most R functions taks a vector usually in their first arguments
○ Few R functions take a single value (e.g. dir.create() )
● lapply(), sapply() from the apply family
11. Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
Create a variable i with values 1 to 10
12. Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
Take each i value and do something using it
13. Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
} Close the for loop with }
14. Syntax forms of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
for(i in c(1:10)){
Command1
Command2
...
}
This works
This works too
15. Vectorised operations in R
Many operations are vectorised in R, meaning that operations occur in all
elements of a vector in parallel
Task : Look up JPG images in 3 folders and get their file paths
dir.1 <- "C:/images"
dir.2 <- "D:/images"
dir.3 <- "E:/images"
list.files(path = c( dir.1, dir.2, dir.3)
,pattern = ".*.jpg"
,full.names = TRUE )
16. The *apply() functions
● Examples: lapply(X=, FUN=), sapply(X=, FUN=)
● Use them when the function to apply is simple
● Misconception: These are internal loops. They apply a function (FUN=) to all
the elements of a vector or list (X=). They are not faster than a for loop!
17. The *apply() functions
● Task: Create 3 folders under C:/images
# Specify the full path of new folders
new.folder.1 <- "C:/images/JPG"
new.folder.2 <- "C:/images/TIF"
new.folder.3 <- "C:/images/PNG"
# Create new folders using dir.create()
lapply( X= c( new.folder.1
,new.folder.2
,new.folder.3)
,FUN = function(x) dir.create(x, recursive = TRUE))
18. An unnecessary usage of lapply()
Tasks: check to see if 3 image folders exist
# Check the existence of 3 image
folders by lapply()
unlist(lapply(X=c( new.folder.1
,new.folder.2
,new.folder.3)
,FUN = function(x)
dir.exists(x))
)
# [1] TRUE TRUE TRUE TRUE
# By vectorised operation
dir.exists(paths = c( new.folder.1
,new.folder.2
,new.folder.3)
)
# [1] TRUE TRUE TRUE TRUE
19. Task: read multiple text files to a single data frame
with lapply()
Specify paths of input folders
Check to see if these input folders exist
Get full paths of input txt files
Read these files to a list
Concatenate these files as a single data frame
20. Read multiple text files to a single data frame (1/4)
# Specify full paths of data folders
drive.dir.C <- 'C:/Lab_MarkS'
input.data.dir <- file.path(drive.dir.C,"lunC/Immunohistochemistry_images/data_output")
input.data.folder.1 <- file.path(input.data.dir, "MT_Exp023.2_18-001-A","analysis-results")
input.data.folder.2 <- file.path(input.data.dir, "MT_Exp023.2_18-001-B","analysis-results")
input.data.folder.3 <- file.path(input.data.dir, "MT_Exp023.2_18-001-C","analysis-results")
input.data.folder.4 <- file.path(input.data.dir, "MT_Exp023.2_18-002-A","analysis-results")
input.data.folder.5 <- file.path(input.data.dir, "MT_Exp023.2_18-002-B","analysis-results")
21. Read multiple text files to a single data frame (2/4)
# Check if input folders exist
dir.exists(path=c( input.data.folder.1
,input.data.folder.2
,input.data.folder.3
,input.data.folder.4
,input.data.folder.5))
22. Read multiple text files to a single data frame (3/4)
# Get full paths of input files
input.data.file.paths <- list.files(path = c( input.data.folder.1
,input.data.folder.2
,input.data.folder.3
,input.data.folder.4
,input.data.folder.5)
,pattern =
".*cell-segmentation-summary_long-format_based-on-merged-cell-seg-file.tsv
"
,full.names = TRUE ) #
length(input.data.file.paths) 5
23. Read multiple text files to a single data frame (4/4)
# Read multiple tsv files to a list
input.data.list <- lapply( X=input.data.file.paths
,FUN= function(x) read.delim( file=x
,header = TRUE
,stringsAsFactors = F)
) # class(input.data.list) "list" # length(input.data.list) 5
# Combine list elements to a single data.frame
input.data.read <- do.call(what = "rbind", args = input.data.list) #
dim(input.data.read) 375 14
25. What is parallel computing
Task 4
Task 3
Task 2
Task 1
TimeMulti-core
processor
Runs on multiple CPU cores, solving
tasks in parallel (simultaneously).
Ideal for independent tasks (i.e. Task
2 does not rely on the result from task
1)
Run time < serial computing
26. Parallelised programming in R
Use it when you run a batch of similar tasks that are independent of each other
● Call an R script in a Shell script multiple times in a super computer
● doParallel, parallel, foreach packages in a local computer
27. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Load required packages into R (Windows users)
28. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Find number of CPU cores in your computer
29. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Use 2 CPU cores for R,
leave the other 2 for
software running in the
background
30. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Register the cluster
31. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Specify arguments in a foreach loop
for (i in 1:10){
Command1
Command2
...
}
32. # Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
for (i in 1:10){
Command1
Command2
...
}
Parallelise tasks with
%dopar%
33. Syntax form of nested foreach loops
# nested foreach general form
foreach( i=1:10
,.combine = 'rbind')%:%
foreach(j=1:5
,.combine = 'rbind'
,.packages =
c("package.A","package.B"))%dopar%{
command.1
command.2
}
%:% concatenates the outer and inner
foreach loop
Parallelise computation using %dopar%
35. Compare time used by vectorised & parallel computing
The testing tool: a birthday simulator
● A function to calculate the probability of having at least 2 people with same
birthdays given N people in the same room
● Returns N probabilities
The timing tool: system.time()
36. Compare time used by vectorised & parallel computing
The birthday simulation function
# Birthday problem simulator
pbirthdaysim <- function(n){
## n: number of people in the room
## ntests: number of simulations and averaging the
results
ntests <- 100000
pop <- 1:365
anydup <- function(i)
any(duplicated(
sample(pop, n, replace=TRUE)))
sum(sapply(seq(ntests), anydup)) / ntests
}
37. Compare the time used by vectorised and parallel
computing
system.time( # run birthday simulator using lapply())
system.time( # run birthday simulator using sapply())
system.time( # run birthday simulator using a for loop)
system.time( # run birthday simulator using 1 CPU core and foreach loop)
system.time( # run birthday simulator using all CPU cores and 1 foreach
loop)
38. Timing serial and parallel programming
Testing conditions:
● Dell E7440 laptop (Intel Core i5-4300U 2 x 1.9 - 2.9 GHz, Haswell.)
● 1 million simulations
Function Elapsed time
lapply
sapply
For loop
Foreach + 1 CPU core
Foreach + all CPU cores detected
sessionInfo()
# R version 4.0.0 (2020-04-24)
# Platform: x86_64-w64-mingw32/x64
(64-bit)
# Running under: Windows 7 x64 (build
7601) Service Pack 1
39. Don’t hesitate to ask yourself
● What is the time
performance of my
working code?
● Can I replace a loop with
vectorised functions?
● If my computing tasks are
independent, why haven’t
I used multiple CPU cores
and parallelised
computing?
40. Q & A
My Qs:
How many CPU cores detected in your computer?
What are the elapsed times running the birthday simulator in your R?
Your Qs?
41. Serial and parallel processing in real world
https://slideplayer.com/slide/7066858/
42. What is a CPU core?
A core, or CPU core, is the actual hardware component.
It is the "brain" of a CPU. It receives instructions, and performs calculations, or
operations, to satisfy those instructions. A CPU can have multiple cores.
A processor with two cores is called a dual-core processor; with four cores, a
quad-core; six cores, hexa-core; eight cores, octa-core.
As of 2019, the majority of consumer CPUs feature between 2 and 12 cores.