1. using R and High
Performance
Computers
an overview by Dave Hiltbrand
2. talking points
● why HPC?
● R environment tips
● staging R scripts for HPC
● purrr::map functions
3. what to do if the computation is too big
for your desktop/laptop ?
• a common user question:
– i have an existing R pipeline for my research work. but the data is
growing too big. now my R program runs for days (weeks) to finish or
simply runs out of memory.
• 3 Strategies
– move to bigger hardware
– advanced libraries/C++
– implement code using parallel packages
4. trends in HPC
➔ processors not getting faster
➔ increase performance => cram
more cores on each chip
➔ requires reducing clock speed
(power + heat)
➔ single-threaded applications
will run SLOWER on these new
resources, must start thinking in
parallel
https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
5. strategy 1: powerful hardware
Stampede2 - HPC
● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes
● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes
Maverick - Vis
● vis queue: 20 cores/ 256 GB mem/ 132 nodes
○ RStudio/ Jupyter Notebooks
● gpu queue: 132 NVIDIA Telsa K40 GPUs
Wrangler - Data
● Hadoop/Spark
● reservations last up to a month
6. allocations
open to national researcher community
do you work in industry?
XSEDE
● national organization providing computation
resources to ~ 90% of cycles on Stampede2
tip
if you need more power
all you have to do is ask
https://portal.xsede.org
/allocations/resource-
info
7. HPCs are:
➔ typically run with linux
➔ more command line
driven
➔ daunting to Windows
only users
➔ RStudio helps the
transition
8. login nodes
➔ always log into the login nodes
➔ shared nodes with limited
resources
➔ ok to edit, compile, move files
➔ for R, ok to install packages
from login nodes
➔ !!! don’t run R Scripts!!!
compute nodes
➔ dedicated nodes for each job
➔ only accessible via a job
scheduler
➔ once you have a job running on
a node you can ssh into the
node
9. access
R command line
● useful to install packages on login nodes
● using interactive development jobs you can request compute resources
to login straight to a compute node and use R via command line
RStudio
● availability depends on the structure of the HPC cluster
● at TACC the window to use RStudio is only 4 hours through the visual
portal
batch Jobs
● best method to use R on HPCs
● relies on a job scheduler to fill your request
● can run multiple R scripts on multiple compute nodes
10. sample batch script
#!/bin/bash
#----------------------------------------------------
#
#----------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.o%j # Name of stdout output file
#SBATCH -e myjob.e%j # Name of stderr error file
#SBATCH -p skx-normal # Queue (partition) name
#SBATCH -N 1 # Total # of nodes (must be 1 for
serial)
#SBATCH -n 1 # Total # of mpi tasks (should be
1 for serial)
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH --mail-user=myname@myschool.edu
#SBATCH --mail-type=all # Send email at begin and end of
job
#SBATCH -A myproject # Allocation name (req'd if you
have more than 1)
# Other commands must follow all #SBATCH directives...
module list
pwd
date
# Launch serial code...
Rscript ./my_analysis.R > output.Rout >> error.Rerr
# ---------------------------------------------------
11. .libPaths and Rprofile()
using your Rprofile.site or .Rprofile files along with
the .libPaths() command will allow you to install
packages in your user folder and have them load up
when you start R on the HPC.
in R, a library is the location on disk where you install your packages. R
creates a different library for each dot-version of R itself
when R starts, it performs a series of steps to initialize the session. you can
modify the startup sequence by changing the contents in a number of
locations.
the following sequence is somewhat simplified:
● first, R reads the file Rprofile.site in the R_Home/etc folder,
where R_HOME is the location where you installed R.
○ for example, this file could live at C:RR-
3.2.2etcRprofile.site.
○ making changes to this file affects all R sessions
that use this version of R.
○ this might be a good place to define your preferred
CRAN mirror, for example.
● next, R reads the file ~/.Rprofile in the user's home folder.
● lastly, R reads the file .Rprofile in the project folder
tip
i like to make a .Rprofile
for each GitHub project
repo which loads my
most commonly used
libraries by default.
12. going parallel
often you need to convert your code
into parallel form to get the most out
of HPC. the foreach and doMC
packages will let you convert loops
from sequential operation to parallel.
you can even use multiple nodes if you
have a really complex data set with the
snow package.
require( foreach )
require( doMC )
result <- foreach( i = 1:10, .combine = c) %dopar% {
myProc()
}
require( foreach )
require( doSNOW )
#Get backend hostnames
hostnames <- scan( "nodelist.txt", what="", sep="n" )
#Set reps to match core count
num.cores <- 4
hostnames <- rep( hostnames, each = num.cores )
cluster <- makeSOCKcluster( hostnames )
registerDoSNOW( cluster )
result <- foreach( i = 1:10, .combine=c ) %dopar% {
myProc()
}
stopCluster( cluster )
13. profiling
➔ simple procedure checks with
tictoc package
➔ use more advanced packages
like microbenchmark for
multiple procedures
➔ For an easy to read graphic
output use the profvis package
to create flamegraphs
checkpointing
➔ when writing your script think of
procedure runtime
➔ you can save objects in your
workflow as a checkpoint
◆ library(readr)
◆ write_rds(obj, “obj.rds”)
➔ if you want to run post hoc
analysis it makes it easier to
have all the parts
14. always start small
i’m quick i’m slow
build a toy dataset
find your typo’s
easier to rerun
run the real data
request the right
resources
once you run a small
dataset you can benchmark
resources needed
15. if you don’t already you need to Git
Git is a command-line tool,
but the center around
which all things involving
Git revolve is the hub—
GitHub.com—where
developers store their
projects and network with
like minded people.
use RStudio and all the
advanced IDE tools on
your local machine then
push and pull to GitHub to
run your job. RStudio
features built-in vcs
track changes in your
analysis, git lets you go
back in time to a previous
version of your file
17. the purrr map functions are an optional replacement to the
lapply functions. they are not technically faster ( although
the speed comparison is in nanoseconds ).
the main advantage is to use uniform syntax with other
tidyverse applications such as dplyr, tidyr, readr, and stringr
as well as the helper functions.
map( .x, .f, … )
map( vector_or_list_input, , function_to_apply,
optional_other_stuff )
modify( .x, .f, …)
ex. modify_at( my.data, 1:5, as.numeric)
https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
18. map in parallel
another key advantage from purrr is use of lambda
functions which has been crucial for analysis involving
multiple columns of a data frame. using the same
basic syntax we create an anonymous function which
maps over many lists simultaneously
my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4,
~ ( .x + .y ) / 2 ))
my.data %<>% mutate( var6 = pmap_dbl( list( .$var3,
.$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 ))
tip
using the grammar of
graphics, data, and lists
through tidyverse
packages can build a
strong workflow
19. closing
unburden your personal device
➔ learn basic linux cli
using batch job submissions gives you
the most flexibility
➔ profile/checkpoint/test
resources are not without limits
➔ share your code
don’t hold onto code until it’s perfect.
use GitHub and get feedback early and
often