SlideShare a Scribd company logo
1 of 20
using R and High
Performance
Computers
an overview by Dave Hiltbrand
talking points
● why HPC?
● R environment tips
● staging R scripts for HPC
● purrr::map functions
what to do if the computation is too big
for your desktop/laptop ?
• a common user question:
– i have an existing R pipeline for my research work. but the data is
growing too big. now my R program runs for days (weeks) to finish or
simply runs out of memory.
• 3 Strategies
– move to bigger hardware
– advanced libraries/C++
– implement code using parallel packages
trends in HPC
➔ processors not getting faster
➔ increase performance => cram
more cores on each chip
➔ requires reducing clock speed
(power + heat)
➔ single-threaded applications
will run SLOWER on these new
resources, must start thinking in
parallel
https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
strategy 1: powerful hardware
Stampede2 - HPC
● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes
● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes
Maverick - Vis
● vis queue: 20 cores/ 256 GB mem/ 132 nodes
○ RStudio/ Jupyter Notebooks
● gpu queue: 132 NVIDIA Telsa K40 GPUs
Wrangler - Data
● Hadoop/Spark
● reservations last up to a month
allocations
open to national researcher community
do you work in industry?
XSEDE
● national organization providing computation
resources to ~ 90% of cycles on Stampede2
tip
if you need more power
all you have to do is ask
https://portal.xsede.org
/allocations/resource-
info
HPCs are:
➔ typically run with linux
➔ more command line
driven
➔ daunting to Windows
only users
➔ RStudio helps the
transition
login nodes
➔ always log into the login nodes
➔ shared nodes with limited
resources
➔ ok to edit, compile, move files
➔ for R, ok to install packages
from login nodes
➔ !!! don’t run R Scripts!!!
compute nodes
➔ dedicated nodes for each job
➔ only accessible via a job
scheduler
➔ once you have a job running on
a node you can ssh into the
node
access
R command line
● useful to install packages on login nodes
● using interactive development jobs you can request compute resources
to login straight to a compute node and use R via command line
RStudio
● availability depends on the structure of the HPC cluster
● at TACC the window to use RStudio is only 4 hours through the visual
portal
batch Jobs
● best method to use R on HPCs
● relies on a job scheduler to fill your request
● can run multiple R scripts on multiple compute nodes
sample batch script
#!/bin/bash
#----------------------------------------------------
#
#----------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.o%j # Name of stdout output file
#SBATCH -e myjob.e%j # Name of stderr error file
#SBATCH -p skx-normal # Queue (partition) name
#SBATCH -N 1 # Total # of nodes (must be 1 for
serial)
#SBATCH -n 1 # Total # of mpi tasks (should be
1 for serial)
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH --mail-user=myname@myschool.edu
#SBATCH --mail-type=all # Send email at begin and end of
job
#SBATCH -A myproject # Allocation name (req'd if you
have more than 1)
# Other commands must follow all #SBATCH directives...
module list
pwd
date
# Launch serial code...
Rscript ./my_analysis.R > output.Rout >> error.Rerr
# ---------------------------------------------------
.libPaths and Rprofile()
using your Rprofile.site or .Rprofile files along with
the .libPaths() command will allow you to install
packages in your user folder and have them load up
when you start R on the HPC.
in R, a library is the location on disk where you install your packages. R
creates a different library for each dot-version of R itself
when R starts, it performs a series of steps to initialize the session. you can
modify the startup sequence by changing the contents in a number of
locations.
the following sequence is somewhat simplified:
● first, R reads the file Rprofile.site in the R_Home/etc folder,
where R_HOME is the location where you installed R.
○ for example, this file could live at C:RR-
3.2.2etcRprofile.site.
○ making changes to this file affects all R sessions
that use this version of R.
○ this might be a good place to define your preferred
CRAN mirror, for example.
● next, R reads the file ~/.Rprofile in the user's home folder.
● lastly, R reads the file .Rprofile in the project folder
tip
i like to make a .Rprofile
for each GitHub project
repo which loads my
most commonly used
libraries by default.
going parallel
often you need to convert your code
into parallel form to get the most out
of HPC. the foreach and doMC
packages will let you convert loops
from sequential operation to parallel.
you can even use multiple nodes if you
have a really complex data set with the
snow package.
require( foreach )
require( doMC )
result <- foreach( i = 1:10, .combine = c) %dopar% {
myProc()
}
require( foreach )
require( doSNOW )
#Get backend hostnames
hostnames <- scan( "nodelist.txt", what="", sep="n" )
#Set reps to match core count
num.cores <- 4
hostnames <- rep( hostnames, each = num.cores )
cluster <- makeSOCKcluster( hostnames )
registerDoSNOW( cluster )
result <- foreach( i = 1:10, .combine=c ) %dopar% {
myProc()
}
stopCluster( cluster )
profiling
➔ simple procedure checks with
tictoc package
➔ use more advanced packages
like microbenchmark for
multiple procedures
➔ For an easy to read graphic
output use the profvis package
to create flamegraphs
checkpointing
➔ when writing your script think of
procedure runtime
➔ you can save objects in your
workflow as a checkpoint
◆ library(readr)
◆ write_rds(obj, “obj.rds”)
➔ if you want to run post hoc
analysis it makes it easier to
have all the parts
always start small
i’m quick i’m slow
build a toy dataset
find your typo’s
easier to rerun
run the real data
request the right
resources
once you run a small
dataset you can benchmark
resources needed
if you don’t already you need to Git
Git is a command-line tool,
but the center around
which all things involving
Git revolve is the hub—
GitHub.com—where
developers store their
projects and network with
like minded people.
use RStudio and all the
advanced IDE tools on
your local machine then
push and pull to GitHub to
run your job. RStudio
features built-in vcs
track changes in your
analysis, git lets you go
back in time to a previous
version of your file
Purrr Package
Map functions apply a function iteratively to each
element of a list or vector
the purrr map functions are an optional replacement to the
lapply functions. they are not technically faster ( although
the speed comparison is in nanoseconds ).
the main advantage is to use uniform syntax with other
tidyverse applications such as dplyr, tidyr, readr, and stringr
as well as the helper functions.
map( .x, .f, … )
map( vector_or_list_input, , function_to_apply,
optional_other_stuff )
modify( .x, .f, …)
ex. modify_at( my.data, 1:5, as.numeric)
https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
map in parallel
another key advantage from purrr is use of lambda
functions which has been crucial for analysis involving
multiple columns of a data frame. using the same
basic syntax we create an anonymous function which
maps over many lists simultaneously
my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4,
~ ( .x + .y ) / 2 ))
my.data %<>% mutate( var6 = pmap_dbl( list( .$var3,
.$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 ))
tip
using the grammar of
graphics, data, and lists
through tidyverse
packages can build a
strong workflow
closing
unburden your personal device
➔ learn basic linux cli
using batch job submissions gives you
the most flexibility
➔ profile/checkpoint/test
resources are not without limits
➔ share your code
don’t hold onto code until it’s perfect.
use GitHub and get feedback early and
often
$ questions -h
refs:
1. https://jennybc.github.io/purrr-tutorial/
2. https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes
3. https://learn.tacc.utexas.edu/mod/page/view.php?id=24
4. http://blog.revolutionanalytics.com/2015/11/r-projects.html

More Related Content

What's hot

Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Anastasia Lubennikova
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)Valeriy Kravchuk
 
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB DevroomMore on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB DevroomValeriy Kravchuk
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...NETWAYS
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganetikawamuray
 
HCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality CheckerHCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality CheckerLinaro
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)Huy Do
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaNelson Forte
 
Odoo command line interface
Odoo command line interfaceOdoo command line interface
Odoo command line interfaceJalal Zahid
 
Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitAndrea Righi
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in HadoopBotond Balázs
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellN Masahiro
 

What's hot (20)

Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
2017 meetup-apache-kafka-nov
2017 meetup-apache-kafka-nov2017 meetup-apache-kafka-nov
2017 meetup-apache-kafka-nov
 
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
E bpf and dynamic tracing for mariadb db as (mariadb day during fosdem 2020)
 
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB DevroomMore on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
More on bpftrace for MariaDB DBAs and Developers - FOSDEM 2022 MariaDB Devroom
 
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
OSBConf 2015 | Backups with rdiff backup and rsnapshot by christoph mitasch &...
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
 
HCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality CheckerHCQC : HPC Compiler Quality Checker
HCQC : HPC Compiler Quality Checker
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)Write on memory TSDB database (gocon tokyo autumn 2018)
Write on memory TSDB database (gocon tokyo autumn 2018)
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 
Odoo command line interface
Odoo command line interfaceOdoo command line interface
Odoo command line interface
 
Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
Data Storage Formats in Hadoop
Data Storage Formats in HadoopData Storage Formats in Hadoop
Data Storage Formats in Hadoop
 
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
 
Python arch wiki
Python   arch wikiPython   arch wiki
Python arch wiki
 

Similar to Using R on High Performance Computers

Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...OpenShift Origin
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in RTackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in RLun-Hsien Chang
 
Workflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesWorkflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesPuppet
 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakWorkflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakNETWAYS
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)Daniel Nüst
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Java in containers
Java in containersJava in containers
Java in containersMartin Baez
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R Kai Lichtenberg
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 

Similar to Using R on High Performance Computers (20)

Data Science
Data ScienceData Science
Data Science
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
 
Tackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in RTackling repetitive tasks with serial or parallel programming in R
Tackling repetitive tasks with serial or parallel programming in R
 
Workflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large EnterprisesWorkflow story: Theory versus practice in Large Enterprises
Workflow story: Theory versus practice in Large Enterprises
 
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin PiebiakWorkflow story: Theory versus Practice in large enterprises by Marcin Piebiak
Workflow story: Theory versus Practice in large enterprises by Marcin Piebiak
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
linux installation.pdf
linux installation.pdflinux installation.pdf
linux installation.pdf
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Java in containers
Java in containersJava in containers
Java in containers
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R From Zero to Hero - All you need to do serious deep learning stuff in R
From Zero to Hero - All you need to do serious deep learning stuff in R
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Using R on High Performance Computers

  • 1. using R and High Performance Computers an overview by Dave Hiltbrand
  • 2. talking points ● why HPC? ● R environment tips ● staging R scripts for HPC ● purrr::map functions
  • 3. what to do if the computation is too big for your desktop/laptop ? • a common user question: – i have an existing R pipeline for my research work. but the data is growing too big. now my R program runs for days (weeks) to finish or simply runs out of memory. • 3 Strategies – move to bigger hardware – advanced libraries/C++ – implement code using parallel packages
  • 4. trends in HPC ➔ processors not getting faster ➔ increase performance => cram more cores on each chip ➔ requires reducing clock speed (power + heat) ➔ single-threaded applications will run SLOWER on these new resources, must start thinking in parallel https://www.quora.com/Why-havent-CPU-clock-speeds-increased-in-the-last-5-years
  • 5. strategy 1: powerful hardware Stampede2 - HPC ● KNL - 68 cores (4x hyperthreading 272)/ 96GB mem/ 4200 nodes ● SKX - 48 cores (2x hyperthreading 96)/ 192 GB mem/ 1736 nodes Maverick - Vis ● vis queue: 20 cores/ 256 GB mem/ 132 nodes ○ RStudio/ Jupyter Notebooks ● gpu queue: 132 NVIDIA Telsa K40 GPUs Wrangler - Data ● Hadoop/Spark ● reservations last up to a month
  • 6. allocations open to national researcher community do you work in industry? XSEDE ● national organization providing computation resources to ~ 90% of cycles on Stampede2 tip if you need more power all you have to do is ask https://portal.xsede.org /allocations/resource- info
  • 7. HPCs are: ➔ typically run with linux ➔ more command line driven ➔ daunting to Windows only users ➔ RStudio helps the transition
  • 8. login nodes ➔ always log into the login nodes ➔ shared nodes with limited resources ➔ ok to edit, compile, move files ➔ for R, ok to install packages from login nodes ➔ !!! don’t run R Scripts!!! compute nodes ➔ dedicated nodes for each job ➔ only accessible via a job scheduler ➔ once you have a job running on a node you can ssh into the node
  • 9. access R command line ● useful to install packages on login nodes ● using interactive development jobs you can request compute resources to login straight to a compute node and use R via command line RStudio ● availability depends on the structure of the HPC cluster ● at TACC the window to use RStudio is only 4 hours through the visual portal batch Jobs ● best method to use R on HPCs ● relies on a job scheduler to fill your request ● can run multiple R scripts on multiple compute nodes
  • 10. sample batch script #!/bin/bash #---------------------------------------------------- # #---------------------------------------------------- #SBATCH -J myjob # Job name #SBATCH -o myjob.o%j # Name of stdout output file #SBATCH -e myjob.e%j # Name of stderr error file #SBATCH -p skx-normal # Queue (partition) name #SBATCH -N 1 # Total # of nodes (must be 1 for serial) #SBATCH -n 1 # Total # of mpi tasks (should be 1 for serial) #SBATCH -t 01:30:00 # Run time (hh:mm:ss) #SBATCH --mail-user=myname@myschool.edu #SBATCH --mail-type=all # Send email at begin and end of job #SBATCH -A myproject # Allocation name (req'd if you have more than 1) # Other commands must follow all #SBATCH directives... module list pwd date # Launch serial code... Rscript ./my_analysis.R > output.Rout >> error.Rerr # ---------------------------------------------------
  • 11. .libPaths and Rprofile() using your Rprofile.site or .Rprofile files along with the .libPaths() command will allow you to install packages in your user folder and have them load up when you start R on the HPC. in R, a library is the location on disk where you install your packages. R creates a different library for each dot-version of R itself when R starts, it performs a series of steps to initialize the session. you can modify the startup sequence by changing the contents in a number of locations. the following sequence is somewhat simplified: ● first, R reads the file Rprofile.site in the R_Home/etc folder, where R_HOME is the location where you installed R. ○ for example, this file could live at C:RR- 3.2.2etcRprofile.site. ○ making changes to this file affects all R sessions that use this version of R. ○ this might be a good place to define your preferred CRAN mirror, for example. ● next, R reads the file ~/.Rprofile in the user's home folder. ● lastly, R reads the file .Rprofile in the project folder tip i like to make a .Rprofile for each GitHub project repo which loads my most commonly used libraries by default.
  • 12. going parallel often you need to convert your code into parallel form to get the most out of HPC. the foreach and doMC packages will let you convert loops from sequential operation to parallel. you can even use multiple nodes if you have a really complex data set with the snow package. require( foreach ) require( doMC ) result <- foreach( i = 1:10, .combine = c) %dopar% { myProc() } require( foreach ) require( doSNOW ) #Get backend hostnames hostnames <- scan( "nodelist.txt", what="", sep="n" ) #Set reps to match core count num.cores <- 4 hostnames <- rep( hostnames, each = num.cores ) cluster <- makeSOCKcluster( hostnames ) registerDoSNOW( cluster ) result <- foreach( i = 1:10, .combine=c ) %dopar% { myProc() } stopCluster( cluster )
  • 13. profiling ➔ simple procedure checks with tictoc package ➔ use more advanced packages like microbenchmark for multiple procedures ➔ For an easy to read graphic output use the profvis package to create flamegraphs checkpointing ➔ when writing your script think of procedure runtime ➔ you can save objects in your workflow as a checkpoint ◆ library(readr) ◆ write_rds(obj, “obj.rds”) ➔ if you want to run post hoc analysis it makes it easier to have all the parts
  • 14. always start small i’m quick i’m slow build a toy dataset find your typo’s easier to rerun run the real data request the right resources once you run a small dataset you can benchmark resources needed
  • 15. if you don’t already you need to Git Git is a command-line tool, but the center around which all things involving Git revolve is the hub— GitHub.com—where developers store their projects and network with like minded people. use RStudio and all the advanced IDE tools on your local machine then push and pull to GitHub to run your job. RStudio features built-in vcs track changes in your analysis, git lets you go back in time to a previous version of your file
  • 16. Purrr Package Map functions apply a function iteratively to each element of a list or vector
  • 17. the purrr map functions are an optional replacement to the lapply functions. they are not technically faster ( although the speed comparison is in nanoseconds ). the main advantage is to use uniform syntax with other tidyverse applications such as dplyr, tidyr, readr, and stringr as well as the helper functions. map( .x, .f, … ) map( vector_or_list_input, , function_to_apply, optional_other_stuff ) modify( .x, .f, …) ex. modify_at( my.data, 1:5, as.numeric) https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf
  • 18. map in parallel another key advantage from purrr is use of lambda functions which has been crucial for analysis involving multiple columns of a data frame. using the same basic syntax we create an anonymous function which maps over many lists simultaneously my.data %<>% mutate( var5 = map2_dbl( .$var3, .$var4, ~ ( .x + .y ) / 2 )) my.data %<>% mutate( var6 = pmap_dbl( list( .$var3, .$var4, .$var5), ~ (..1 + ..2 + ..3) / 3 )) tip using the grammar of graphics, data, and lists through tidyverse packages can build a strong workflow
  • 19. closing unburden your personal device ➔ learn basic linux cli using batch job submissions gives you the most flexibility ➔ profile/checkpoint/test resources are not without limits ➔ share your code don’t hold onto code until it’s perfect. use GitHub and get feedback early and often
  • 20. $ questions -h refs: 1. https://jennybc.github.io/purrr-tutorial/ 2. https://portal.tacc.utexas.edu/user-guides/stampede2#running-jobs-on-the-stampede2-compute-nodes 3. https://learn.tacc.utexas.edu/mod/page/view.php?id=24 4. http://blog.revolutionanalytics.com/2015/11/r-projects.html