While much of the recent literature in spatial statistics has evolved around addressing the big data issue, practical implementations of these methods on high performance computing systems for truly large data are still rare. We discuss our explorations in this area at the National Center for Atmospheric Research for a range of applications, which can benefit from large scale computing infrastructure. These applications include extreme value analysis, approximate spatial methods, spatial localization methods and statistically-based data compression and are implemented in different programming languages. We will focus on timing results and practical considerations, such as speed vs. memory trade-offs, limits of scaling and ease of use.
1. High performance computing and spatial data:
an overview of recent work at NCAR
Dorit Hammerling
Analytics and Integrative Machine Learning Group
Technology Development Divison
National Center for Atmospheric Research (NCAR)
Joint work with Sophia Chen, Joseph Guinness, Marcin Jurek, Matthias Katzfuss,
Daniel Milroy, Douglas Nychka, Vinay Ramakrishnaiah, Yun Joon Soon and Brian
Vanderwende
February 13, 2018
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 1 / 45
2. Introduction and Motivation
Outline
1 Introduction and Motivation
2 Application benchmarking study example
3 Other examples and future work
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 2 / 45
3. Introduction and Motivation
The National Center for Atmospheric Research (NCAR)
• A federally federally funded research and development center
• Mission: To understand the behavior of the atmosphere and related
Earth and geospace systems
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 3 / 45
4. Introduction and Motivation
NCAR’s Community Earth System Model
• a “virtual laboratory” to study past, present and future climate states
• describes interactions of the atmosphere, land, river runoff, land-ice,
oceans and sea-ice
• complex! Large code base: approx. 1.5 Millions lines of code
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 4 / 45
5. Introduction and Motivation
Earth System Models
• Computationally very demanding, differential equations are solved for
millions of grid boxes. → Require HPC infrastructure.
• Approximately 200 variables, many in 3-D, are saved to describe the
state of the atmosphere, land, river runoff, land-ice, oceans and
sea-ice for every grid cell. → Massive amounts of data and storage
requirements, lots of science questions.
Work with scientists to gain insights from massive data sets, ideally
without moving the data.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 5 / 45
6. Introduction and Motivation
Earth System Models
• Computationally very demanding, differential equations are solved for
millions of grid boxes. → Require HPC infrastructure.
• Approximately 200 variables, many in 3-D, are saved to describe the
state of the atmosphere, land, river runoff, land-ice, oceans and
sea-ice for every grid cell. → Massive amounts of data and storage
requirements, lots of science questions.
Work with scientists to gain insights from massive data sets, ideally
without moving the data.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 5 / 45
7. Introduction and Motivation
Earth System Models
• Computationally very demanding, differential equations are solved for
millions of grid boxes. → Require HPC infrastructure.
• Approximately 200 variables, many in 3-D, are saved to describe the
state of the atmosphere, land, river runoff, land-ice, oceans and
sea-ice for every grid cell. → Massive amounts of data and storage
requirements, lots of science questions.
Work with scientists to gain insights from massive data sets, ideally
without moving the data.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 5 / 45
8. Introduction and Motivation
Analyzing large spatial data: initial considerations
• What is the scientific question and what statistical or machine
learning modeling framework can we use to answer it?
• Is the analysis inherently parallel or does the model allow for
parallelization?
• What software? Where and how is the data stored?
• How can we optimize the execution on our HPC infrastructure?
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 6 / 45
9. Introduction and Motivation
Analyzing large spatial data: initial considerations
• What is the scientific question and what statistical or machine
learning modeling framework can we use to answer it?
• Is the analysis inherently parallel or does the model allow for
parallelization?
• What software? Where and how is the data stored?
• How can we optimize the execution on our HPC infrastructure?
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 6 / 45
10. Introduction and Motivation
Analyzing large spatial data: initial considerations
• What is the scientific question and what statistical or machine
learning modeling framework can we use to answer it?
• Is the analysis inherently parallel or does the model allow for
parallelization?
• What software? Where and how is the data stored?
• How can we optimize the execution on our HPC infrastructure?
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 6 / 45
11. Introduction and Motivation
Analyzing large spatial data: initial considerations
• What is the scientific question and what statistical or machine
learning modeling framework can we use to answer it?
• Is the analysis inherently parallel or does the model allow for
parallelization?
• What software? Where and how is the data stored?
• How can we optimize the execution on our HPC infrastructure?
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 6 / 45
12. Application benchmarking study example
Outline
1 Introduction and Motivation
2 Application benchmarking study example
3 Other examples and future work
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 7 / 45
13. Application benchmarking study example
Study of Precipitation extremes: a typical example
• daily data for 35 years: 12,775 values per grid cell
• 288 longitudes × 192 latitudes: 55,296 grid cells
• 12,775 × 55,296 = 706,406,400 data points (2.83 GB)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 8 / 45
14. Application benchmarking study example
Fitting a Generalized Pareto distribution
• This is a complementary approach to block maxima for Extreme
Value Analysis
• For data above a given threshold (µ) fit a probability density with the
form:
f (x) =
1
σ[1 + ξ(x−µ)
σ
](1/ξ+1)
for x ≥ µ.
• σ – scale parameter, ξ – shape parameter
• We are ignoring all the data below the threshold to just fit the tail.
• Having selected the threshold, estimate σ and ξ by maximum
likelihood.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 9 / 45
15. Application benchmarking study example
Fitting a Generalized Pareto distribution: R code
tailProb<- .01 # tail probability used in extremes fitting
returnLevelYear <- 100 # Years used for return level
Y<- dataset[lonindex,latindex]
threshold<- quantile(Y, 1- tailProb)
frac<- sum(Y > threshold) / length(Y)
GPFit<- fevd(Y, threshold=threshold, type="GP",method="MLE")
ReturnLevel<- return.level(GPFit,returnLevelYear, do.ci=TRUE)
Depending on your machine takes somewhere from 0.3 to 1 second.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 10 / 45
16. Application benchmarking study example
Fitting a Generalized Pareto distribution: R code
tailProb<- .01 # tail probability used in extremes fitting
returnLevelYear <- 100 # Years used for return level
Y<- dataset[lonindex,latindex]
threshold<- quantile(Y, 1- tailProb)
frac<- sum(Y > threshold) / length(Y)
GPFit<- fevd(Y, threshold=threshold, type="GP",method="MLE")
ReturnLevel<- return.level(GPFit,returnLevelYear, do.ci=TRUE)
Depending on your machine takes somewhere from 0.3 to 1 second.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 10 / 45
17. Application benchmarking study example
Why use HPC systems for statistical computing?
Doing repetitive tasks can take a lot of time.
Even short tasks add up quickly:
• 0.33 seconds for one location corresponds to approx. 5 hours for
55,000 locations.
• 1 second for one location corresponds to approx. 15 hours for 55,000
locations.
And that is for a single data set. Often we want to analyze hundreds of
data sets and test different parameters.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 11 / 45
18. Application benchmarking study example
Why use HPC systems for statistical computing?
Doing repetitive tasks can take a lot of time.
Even short tasks add up quickly:
• 0.33 seconds for one location corresponds to approx. 5 hours for
55,000 locations.
• 1 second for one location corresponds to approx. 15 hours for 55,000
locations.
And that is for a single data set. Often we want to analyze hundreds of
data sets and test different parameters.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 11 / 45
19. Application benchmarking study example
Why use HPC systems for statistical computing?
Doing repetitive tasks can take a lot of time.
Even short tasks add up quickly:
• 0.33 seconds for one location corresponds to approx. 5 hours for
55,000 locations.
• 1 second for one location corresponds to approx. 15 hours for 55,000
locations.
And that is for a single data set. Often we want to analyze hundreds of
data sets and test different parameters.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 11 / 45
20. Application benchmarking study example
NCAR’s high performance computing (HPC) systems
Yellowstone (previous system (decommissioned at the end of 2017) ):
• 1.5 petaflops peak
• 72,256 cores
• 145 TB total memory
• 56 Gb/s interconnects
Cheyenne (new system: evolutionary increase)
• 5.34 petaflops peak
• 145,152 cores
• 313 TB total memory
• 100 Gb/s interconnects
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 12 / 45
21. Application benchmarking study example
Cores and nodes on HPC systems
• Usually cores on one node share memory (cache).
• Memory between nodes is typically not shared, but can be accessed.
• Understanding the basics of the architecture and interconnects can be
really helpful!
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 13 / 45
22. Application benchmarking study example
Relevant details: Memory and parallelization tools
Memory available on compute nodes
Two classes of nodes on Cheyenne:
• Standard nodes have 64 GB of memory (46 GB usable).
• Large memory nodes with 128 GB of memory (110 GB usable).
• Data Analysis cluster: 1TB (!) of memory (1000 GB usable)
• But very different network architecture, not meant for working across
nodes!
You need to know what is installed and how it is configured!
• Rmpi:
• Limits on workers?
• What physical interconnect is it using?
• Matlab Distributed Computing server
• Spark for Python or Scala
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 14 / 45
23. Application benchmarking study example
Relevant details: Memory and parallelization tools
Memory available on compute nodes
Two classes of nodes on Cheyenne:
• Standard nodes have 64 GB of memory (46 GB usable).
• Large memory nodes with 128 GB of memory (110 GB usable).
• Data Analysis cluster: 1TB (!) of memory (1000 GB usable)
• But very different network architecture, not meant for working across
nodes!
You need to know what is installed and how it is configured!
• Rmpi:
• Limits on workers?
• What physical interconnect is it using?
• Matlab Distributed Computing server
• Spark for Python or Scala
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 14 / 45
24. Application benchmarking study example
Application benchmarks
Even if one knows the architecture very well and has data on low-level
benchmarks, application benchmarking is critical.
Application benchmarking: benchmarking that uses code as close as
possible to the real production code (including I/O operations!).
For our application we use a quarter (approximately 14,000 grid cells) of
the full data for initial benchmarking.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 15 / 45
25. Application benchmarking study example
Application benchmarks
Even if one knows the architecture very well and has data on low-level
benchmarks, application benchmarking is critical.
Application benchmarking: benchmarking that uses code as close as
possible to the real production code (including I/O operations!).
For our application we use a quarter (approximately 14,000 grid cells) of
the full data for initial benchmarking.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 15 / 45
26. Application benchmarking study example
Application benchmarks
Even if one knows the architecture very well and has data on low-level
benchmarks, application benchmarking is critical.
Application benchmarking: benchmarking that uses code as close as
possible to the real production code (including I/O operations!).
For our application we use a quarter (approximately 14,000 grid cells) of
the full data for initial benchmarking.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 15 / 45
27. Application benchmarking study example
Double for loop: code sketch
# outer loop over latitude
for (latindex in 1:dim(lat)) {
# inner loop over longitude
for (lonindex in 1:dim(lon)) {
Y<- dataset[latindex,lonindex,] # extract data
......
EXTREME VALUE ANALYSIS (EVA) CODE
outSummary[latindex,lonindex,]<-EVA RESULTS
......
print(lonindex) # Counter
} # end of inner loop
print(latindex) # Counter
} # end of outer loop
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 16 / 45
28. Application benchmarking study example
Setup for Application benchmarking
Experimental design:
• What kind of cluster i.e. communication protocol and network?
• What loop to parallelize: inner or outer? Or nested?
• What to put in the inner/outer loop: latitude or longitude?
• How to read in the data? All at once, one latitude/longitude band at
a time, one grid cell at a time?
• How does the application scale with more cores and nodes?
Additional consideration:
• Do we want our code to run as fast as possible or as efficiently as
possible? Total execution time vs. time per core?
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 17 / 45
29. Application benchmarking study example
Setup for Application benchmarking
Experimental design:
• What kind of cluster i.e. communication protocol and network?
• . . .
• . . .
• . . .
• . . .
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 18 / 45
30. Application benchmarking study example
Physical networks and communication protocols
• TCP/IP is the protocol on which the internet is based. Connections
can be high bandwidth but also high latency, partially because the
protocol is designed to work in lossy networks. Logical endpoints of
TCP/IP connections are called sockets.
• Ethernet is a physical network designed to support TCP/IP
connections.
• InfiniBand is a physical network designed to support MPI message
passing. The physical connections are very high bandwidth, very low
latency, and very expensive.
• MPI (Message Passing Interface) is a library written to enable
message passing on compute clusters. It employs algorithms that
optimize communication efficiency and speed on clusters. It works
best with a high bandwidth, low latency and near lossless physical
network, but can work on Ethernet via TCP/IP as well.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 19 / 45
31. Application benchmarking study example
Remote Direct Memory Access (RDMA)
• RDMA allows data to be written/read to/from other nodes’
memories.
• In a sense, the nodes behave like a single aggregate node.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 20 / 45
32. Application benchmarking study example
Cluster setup in R
To run foreach in parallel, a cluster needs to be set up.
• Starting an R PSOCK cluster sets up TCP/IP connections (without
RDMA) across the Ethernet network (and can be tricked to work over
InfiniBand).
• Starting an MPI cluster sets up MPI communications with RDMA
across the InfiniBand network.
library(Rmpi)
library(doMPI)
##### Cluster setup #####
cl <- startMPIcluster(numCores) # Create MPI cluster
registerDoMPI(cl) # Register parallel backend for foreach
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 21 / 45
33. Application benchmarking study example
Cluster setup in R
To run foreach in parallel, a cluster needs to be set up.
• Starting an R PSOCK cluster sets up TCP/IP connections (without
RDMA) across the Ethernet network (and can be tricked to work over
InfiniBand).
• Starting an MPI cluster sets up MPI communications with RDMA
across the InfiniBand network.
library(Rmpi)
library(doMPI)
##### Cluster setup #####
cl <- startMPIcluster(numCores) # Create MPI cluster
registerDoMPI(cl) # Register parallel backend for foreach
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 21 / 45
34. Application benchmarking study example
Setup for Application benchmarking
Experimental design:
• What kind of cluster i.e. communication protocol?
• What loop to parallelize: inner or outer? Or nested?
• What to put in the inner/outer loop: latitude or longitude?
• . . .
• . . .
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 22 / 45
35. Application benchmarking study example
Parallelizing for loops: the “foreach” package
• foreach provides a looping construct using binary operators, which
can be easily parallelized.
• hyprid between standard for loop and lapply, evaluates an expression,
not a function (as lapply).
• returns a value rather than “causing side-effects”.
• needs a parallel backend, most commonly doParallel
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 23 / 45
36. Application benchmarking study example
Parallelizing for loops: the “foreach” package
• foreach provides a looping construct using binary operators, which
can be easily parallelized.
• hyprid between standard for loop and lapply, evaluates an expression,
not a function (as lapply).
• returns a value rather than “causing side-effects”.
• needs a parallel backend, most commonly doParallel
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 23 / 45
37. Application benchmarking study example
Parallelizing for loops: the “foreach” package
• foreach provides a looping construct using binary operators, which
can be easily parallelized.
• hyprid between standard for loop and lapply, evaluates an expression,
not a function (as lapply).
• returns a value rather than “causing side-effects”.
• needs a parallel backend, most commonly doParallel
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 23 / 45
38. Application benchmarking study example
Parallelizing for loops: the “foreach” package
• foreach provides a looping construct using binary operators, which
can be easily parallelized.
• hyprid between standard for loop and lapply, evaluates an expression,
not a function (as lapply).
• returns a value rather than “causing side-effects”.
• needs a parallel backend, most commonly doParallel
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 23 / 45
39. Application benchmarking study example
Code sketch for inner loop parallelization
library(doParallel) # loads foreach, parallel and iterators
# outer loop over latitude
for (latindex in 1:numLat) {
dataset <- getData(latindex) # load data for specific latitu
# inner loop over longitude (executed in parallel)
outSummary[latindex,,]<- foreach (lonindex = 1:dim(lon),
.combine=rbind,.packages=c("extRemes")) %dopar% {
Y<- dataset[lonindex,] # extract Y for specific longitude
......
EXTREME VALUE ANALYSIS (EVA) CODE
......
c(threshold,GPFit$results$par,frac = frac,ReturnLevel )
#print(lonindex) # Counter DON’T use in parallel execution
}
print(latindex) # Counter to monitor progress
} Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 24 / 45
40. Application benchmarking study example
One latitude at a time: inner loop over longitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 25 / 45
41. Application benchmarking study example
One latitude at a time: inner loop over longitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 25 / 45
42. Application benchmarking study example
One latitude at a time: inner loop over longitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 25 / 45
43. Application benchmarking study example
One longitude at a time: inner loop over latitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 26 / 45
44. Application benchmarking study example
One longitude at a time: inner loop over latitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 26 / 45
45. Application benchmarking study example
One longitude at a time: inner loop over latitude
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 26 / 45
46. Application benchmarking study example
Code sketch for outer loop parallelization
# outer loop over latitude (executed in parallel)
outSummary <- foreach(latindex = 1:numLat,lat_count = icount()
.combine = rbind,.packages=c("extRemes", "ncdf4","foreach",
"iterators")) %dopar% {
dataset <- getData(latindex)
# inner loop over longitude (executed sequentially)
foreach(lonindex = 1:dim(lon), lon_count = icount(),
.combine = rbind,.packages=c("extRemes", "foreach")) %do% {
Y<- dataset[lonindex,]
......
EXTREME VALUE ANALYSIS (EVA) CODE
......
c(threshold,GPFit$results$par,frac = frac,ReturnLevel)
}
}
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 27 / 45
47. Application benchmarking study example
Setup for Application benchmarking
Experimental design:
• What kind of cluster i.e. communication protocol and network?
• What loop to parallelize: inner or outer? Or nested?
• What to put in the inner/outer loop: latitude or longitude?
• How to read in the data? All at once, one latitude/longitude
band at a time, one grid cell at a time?
• . . .
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 28 / 45
48. Application benchmarking study example
Options of reading in the data
• All the data at once
• One latitude or longitude band at a time
• One grid cell at a time
Trade-off between the number of I/O calls and filling up memory.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 29 / 45
49. Application benchmarking study example
Options of reading in the data
• All the data at once
• One latitude or longitude band at a time
• One grid cell at a time
Trade-off between the number of I/O calls and filling up memory.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 29 / 45
50. Application benchmarking study example
Application Benchmarking Results
Experiment 1: Inner Loop Parallelization Across Longitude for 48
Latitudes reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 375
IPoIB 32 16 2 523
IPoIB 48 16 3 543
mpi 16 16 1 2955
mpi 32 16 2 3206
mpi 48 16 3 3277
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 30 / 45
51. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 2: Inner Loop Parallelization Across Longitude for 48
Latitudes reading all the data at once
Experiment 3: Inner Loop Parallelization Across Latitude for 48
Longitudes reading in the data by longitudinal band
Experiment 4: Inner Loop Parallelization Across Latitude for 48
Longitudes reading all the data at once
• Results equivalent or worse for experiments 2 through 4.
• The inner loop parallelization does NOT scale across nodes and can
run out of memory when reading in all the data at once.
• R does not have variable slicing, meaning each worker needs to be
sent the full data and then worker-specific-data is extracted.
[Different in Matlab!]
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 31 / 45
52. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 2: Inner Loop Parallelization Across Longitude for 48
Latitudes reading all the data at once
Experiment 3: Inner Loop Parallelization Across Latitude for 48
Longitudes reading in the data by longitudinal band
Experiment 4: Inner Loop Parallelization Across Latitude for 48
Longitudes reading all the data at once
• Results equivalent or worse for experiments 2 through 4.
• The inner loop parallelization does NOT scale across nodes and can
run out of memory when reading in all the data at once.
• R does not have variable slicing, meaning each worker needs to be
sent the full data and then worker-specific-data is extracted.
[Different in Matlab!]
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 31 / 45
53. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 2: Inner Loop Parallelization Across Longitude for 48
Latitudes reading all the data at once
Experiment 3: Inner Loop Parallelization Across Latitude for 48
Longitudes reading in the data by longitudinal band
Experiment 4: Inner Loop Parallelization Across Latitude for 48
Longitudes reading all the data at once
• Results equivalent or worse for experiments 2 through 4.
• The inner loop parallelization does NOT scale across nodes and can
run out of memory when reading in all the data at once.
• R does not have variable slicing, meaning each worker needs to be
sent the full data and then worker-specific-data is extracted.
[Different in Matlab!]
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 31 / 45
54. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 2: Inner Loop Parallelization Across Longitude for 48
Latitudes reading all the data at once
Experiment 3: Inner Loop Parallelization Across Latitude for 48
Longitudes reading in the data by longitudinal band
Experiment 4: Inner Loop Parallelization Across Latitude for 48
Longitudes reading all the data at once
• Results equivalent or worse for experiments 2 through 4.
• The inner loop parallelization does NOT scale across nodes and can
run out of memory when reading in all the data at once.
• R does not have variable slicing, meaning each worker needs to be
sent the full data and then worker-specific-data is extracted.
[Different in Matlab!]
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 31 / 45
55. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 198
IPoIB 32 16 2 172
IPoIB 48 16 3 101
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Much better scaling than innner loop parallelization.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 32 / 45
56. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 198
IPoIB 32 16 2 172
IPoIB 48 16 3 101
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Much better scaling than innner loop parallelization.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 32 / 45
57. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 7: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data by longitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 214
IPoIB 32 16 2 152
IPoIB 48 16 3 116
mpi 16 16 1 209
mpi 32 16 2 146
mpi 48 16 3 101
Much better scaling than innner loop parallelization.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 33 / 45
58. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 7: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data by longitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 214
IPoIB 32 16 2 152
IPoIB 48 16 3 116
mpi 16 16 1 209
mpi 32 16 2 146
mpi 48 16 3 101
Much better scaling than innner loop parallelization.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 33 / 45
59. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Experiment 7: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data by longitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 209
mpi 32 16 2 146
mpi 48 16 3 101
Outer loop parallelization over latitude somewhat faster.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 34 / 45
60. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Experiment 7: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data by longitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 209
mpi 32 16 2 146
mpi 48 16 3 101
Outer loop parallelization over latitude somewhat faster.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 34 / 45
61. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Experiment 6: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data only once
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 220
mpi 32 16 2 123
mpi 48 16 3 93
Data read options provide similar results for this data size.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 35 / 45
62. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Experiment 6: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data only once
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 220
mpi 32 16 2 123
mpi 48 16 3 93
Data read options provide similar results for this data size.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 35 / 45
63. Application benchmarking study example
Application Benchmarking Results cont.
Experiment 5: Outer Loop Parallelization Across Latitude for 48 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 208
mpi 32 16 2 129
mpi 48 16 3 80
Experiment 6: Outer Loop Parallelization Across Longitude for 72
Longitude Values reading in the data only once
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 220
mpi 32 16 2 123
mpi 48 16 3 93
Data read options provide similar results for this data size.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 35 / 45
64. Application benchmarking study example
Application Benchmarking Results for FULL data
Experiment 5: Outer Loop Parallelization Across Latitude for 192 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 719
IPoIB 32 16 2 521
IPoIB 48 16 3 352
IPoIB 64 16 4 279
IPoIB 80 16 5 252
IPoIB 96 16 6 192
IP (Socket) clusters have a limit of 128 workers within R.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 36 / 45
65. Application benchmarking study example
Application Benchmarking Results for FULL data
Experiment 5: Outer Loop Parallelization Across Latitude for 192 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
IPoIB 16 16 1 719
IPoIB 32 16 2 521
IPoIB 48 16 3 352
IPoIB 64 16 4 279
IPoIB 80 16 5 252
IPoIB 96 16 6 192
IP (Socket) clusters have a limit of 128 workers within R.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 36 / 45
66. Application benchmarking study example
Application Benchmarking Results for FULL data
Experiment 5: Outer Loop Parallelization Across Latitude for 192 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 658
mpi 32 16 2 354
mpi 48 16 3 233
mpi 64 16 4 206
mpi 80 16 5 187
mpi 96 16 6 154
mpi 192 16 12 88
Fastest setup overall. About one and a half minutes for the entire data set.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 37 / 45
67. Application benchmarking study example
Application Benchmarking Results for FULL data
Experiment 5: Outer Loop Parallelization Across Latitude for 192 Latitude
Values reading in the data by latitudinal band
Cluster numCores ptile numNodes Run Time (seconds)
mpi 16 16 1 658
mpi 32 16 2 354
mpi 48 16 3 233
mpi 64 16 4 206
mpi 80 16 5 187
mpi 96 16 6 154
mpi 192 16 12 88
Fastest setup overall. About one and a half minutes for the entire data set.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 37 / 45
68. Application benchmarking study example
Technical Report, Data and Code available
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 38 / 45
69. Other examples and future work
Outline
1 Introduction and Motivation
2 Application benchmarking study example
3 Other examples and future work
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 39 / 45
70. Other examples and future work
Think globally - act locally (Doug Nychka’s latest work)
• A global statistical model for a spatial field provides seamless
inference across a spatial domain
• A local analysis of spatial data avoids large memory requirements and
simplifies parallel computation
Goal is to combine these two ideas.
• Compute on local neighborhoods of the spatial field but assemble the
results into a global model.
• The local computations are embarrassingly parallel and so easily scale
to many cores.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 40 / 45
71. Other examples and future work
Emulation of model output
• Pattern scaling is based on a linear relationship between local
temperatures and the global mean.
• Derived from a long coupled climate model run.
Mean scaling pattern
Variation in 8 pattern scaling fields
Goal: Simulate additional fields cheaply that reflect
the properties of the ensemble.
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 41 / 45
72. Other examples and future work
Parameters of local spatial models
−140 −100 −60
−40−200204060
2 4 6 8 10 14
Range (degrees)
−140 −100 −60
−40−200204060
0.05 0.15 0.25
Sigma (C)
−140 −100 −60
−40−200204060
0.00 0.02 0.04 0.06
Tau (C)
• Parallel fits using Rmpi on moving 11×11 pixel windows
• Demonstrated linear scaling to at least 1000 cores
• Highly nonstationary!
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 42 / 45
73. Other examples and future work
Emulation of the model output
• Encode the local parameter estimates into a global Markov Random
field model (LatticeKrig).
• Fast simulation due to sparsity of the LatticeKrig precision matrix and
basis functions
Top row: 4 model fields Bottom row: 4 simulated fields
−0.5 0.0 0.5
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 43 / 45
74. Other examples and future work
Other spatial work using HPC infrastructure
• Comparison of Python and Matlab implementation of the
Multi-resolution approximation
• Statistical compression algorithms using half-spectral models for
spatio-temporal data, parallelization over temporal frequencies
• Bayesian climate change detection and attribution models,
parallelization over number of basis functions
• . . .
Thanks! Any questions? Dorit Hammerling (dorith@ucar.edu)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 44 / 45
75. Other examples and future work
Other spatial work using HPC infrastructure
• Comparison of Python and Matlab implementation of the
Multi-resolution approximation
• Statistical compression algorithms using half-spectral models for
spatio-temporal data, parallelization over temporal frequencies
• Bayesian climate change detection and attribution models,
parallelization over number of basis functions
• . . .
Thanks! Any questions? Dorit Hammerling (dorith@ucar.edu)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 44 / 45
76. Other examples and future work
Other spatial work using HPC infrastructure
• Comparison of Python and Matlab implementation of the
Multi-resolution approximation
• Statistical compression algorithms using half-spectral models for
spatio-temporal data, parallelization over temporal frequencies
• Bayesian climate change detection and attribution models,
parallelization over number of basis functions
• . . .
Thanks! Any questions? Dorit Hammerling (dorith@ucar.edu)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 44 / 45
77. Other examples and future work
Other spatial work using HPC infrastructure
• Comparison of Python and Matlab implementation of the
Multi-resolution approximation
• Statistical compression algorithms using half-spectral models for
spatio-temporal data, parallelization over temporal frequencies
• Bayesian climate change detection and attribution models,
parallelization over number of basis functions
• . . .
Thanks! Any questions? Dorit Hammerling (dorith@ucar.edu)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 44 / 45
78. Other examples and future work
Other spatial work using HPC infrastructure
• Comparison of Python and Matlab implementation of the
Multi-resolution approximation
• Statistical compression algorithms using half-spectral models for
spatio-temporal data, parallelization over temporal frequencies
• Bayesian climate change detection and attribution models,
parallelization over number of basis functions
• . . .
Thanks! Any questions? Dorit Hammerling (dorith@ucar.edu)
Hammerling et al. (NCAR) HPC for spatial data February 13, 2018 44 / 45