Using A100 MIG to Scale Astronomy Scientific Output

Using A100 MIG to
Scale Astronomy Scientific Output
Presented by Igor Sfiligoi – UC San Diego
Prepared in collaboration with David Schultz, Benedikt Riedel, James A. Clark and Frank Würthwein
April 2021

Image Credit: NASA's Goddard Space Flight Center
The night sky may look peaceful.
But there are many really violent
events going on!
This artist's rendering shows the tidal disruption event named ASASSN-14li, where a star wandering too close to a
3-million-solar-mass black hole was torn apart. The debris gathered into an accretion disk around the black hole.

The most violent events
cannot be seen with
“traditional” methods!
Multi-messenger astronomy

High energy neutrinos
can travel
long distances
in straight line.
Produce Cherenkov light
when passing through ice.

Really big events can distort space-time
creating gravitational waves.
Can be detected with
laser interferometry.
Image credit: LIGO/T. Pyle
This illustration shows the merger of two black holes and the gravitational
waves that ripple outward as the black holes spiral toward each other.
LIGO Livingston, Louisiana
4km
4km

Both detectors require extensive
compute power to properly reconstruct
the observed events.

Optical Properties
• Combining all the possible information
• These features are included in simulation
• We’re always be developing them
Nature never tell us a perfect answer but obtained a
satisfactory agreement with data!
The need for calibration
• Natural medium
• Hard to calibrate properly
• Dropped a detector into a grey box
• The ice is very clear, but…
• Is it uniform?
• How has construction changed the ice?
• Drastic changes
in reconstructed position
with different ice models

GPUs great fit for calibration purposes
• Too complex for a parametrized approach
• Needs brute force approach
• Basically, Ray-Tracing
• Starting from MC produced events
• Switched to GPU Ray-Tracing in 2011
• At that point
1x GPU ~ 200x CPU cores

Really excited about the A100
• Great speed growth in core IceCube compute benchmark
0
100
200
300
400
500
T4 P100 P40 V100 A100
Mphotons/s
V100 = 1.7x T4
P100 = 1.1x T4
A100 = 3.3x T4
A100 = 2.9x P100
A100 = 1.9x V100
(Image credit: Nvidia)

0
20
40
60
80
100
120
140
0
100
200
300
400
500
T4 P100 P40 V100 A100
But actual application speed improvement much lower
• IceCube CPU-based code cannot keep GPU busy!
A100 = 1.9x T4
A100 = 1.4x V100
Benchmark
Mphotons/s
Application jobs/day
With oversize=3

0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
A100 MIG (Multi-Instance GPU)
to the rescue
• With 7x MIG partitions we are back to theoretical speedup
A100 = 3.5x T4
A100 = 2.5x V100
Benchmark
Mphotons/s
Actually…
Event better
With oversize=3

0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
to the rescue
• With 7x MIG partitions we are back to theoretical speedup
A100 = 3.5x T4
A100 = 2.5x V100
Benchmark
Mphotons/s
Actually…
Event better
We did try to share a
single GPU between
multiple jobs, but no real
advantage observed
Each job believes it has a
full GPU to itself.
A slower GPU, but we
mostly care about
throughput.

0
50
100
150
200
250
300
0
50
100
150
200
250
300
350
400
450
500
T4 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
to the rescue
• And the story get better for faster codes!
A100 = 4.3x T4
A100 = 4.3x V100
Benchmark
Mphotons/s
With oversize=5 V100 was already
too fast for IceCube
in fast mode
2x A100 MIG runtime
identical to
no_MIG A100 runtime!
2x HTC
(When precision less important)

0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
to the rescue
Benchmark
Mphotons/s
Actually…
Event better
We did try to share a
single GPU between
multiple jobs, but no real
advantage observed
Each job believes it has a
full GPU to itself.
A slower GPU, but we
mostly care about
throughput.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Relatively easy
to setup
But requires
a reboot
to be enabled

Gravitational wave parameter estimation
• Gravitational wave data:
• Very noisy time series of strain measurements
• Sources like binary black hole coalescence
induce a well-modelled signal
• Infer source parameters (masses, spins, etc)
by solving inverse problem:
• matching waveform model
“templates” to the data
PRL 118, 221101 (2017)
Computationally expensive
• Computing posterior
probability density functions
• 1k -1M likelihood evaluations

RIFT = Rapid parameter inference for
gravitational wave sources via Iterative FiTting
• Highly parallelizable, grid-based algorithm:
• Start with discrete grid of waveforms
• Monte-Carlo integrate likelihood over extrinsic,
“nuisance” parameters (e.g., distance, sky-location)
• Gaussian process interpolation
→ approximate continuous likelihood function
• Interpolation ported to GPUs using cupy
• 1x Nvidia Quadro P2000 GPU ~ 20x Intel Xeon Silver 4116 CPU cores
• Requires MC input, which is currently CPU-only

LIGO still in the process to transition
• All GPUs too fast!
• Workflow heavily single-CPU-core-bound
Image Credit: T. Pyle/Caltech/MIT/LIGO Lab
0
20
40
60
80
100
120
140
GTX1080TI V100 A100
jobs/day
A100 = 1.3x GTX1080TI
A100 ~ V100
Code crushing on
Turing platforms.
GTX1080TI ~ T4
Coming soon, but not yet ready:
GPU-based Monte-Carlo integration
RIFT

Multi-process the way to go
• All GPUs still too fast!
• Linear scaling from 1 to 3 processes per GPU
A100 = 1.3x GTX1080TI
A100 ~ V100
0
100
200
300
400
GTX1080TI V100 A100
jobs/day
3 processes/GPU
1 process/GPU
3x
RIFT

Multi-process the way to go
• All GPUs still too fast!
• Linear scaling from 1 to 3 processes per GPU
A100 = 1.3x GTX1080TI
A100 ~ V100
0
100
200
300
400
GTX1080TI V100 A100
jobs/day
3 processes/GPU
1 process/GPU
3x
No reason to stop at
3 processes/GPU.
But makes scheduling harder.
RIFT

0
500
1000
1500
2000
2500
GTX1080TI V100 A100 3x A100
MIG
7x A100
MIG
jobs/day
A100 MIG helps pack more processes
• A100 7xMIG still fast enough to feed 3 processes
• Each MIG partition can be scheduled independently
3 processes/GPU
1 process/GPU
Image Credit: SXS, the Simulating eXtreme Spacetimes
7x more throughput
with marginal per-job
slowdown
RIFT

MIG makes A100 a great GPU
• The A100 GPU is fast
• Too fast for IceCube and LIGO to
fully use the whole GPU!
• Splitting the GPU with MIG
makes it much easier to use
• Required for IceCube
• Convenient for LIGO
• Enables 200% - 600% throughput
on identical HW!
0%
100%
200%
300%
400%
500%
600%
V100 A100 3x A100 MIG 7x A100 MIG
jobs/day
IceCube Regular IceCube Fast LIGO (3x)

Acknowledgments
This work was partially funded by the
US National Science Foundation (NSF) though grants
OAC-1941481, OAC-1841530, OAC-1826967 and
OPP-1600823.

Using A100 MIG to Scale Astronomy Scientific Output

More Related Content

What's hot

Similar to Using A100 MIG to Scale Astronomy Scientific Output

More from Igor Sfiligoi

Recently uploaded

Using A100 MIG to Scale Astronomy Scientific Output