Using A100 MIG to
Scale Astronomy Scientific Output
Presented by Igor Sfiligoi – UC San Diego
Prepared in collaboration with David Schultz, Benedikt Riedel, James A. Clark and Frank Würthwein
April 2021
Image Credit: NASA's Goddard Space Flight Center
The night sky may look peaceful.
But there are many really violent
events going on!
This artist's rendering shows the tidal disruption event named ASASSN-14li, where a star wandering too close to a
3-million-solar-mass black hole was torn apart. The debris gathered into an accretion disk around the black hole.
The most violent events
cannot be seen with
“traditional” methods!
Multi-messenger astronomy
High energy neutrinos
can travel
long distances
in straight line.
Produce Cherenkov light
when passing through ice.
Really big events can distort space-time
creating gravitational waves.
Can be detected with
laser interferometry.
Image credit: LIGO/T. Pyle
This illustration shows the merger of two black holes and the gravitational
waves that ripple outward as the black holes spiral toward each other.
LIGO Livingston, Louisiana
4km
4km
Both detectors require extensive
compute power to properly reconstruct
the observed events.
Optical Properties
• Combining all the possible information
• These features are included in simulation
• We’re always be developing them
Nature never tell us a perfect answer but obtained a
satisfactory agreement with data!
The need for calibration
• Natural medium
• Hard to calibrate properly
• Dropped a detector into a grey box
• The ice is very clear, but…
• Is it uniform?
• How has construction changed the ice?
• Drastic changes
in reconstructed position
with different ice models
GPUs great fit for calibration purposes
• Too complex for a parametrized approach
• Needs brute force approach
• Basically, Ray-Tracing
• Starting from MC produced events
• Switched to GPU Ray-Tracing in 2011
• At that point
1x GPU ~ 200x CPU cores
Really excited about the A100
• Great speed growth in core IceCube compute benchmark
0
100
200
300
400
500
T4 P100 P40 V100 A100
Mphotons/s
V100 = 1.7x T4
P100 = 1.1x T4
A100 = 3.3x T4
A100 = 2.9x P100
A100 = 1.9x V100
(Image credit: Nvidia)
0
20
40
60
80
100
120
140
0
100
200
300
400
500
T4 P100 P40 V100 A100
But actual application speed improvement much lower
• IceCube CPU-based code cannot keep GPU busy!
A100 = 1.9x T4
A100 = 1.4x V100
Benchmark
Mphotons/s
Application jobs/day
With oversize=3
0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
A100 MIG (Multi-Instance GPU)
to the rescue
• With 7x MIG partitions we are back to theoretical speedup
A100 = 3.5x T4
A100 = 2.5x V100
Benchmark
Mphotons/s
Application jobs/day
Actually…
Event better
With oversize=3
0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
A100 MIG (Multi-Instance GPU)
to the rescue
• With 7x MIG partitions we are back to theoretical speedup
A100 = 3.5x T4
A100 = 2.5x V100
Benchmark
Mphotons/s
Application jobs/day
Actually…
Event better
We did try to share a
single GPU between
multiple jobs, but no real
advantage observed
Each job believes it has a
full GPU to itself.
A slower GPU, but we
mostly care about
throughput.
0
50
100
150
200
250
300
0
50
100
150
200
250
300
350
400
450
500
T4 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
A100 MIG (Multi-Instance GPU)
to the rescue
• And the story get better for faster codes!
A100 = 4.3x T4
A100 = 4.3x V100
Benchmark
Mphotons/s
Application jobs/day
With oversize=5 V100 was already
too fast for IceCube
in fast mode
2x A100 MIG runtime
identical to
no_MIG A100 runtime!
2x HTC
(When precision less important)
0
20
40
60
80
100
120
140
0
50
100
150
200
250
300
350
400
450
T4 P100 P40 V100 A100 2x A100
MIG
3x A100
MIG
7x A100
MIG
A100 MIG (Multi-Instance GPU)
to the rescue
Benchmark
Mphotons/s
Application jobs/day
Actually…
Event better
We did try to share a
single GPU between
multiple jobs, but no real
advantage observed
Each job believes it has a
full GPU to itself.
A slower GPU, but we
mostly care about
throughput.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Relatively easy
to setup
But requires
a reboot
to be enabled
A few words about LIGO
Gravitational wave parameter estimation
• Gravitational wave data:
• Very noisy time series of strain measurements
• Sources like binary black hole coalescence
induce a well-modelled signal
• Infer source parameters (masses, spins, etc)
by solving inverse problem:
• matching waveform model
“templates” to the data
PRL 118, 221101 (2017)
Computationally expensive
• Computing posterior
probability density functions
• 1k -1M likelihood evaluations
RIFT = Rapid parameter inference for
gravitational wave sources via Iterative FiTting
• Highly parallelizable, grid-based algorithm:
• Start with discrete grid of waveforms
• Monte-Carlo integrate likelihood over extrinsic,
“nuisance” parameters (e.g., distance, sky-location)
• Gaussian process interpolation
→ approximate continuous likelihood function
• Interpolation ported to GPUs using cupy
• 1x Nvidia Quadro P2000 GPU ~ 20x Intel Xeon Silver 4116 CPU cores
• Requires MC input, which is currently CPU-only
LIGO still in the process to transition
• All GPUs too fast!
• Workflow heavily single-CPU-core-bound
Image Credit: T. Pyle/Caltech/MIT/LIGO Lab
0
20
40
60
80
100
120
140
GTX1080TI V100 A100
jobs/day
A100 = 1.3x GTX1080TI
A100 ~ V100
Code crushing on
Turing platforms.
GTX1080TI ~ T4
Coming soon, but not yet ready:
GPU-based Monte-Carlo integration
RIFT
Multi-process the way to go
• All GPUs still too fast!
• Linear scaling from 1 to 3 processes per GPU
Image Credit: T. Pyle/Caltech/MIT/LIGO Lab
A100 = 1.3x GTX1080TI
A100 ~ V100
0
100
200
300
400
GTX1080TI V100 A100
jobs/day
3 processes/GPU
1 process/GPU
3x
RIFT
Multi-process the way to go
• All GPUs still too fast!
• Linear scaling from 1 to 3 processes per GPU
Image Credit: T. Pyle/Caltech/MIT/LIGO Lab
A100 = 1.3x GTX1080TI
A100 ~ V100
0
100
200
300
400
GTX1080TI V100 A100
jobs/day
3 processes/GPU
1 process/GPU
3x
No reason to stop at
3 processes/GPU.
But makes scheduling harder.
RIFT
0
500
1000
1500
2000
2500
GTX1080TI V100 A100 3x A100
MIG
7x A100
MIG
jobs/day
A100 MIG helps pack more processes
• A100 7xMIG still fast enough to feed 3 processes
• Each MIG partition can be scheduled independently
3 processes/GPU
1 process/GPU
Image Credit: SXS, the Simulating eXtreme Spacetimes
7x more throughput
with marginal per-job
slowdown
RIFT
To summarize…
MIG makes A100 a great GPU
• The A100 GPU is fast
• Too fast for IceCube and LIGO to
fully use the whole GPU!
• Splitting the GPU with MIG
makes it much easier to use
• Required for IceCube
• Convenient for LIGO
• Enables 200% - 600% throughput
on identical HW!
0%
100%
200%
300%
400%
500%
600%
V100 A100 3x A100 MIG 7x A100 MIG
jobs/day
IceCube Regular IceCube Fast LIGO (3x)
Acknowledgments
This work was partially funded by the
US National Science Foundation (NSF) though grants
OAC-1941481, OAC-1841530, OAC-1826967 and
OPP-1600823.

Using A100 MIG to Scale Astronomy Scientific Output

  • 1.
    Using A100 MIGto Scale Astronomy Scientific Output Presented by Igor Sfiligoi – UC San Diego Prepared in collaboration with David Schultz, Benedikt Riedel, James A. Clark and Frank Würthwein April 2021
  • 2.
    Image Credit: NASA'sGoddard Space Flight Center The night sky may look peaceful. But there are many really violent events going on! This artist's rendering shows the tidal disruption event named ASASSN-14li, where a star wandering too close to a 3-million-solar-mass black hole was torn apart. The debris gathered into an accretion disk around the black hole.
  • 3.
    The most violentevents cannot be seen with “traditional” methods! Multi-messenger astronomy
  • 4.
    High energy neutrinos cantravel long distances in straight line. Produce Cherenkov light when passing through ice.
  • 5.
    Really big eventscan distort space-time creating gravitational waves. Can be detected with laser interferometry. Image credit: LIGO/T. Pyle This illustration shows the merger of two black holes and the gravitational waves that ripple outward as the black holes spiral toward each other. LIGO Livingston, Louisiana 4km 4km
  • 6.
    Both detectors requireextensive compute power to properly reconstruct the observed events.
  • 7.
    Optical Properties • Combiningall the possible information • These features are included in simulation • We’re always be developing them Nature never tell us a perfect answer but obtained a satisfactory agreement with data! The need for calibration • Natural medium • Hard to calibrate properly • Dropped a detector into a grey box • The ice is very clear, but… • Is it uniform? • How has construction changed the ice? • Drastic changes in reconstructed position with different ice models
  • 8.
    GPUs great fitfor calibration purposes • Too complex for a parametrized approach • Needs brute force approach • Basically, Ray-Tracing • Starting from MC produced events • Switched to GPU Ray-Tracing in 2011 • At that point 1x GPU ~ 200x CPU cores
  • 9.
    Really excited aboutthe A100 • Great speed growth in core IceCube compute benchmark 0 100 200 300 400 500 T4 P100 P40 V100 A100 Mphotons/s V100 = 1.7x T4 P100 = 1.1x T4 A100 = 3.3x T4 A100 = 2.9x P100 A100 = 1.9x V100 (Image credit: Nvidia)
  • 10.
    0 20 40 60 80 100 120 140 0 100 200 300 400 500 T4 P100 P40V100 A100 But actual application speed improvement much lower • IceCube CPU-based code cannot keep GPU busy! A100 = 1.9x T4 A100 = 1.4x V100 Benchmark Mphotons/s Application jobs/day With oversize=3
  • 11.
    0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 350 400 450 T4 P100 P40V100 A100 2x A100 MIG 3x A100 MIG 7x A100 MIG A100 MIG (Multi-Instance GPU) to the rescue • With 7x MIG partitions we are back to theoretical speedup A100 = 3.5x T4 A100 = 2.5x V100 Benchmark Mphotons/s Application jobs/day Actually… Event better With oversize=3
  • 12.
    0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 350 400 450 T4 P100 P40V100 A100 2x A100 MIG 3x A100 MIG 7x A100 MIG A100 MIG (Multi-Instance GPU) to the rescue • With 7x MIG partitions we are back to theoretical speedup A100 = 3.5x T4 A100 = 2.5x V100 Benchmark Mphotons/s Application jobs/day Actually… Event better We did try to share a single GPU between multiple jobs, but no real advantage observed Each job believes it has a full GPU to itself. A slower GPU, but we mostly care about throughput.
  • 13.
    0 50 100 150 200 250 300 0 50 100 150 200 250 300 350 400 450 500 T4 V100 A1002x A100 MIG 3x A100 MIG 7x A100 MIG A100 MIG (Multi-Instance GPU) to the rescue • And the story get better for faster codes! A100 = 4.3x T4 A100 = 4.3x V100 Benchmark Mphotons/s Application jobs/day With oversize=5 V100 was already too fast for IceCube in fast mode 2x A100 MIG runtime identical to no_MIG A100 runtime! 2x HTC (When precision less important)
  • 14.
    0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 350 400 450 T4 P100 P40V100 A100 2x A100 MIG 3x A100 MIG 7x A100 MIG A100 MIG (Multi-Instance GPU) to the rescue Benchmark Mphotons/s Application jobs/day Actually… Event better We did try to share a single GPU between multiple jobs, but no real advantage observed Each job believes it has a full GPU to itself. A slower GPU, but we mostly care about throughput. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html Relatively easy to setup But requires a reboot to be enabled
  • 15.
    A few wordsabout LIGO
  • 16.
    Gravitational wave parameterestimation • Gravitational wave data: • Very noisy time series of strain measurements • Sources like binary black hole coalescence induce a well-modelled signal • Infer source parameters (masses, spins, etc) by solving inverse problem: • matching waveform model “templates” to the data PRL 118, 221101 (2017) Computationally expensive • Computing posterior probability density functions • 1k -1M likelihood evaluations
  • 17.
    RIFT = Rapidparameter inference for gravitational wave sources via Iterative FiTting • Highly parallelizable, grid-based algorithm: • Start with discrete grid of waveforms • Monte-Carlo integrate likelihood over extrinsic, “nuisance” parameters (e.g., distance, sky-location) • Gaussian process interpolation → approximate continuous likelihood function • Interpolation ported to GPUs using cupy • 1x Nvidia Quadro P2000 GPU ~ 20x Intel Xeon Silver 4116 CPU cores • Requires MC input, which is currently CPU-only
  • 18.
    LIGO still inthe process to transition • All GPUs too fast! • Workflow heavily single-CPU-core-bound Image Credit: T. Pyle/Caltech/MIT/LIGO Lab 0 20 40 60 80 100 120 140 GTX1080TI V100 A100 jobs/day A100 = 1.3x GTX1080TI A100 ~ V100 Code crushing on Turing platforms. GTX1080TI ~ T4 Coming soon, but not yet ready: GPU-based Monte-Carlo integration RIFT
  • 19.
    Multi-process the wayto go • All GPUs still too fast! • Linear scaling from 1 to 3 processes per GPU Image Credit: T. Pyle/Caltech/MIT/LIGO Lab A100 = 1.3x GTX1080TI A100 ~ V100 0 100 200 300 400 GTX1080TI V100 A100 jobs/day 3 processes/GPU 1 process/GPU 3x RIFT
  • 20.
    Multi-process the wayto go • All GPUs still too fast! • Linear scaling from 1 to 3 processes per GPU Image Credit: T. Pyle/Caltech/MIT/LIGO Lab A100 = 1.3x GTX1080TI A100 ~ V100 0 100 200 300 400 GTX1080TI V100 A100 jobs/day 3 processes/GPU 1 process/GPU 3x No reason to stop at 3 processes/GPU. But makes scheduling harder. RIFT
  • 21.
    0 500 1000 1500 2000 2500 GTX1080TI V100 A1003x A100 MIG 7x A100 MIG jobs/day A100 MIG helps pack more processes • A100 7xMIG still fast enough to feed 3 processes • Each MIG partition can be scheduled independently 3 processes/GPU 1 process/GPU Image Credit: SXS, the Simulating eXtreme Spacetimes 7x more throughput with marginal per-job slowdown RIFT
  • 22.
  • 23.
    MIG makes A100a great GPU • The A100 GPU is fast • Too fast for IceCube and LIGO to fully use the whole GPU! • Splitting the GPU with MIG makes it much easier to use • Required for IceCube • Convenient for LIGO • Enables 200% - 600% throughput on identical HW! 0% 100% 200% 300% 400% 500% 600% V100 A100 3x A100 MIG 7x A100 MIG jobs/day IceCube Regular IceCube Fast LIGO (3x)
  • 24.
    Acknowledgments This work waspartially funded by the US National Science Foundation (NSF) though grants OAC-1941481, OAC-1841530, OAC-1826967 and OPP-1600823.