SlideShare a Scribd company logo
1 of 28
Download to read offline
Igor Sfiligoi, UC San Diego & SDSC, November 2020
SPEEDING MICROBIOME
RESEARCH BY THREE
ORDERS OF MAGNITUDE
Presented at
NVIDIA Virtual Theater
0
4
8
12
16
20
24
28
32
Original CPU Xeon Gold V100 A100 RTX8000
Wallclock time, in hours
8000
6000x
25x
2
THE CONTEXT
Microbiome research has expanded over the years from analyzing handfuls of
samples to hundreds of thousands.
What worked at small scale, does not work at the more recent scales!
UniFrac is one such tool and is used for comparing microbiome profiles to one
another. One important field is studying the impact of the gut population of
microbes, which influence diseases ranging from ulcers to heart disease to
autism to COVID-19.
My collogues at UCSD were hurting due to excessive runtimes of the tool, so
they asked me to explore if porting it to GPUs would be an advantageous option.
Since I am here here to talk, I think you will guess how that worked out.
3
THE SCIENCE
4
WE ARE WHAT WE EAT
Microbiome is critical for health:
• Produces compounds your body needs
which it cannot otherwise produce
• Disruptions in the microbiome are
associated with a range of diseases
• Many non-communicable diseases,
like Alzheimer’s, various cancers,
Cardiovascular disease and much
more are associated with the
microbiome
On a more recent theme:
• Many high-risk populations for
COVID19 also have diseases known to
be associated with the microbiome
https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
5
KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT
Collecting specimens, DNA sequencing samples, and analyzing the results
Daniel McDonaldRob Knight
6
7
SAMPLES RELATIONSHIPS
A fundamental component to
microbiome analysis is understanding
how entire microbial communities
relate to each other. This requires
pairwise comparisons of all samples
in a dataset
DISTANCE MATRIX
8
UNIFRAC DISTANCE
• Incorporates information on the
evolutionary relatedness of community
members by incorporating the phylogeny of
the observed organisms in the computation.
• Other measures, such as Euclidean distance,
implicitly assume all organisms are equally
related.
Lozupone and Knight Applied and environmental microbiology 2005
Samples where the organisms are evolutionarily
very similar from an evolutionary perspective
will have a small UniFrac distance.
On the other hand, two samples composed of
very different organisms will have
a large UniFrac distance.
A distance metric
https://en.wikipedia.org/wiki/UniFrac
9
COMPUTING ON GPU
10
STARTING WITH STRIPED UNIFRAC
Recent (2018) algorithm
that is optimized for
both speed and parallelism (on CPUs)
Allowed the microbiome researchers to analyze
tens of thousands of samples from modern
studies.
But going into 100k range and beyond
becoming too expensive
Runtime scales approximately quadratically
State of the art as of early 2020
From: Striped UniFrac: enabling microbiome analysis at unprecedented scale
Projected runtimes using the early 2020 CPU-only code
11
COULD PORTING
TO GPU ALLOW US
TO DRASTICALLY
REDUCE THE RUNTIME?
Let me take a look!
Igor Sfiligoi
12
Where is most of the
time spent?
Turns out to be a tight double loop
• With many iterations
• All independent
Conceptually not too far away
from BLAS
• Should fly on GPUs!
Simple stack sampling method
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int k = 0;
k < n_samples; k++) {
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
…
dm_stripe[k] += fabs(u1-v1)*length;
}
}
13
OpenACC makes it easy
to have a first port
Almost as easy as adding a decorator
• Too bad arrays of pointers
not well supported
• Thus required a bit of refactoring
Done in less than a week
• A couple days FTE
8x speedup CPU -> GPU (chip vs chip)
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples; k++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
unsigned int l =
(k + stripe + 1)%n_samples;
double u1 = emb[k];
double v1 = emb[k + stripe + 1];
…
dm_stripe[k] += fabs(u1-v1)*length;
}
}
Intel Xeon E5-2680 v4
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
Runtime on 25k sample
14
But how it is used just
as important
The emb input buffers must be prepared
for each invocation
• Data movement latency
Main buffer all traversed every time
• No cache reuse
Large number of invocations
• GPU invocation overhead
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k = 0;
k < (tree.nparens / 2) – 1) ; k++) {
// must be run sequentially
// on CPU, logic and deep function nesting
// rewrites all of emp buffer
embed(emb, tree, k);
// on GPU
#pragma acc data copyin(emb)
run_loop(dm_stripe_buf, emb, tree.getlen(k));
}
return dm_stripe_buf;
Bad for both
CPU and GPU code-paths
15
Batching to the rescue
Batching many emb buffers
• Improves memory locality
for main buffer
• Reduces GPU invocation overhead
and allows for overlap with CPU
• At the expense of more memory use
Cache-awareness in loop becomes very
important
Additional 8x speedup on GPU (total 64x)
And a decent 4x speedup on CPU
#pragma acc parallel loop collapse(3) async 
present(emb,dm_stripes_buf,length)
for(sk) { // swap order and tile
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += fabs(u-v)*length[e];
}
…
dm_stripe[k] += my_stripe;
}
}
}
#ifdef _OPENACC
std::swap(emb,emb_alt);
#endif
Intel Xeon E5-2680 v4
(using all 14 cores)
193 minutes (3.2 hours)
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Runtime on 25k sample
16
WE WERE PRETTY HAPPY WITH SPEEDUP
Switching to fp32 added additional boost
80x
600x
Spring 2020
17
RETHINKING THE
ALGORITHM
18
SEVERAL FLAVORS OF UNIFRAC
There are several versions of UniFrac
Two of the popular ones are true FP compute (like previous slides)
One is binary in nature
Expected binary version to be significantly faster
But was not!
19
Binary operations
only in tight loop
Moreover, the same emb buffer
being read multiple time
• FP -> bool conversion every single time
Full FP logic still needed
#pragma acc parallel loop …
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
…
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
bool u = emb[offset+k]>0;
bool v = emb[offset+k+stripe+ 1]>0;
my_stripe += (u^v)*length[e];
}
…
dm_stripe[k] += my_stripe;
}
}
}
20
BINARY PRE-PROCESSING AND PACKING
Computing FP -> bool before invoking the loop saves a lot of compute
Packing 8 bools into a single UINT8 saves a lot of memory (size and access)
I can pre-compute all 256 combinations, too
• Just memory lookup and sums in loop now
NVIDIA Tesla V100
(using all 84 SMs)
2.5 minutes
Runtime on 25k sample
21
LOTS OF ZEROES EVERYWHERE!
Since I have only 256 combinations, I get curious and check the distribution
• >90% of the time it is a 0!
We were doing a huge amount of NOOP compute (add by zero)
• The emb buffer is basically a sparse matrix!
Using UINT64 and adding a simple if (!=zero)
gets me another 3x speedup
• Ran out of time for further optimizations
NVIDIA Tesla V100
(using all 84 SMs)
45 seconds
Runtime on 25k sample
Basically a sparse matrix problem
22
#pragma acc parallel loop …
for(sk) {
for(stripe) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
unsigned int l =
(k + stripe + 1)%n_samples;
…
double my_stripe = dm_stripe[k];
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
uint64_t u = emb_packed[offset+k];
uint64_t v = emb_packed[offset+k+stripe+ 1];
uint64_t o1 = u1 | v1;
if (o1!=0) { // zeros are prevalent
my_stripe += psum[ (o1 & 0xff)] +
psum[0x100+((o1 >> 8) & 0xff)] +
…
psum[0x700+((o1 >> 56) )];
}
}
…
dm_stripe[k] += my_stripe;
}
}
}
#pragma acc parallel loop …
for (unsigned int emb_el=0;
emb_el<embs_els; emb_el++) {
for (unsigned int sub8=0;
sub8<8; sub8++) {
unsigned int emb8 = emb_el*8+sub8;
TFloat * psum = &(sums[emb8<<8]);
TFloat * pl = &(lengths[emb8*8]);
for (unsigned int b8_i=0;
b8_i<0x100; b8_i++) {
psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) +
(((b8_i >> 1) & 1) * pl[1]) +
…
(((b8_i >> 7) & 1) * pl[7]);
}
}
}
Sparse packed version
23
WORKS EVEN BETTER ON LARGER PROBLEMS
25k 50k 115k 300k
Original, Xeon Gold CPU 30k seconds 2.5k minutes 30k minutes 8k hours
Latest, Xeon Gold CPU 290 seconds 16.5 minutes 180 minutes 33 hours
Latest, V100 GPU 45 seconds 2.2 minutes 13 minutes 1.9 hours
Latest, A100 GPU 33 seconds 1.72 minutes 9.8 minutes 1.4 hours
Latest, RTX8000 GPU 29 seconds 1.58 minutes 9.4 minutes 1.3 hours
1000x 1500x 6000x3000x
24
CPU speed now often
the limiting factor
Original code was single threaded
• Relying on partitioning of problem
• GPUs prefer full problem in loop
Using OpenMP for CPU parallelization
• Together with OpenACC for GPUs
make for a great pair
GPU compute just so fast!
initialize(dm_stripe_buf);
#pragma acc data copy(dm_stripe_buf)
for(unsigned int k0 = 0;
k < (tree.nparens / 2) – 1) ; k+=chunk) {
#pragma omp parallel for
for (unsigned int i=0; i<chunk; i+=64) {
embed_packed(emb[i], tree, k0+i);
fill_leghts(lengths,tree, k0+i);
}
#pragma acc data update device(emb)
#pragma acc wait
#pragma acc data copyin(lengths)
run_loop(dm_stripe_buf, emb, length);
}
#pragma acc wait
return dm_stripe_buf;
25
IN SUMMARY
26
THE PORTING TO GPUS WAS A MAJOR SUCCESS
Most of the time spent in a tight loop
Easy to port to GPUs using OpenACC
But deep understanding of code critical for maximum speedup
Gained significantly more from better algorithm than better HW
Having a single code set between CPU and GPU
helped a lot improving both code paths
Optimizing one side usually let to discoveries for the other
GPUs still way faster than CPUs, so HW does matter
Way beyond our greatest hopes
6000x
27
ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE
300k sample now computed in about an hour on single node vs heroic HPC job
6000x
Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.
28
ACKNOWLEDGMENTS
This work was partially funded by US National Science Foundation (NSF) grants
DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by
US National Institutes of Health (NIH) grant DP1-AT010885.

More Related Content

More from Igor Sfiligoi

Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorIgor Sfiligoi
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsIgor Sfiligoi
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Serving HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondorServing HTC Users in Kubernetes by Leveraging HTCondor
Serving HTC Users in Kubernetes by Leveraging HTCondor
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
Characterizing network paths in and out of the Clouds
Characterizing network paths in and out of the CloudsCharacterizing network paths in and out of the Clouds
Characterizing network paths in and out of the Clouds
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 

Recently uploaded

5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials
5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials
5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materialsSherrylee83
 
The Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptxThe Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptxDr. Rabia Inam Gandapore
 
Presentació "Advancing Emergency Medicine Education through Virtual Reality"
Presentació "Advancing Emergency Medicine Education through Virtual Reality"Presentació "Advancing Emergency Medicine Education through Virtual Reality"
Presentació "Advancing Emergency Medicine Education through Virtual Reality"Badalona Serveis Assistencials
 
World Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptWorld Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptdesktoppc
 
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...marcuskenyatta275
 
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.GawadHemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.GawadNephroTube - Dr.Gawad
 
180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghanahealthwatchghana
 
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYTUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYDRPREETHIJAMESP
 
Renal Replacement Therapy in Acute Kidney Injury -time modality -Dr Ayman Se...
Renal Replacement Therapy in Acute Kidney Injury -time  modality -Dr Ayman Se...Renal Replacement Therapy in Acute Kidney Injury -time  modality -Dr Ayman Se...
Renal Replacement Therapy in Acute Kidney Injury -time modality -Dr Ayman Se...Ayman Seddik
 
Mgr university bsc nursing adult health previous question paper with answers
Mgr university  bsc nursing adult health previous question paper with answersMgr university  bsc nursing adult health previous question paper with answers
Mgr university bsc nursing adult health previous question paper with answersShafnaP5
 
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadHemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadNephroTube - Dr.Gawad
 
DIGITAL RADIOGRAPHY-SABBU KHATOON .pptx
DIGITAL RADIOGRAPHY-SABBU KHATOON  .pptxDIGITAL RADIOGRAPHY-SABBU KHATOON  .pptx
DIGITAL RADIOGRAPHY-SABBU KHATOON .pptxSabbu Khatoon
 
5cladba raw material 5CL-ADB-A precursor raw
5cladba raw material 5CL-ADB-A precursor raw5cladba raw material 5CL-ADB-A precursor raw
5cladba raw material 5CL-ADB-A precursor rawSherrylee83
 
Tips and tricks to pass the cardiovascular station for PACES exam
Tips and tricks to pass the cardiovascular station for PACES examTips and tricks to pass the cardiovascular station for PACES exam
Tips and tricks to pass the cardiovascular station for PACES examJunhao Koh
 
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best supplerCas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best supplerSherrylee83
 
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptxCURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptxDr KHALID B.M
 
Denture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDenture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDr.shiva sai vemula
 
Sonia Journal club presentation (2).pptx
Sonia Journal club presentation (2).pptxSonia Journal club presentation (2).pptx
Sonia Journal club presentation (2).pptxpalsonia139
 
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediatesBMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediatesdorademei
 

Recently uploaded (20)

5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials
5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials
5Cladba ADBB 5cladba buy 6cl adbb powder 5cl ADBB precursor materials
 
The Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptxThe Orbit & its contents by Dr. Rabia I. Gandapore.pptx
The Orbit & its contents by Dr. Rabia I. Gandapore.pptx
 
Presentació "Advancing Emergency Medicine Education through Virtual Reality"
Presentació "Advancing Emergency Medicine Education through Virtual Reality"Presentació "Advancing Emergency Medicine Education through Virtual Reality"
Presentació "Advancing Emergency Medicine Education through Virtual Reality"
 
World Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 pptWorld Hypertension Day 17th may 2024 ppt
World Hypertension Day 17th may 2024 ppt
 
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th Edition by ...
 
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.GawadHemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
Hemodialysis: Chapter 1, Physiological Principles of Hemodialysis - Dr.Gawad
 
180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana180-hour Power Capsules For Men In Ghana
180-hour Power Capsules For Men In Ghana
 
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHYTUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
TUBERCULINUM-2.BHMS.MATERIA MEDICA.HOMOEOPATHY
 
Renal Replacement Therapy in Acute Kidney Injury -time modality -Dr Ayman Se...
Renal Replacement Therapy in Acute Kidney Injury -time  modality -Dr Ayman Se...Renal Replacement Therapy in Acute Kidney Injury -time  modality -Dr Ayman Se...
Renal Replacement Therapy in Acute Kidney Injury -time modality -Dr Ayman Se...
 
Mgr university bsc nursing adult health previous question paper with answers
Mgr university  bsc nursing adult health previous question paper with answersMgr university  bsc nursing adult health previous question paper with answers
Mgr university bsc nursing adult health previous question paper with answers
 
Scleroderma: Treatment Options and a Look to the Future - Dr. Macklin
Scleroderma: Treatment Options and a Look to the Future - Dr. MacklinScleroderma: Treatment Options and a Look to the Future - Dr. Macklin
Scleroderma: Treatment Options and a Look to the Future - Dr. Macklin
 
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.GawadHemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
Hemodialysis: Chapter 2, Extracorporeal Blood Circuit - Dr.Gawad
 
DIGITAL RADIOGRAPHY-SABBU KHATOON .pptx
DIGITAL RADIOGRAPHY-SABBU KHATOON  .pptxDIGITAL RADIOGRAPHY-SABBU KHATOON  .pptx
DIGITAL RADIOGRAPHY-SABBU KHATOON .pptx
 
5cladba raw material 5CL-ADB-A precursor raw
5cladba raw material 5CL-ADB-A precursor raw5cladba raw material 5CL-ADB-A precursor raw
5cladba raw material 5CL-ADB-A precursor raw
 
Tips and tricks to pass the cardiovascular station for PACES exam
Tips and tricks to pass the cardiovascular station for PACES examTips and tricks to pass the cardiovascular station for PACES exam
Tips and tricks to pass the cardiovascular station for PACES exam
 
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best supplerCas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
Cas 28578-16-7 PMK ethyl glycidate ( new PMK powder) best suppler
 
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptxCURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
 
Denture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of actionDenture base resins materials and its mechanism of action
Denture base resins materials and its mechanism of action
 
Sonia Journal club presentation (2).pptx
Sonia Journal club presentation (2).pptxSonia Journal club presentation (2).pptx
Sonia Journal club presentation (2).pptx
 
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediatesBMK Glycidic Acid (sodium salt)  CAS 5449-12-7 Pharmaceutical intermediates
BMK Glycidic Acid (sodium salt) CAS 5449-12-7 Pharmaceutical intermediates
 

Speeding Microbiome Research by Three Orders of Magnitude

  • 1. Igor Sfiligoi, UC San Diego & SDSC, November 2020 SPEEDING MICROBIOME RESEARCH BY THREE ORDERS OF MAGNITUDE Presented at NVIDIA Virtual Theater 0 4 8 12 16 20 24 28 32 Original CPU Xeon Gold V100 A100 RTX8000 Wallclock time, in hours 8000 6000x 25x
  • 2. 2 THE CONTEXT Microbiome research has expanded over the years from analyzing handfuls of samples to hundreds of thousands. What worked at small scale, does not work at the more recent scales! UniFrac is one such tool and is used for comparing microbiome profiles to one another. One important field is studying the impact of the gut population of microbes, which influence diseases ranging from ulcers to heart disease to autism to COVID-19. My collogues at UCSD were hurting due to excessive runtimes of the tool, so they asked me to explore if porting it to GPUs would be an advantageous option. Since I am here here to talk, I think you will guess how that worked out.
  • 4. 4 WE ARE WHAT WE EAT Microbiome is critical for health: • Produces compounds your body needs which it cannot otherwise produce • Disruptions in the microbiome are associated with a range of diseases • Many non-communicable diseases, like Alzheimer’s, various cancers, Cardiovascular disease and much more are associated with the microbiome On a more recent theme: • Many high-risk populations for COVID19 also have diseases known to be associated with the microbiome https://www.biotechniques.com/multimedia/archive/00252/microbiome2_252150a.jpg
  • 5. 5 KNIGHT LAB AT UCSD LEADING AMERICAN GUT PROJECT Collecting specimens, DNA sequencing samples, and analyzing the results Daniel McDonaldRob Knight
  • 6. 6
  • 7. 7 SAMPLES RELATIONSHIPS A fundamental component to microbiome analysis is understanding how entire microbial communities relate to each other. This requires pairwise comparisons of all samples in a dataset DISTANCE MATRIX
  • 8. 8 UNIFRAC DISTANCE • Incorporates information on the evolutionary relatedness of community members by incorporating the phylogeny of the observed organisms in the computation. • Other measures, such as Euclidean distance, implicitly assume all organisms are equally related. Lozupone and Knight Applied and environmental microbiology 2005 Samples where the organisms are evolutionarily very similar from an evolutionary perspective will have a small UniFrac distance. On the other hand, two samples composed of very different organisms will have a large UniFrac distance. A distance metric https://en.wikipedia.org/wiki/UniFrac
  • 10. 10 STARTING WITH STRIPED UNIFRAC Recent (2018) algorithm that is optimized for both speed and parallelism (on CPUs) Allowed the microbiome researchers to analyze tens of thousands of samples from modern studies. But going into 100k range and beyond becoming too expensive Runtime scales approximately quadratically State of the art as of early 2020 From: Striped UniFrac: enabling microbiome analysis at unprecedented scale Projected runtimes using the early 2020 CPU-only code
  • 11. 11 COULD PORTING TO GPU ALLOW US TO DRASTICALLY REDUCE THE RUNTIME? Let me take a look! Igor Sfiligoi
  • 12. 12 Where is most of the time spent? Turns out to be a tight double loop • With many iterations • All independent Conceptually not too far away from BLAS • Should fly on GPUs! Simple stack sampling method for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int k = 0; k < n_samples; k++) { unsigned int l = (k + stripe + 1)%n_samples; double u1 = emb[k]; double v1 = emb[k + stripe + 1]; … dm_stripe[k] += fabs(u1-v1)*length; } }
  • 13. 13 OpenACC makes it easy to have a first port Almost as easy as adding a decorator • Too bad arrays of pointers not well supported • Thus required a bit of refactoring Done in less than a week • A couple days FTE 8x speedup CPU -> GPU (chip vs chip) #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples; k++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; unsigned int l = (k + stripe + 1)%n_samples; double u1 = emb[k]; double v1 = emb[k + stripe + 1]; … dm_stripe[k] += fabs(u1-v1)*length; } } Intel Xeon E5-2680 v4 (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) Runtime on 25k sample
  • 14. 14 But how it is used just as important The emb input buffers must be prepared for each invocation • Data movement latency Main buffer all traversed every time • No cache reuse Large number of invocations • GPU invocation overhead initialize(dm_stripe_buf); #pragma acc data copy(dm_stripe_buf) for(unsigned int k = 0; k < (tree.nparens / 2) – 1) ; k++) { // must be run sequentially // on CPU, logic and deep function nesting // rewrites all of emp buffer embed(emb, tree, k); // on GPU #pragma acc data copyin(emb) run_loop(dm_stripe_buf, emb, tree.getlen(k)); } return dm_stripe_buf; Bad for both CPU and GPU code-paths
  • 15. 15 Batching to the rescue Batching many emb buffers • Improves memory locality for main buffer • Reduces GPU invocation overhead and allows for overlap with CPU • At the expense of more memory use Cache-awareness in loop becomes very important Additional 8x speedup on GPU (total 64x) And a decent 4x speedup on CPU #pragma acc parallel loop collapse(3) async present(emb,dm_stripes_buf,length) for(sk) { // swap order and tile for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += fabs(u-v)*length[e]; } … dm_stripe[k] += my_stripe; } } } #ifdef _OPENACC std::swap(emb,emb_alt); #endif Intel Xeon E5-2680 v4 (using all 14 cores) 193 minutes (3.2 hours) NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Runtime on 25k sample
  • 16. 16 WE WERE PRETTY HAPPY WITH SPEEDUP Switching to fp32 added additional boost 80x 600x Spring 2020
  • 18. 18 SEVERAL FLAVORS OF UNIFRAC There are several versions of UniFrac Two of the popular ones are true FP compute (like previous slides) One is binary in nature Expected binary version to be significantly faster But was not!
  • 19. 19 Binary operations only in tight loop Moreover, the same emb buffer being read multiple time • FP -> bool conversion every single time Full FP logic still needed #pragma acc parallel loop … for(sk) { for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; … double my_stripe = dm_stripe[k]; for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; bool u = emb[offset+k]>0; bool v = emb[offset+k+stripe+ 1]>0; my_stripe += (u^v)*length[e]; } … dm_stripe[k] += my_stripe; } } }
  • 20. 20 BINARY PRE-PROCESSING AND PACKING Computing FP -> bool before invoking the loop saves a lot of compute Packing 8 bools into a single UINT8 saves a lot of memory (size and access) I can pre-compute all 256 combinations, too • Just memory lookup and sums in loop now NVIDIA Tesla V100 (using all 84 SMs) 2.5 minutes Runtime on 25k sample
  • 21. 21 LOTS OF ZEROES EVERYWHERE! Since I have only 256 combinations, I get curious and check the distribution • >90% of the time it is a 0! We were doing a huge amount of NOOP compute (add by zero) • The emb buffer is basically a sparse matrix! Using UINT64 and adding a simple if (!=zero) gets me another 3x speedup • Ran out of time for further optimizations NVIDIA Tesla V100 (using all 84 SMs) 45 seconds Runtime on 25k sample Basically a sparse matrix problem
  • 22. 22 #pragma acc parallel loop … for(sk) { for(stripe) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; unsigned int l = (k + stripe + 1)%n_samples; … double my_stripe = dm_stripe[k]; for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; uint64_t u = emb_packed[offset+k]; uint64_t v = emb_packed[offset+k+stripe+ 1]; uint64_t o1 = u1 | v1; if (o1!=0) { // zeros are prevalent my_stripe += psum[ (o1 & 0xff)] + psum[0x100+((o1 >> 8) & 0xff)] + … psum[0x700+((o1 >> 56) )]; } } … dm_stripe[k] += my_stripe; } } } #pragma acc parallel loop … for (unsigned int emb_el=0; emb_el<embs_els; emb_el++) { for (unsigned int sub8=0; sub8<8; sub8++) { unsigned int emb8 = emb_el*8+sub8; TFloat * psum = &(sums[emb8<<8]); TFloat * pl = &(lengths[emb8*8]); for (unsigned int b8_i=0; b8_i<0x100; b8_i++) { psum[b8_i] = (((b8_i >> 0) & 1) * pl[0]) + (((b8_i >> 1) & 1) * pl[1]) + … (((b8_i >> 7) & 1) * pl[7]); } } } Sparse packed version
  • 23. 23 WORKS EVEN BETTER ON LARGER PROBLEMS 25k 50k 115k 300k Original, Xeon Gold CPU 30k seconds 2.5k minutes 30k minutes 8k hours Latest, Xeon Gold CPU 290 seconds 16.5 minutes 180 minutes 33 hours Latest, V100 GPU 45 seconds 2.2 minutes 13 minutes 1.9 hours Latest, A100 GPU 33 seconds 1.72 minutes 9.8 minutes 1.4 hours Latest, RTX8000 GPU 29 seconds 1.58 minutes 9.4 minutes 1.3 hours 1000x 1500x 6000x3000x
  • 24. 24 CPU speed now often the limiting factor Original code was single threaded • Relying on partitioning of problem • GPUs prefer full problem in loop Using OpenMP for CPU parallelization • Together with OpenACC for GPUs make for a great pair GPU compute just so fast! initialize(dm_stripe_buf); #pragma acc data copy(dm_stripe_buf) for(unsigned int k0 = 0; k < (tree.nparens / 2) – 1) ; k+=chunk) { #pragma omp parallel for for (unsigned int i=0; i<chunk; i+=64) { embed_packed(emb[i], tree, k0+i); fill_leghts(lengths,tree, k0+i); } #pragma acc data update device(emb) #pragma acc wait #pragma acc data copyin(lengths) run_loop(dm_stripe_buf, emb, length); } #pragma acc wait return dm_stripe_buf;
  • 26. 26 THE PORTING TO GPUS WAS A MAJOR SUCCESS Most of the time spent in a tight loop Easy to port to GPUs using OpenACC But deep understanding of code critical for maximum speedup Gained significantly more from better algorithm than better HW Having a single code set between CPU and GPU helped a lot improving both code paths Optimizing one side usually let to discoveries for the other GPUs still way faster than CPUs, so HW does matter Way beyond our greatest hopes 6000x
  • 27. 27 ENABLING SCIENCE THAT WOULD OTHERWISE NOT BE POSSIBLE 300k sample now computed in about an hour on single node vs heroic HPC job 6000x Supported by NSF grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and NIH grant DP1-AT010885.
  • 28. 28 ACKNOWLEDGMENTS This work was partially funded by US National Science Foundation (NSF) grants DBI-2038509, OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.