SlideShare a Scribd company logo
1 of 1
Download to read offline
Porting and optimizing UniFrac for GPUs
Reducing microbiome analysis runtimes by orders of magnitude
Igor Sfiligoi Daniel McDonald Rob Knight
(isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu)
University of California San Diego - La Jolla, CA, USA
This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.
Abstract
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another
(“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many
independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the
number of CPU cores on a single node, indicating memory-bound compute.
Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this
poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting
using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be
wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too.
The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours.
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
Most time spent in small
area of the code, which is
implemented as a set of
tight loops.
OpenACC makes it trivial toport
to GPU compute.
Just decorate with a pragma.
But needed minor refactoring to have a unified buffer.
Memory writes much more
expensive than memory reads,
so cluster reads and minimize writes.
Also undo manual unrolls: were optimal for CPU, bad for GPU.
Reorder loops to maximize
cache reuse.
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Measuring runtime computing
UniFrac on EMP dataset.
UniFrac was originally designed
and always implemented using
fp64 math operations. Thehigher-
precision floating point math was
used to maximize reliability of
the results.
Can we use fp32?
On mobile and gaming GPUs
fp64 compute is 32x slower
than fp32 compute.
We can!
We compared the results of the compute using
fp32-enabled and fp64-only code, using the same
EMP input, and observed a near identical result.
(Mantel R2 0.99999; p<0.001, comparing
pairwise distances in the two matrices).
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Runtime computing UniFrac on EMP dataset (Single Chip, in minutes)
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips)
Conclusions
Our work now allows many microbiome analyses which were
previously relegated to large compute clusters to be performed with
much lower resource requirements. Even the largest datasets currently
envisaged could be processed in reasonable time with just a handful of
server-class GPUs, while smaller but still sizable datasets like the
EMP can be processed even on GPU-enabled workstations.
We also explored the use of lower-precision floating point math to
effectively exploit consumer-grade GPUs, which are typical in
desktop and laptop setups. We conclude that fp32 math yields
nearly identical results to fp64, and should be adequate for the vast
majority of studies, making compute on GPU-enabled personal
devices, even laptops, a sufficient resource for this otherwise rate-
limiting step for many researchers.
An ordination plot of unweighted UniFrac distances over 113,721 samples
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Properly aligning the memory
buffers and picking the right value
for the grouping parameters can
make a 5x difference in speed.

More Related Content

What's hot

Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
bosc
 

What's hot (20)

Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
 
Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
Trelles_QnormBOSC2009
Trelles_QnormBOSC2009Trelles_QnormBOSC2009
Trelles_QnormBOSC2009
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Deep parking
Deep parkingDeep parking
Deep parking
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 

Similar to Porting and optimizing UniFrac for GPUs

General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
Praveen Narayanan
 

Similar to Porting and optimizing UniFrac for GPUs (20)

Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 

More from Igor Sfiligoi

Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
Igor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 
Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 

Recently uploaded

Difference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac MusclesDifference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac Muscles
MedicoseAcademics
 
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
Sheetaleventcompany
 
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
Sheetaleventcompany
 
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
Genuine Call Girls
 
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Sheetaleventcompany
 

Recently uploaded (20)

Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room DeliveryCall 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
Call 8250092165 Patna Call Girls ₹4.5k Cash Payment With Room Delivery
 
💰Call Girl In Bangalore☎️7304373326💰 Call Girl service in Bangalore☎️Bangalor...
💰Call Girl In Bangalore☎️7304373326💰 Call Girl service in Bangalore☎️Bangalor...💰Call Girl In Bangalore☎️7304373326💰 Call Girl service in Bangalore☎️Bangalor...
💰Call Girl In Bangalore☎️7304373326💰 Call Girl service in Bangalore☎️Bangalor...
 
💚Reliable Call Girls Chandigarh 💯Niamh 📲🔝8868886958🔝Call Girl In Chandigarh N...
💚Reliable Call Girls Chandigarh 💯Niamh 📲🔝8868886958🔝Call Girl In Chandigarh N...💚Reliable Call Girls Chandigarh 💯Niamh 📲🔝8868886958🔝Call Girl In Chandigarh N...
💚Reliable Call Girls Chandigarh 💯Niamh 📲🔝8868886958🔝Call Girl In Chandigarh N...
 
Difference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac MusclesDifference Between Skeletal Smooth and Cardiac Muscles
Difference Between Skeletal Smooth and Cardiac Muscles
 
Call Girls in Lucknow Just Call 👉👉8630512678 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉8630512678 Top Class Call Girl Service Avai...Call Girls in Lucknow Just Call 👉👉8630512678 Top Class Call Girl Service Avai...
Call Girls in Lucknow Just Call 👉👉8630512678 Top Class Call Girl Service Avai...
 
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
Call Girls Bangalore - 450+ Call Girl Cash Payment 💯Call Us 🔝 6378878445 🔝 💃 ...
 
Call Girl In Chandigarh 📞9809698092📞 Just📲 Call Inaaya Chandigarh Call Girls ...
Call Girl In Chandigarh 📞9809698092📞 Just📲 Call Inaaya Chandigarh Call Girls ...Call Girl In Chandigarh 📞9809698092📞 Just📲 Call Inaaya Chandigarh Call Girls ...
Call Girl In Chandigarh 📞9809698092📞 Just📲 Call Inaaya Chandigarh Call Girls ...
 
Cardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their RegulationCardiac Output, Venous Return, and Their Regulation
Cardiac Output, Venous Return, and Their Regulation
 
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
❤️Chandigarh Escorts Service☎️9814379184☎️ Call Girl service in Chandigarh☎️ ...
 
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
💚Chandigarh Call Girls Service 💯Piya 📲🔝8868886958🔝Call Girls In Chandigarh No...
 
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
Gastric Cancer: Сlinical Implementation of Artificial Intelligence, Synergeti...
 
Call Girls in Lucknow Just Call 👉👉 8875999948 Top Class Call Girl Service Ava...
Call Girls in Lucknow Just Call 👉👉 8875999948 Top Class Call Girl Service Ava...Call Girls in Lucknow Just Call 👉👉 8875999948 Top Class Call Girl Service Ava...
Call Girls in Lucknow Just Call 👉👉 8875999948 Top Class Call Girl Service Ava...
 
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
 
Kolkata Call Girls Shobhabazar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Gir...
Kolkata Call Girls Shobhabazar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Gir...Kolkata Call Girls Shobhabazar  💯Call Us 🔝 8005736733 🔝 💃  Top Class Call Gir...
Kolkata Call Girls Shobhabazar 💯Call Us 🔝 8005736733 🔝 💃 Top Class Call Gir...
 
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service AvailableCall Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
Call Girls Rishikesh Just Call 9667172968 Top Class Call Girl Service Available
 
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Shahdol Just Call 8250077686 Top Class Call Girl Service Available
 
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
Ahmedabad Call Girls Book Now 8980367676 Top Class Ahmedabad Escort Service A...
 
Low Cost Call Girls Bangalore {9179660964} ❤️VVIP NISHA Call Girls in Bangalo...
Low Cost Call Girls Bangalore {9179660964} ❤️VVIP NISHA Call Girls in Bangalo...Low Cost Call Girls Bangalore {9179660964} ❤️VVIP NISHA Call Girls in Bangalo...
Low Cost Call Girls Bangalore {9179660964} ❤️VVIP NISHA Call Girls in Bangalo...
 
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
Call Girl In Indore 📞9235973566📞 Just📲 Call Inaaya Indore Call Girls Service ...
 
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service AvailableCall Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
Call Girls Mussoorie Just Call 8854095900 Top Class Call Girl Service Available
 

Porting and optimizing UniFrac for GPUs

  • 1. Porting and optimizing UniFrac for GPUs Reducing microbiome analysis runtimes by orders of magnitude Igor Sfiligoi Daniel McDonald Rob Knight (isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu) University of California San Diego - La Jolla, CA, USA This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885. Abstract UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the number of CPU cores on a single node, indicating memory-bound compute. Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too. The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours. for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } Most time spent in small area of the code, which is implemented as a set of tight loops. OpenACC makes it trivial toport to GPU compute. Just decorate with a pragma. But needed minor refactoring to have a unified buffer. Memory writes much more expensive than memory reads, so cluster reads and minimize writes. Also undo manual unrolls: were optimal for CPU, bad for GPU. Reorder loops to maximize cache reuse. Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) NVIDIA Tesla V100 (using all 84 SMs) 33 minutes NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Measuring runtime computing UniFrac on EMP dataset. UniFrac was originally designed and always implemented using fp64 math operations. Thehigher- precision floating point math was used to maximize reliability of the results. Can we use fp32? On mobile and gaming GPUs fp64 compute is 32x slower than fp32 compute. We can! We compared the results of the compute using fp32-enabled and fp64-only code, using the same EMP input, and observed a near identical result. (Mantel R2 0.99999; p<0.001, comparing pairwise distances in the two matrices). E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Runtime computing UniFrac on EMP dataset (Single Chip, in minutes) Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips) Conclusions Our work now allows many microbiome analyses which were previously relegated to large compute clusters to be performed with much lower resource requirements. Even the largest datasets currently envisaged could be processed in reasonable time with just a handful of server-class GPUs, while smaller but still sizable datasets like the EMP can be processed even on GPU-enabled workstations. We also explored the use of lower-precision floating point math to effectively exploit consumer-grade GPUs, which are typical in desktop and laptop setups. We conclude that fp32 math yields nearly identical results to fp64, and should be adequate for the vast majority of studies, making compute on GPU-enabled personal devices, even laptops, a sufficient resource for this otherwise rate- limiting step for many researchers. An ordination plot of unweighted UniFrac distances over 113,721 samples Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Properly aligning the memory buffers and picking the right value for the grouping parameters can make a 5x difference in speed.