Slides meyer116

Utility of Graphics Processing Units
for dense matrix calculations in
computing and inverting genomic
relationship matrices
Karin Meyer and Bruce Tier
Animal Genetics and Breeding Unit,
University of New England,
Armidale, Australia
AAABG 2013

GPUs for GRMs | Introduction
Computing in the genomics age
Requirements for mixed model analyses have changed!
Pre-genomics:
mixed model equations sparse (few non-zero elements)
– inverse of pedigree based relationship matrix (NRM)
concentrated on exploiting sparsity in computations
Now: Genomic relationship matrix (GRM)
dense (large proportion of elements non-zero)
require manipulation of large, dense matrices to
– compute the GRM
– invert the GRM
– predict breeding values in single-step approach
computationally demanding
exploit new hardware and software libraries
– computing in parallel
K. M. | 2 / 16

Objectives
A ﬁrst look to examine scope for
speeding up computations to calculate & invert GRM
using parallel computations
exploiting a Graphics Processing Unit (GPU)
“An excursion into the land of computer gaming,
modern compilers and software libraries”
K. M. | 3 / 16

What is a GPU?
Graphical Processing Unit
initially: co-processor for computer gaming
– graphics card
– fast recalculation of pixel values
– remember: i387 math processor for i386
now: GP-GPU adapted to General Purpose computing
– hundreds to thousands of cores
– capable of double precision calculations
– turns desktop PC into personal super-computer
– but: limited memory (currently 6GB max.)
specialised interface for application programming
– CUDA (proprietary of NVIDIA) −→ Fortran interface
– OPENCL (C++ like)
K. M. | 4 / 16

Tesla K20X
2688 cores
6 GB RAM
Peak: 3.95/1.31 Tﬂops (single/double precision)
K. M. | 5 / 16

GPUs for GRMs | Calculating the GRM
Memory needed: G = ZZ
Gigabytes – single precision matrices
n Gn×n Zn×s
s = 100K 500K 1000K
5 000 0.1 1.9 9.3 18.6
10 000 0.4 3.7 18.6 37.3
20 000 1.5 7.5 37.3 74.5
30 000 3.4 11.2 55.9 111.8
50 000 9.3 18.6 93.1 186.3
−→ break calculations into blocks
−→ use out-of-core storage
→ parallel computing literature
K. M. | 6 / 16

Blockwise multiplication
All-in-one
=Z
Z
G
G = ZZ
Column-wise division
= + · · · +Z.1 · · · · · · Z.4
Z.1
· · ·
· · ·
Z.4
Z.1Z.1 Z.4Z.4
G =
k
Z.kZ.k
K. M. | 7 / 16
Z: n animals
× s SNPs
G: symmetric
sn(n + 1)/2
ﬂops

Blockwise multiplication - cont’
Row-wise division
=Zi.
Zj. Gij = Zi.Zj.
Row- and columnwise division
=
Gij =
k
ZikZjk
K. M. | 8 / 16

GPUs for GRMs | Inverting the GRM
Blockwise inversion
G =


GPP GPC GPT
GCP GCC GCT
GTP GTC GTT


1. Subdivide matrix into blocks
C current −→ choosen size
P previous
T trailing
2. Factor & invert chol(GCC)
3. Adjust GPP & GPC
4. ‘Absorb’ into GTT & GCT
5. Calculate inverse
6. Redeﬁne C, P & T; repeat
Gauss-Jordan Elimination
GCC := chol(GCC)
GCC := G−1
CC
GPC := GPCGCC
GPP := GPP + GPCGPC
GCT := GCCGCT
GTT := GTT − GCTGCT
GPT := GPT − GPCGCT
GCT := −(GCCGCT)
GPC := GPCGCC
GCC := GCCGCC
−→ break matrix products into blocks as needed
K. M. | 9 / 16

GPUs for GRMs | Computing environment
Hardware
‘Standard’ multi-core CPU
– Intel I7-3930K
– 3.2 GHz, 12M cache
– 64GB RAM
– 6 cores
GPU
– Nvidia GTX 580
– 512 cores
– 3GB RAM
– capable of double precision calculations
→ high end gaming card, not GP-GPU
K. M. | 10 / 16

GPUs for GRMs | Computing environment
Software
Linux (CentOS 6)
gfortran, gcc 4.4.7; Intel composer XE 13.0.1
Intel MKL library 11.0
– BLAS: SSYRK, SGEMM, STRTRI, STRMM
– LAPACK: SPOTRF, SPOTRI
CUDA 5.0
– CUBLAS: CUBLAS_SSYRK, CUBLAS_SGEMM,
CUBLAS_STRTRI, CUBLAS_STRMM
CULA dense R16a
– CULA_SPOTRF, CULA_SPOTRI, CULA_STRTRI
MAGMA 1.3
– MAGMAF_SLAUUM
K. M. | 11 / 16

GPUs for GRMs | Results
Time to ‘make’ GRM∗
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
GPU
CPU6
CPU1
1 hour
5 hours
10−1
100
101
102
103
0 20000 40000 60000 80000
No. animals
Systemtime(min)
∗
Single precision calculations, s = 500 000
K. M. | 12 / 16

Time to ‘make’ GRM∗
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
GPU
CPU6
CPU1
1 hour
5 hours
10−1
100
101
102
103
0 20000 40000 60000 80000
No. animals
Systemtime(min)
∗
Single precision calculations, s = 500 000
K. M. | 12 / 16
●
● ●
●
● ●
●
●
●
●
●
● ● ● ●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
GPU / CPU1
CPU6 / CPU1
5
10
15
20

Time to invert GRM†
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 hour
GPU
CPU6
CPU1
10−4
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000
No. animals
Systemtime(min)
†
Single precision calculations
K. M. | 13 / 16

Time to invert GRM†
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 hour
GPU
CPU6
CPU1
10−4
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000
No. animals
Systemtime(min)
†
Single precision calculations
K. M. | 13 / 16
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●●
●●●●
●
●
●●●●●●●●
●●●●●●
●
●●●●●●●●
●
●●●●
●●
●
●●●●
●●
●●●●●●●●●
●
GPU / CPU1
CPU6 / CPU1
5
10
15

Time to invert GRM‡
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●
●
●
●
● ●
1 hour
GPU
CPU6
CPU1
10−3
10−2
10−1
100
101
102
0 20000 40000 60000 80000
No. animals
Systemtime(min)
‡
Double precision calculations
K. M. | 14 / 16
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ●
●
●
● ●
●
● ● ●
● ●
●
●
● ●
● ●
●
● ● ● ● ●
● ● ● ● ● ●
●
●
● ● ● ● ● ●
● ● ●
●
●
●
●
●
●
●
●
GPU / CPU1
CPU6 / CPU1
4
6
8

GPUs for GRMs | Results | Conclusions
Conclusions
Build & invert GRM
−→ dense matrix multiplications
−→ highly parallelisable
K. M. | 15 / 16

Conclusions
Build & invert GRM
BLAS and LAPACK library routines awesome
– highly optimised
– exploit memory architecture of modern processors
– multi-threaded versions
K. M. | 15 / 16

Conclusions
Build & invert GRM
BLAS and LAPACK library routines awesome
– highly optimised
– exploit memory architecture of modern processors
– multi-threaded versions
GPU: powerful new hardware for parallel computing
– thousands of processors
– can use multiple GPUs simultaneously
– BLAS/LAPACK etc. libraries available
– but: limited memory −→ blocked algorithms
– but: specialised interface, not much Fortran support yet
K. M. | 15 / 16

K. M. | 16 / 16

Slides meyer116

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Slides meyer116

Similar to Slides meyer116 (20)

Recently uploaded

Recently uploaded (20)

Slides meyer116