OpenCL applications in genomics

Using OpenCL to accelerate
genomic analysis

Gary K. Chen

June 16, 2011

An outline

OpenCL Introduction

Copy number inference in tumors

Data considerations

Hidden Relatedness

Variable Selection

Scientiﬁc Programming on GPGPU
devices

nVidia and ATI are currently market leaders
Very competitive in performance and price
Impressive double-precision performance - though
still about 4 times slower than 32 bit FP
ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5
$2,399
nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5
$2,199
Source: www.sabrepc.com

Future multi-core CPUs

Intel’s 48 core SCC chip
Potentially a more powerful solution when
considering data intenstive computing. Not
constrained by PCI bus

An open-standards based development
platform

Same idea as CUDA, diﬀerent terms

Biology background
DNA
A string with a four letter alphabet: A,C,G,T
Humans have two copies: one from mom, one from
dad
Most of the sequence between two strands is the
same, except for a small proportion
Example sequence: ATATTGC. We could have:

A single nucleotide polymorphism (common point
mutation): ATATAGC
Copy number variants/abberations
(deletions,ampliﬁcations,translocations):
AT–GC
ATATTATTATTGC

What is observed

Microarray output
Probes are dyed, and microarrays scanned with
CCD cameras
X,Y: Intensities of A and B alleles (two possible
variants)
R = X+Y: Overall intensity
LRR (log2 R ratio): Intensity relative to a standard
intensity
BAF (B allele frequency): Ratio of allelic intensity
between A and B

Inferring CNVs from microarray output

Hidden Markov Model
A formalized statistical model
We want to use information from observables
(LRR,BAF) to infer true state of nature (copy
number, genotype)

Table: Example hidden states from PennCNV software
State CN possible genotypes
1 0 Null
2 1 A,B
3 2 AA,AB,BB
4 2 AA,BB
5 3 AAA,AAB,ABB,BBB
6 4 AAAA,AAAB,AABB,ABBB,BBBB

Copy number inference in tumors
Inference is harder!
1. When dissecting breast tissue for example,
stromal (normal cell) contamination is almost
inevitable. Hence you are modeling a mixture of
two or more cell populations
Suppose you have a state assuming normal CN=2,
tumor CN=4,α = .2
e.g. ri = αri,n + (1 − α)ri,t
expected mean intensity: .2(1) + .8(1.68) = 1.544
2. Ampliﬁcation events can be wilder than
germline (e.g. blood) events, leading to greater
copy number/genotype possibilities
Combine issues 1) and 2) and you can get a huge
search space

Expanded state space
ID BACn CNt BACt α ¯
r ¯
b
0 0 2 0 0.3 1 0
1 0 2 0 0.6 1 0
2 1 2 1 0.3 1 0.5
3 1 2 1 0.6 1 0.5
4 2 2 2 0.3 1 1
5 2 2 2 0.6 1 1
6 0 1 0 0.3 0.65 0
7 0 1 0 0.6 0.8 0
8 0 1 1 0.3 0.65 0.538462
9 0 1 1 0.6 0.8 0.25
10 1 1 0 0.3 0.65 0.230769
11 1 1 0 0.6 0.8 0.375
12 1 1 1 0.3 0.65 0.769231
13 1 1 1 0.6 0.8 0.625
14 2 1 0 0.3 0.65 0.461538
15 2 1 0 0.6 0.8 0.75
16 2 1 1 0.3 0.65 1
17 2 1 1 0.6 0.8 1
18 0 3 0 0.3 1.35 0
19 0 3 0 0.6 1.2 0
20 1 3 1 0.3 1.35 0.37037
21 1 3 1 0.6 1.2 0.416667
22 1 3 2 0.3 1.35 0.62963
23 1 3 2 0.6 1.2 0.583333
24 2 3 3 0.3 1.35 1
25 2 3 3 0.6 1.2 1

Algorithm
Initialize
Empirically estimate σ of BAF and LRR
Compute emission matrix O for each state/obs
from a Gaussian pdf
Train: Expectation Maximization
Forward backward: computes posterior probs and
overall likelhood
Baum Welch: Compute MLE of transition
probabilites in matrix T
Traverse state path
Viterbi (dynamic programming): walk the state
path based on max-product

Parallel Forward Algorithm
We compute the probability vector at observation
t: f0:t = f0:t−1 TOt
Each state (element of the m-state vector) can
independently compute a sum-product
Threadblocks map to states
Threads calculate products in parallel, followed by
a log2(m) addition reduction

Technical issue: Underflow

Tiny probabilities often have to be represented
in log space (even for FP64)
How do we deal with adding log probabilities?
We usually exponentiate, add, then log
Remedy
Add an offset to log before exponentiating
Subtract the offset from the log space answer

Gridblocks: Forward Backward Calculation

Code: Computing products in parallel

Code: 2 Reductions: computing oﬀset,
sum-product

Algorithm Improvements
Examples:
Re-scaling transition matrix (accounting for
SNP spacing)
Serial: O(2nm2 ); Parallel: O(n)
Forward backward
Serial: O(2nm2 ); Parallel: O(nlog2 (m))
Viterbi
Serial: O(nm2 ); Parallel: O(nlog2 (m))
Normalizing constant (Baum-Welch)
Serial: O(nm); Parallel: O(log2 (n))
MLE of transition matrix (Baum-Welch)
Serial: O(nm2 ); Parallel: O(n)

Performance

Table: One EM iteration on Chr 1 (41,263 SNPs)
states CPU GPU fold-speedup
128 9.5m 37s 15x
512 2h 35m 1m 44s 108x

Storing data

Global memory
Relatively abundant, but slow
However, even 4GB may be insuﬃcient for modern
datasets
Genotype data
Highly compressible
We only care if a position diﬀers from the
canonical sequence
Thus: AA,AB,BB,NULL are 4 possible genotypes
Should be able to encode this into two bits, so 4
genotypes per byte

Possible approaches
Store as a float array
+: Easy to implement
-: Uses 16 times as much memory as needed!
Store as an int array
Allocate a local memory array of 256 rows, 4 cols
for mapping all possible genotype 4-tuples
+: Uses global memory efficiently, maximizes
bandwidth
-: You might not even have enough local memory,
much less for real work
Store as a char array
Right bitshift pairs of bits, then OR mask with 3
+: Uses global memory efficiently, saves on local
memory
-: Threads load a minimum of 4 bytes per word,
you use 25% of available bandwidth

One solution: custom container

Idea:
Designate each threadblock to handle 512
genotypes
First 32 threads: each loads a packedgeno t
element
For each of the 32 threads:
Loop four times, extracting each char
Subloop four times, extacting each genotype via
bitshift/mask

Inferring Relatedness

Inferring relatedness
The human race is one large pedigree
Individuals of the same ethnicity are expected
to share more SNP alleles
We can summarize this relationship through a
correlation matrix called ’K’

Uses for the ’K’ matrix

Principal Components Analysis
A singular value decomposition on ’K’
K = VDV
V contains orthogonal axes, facilitating population
structure inference
Estimating heritability
In random eﬀects models
Y = µ + βX + γ 2 K + σ 2 I
2
h2 = γ 2γ 2
+σ

Computing K
Essentially a matrix multiplication
(x −2fi )(xik −2f )
Kjk = m m ij 4fi (1−fi ) i
ˆ 1
i=1
Or in another words: K=ZZ’
Including more SNPs adds more precise, subtle
information
Parallel code
Carrying out matrix multiplication is
straightforward on GPU
Matrix multiplication is ideal for GPU: Approx.
240x speedup.
Because K is summed over SNPs, we can split
genotype matrix by subsets of SNPs and run each
K slice in parallel

Variable Selection

One goal in biomedical research is correlating
DNA variation to disease phenotypes
Genomics technology
The number of subjects n remains about the same
(cost of recruiting, sample preps, etc), while
number of features p is exploding
Rate that data is being generated per dollar
surpasses Moore’s Law

Regression

Standard logistic regression
The usual method for hypothesis testing of
candidate predictors
p
log( 1−p ) = βX , p being the probability of aﬀection
We apply Newton-Raphson scoring until f (β) is
maximized.
Logistic regression simple fails whens p > n
L1 penalized regression, aka LASSO
Idea: Fit the logistic regression model, but subject
to a penalty parameter λ
g (β) = f (β) − λ p |βj |
j=1

Algorithms for ﬁtting the LASSO
One dimensional Newton Raphson at variable
j:
Cyclic Coordinate Descent
(new ) g β
∆βj = βj − βj = − g βjj
n
1
g (βj ) = xi,j yi − sgn(βj )λ
i=1
1 + exp(xi,j βj yi )
n
2 exp(xi,j βj yi )
g (βj ) = xi,j
i=1
(1 + exp(xi,j βj yi ))2
We cycle through each j until likelihood stops
increasing within some tolerance
Performs great, but only allows parallelization
across samples
ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3

Distributed GPU implementation

If possible to parallelize across variables, it is
worth splitting up design matrix
For really large dimensions, we can link up an
arbitrary number of GPUs
Message Passing Interface allows us to be
agnostic to physical location of GPU devices

Distributed GPU implementation

Approach:
MPI master node delegates heavy lifting to slaves
across network
Master node performs fast serial code, such as
sampling new λ, comparing logLs, broadcasting
gradients, etc.
Network traﬃc is kept to a minimum
Implemented for Greedy Coordinate Descent and
Gradient Descent
Developed on server at USC Epigenome Center: 2
Tesla C2050s

Parallel algorithms for ﬁtting the LASSO
Greedy coordinate descent (ref)
Same algorithm as CCD, except for each variable
sweep, update only j that gives greatest increase in
logL
No dependencies between subjects and variables,
massive parallelization across subjects AND
variables
Ideal if you have a huge dataset, and you want a
stringest type 1 error rate (only care about a few
variables)
Ayers and Cordell, Gen Epi 2010: Permute, and
pick largest λ that allows ﬁrst “false” variable to
enter
ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1

Layout for greedy coordinate descent
implementation

Overview of Greedy CD algorithm
Newton-Raphson kernel
Each threadblock maps to a block of 512 subjects
(theads) for 1 variable
Each thread calculates subject’s contribution to
gradient and hessian
Sum (reduction) across 512 subjects
Sum (reduction) across subject blocks in new
kernel
Compute log-likelihood change for each
variable (like above).
Apply a max operator (log2 reduction) to
select variable with greatest contribution to
likelihood.
Iterate repeatedly until likelihood increase less
than epsilon

Evaluation on large dataset
GWAS data
6,806 subjects in a case control study of prostate
cancer
1,047,986 SNPs typed
Invoke approx. 7 billion threads per iteration
Total walltime for 1 GCD iteration (sweep
across all variables)
15 minutes on optimized serial implementation
split across 2 slave CPUs
5.8 seconds on parallel implementation across 2
nVidia Tesla C2050 GPU devices
155x speed up

Parallel algorithms for ﬁtting the LASSO
(Stochastic Mirror) Gradient Descent (ref)
Sometimes, we are interested in tuning λ for say
the best cross validation errors
Greedy descent seems awfully wasteful in that only
one βj is updated
However, we can update all variables in parallel
cycling through subjects
Algorithm
Extremely simple:
−yi
For subject i: gradient gi = (1+exp(xi βyi ))
Update his βi vector, where βi,j = βi,j − ηgi xi,j
η is a learning parameter, set suﬃciently small
(e.g. .0001)
ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine
Learning 2009

Gradient descent
Performance
Slow convergence compared to serial cyclic
coordinate descent, but far more scalable
For large lambdas, slower than greedy coordinate
descent
Computation:bandwidth ratio not great
For 1 million SNPs, only about 15x speedup. Far
more SNPs are needed
Technical issues
Must store genotypes in subject major order to
enabled coalesced memory loads/stores
Makes SNP level summaries like means and SDs
difficult to compute.
Heterogeneous data types: floats: (E,ExG),
compresesed chars: (G,GxG)
Memory constrained: can perform interactions on
the fly with SNP major

Potential for robust variable selection:

Subsampling:
Applying LASSO once overﬁts data. Model
selection inconsistent
Subsampling is preferable: Bootstrapping, stability
selection, x-fold cross validation
Number of replicates << number of samples <<
number of features
Bayesian variable selection:
If we assume βLASSO conditionally independent
Master node can (quickly) sample hyperparameters
(e.g. λ) from a prior distribution

OpenCL applications in genomics

Recommended

Recommended

More Related Content

Similar to OpenCL applications in genomics

Similar to OpenCL applications in genomics (20)

More from USC

More from USC (6)

Recently uploaded

Recently uploaded (20)

OpenCL applications in genomics