Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenCL applications in genomics


Published on

Published in: Technology
  • Be the first to comment

OpenCL applications in genomics

  1. 1. Using OpenCL to accelerate genomic analysis Gary K. Chen June 16, 2011
  2. 2. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  3. 3. Scientific Programming on GPGPUdevices nVidia and ATI are currently market leaders Very competitive in performance and price Impressive double-precision performance - though still about 4 times slower than 32 bit FP ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5 $2,399 nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5 $2,199 Source:
  4. 4. Future multi-core CPUs Intel’s 48 core SCC chip Potentially a more powerful solution when considering data intenstive computing. Not constrained by PCI bus
  5. 5. An open-standards based developmentplatform
  6. 6. Same idea as CUDA, different terms
  7. 7. Data parallel coding
  8. 8. OpenCL Concepts
  9. 9. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  10. 10. Biology background DNA A string with a four letter alphabet: A,C,G,T Humans have two copies: one from mom, one from dad Most of the sequence between two strands is the same, except for a small proportion Example sequence: ATATTGC. We could have: A single nucleotide polymorphism (common point mutation): ATATAGC Copy number variants/abberations (deletions,amplifications,translocations): AT–GC ATATTATTATTGC
  11. 11. SNP microarrays
  12. 12. What is observed Microarray output Probes are dyed, and microarrays scanned with CCD cameras X,Y: Intensities of A and B alleles (two possible variants) R = X+Y: Overall intensity LRR (log2 R ratio): Intensity relative to a standard intensity BAF (B allele frequency): Ratio of allelic intensity between A and B
  13. 13. Inferring CNVs from microarray output
  14. 14. Hidden Markov Model A formalized statistical model We want to use information from observables (LRR,BAF) to infer true state of nature (copy number, genotype) Table: Example hidden states from PennCNV software State CN possible genotypes 1 0 Null 2 1 A,B 3 2 AA,AB,BB 4 2 AA,BB 5 3 AAA,AAB,ABB,BBB 6 4 AAAA,AAAB,AABB,ABBB,BBBB
  15. 15. Copy number inference in tumors Inference is harder! 1. When dissecting breast tissue for example, stromal (normal cell) contamination is almost inevitable. Hence you are modeling a mixture of two or more cell populations Suppose you have a state assuming normal CN=2, tumor CN=4,α = .2 e.g. ri = αri,n + (1 − α)ri,t expected mean intensity: .2(1) + .8(1.68) = 1.544 2. Amplification events can be wilder than germline (e.g. blood) events, leading to greater copy number/genotype possibilities Combine issues 1) and 2) and you can get a huge search space
  16. 16. Expanded state space ID BACn CNt BACt α ¯ r ¯ b 0 0 2 0 0.3 1 0 1 0 2 0 0.6 1 0 2 1 2 1 0.3 1 0.5 3 1 2 1 0.6 1 0.5 4 2 2 2 0.3 1 1 5 2 2 2 0.6 1 1 6 0 1 0 0.3 0.65 0 7 0 1 0 0.6 0.8 0 8 0 1 1 0.3 0.65 0.538462 9 0 1 1 0.6 0.8 0.25 10 1 1 0 0.3 0.65 0.230769 11 1 1 0 0.6 0.8 0.375 12 1 1 1 0.3 0.65 0.769231 13 1 1 1 0.6 0.8 0.625 14 2 1 0 0.3 0.65 0.461538 15 2 1 0 0.6 0.8 0.75 16 2 1 1 0.3 0.65 1 17 2 1 1 0.6 0.8 1 18 0 3 0 0.3 1.35 0 19 0 3 0 0.6 1.2 0 20 1 3 1 0.3 1.35 0.37037 21 1 3 1 0.6 1.2 0.416667 22 1 3 2 0.3 1.35 0.62963 23 1 3 2 0.6 1.2 0.583333 24 2 3 3 0.3 1.35 1 25 2 3 3 0.6 1.2 1
  17. 17. Algorithm Initialize Empirically estimate σ of BAF and LRR Compute emission matrix O for each state/obs from a Gaussian pdf Train: Expectation Maximization Forward backward: computes posterior probs and overall likelhood Baum Welch: Compute MLE of transition probabilites in matrix T Traverse state path Viterbi (dynamic programming): walk the state path based on max-product
  18. 18. Parallel Forward Algorithm We compute the probability vector at observation t: f0:t = f0:t−1 TOt Each state (element of the m-state vector) can independently compute a sum-product Threadblocks map to states Threads calculate products in parallel, followed by a log2(m) addition reduction
  19. 19. Technical issue: Underflow Tiny probabilities often have to be represented in log space (even for FP64) How do we deal with adding log probabilities? We usually exponentiate, add, then log Remedy Add an offset to log before exponentiating Subtract the offset from the log space answer
  20. 20. Gridblocks: Forward Backward Calculation
  21. 21. Code: Computing products in parallel
  22. 22. Code: 2 Reductions: computing offset,sum-product
  23. 23. Algorithm Improvements Examples: Re-scaling transition matrix (accounting for SNP spacing) Serial: O(2nm2 ); Parallel: O(n) Forward backward Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Viterbi Serial: O(nm2 ); Parallel: O(nlog2 (m)) Normalizing constant (Baum-Welch) Serial: O(nm); Parallel: O(log2 (n)) MLE of transition matrix (Baum-Welch) Serial: O(nm2 ); Parallel: O(n)
  24. 24. Performance Table: One EM iteration on Chr 1 (41,263 SNPs) states CPU GPU fold-speedup 128 9.5m 37s 15x 512 2h 35m 1m 44s 108x
  25. 25. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  26. 26. Storing data Global memory Relatively abundant, but slow However, even 4GB may be insufficient for modern datasets Genotype data Highly compressible We only care if a position differs from the canonical sequence Thus: AA,AB,BB,NULL are 4 possible genotypes Should be able to encode this into two bits, so 4 genotypes per byte
  27. 27. Possible approaches Store as a float array +: Easy to implement -: Uses 16 times as much memory as needed! Store as an int array Allocate a local memory array of 256 rows, 4 cols for mapping all possible genotype 4-tuples +: Uses global memory efficiently, maximizes bandwidth -: You might not even have enough local memory, much less for real work Store as a char array Right bitshift pairs of bits, then OR mask with 3 +: Uses global memory efficiently, saves on local memory -: Threads load a minimum of 4 bytes per word, you use 25% of available bandwidth
  28. 28. One solution: custom container Idea: Designate each threadblock to handle 512 genotypes First 32 threads: each loads a packedgeno t element For each of the 32 threads: Loop four times, extracting each char Subloop four times, extacting each genotype via bitshift/mask
  29. 29. Illustration
  30. 30. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  31. 31. Inferring Relatedness Inferring relatedness The human race is one large pedigree Individuals of the same ethnicity are expected to share more SNP alleles We can summarize this relationship through a correlation matrix called ’K’
  32. 32. Uses for the ’K’ matrix Principal Components Analysis A singular value decomposition on ’K’ K = VDV V contains orthogonal axes, facilitating population structure inference Estimating heritability In random effects models Y = µ + βX + γ 2 K + σ 2 I 2 h2 = γ 2γ 2 +σ
  33. 33. Example: Latino samples in LA
  34. 34. Computing K Essentially a matrix multiplication (x −2fi )(xik −2f ) Kjk = m m ij 4fi (1−fi ) i ˆ 1 i=1 Or in another words: K=ZZ’ Including more SNPs adds more precise, subtle information Parallel code Carrying out matrix multiplication is straightforward on GPU Matrix multiplication is ideal for GPU: Approx. 240x speedup. Because K is summed over SNPs, we can split genotype matrix by subsets of SNPs and run each K slice in parallel
  35. 35. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
  36. 36. Variable Selection One goal in biomedical research is correlating DNA variation to disease phenotypes Genomics technology The number of subjects n remains about the same (cost of recruiting, sample preps, etc), while number of features p is exploding Rate that data is being generated per dollar surpasses Moore’s Law
  37. 37. Regression Standard logistic regression The usual method for hypothesis testing of candidate predictors p log( 1−p ) = βX , p being the probability of affection We apply Newton-Raphson scoring until f (β) is maximized. Logistic regression simple fails whens p > n L1 penalized regression, aka LASSO Idea: Fit the logistic regression model, but subject to a penalty parameter λ g (β) = f (β) − λ p |βj | j=1
  38. 38. Algorithms for fitting the LASSO One dimensional Newton Raphson at variable j: Cyclic Coordinate Descent (new ) g β ∆βj = βj − βj = − g βjj n 1 g (βj ) = xi,j yi − sgn(βj )λ i=1 1 + exp(xi,j βj yi ) n 2 exp(xi,j βj yi ) g (βj ) = xi,j i=1 (1 + exp(xi,j βj yi ))2 We cycle through each j until likelihood stops increasing within some tolerance Performs great, but only allows parallelization across samples ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
  39. 39. Distributed GPU implementation If possible to parallelize across variables, it is worth splitting up design matrix For really large dimensions, we can link up an arbitrary number of GPUs Message Passing Interface allows us to be agnostic to physical location of GPU devices
  40. 40. Distributed GPU implementation Approach: MPI master node delegates heavy lifting to slaves across network Master node performs fast serial code, such as sampling new λ, comparing logLs, broadcasting gradients, etc. Network traffic is kept to a minimum Implemented for Greedy Coordinate Descent and Gradient Descent Developed on server at USC Epigenome Center: 2 Tesla C2050s
  41. 41. Parallel algorithms for fitting the LASSO Greedy coordinate descent (ref) Same algorithm as CCD, except for each variable sweep, update only j that gives greatest increase in logL No dependencies between subjects and variables, massive parallelization across subjects AND variables Ideal if you have a huge dataset, and you want a stringest type 1 error rate (only care about a few variables) Ayers and Cordell, Gen Epi 2010: Permute, and pick largest λ that allows first “false” variable to enter ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
  42. 42. Layout for greedy coordinate descentimplementation
  43. 43. Overview of Greedy CD algorithm Newton-Raphson kernel Each threadblock maps to a block of 512 subjects (theads) for 1 variable Each thread calculates subject’s contribution to gradient and hessian Sum (reduction) across 512 subjects Sum (reduction) across subject blocks in new kernel Compute log-likelihood change for each variable (like above). Apply a max operator (log2 reduction) to select variable with greatest contribution to likelihood. Iterate repeatedly until likelihood increase less than epsilon
  44. 44. Evaluation on large dataset GWAS data 6,806 subjects in a case control study of prostate cancer 1,047,986 SNPs typed Invoke approx. 7 billion threads per iteration Total walltime for 1 GCD iteration (sweep across all variables) 15 minutes on optimized serial implementation split across 2 slave CPUs 5.8 seconds on parallel implementation across 2 nVidia Tesla C2050 GPU devices 155x speed up
  45. 45. Parallel algorithms for fitting the LASSO (Stochastic Mirror) Gradient Descent (ref) Sometimes, we are interested in tuning λ for say the best cross validation errors Greedy descent seems awfully wasteful in that only one βj is updated However, we can update all variables in parallel cycling through subjects Algorithm Extremely simple: −yi For subject i: gradient gi = (1+exp(xi βyi )) Update his βi vector, where βi,j = βi,j − ηgi xi,j η is a learning parameter, set sufficiently small (e.g. .0001) ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine Learning 2009
  46. 46. Gradient descent Performance Slow convergence compared to serial cyclic coordinate descent, but far more scalable For large lambdas, slower than greedy coordinate descent Computation:bandwidth ratio not great For 1 million SNPs, only about 15x speedup. Far more SNPs are needed Technical issues Must store genotypes in subject major order to enabled coalesced memory loads/stores Makes SNP level summaries like means and SDs difficult to compute. Heterogeneous data types: floats: (E,ExG), compresesed chars: (G,GxG) Memory constrained: can perform interactions on the fly with SNP major
  47. 47. Potential for robust variable selection: Subsampling: Applying LASSO once overfits data. Model selection inconsistent Subsampling is preferable: Bootstrapping, stability selection, x-fold cross validation Number of replicates << number of samples << number of features Bayesian variable selection: If we assume βLASSO conditionally independent Master node can (quickly) sample hyperparameters (e.g. λ) from a prior distribution