Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- OpenCL Programming 101 by Yoss Cohen 7633 views
- Introduction to OpenCL by Unai Lopez-Novoa 2070 views
- OpenCL Overview (Dec 2012) by The Khronos Group... 5651 views
- "Efficient Implementation of Convol... by Embedded Vision A... 4076 views
- FPGA Architecture Presentation by omutukuda 5435 views
- Field programable gate array by Neha Agarwal 4110 views

No Downloads

Total views

1,552

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

24

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Using OpenCL to accelerate genomic analysis Gary K. Chen June 16, 2011
- 2. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
- 3. Scientiﬁc Programming on GPGPUdevices nVidia and ATI are currently market leaders Very competitive in performance and price Impressive double-precision performance - though still about 4 times slower than 32 bit FP ATI 9370 chipset: 528 64-GFLOPS 4GB GDDR5 $2,399 nVidia Tesla C2050: 520 64-GFLOPS 3GB GDDR5 $2,199 Source: www.sabrepc.com
- 4. Future multi-core CPUs Intel’s 48 core SCC chip Potentially a more powerful solution when considering data intenstive computing. Not constrained by PCI bus
- 5. An open-standards based developmentplatform
- 6. Same idea as CUDA, diﬀerent terms
- 7. Data parallel coding
- 8. OpenCL Concepts
- 9. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
- 10. Biology background DNA A string with a four letter alphabet: A,C,G,T Humans have two copies: one from mom, one from dad Most of the sequence between two strands is the same, except for a small proportion Example sequence: ATATTGC. We could have: A single nucleotide polymorphism (common point mutation): ATATAGC Copy number variants/abberations (deletions,ampliﬁcations,translocations): AT–GC ATATTATTATTGC
- 11. SNP microarrays
- 12. What is observed Microarray output Probes are dyed, and microarrays scanned with CCD cameras X,Y: Intensities of A and B alleles (two possible variants) R = X+Y: Overall intensity LRR (log2 R ratio): Intensity relative to a standard intensity BAF (B allele frequency): Ratio of allelic intensity between A and B
- 13. Inferring CNVs from microarray output
- 14. Hidden Markov Model A formalized statistical model We want to use information from observables (LRR,BAF) to infer true state of nature (copy number, genotype) Table: Example hidden states from PennCNV software State CN possible genotypes 1 0 Null 2 1 A,B 3 2 AA,AB,BB 4 2 AA,BB 5 3 AAA,AAB,ABB,BBB 6 4 AAAA,AAAB,AABB,ABBB,BBBB
- 15. Copy number inference in tumors Inference is harder! 1. When dissecting breast tissue for example, stromal (normal cell) contamination is almost inevitable. Hence you are modeling a mixture of two or more cell populations Suppose you have a state assuming normal CN=2, tumor CN=4,α = .2 e.g. ri = αri,n + (1 − α)ri,t expected mean intensity: .2(1) + .8(1.68) = 1.544 2. Ampliﬁcation events can be wilder than germline (e.g. blood) events, leading to greater copy number/genotype possibilities Combine issues 1) and 2) and you can get a huge search space
- 16. Expanded state space ID BACn CNt BACt α ¯ r ¯ b 0 0 2 0 0.3 1 0 1 0 2 0 0.6 1 0 2 1 2 1 0.3 1 0.5 3 1 2 1 0.6 1 0.5 4 2 2 2 0.3 1 1 5 2 2 2 0.6 1 1 6 0 1 0 0.3 0.65 0 7 0 1 0 0.6 0.8 0 8 0 1 1 0.3 0.65 0.538462 9 0 1 1 0.6 0.8 0.25 10 1 1 0 0.3 0.65 0.230769 11 1 1 0 0.6 0.8 0.375 12 1 1 1 0.3 0.65 0.769231 13 1 1 1 0.6 0.8 0.625 14 2 1 0 0.3 0.65 0.461538 15 2 1 0 0.6 0.8 0.75 16 2 1 1 0.3 0.65 1 17 2 1 1 0.6 0.8 1 18 0 3 0 0.3 1.35 0 19 0 3 0 0.6 1.2 0 20 1 3 1 0.3 1.35 0.37037 21 1 3 1 0.6 1.2 0.416667 22 1 3 2 0.3 1.35 0.62963 23 1 3 2 0.6 1.2 0.583333 24 2 3 3 0.3 1.35 1 25 2 3 3 0.6 1.2 1
- 17. Algorithm Initialize Empirically estimate σ of BAF and LRR Compute emission matrix O for each state/obs from a Gaussian pdf Train: Expectation Maximization Forward backward: computes posterior probs and overall likelhood Baum Welch: Compute MLE of transition probabilites in matrix T Traverse state path Viterbi (dynamic programming): walk the state path based on max-product
- 18. Parallel Forward Algorithm We compute the probability vector at observation t: f0:t = f0:t−1 TOt Each state (element of the m-state vector) can independently compute a sum-product Threadblocks map to states Threads calculate products in parallel, followed by a log2(m) addition reduction
- 19. Technical issue: Underﬂow Tiny probabilities often have to be represented in log space (even for FP64) How do we deal with adding log probabilities? We usually exponentiate, add, then log Remedy Add an oﬀset to log before exponentiating Subtract the oﬀset from the log space answer
- 20. Gridblocks: Forward Backward Calculation
- 21. Code: Computing products in parallel
- 22. Code: 2 Reductions: computing oﬀset,sum-product
- 23. Algorithm Improvements Examples: Re-scaling transition matrix (accounting for SNP spacing) Serial: O(2nm2 ); Parallel: O(n) Forward backward Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Viterbi Serial: O(nm2 ); Parallel: O(nlog2 (m)) Normalizing constant (Baum-Welch) Serial: O(nm); Parallel: O(log2 (n)) MLE of transition matrix (Baum-Welch) Serial: O(nm2 ); Parallel: O(n)
- 24. Performance Table: One EM iteration on Chr 1 (41,263 SNPs) states CPU GPU fold-speedup 128 9.5m 37s 15x 512 2h 35m 1m 44s 108x
- 25. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
- 26. Storing data Global memory Relatively abundant, but slow However, even 4GB may be insuﬃcient for modern datasets Genotype data Highly compressible We only care if a position diﬀers from the canonical sequence Thus: AA,AB,BB,NULL are 4 possible genotypes Should be able to encode this into two bits, so 4 genotypes per byte
- 27. Possible approaches Store as a ﬂoat array +: Easy to implement -: Uses 16 times as much memory as needed! Store as an int array Allocate a local memory array of 256 rows, 4 cols for mapping all possible genotype 4-tuples +: Uses global memory eﬃciently, maximizes bandwidth -: You might not even have enough local memory, much less for real work Store as a char array Right bitshift pairs of bits, then OR mask with 3 +: Uses global memory eﬃciently, saves on local memory -: Threads load a minimum of 4 bytes per word, you use 25% of available bandwidth
- 28. One solution: custom container Idea: Designate each threadblock to handle 512 genotypes First 32 threads: each loads a packedgeno t element For each of the 32 threads: Loop four times, extracting each char Subloop four times, extacting each genotype via bitshift/mask
- 29. Illustration
- 30. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
- 31. Inferring Relatedness Inferring relatedness The human race is one large pedigree Individuals of the same ethnicity are expected to share more SNP alleles We can summarize this relationship through a correlation matrix called ’K’
- 32. Uses for the ’K’ matrix Principal Components Analysis A singular value decomposition on ’K’ K = VDV V contains orthogonal axes, facilitating population structure inference Estimating heritability In random eﬀects models Y = µ + βX + γ 2 K + σ 2 I 2 h2 = γ 2γ 2 +σ
- 33. Example: Latino samples in LA
- 34. Computing K Essentially a matrix multiplication (x −2fi )(xik −2f ) Kjk = m m ij 4fi (1−fi ) i ˆ 1 i=1 Or in another words: K=ZZ’ Including more SNPs adds more precise, subtle information Parallel code Carrying out matrix multiplication is straightforward on GPU Matrix multiplication is ideal for GPU: Approx. 240x speedup. Because K is summed over SNPs, we can split genotype matrix by subsets of SNPs and run each K slice in parallel
- 35. An outline OpenCL Introduction Copy number inference in tumors Data considerations Hidden Relatedness Variable Selection
- 36. Variable Selection One goal in biomedical research is correlating DNA variation to disease phenotypes Genomics technology The number of subjects n remains about the same (cost of recruiting, sample preps, etc), while number of features p is exploding Rate that data is being generated per dollar surpasses Moore’s Law
- 37. Regression Standard logistic regression The usual method for hypothesis testing of candidate predictors p log( 1−p ) = βX , p being the probability of aﬀection We apply Newton-Raphson scoring until f (β) is maximized. Logistic regression simple fails whens p > n L1 penalized regression, aka LASSO Idea: Fit the logistic regression model, but subject to a penalty parameter λ g (β) = f (β) − λ p |βj | j=1
- 38. Algorithms for ﬁtting the LASSO One dimensional Newton Raphson at variable j: Cyclic Coordinate Descent (new ) g β ∆βj = βj − βj = − g βjj n 1 g (βj ) = xi,j yi − sgn(βj )λ i=1 1 + exp(xi,j βj yi ) n 2 exp(xi,j βj yi ) g (βj ) = xi,j i=1 (1 + exp(xi,j βj yi ))2 We cycle through each j until likelihood stops increasing within some tolerance Performs great, but only allows parallelization across samples ref: Genkin,Lewis,Madigan: Am Stat Assoc 2007 Vol 49,No. 3
- 39. Distributed GPU implementation If possible to parallelize across variables, it is worth splitting up design matrix For really large dimensions, we can link up an arbitrary number of GPUs Message Passing Interface allows us to be agnostic to physical location of GPU devices
- 40. Distributed GPU implementation Approach: MPI master node delegates heavy lifting to slaves across network Master node performs fast serial code, such as sampling new λ, comparing logLs, broadcasting gradients, etc. Network traﬃc is kept to a minimum Implemented for Greedy Coordinate Descent and Gradient Descent Developed on server at USC Epigenome Center: 2 Tesla C2050s
- 41. Parallel algorithms for ﬁtting the LASSO Greedy coordinate descent (ref) Same algorithm as CCD, except for each variable sweep, update only j that gives greatest increase in logL No dependencies between subjects and variables, massive parallelization across subjects AND variables Ideal if you have a huge dataset, and you want a stringest type 1 error rate (only care about a few variables) Ayers and Cordell, Gen Epi 2010: Permute, and pick largest λ that allows ﬁrst “false” variable to enter ref: Wu, Lange: Annals Appl Stat 2008 Vol 2,No. 1
- 42. Layout for greedy coordinate descentimplementation
- 43. Overview of Greedy CD algorithm Newton-Raphson kernel Each threadblock maps to a block of 512 subjects (theads) for 1 variable Each thread calculates subject’s contribution to gradient and hessian Sum (reduction) across 512 subjects Sum (reduction) across subject blocks in new kernel Compute log-likelihood change for each variable (like above). Apply a max operator (log2 reduction) to select variable with greatest contribution to likelihood. Iterate repeatedly until likelihood increase less than epsilon
- 44. Evaluation on large dataset GWAS data 6,806 subjects in a case control study of prostate cancer 1,047,986 SNPs typed Invoke approx. 7 billion threads per iteration Total walltime for 1 GCD iteration (sweep across all variables) 15 minutes on optimized serial implementation split across 2 slave CPUs 5.8 seconds on parallel implementation across 2 nVidia Tesla C2050 GPU devices 155x speed up
- 45. Parallel algorithms for ﬁtting the LASSO (Stochastic Mirror) Gradient Descent (ref) Sometimes, we are interested in tuning λ for say the best cross validation errors Greedy descent seems awfully wasteful in that only one βj is updated However, we can update all variables in parallel cycling through subjects Algorithm Extremely simple: −yi For subject i: gradient gi = (1+exp(xi βyi )) Update his βi vector, where βi,j = βi,j − ηgi xi,j η is a learning parameter, set suﬃciently small (e.g. .0001) ref: Shwartz,Tewari: Proc. 26th Intern. Conf Machine Learning 2009
- 46. Gradient descent Performance Slow convergence compared to serial cyclic coordinate descent, but far more scalable For large lambdas, slower than greedy coordinate descent Computation:bandwidth ratio not great For 1 million SNPs, only about 15x speedup. Far more SNPs are needed Technical issues Must store genotypes in subject major order to enabled coalesced memory loads/stores Makes SNP level summaries like means and SDs diﬃcult to compute. Heterogeneous data types: ﬂoats: (E,ExG), compresesed chars: (G,GxG) Memory constrained: can perform interactions on the ﬂy with SNP major
- 47. Potential for robust variable selection: Subsampling: Applying LASSO once overﬁts data. Model selection inconsistent Subsampling is preferable: Bootstrapping, stability selection, x-fold cross validation Number of replicates << number of samples << number of features Bayesian variable selection: If we assume βLASSO conditionally independent Master node can (quickly) sample hyperparameters (e.g. λ) from a prior distribution

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment