Multi-core programming talk for weekly biostat seminar
Upcoming SlideShare
Loading in...5
×
 

Multi-core programming talk for weekly biostat seminar

on

  • 621 views

 

Statistics

Views

Total Views
621
Views on SlideShare
621
Embed Views
0

Actions

Likes
0
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Multi-core programming talk for weekly biostat seminar Multi-core programming talk for weekly biostat seminar Presentation Transcript

  • Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarUsing multi-core algorithms to Introduction to high-performance speed up optimization computing Concepts Example 1: Hidden Markov Model Training Gary K. Chen Example 2: Regularized Biostat Noon Seminar Logistic Regression Closing remarks March 23, 2011
  • Using multi-coreAn outline algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction to high-performance computing Introduction to high-performance computingConcepts Concepts Example 1: Hidden Markov Model TrainingExample 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarksExample 2: Regularized Logistic RegressionClosing remarks
  • Using multi-coreCPUs are not getting any faster algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Heat and power are the sole obstacles Logistic Regression Closing remarks According to Intel: underclock a single core by 20 percent and you save half the power while sacrificing only 13 percent of the performance. Implication? Two cores at the same power have 73% more performance (100 − 13) ∗ 2/100
  • Using multi-core1. High performance computing algorithms to speed up optimizationclusters Gary K. Chen Biostat Noon Seminar Coarse-grained, aka “embararassingly Introduction to parallel”, problems high-performance computing 1. Launch multiple instances of the program Concepts 2. Compute summary statistics across log Example 1: Hidden Markov Model files Training Examples Example 2: Regularized Logistic Regression Monte Carlo simulations (power/specificity), Closing remarks GWAS scans, imputation, etc. Remarks Pros: maximizes throughput (CPUs kept busy), gentle learning curve Cons: Doesn’t address some interesting computational problems
  • Using multi-coreCluster Resource Example algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar HPCC at USC Introduction to high-performance 94 teraflop cluster computing 1,980 simultaneous processes running on Concepts Example 1: Hidden main queue Markov Model Training Jobs are asynchronous; can start and end in Example 2: any order Regularized Logistic Regression Portable Batch System Closing remarks Simply prepend some headers in your shell script, describing how much memory you want, how long your job will run, etc.
  • Using multi-core2. High performance computing algorithms to speed up optimizationclusters Gary K. Chen Biostat Noon Seminar Tightly-coupled parallel programs Introduction to high-performance Message Passing Interface computing Concepts 1. Programs are distributed across multiple Example 1: Hidden physical hosts Markov Model Training 2. Each program executes the exact same Example 2: code Regularized Logistic Regression 3. All processes can be synchronized at Closing remarks strategic points Remarks Pro: Can run interesting algorithms like parallel tempered MCMC Con: Developer is responsible for establishing a communication protocol
  • Using multi-coreExploiting multiple-core processors algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to Fine-grained parallelism high-performance computing Suggests a much higher degree of Concepts inter-dependence between each process Example 1: Hidden Markov Model A “master” process executes majority of Training code base. “Slave” processes are invoked to Example 2: Regularized ease bottlenecks. Logistic Regression We hope to minimize the time spent in the Closing remarks master process Some Bayesian algorithms stand to benefit
  • Using multi-coreAmdahl’s Law algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks 1 P (1−P)+ N
  • Using multi-coreHeterogeneous Computing algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreMulti-core programming algorithms to speed up optimization Gary K. Chen aka data-parallel programming Biostat Noon Seminar Built in to common compilers (e.g. gcc) Introduction to high-performance Very easy to get started! computing SSE or Streaming SIMD Extensions: each Concepts core can do vector operations Example 1: Hidden Markov Model OpenMP: parallel processing across multiple Training cores Example 2: Regularized e.g. simply insert ”pragma omp for” directive Logistic Regression and compile with gcc! Closing remarks CUDA/OpenCL CUDA is a proprietary C-based language endorsed by nVidia OpenCL: standards based implementation backed by the Khronos Group
  • Using multi-coreOpenCL and CUDA algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar CUDA Introduction to Powerful libraries available to enrich high-performance computing productivity Concepts Thrust: C++ generics, cuBLAS: Level 1 and Example 1: Hidden 2 parallel BLAS Markov Model Training Supported only on nVidia GPU devices Example 2: OpenCL Regularized Logistic Regression Compatible with nVidia and ATI GPU Closing remarks devices, as well as AMD/Intel CPUs Lags behind CUDA in libraries and tools Good to work with, given ATI hardware currently leads in value
  • Using multi-coreA $60 HPC under your desk algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreAn outline algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction to high-performance computing Introduction to high-performance computingConcepts Concepts Example 1: Hidden Markov Model TrainingExample 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarksExample 2: Regularized Logistic RegressionClosing remarks
  • Using multi-coreThreads and threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Threads: Introduction to Perform a very limited function, but do all high-performance computing the heavy lifting Concepts Are extremely lightweight, so you’ll want to Example 1: Hidden Markov Model launch thousands Training Threadblocks: Example 2: Regularized Logistic Regression Developer assigns threads that can cooperate Closing remarks on a common task into threadblocks Threadblocks cannot communicate with one another and run in any order (asynchronously)
  • Using multi-coreThread organization algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreMemory hierarchy algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreKernels algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Warps/Wavefront: Example 1: Hidden Markov Model Describes an atomic set of threads (32 for Training nVidia, 64 for ATI) Example 2: Regularized Instructions are executed in lock step across Logistic Regression the set, each thread processing a distinct Closing remarks data element Developer responsible for synchronizing across warps Kernels: Code that developer writes, which can execute on a SIMD device Essentially C functions
  • Using multi-coreAn outline algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction to high-performance computing Introduction to high-performance computingConcepts Concepts Example 1: Hidden Markov Model TrainingExample 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarksExample 2: Regularized Logistic RegressionClosing remarks
  • Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction toHidden Markov Models high-performance computing A staple in machine learning. Concepts Many applications in statistical genetics, Example 1: Hidden Markov Model including imputation of untyped genotypes, Training local ancestry, sequence alignment (e.g. Example 2: Regularized protein family scoring) Logistic Regression Closing remarks
  • Using multi-coreApplication to cancer tumor data algorithms to speed up optimization Gary K. Chen Extending PennCNV Biostat Noon Seminar Tissues are assumed to be a mixture of Introduction to tumor/normal cells high-performance computing Tumors are assumed to be heterogeneous in Concepts CN across cells, implying fractional copy Example 1: Hidden Markov Model number states Training PennCNV defines 6 hidden integer states for Example 2: Regularized normal cells and does not infer allelic state Logistic Regression We can make more precise estimates of both Closing remarks copy numbers and allelic state in tumors with little sacrifice in performance Copy Num: z = (1-α)znormal + α ztumor z is fractional, whereas ztumor = I(z<=2)floor(z) + I(z>2)ceil(z)
  • Using multi-coreState Space algorithms to speed up optimization state CNfrac BACnormal CNtumor BACtumor 0 2 0 2 0 Gary K. Chen 1 2 1 2 1 Biostat Noon 2 2 2 2 2 Seminar 3 0 0 0 0 4 0 1 0 0 Introduction to 5 0 2 0 0 high-performance 6 0.5 0 0 0 computing 7 0.5 1 0 0 8 0.5 2 0 0 Concepts 9 1 0 1 0 10 1 1 1 0 Example 1: Hidden 11 1 1 1 1 Markov Model 12 1 2 1 1 Training 13 1.5 0 1 0 14 1.5 1 1 0 Example 2: 15 1.5 1 1 1 Regularized 16 1.5 2 1 1 Logistic Regression 17 2.5 0 3 0 18 2.5 1 3 1 Closing remarks 19 2.5 1 3 2 20 2.5 2 3 3 21 3 0 4 0 22 3 1 4 1 23 3 1 4 2 24 3 1 4 3 25 3 2 4 4 26 3.5 0 4 0 27 3.5 1 4 1 28 3.5 1 4 2 29 3.5 1 4 3 30 3.5 2 4 4
  • Using multi-coreTraining a Hidden Markov Model algorithms to speed up optimization Gary K. Chen Biostat Noon Objective: Infer the probabilities of Seminar transitioning between any pair of states Introduction to high-performance computing Apply forward-backward and Baum-Welch Concepts algorithms Example 1: Hidden A special case of the Markov Model Training Expectation-Maximization (or generally, Example 2: Regularized MM) family of algorithms Logistic Regression Expectation step: forward-backward Closing remarks computes posterior probs based on estimated parameters Maximization: Baum-Welch empirically estimates parameters by averaging across observations
  • Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: RegularizedForward algorithm Logistic Regression Closing remarks We compute the probability vector at observation t: f0:t = f0:t−1 TOt Each state (element of the m-state vector) can independently compute a sum-product Threadblocks map to states Threads calculate products in parallel, followed by a log2(m) addition reduction
  • Using multi-coreGridblock of threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreSpeedups algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar We implement 8 kernels. Examples: Introduction to high-performance Re-scaling transition matrix (for SNP computing spacing) Concepts 2 Example 1: Hidden Serial: O(2nm ); Parallel: O(n) Markov Model Training Forward backward Example 2: Serial: O(2nm2 ); Parallel: O(nlog2 (m)) Regularized Logistic Regression Normalizing constant (Baum-Welch) Closing remarks Serial: O(nm); Parallel: O(log2 (n)) MLE of transition matrix (Baum-Welch) Serial: O(nm2 ); Parallel: O(n)
  • Using multi-coreRun time comparison algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computingTable: 1 iteration of HMM training on Chr 1 (41,263 Concepts Example 1: HiddenSNPs) Markov Model Training states CPU GPU fold-speedup Example 2: Regularized 128 9.5m 37s 15x Logistic Regression Closing remarks 512 2h 35m 1m 44s 108x
  • Using multi-coreAn outline algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction to high-performance computing Introduction to high-performance computingConcepts Concepts Example 1: Hidden Markov Model TrainingExample 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarksExample 2: Regularized Logistic RegressionClosing remarks
  • Using multi-coreRegularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Variable Selection Introduction to high-performance computing For tractability, most GWAS analyses entail Concepts separate univariate tests of each variable Example 1: Hidden (e.g. SNP, GxG, GxE). Markov Model Training However, it is preferable to model all Example 2: variables simultaneously to tease out Regularized Logistic Regression correlated variables Closing remarks This is problematic when p < n. Parameters are unestimable, matrix inversion becomes computationally intractable
  • Using multi-coreRegularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar The LASSO method (Tibshirani, 1996) Introduction to high-performance computing Seeded a cottage industry of related methods Concepts e.g. Group LASSO, Elastic Net, MCP, NEG, Example 1: Hidden Overlap LASSO, Graph LASSO Markov Model Training Fundamentally solves variable selection Example 2: problem by introducing an L1 norm to invoke Regularized Logistic Regression sparsity Closing remarks Limitations: Do not provide a mechanism for hypothesis testing (e.g p-values)
  • Using multi-coreRegularized Regression algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to Bayesian methods high-performance computing Posterior inferences on β Concepts e.g.: Bayesian LASSO, Bayesian Elastic Net, Example 1: Hidden Markov Model Highly computational. Scaling up to genome Training wide scale is not obvious Example 2: Regularized MCMC is inherently serial, best option is to Logistic Regression speed up the sampling chain Closing remarks Proposal: Implement key bottle neck on the GPU: fitting βLASSO to the data
  • Using multi-coreOptimization algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar For binomial logistic regression: n Introduction to L(β) = i=1 [yi logpi + (1 − yi )log(1 − pi )] high-performance t computing e µ+xi β pi = µ+x t β Concepts 1+e i n i=1 [yi − pi (β)]xi Example 1: Hidden L(β) = Markov Model n −d 2 L(β) = i=1 pi (β)[1 − pi (β)]xi xi t Training Example 2: For *penalized* regression: Regularized Logistic Regression p f (β) = L(β) − λ j=1 |βj | Closing remarks Find global maximum by applying Newton Raphson one variable at a time. n [yi −pi (β m )]xi −λsgn(βjm ) P βjm+1 = βjm − i=1 Pn m m t i=1 pi (β )[1−pi (β )]xi xi
  • Using multi-coreOverview of algorithm algorithms to speed up optimization Newton-Raphson kernel Gary K. Chen Biostat Noon Each threadblock maps to a block of 512 Seminar subjects (theads) for 1 variable Introduction to Each thread calculates subject’s contribution high-performance computing to gradient and hessian Concepts Sum (reduction) across 512 subjects Example 1: Hidden Sum (reduction) across subject blocks in new Markov Model Training kernel Example 2: Regularized Compute log-likelihood change for each Logistic Regression variable (like above). Closing remarks Apply a max operator (log2 reduction) to select variable with greatest contribution to likelihood. Iterate repeatedly until likelihood increase less than epsilon
  • Using multi-coreGridblock of threadblocks algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreConsideration of datatypes algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks Need to compress genotypes Why? Global memory is scarce, bandwidth is expensive A warp of 32 threads loads 32 words (containing 512 genotypes) into local memory
  • Using multi-coreDistributed GPU implementation algorithms to speed up optimization Gary K. Chen For really large dimensions, we can link up Biostat Noon Seminar an arbitrary number of GPUs Introduction to high-performance MPI allows us to spread work across a computing Concepts cluster Example 1: Hidden Markov Model Developed on Epigraph: 2 Tesla C2050s Training Approach Example 2: Regularized Logistic Regression MPI master node delegates heavy lifting to Closing remarks slaves across network Master node performs fast serial code, such as sampling from the full conditional likelihood of any penalty parameter (e.g. λ) Network traffic is minimized so slaves must maintain up to date copies of data structures
  • Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction tohigh-performancecomputingConceptsExample 1: HiddenMarkov ModelTrainingExample 2:RegularizedLogistic RegressionClosing remarks
  • Using multi-coreEvaluation on large dataset algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar GWAS data Introduction to 6,806 African American subjects in a case high-performance computing control study of prostate cancer Concepts 1,047,986 SNPs typed Example 1: Hidden Markov Model Elapsed walltime for 1 LASSO iteration Training (sweep across all variables) Example 2: Regularized Logistic Regression 15 minutes on optimized serial Closing remarks implementation across 2 slave CPUs 5.8 seconds on parallel implementation across 2 nVidia Tesla C2050 GPU devices 155x speed up
  • Using multi-core algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction tohigh-performancecomputingConceptsExample 1: HiddenMarkov ModelTrainingExample 2:RegularizedLogistic RegressionClosing remarks
  • Using multi-coreAn outline algorithms to speed up optimization Gary K. Chen Biostat Noon SeminarIntroduction to high-performance computing Introduction to high-performance computingConcepts Concepts Example 1: Hidden Markov Model TrainingExample 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarksExample 2: Regularized Logistic RegressionClosing remarks
  • Using multi-coreConclusion algorithms to speed up optimization Gary K. Chen Multicore programming is not a panacea Biostat Noon Seminar Insufficient parallelism leads to an inferior Introduction to implementation high-performance computing Graph algorithms *generally* do not map Concepts well to SIMD architectures Example 1: Hidden Markov Model Programming Effort Training Expect to spend at least 90 % time Example 2: Regularized debugging a black box Logistic Regression Closing remarks Is it worth it? Human time > computer time? For generic problems (matrix multiplication, sorting), absolutely OpenCL is a bit more verbose than CUDA, but is more portable
  • Using multi-corePotential Future Work algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Reconstructing Bayesian Networks Introduction to high-performance Compute joint probability for each possible computing Concepts topology Example 1: Hidden Code graph as a sparse adjacency matrix Markov Model Training Approximate Bayesian Computation Example 2: Regularized Sample θ from some assumed prior Logistic Regression distribution Closing remarks Generate a dataset conditional on θ Examine how close fake data is to the real one
  • Using multi-coreTomorrow’s clusters will require algorithms to speed up optimizationheterogeneous programming Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Concepts Example 1: Hidden Markov Model Training Example 2: Regularized Logistic Regression Closing remarks
  • Using multi-coreTianhe-1A algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar World’s faster supercomputer Introduction to high-performance 4.7 petaflops (quadrillion floating point computing operations/sec) Concepts Example 1: Hidden 14,336 Xeon CPUs, 7,168 Tesla M2050s Markov Model Training According to nVidia Example 2: Regularized CPU only: 50k CPUs, twice the floor space Logistic Regression CPU only: 12 megawatts compared to 4.04 Closing remarks megawatts $88 million dollars to build, $20 million for annual energy costs
  • Using multi-coreThanks to algorithms to speed up optimization Gary K. Chen Biostat Noon Seminar Introduction to high-performance computing Kai: Ideas for CNV analysis Concepts Duncan, Wei: Discussions on LASSO Example 1: Hidden Markov Model Training Tim, Zack: Access to Epigraph Example 2: Regularized Alex, James: Lively HPC Logistic Regression Closing remarks discussions/debates