AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation
Upcoming SlideShare
Loading in...5
×
 

AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation

on

  • 626 views

Description of the Cloudbreak Hadoop-based genomic structural variation detection algorithm for the UC Berkeley AMP Lab, 3/05/13

Description of the Cloudbreak Hadoop-based genomic structural variation detection algorithm for the UC Berkeley AMP Lab, 3/05/13

Statistics

Views

Total Views
626
Views on SlideShare
626
Embed Views
0

Actions

Likes
1
Downloads
14
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation Presentation Transcript

  • CloudBreak: A MapReduce Algorithm for Genomic Structural Variation Detection Chris Whelan & Kemal S¨nmez o Oregon Health & Science University March 5, 2013Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 1 / 34
  • Overview Background Current Approaches MapReduce Framework for SV Detection Cloudbreak Algorithm Results Ongoing Work Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 2 / 34
  • Background - High Throughput Sequencing High Throughput (Illumina) Sequencing produces millions of paired short (∼100bp) reads of DNA from an input sample The challenge: use these reads to find characteristics of DNA sample relevant to disease or phenotype The approach: In resequencing experiments, align short reads to a reference genome for the species and find the differences Sequencing error, diploid genomes, hard to map repetitive sequences can make this difficult Need high coverage (eg 30X) to detect all single nucleotide polymorphisms (SNPs); results in large data sets (100GB compressed raw data for human) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 3 / 34
  • Structural Variations Harder to detect than SNPs are structural variations: deletions, insertions, inversions, duplications, etc. Generally events that affect more than 40 or 50 bases The majority of variant bases in a normal individual genome are due to structural variations (primarily insertions and deletions) Variants are associated with cancer, neurological disease Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 4 / 34
  • SV Detection Approaches Four main algorithmic approaches Read pair (RP): Look for paired reads that map to the reference at a distance or orientation that disagrees with the expected characteristics of the library Read depth (RD): Infer deletions and duplications from the number of reads mapped to each locus Split read mapping (SR): Split individual reads into two parts, see if you can map them to either side of a breakpoint De novo assembly (AS): assemble the reads into their original sequence and compare to the reference Hybrid approaches Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 5 / 34
  • SV Detection from Sequencing DataMills et al. Nature 2011 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 6 / 34
  • SV Detection is HardSensitivity and FDR of deletion detection methods used on 1,000 GenomesProject.Mills et al. Nature 2011 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 7 / 34
  • Read-pair (RP) SV Detection Building the sample library involves selecting the size of DNA fragments Only the ends of each fragment are sequenced, from the outside in Therefore the distance between the two sequenced reads (the insert size) is known - typically modeled as a normal distribution Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 8 / 34
  • Discordant read pairs When reads map to the reference at a greater than expected distance apart, indicates a deletion in the sample between the mapping location of the two ends Reads that map closer than expected imply an insertion Reads in the wrong orientation imply an inversionMedvedev et al. 2009 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 9 / 34
  • Read Pair Algorithms Identify all read pairs with discordant mappings Attempt to cluster discordant pairs supporting the same variant Typically ignore concordant mappings Some algorithms consider reads with multiple mappings by choosing the mappings that minimize the number of predicted variants: have shown that this increases sensitivity in repetitive regions of the genome Mapping results for a high coverage human genome are very large (100GB of compressed alignment data storing only the best mappings for a 30X genome) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 10 / 34
  • MapReduce and Hadoop Provides a distributed filesystem across a cluster with redundant storage Divides computation into Map and Reduce phases: Mappers emit key-value pairs for a block of data, Reducers process all of the values for each key Good at handling data sets of the size seen in sequencing experiments, and much larger Able to harness a cluster of commodity machines rather than single high-powered servers Some algorithms translate easily to MapReduce model; others are much harder A natural abstraction in resequencing experiments: use a key for each location in the genome. Examples: SNP calling in GATK or Crossbow Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 11 / 34
  • SV Detection in MapReduce Clustering of read pairs as in traditional RP algorithms typically involves global compuations or graph structures MapReduce, on the other hand, forces local, parallel computations Our approach: use MapReduce to compute features for each location in the genome from alignments relevant to that location Locations can be small tiled windows to make the problem more tractable Make SV calls from features computed along the genome in a post-processing step Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 12 / 34
  • An Algorithmic Framework for SV Detection in MapReduce1: job Alignment2: function Map(ReadPairId rpid, ReadId r , ReadSequence s, ReadQuality q)3: for all Alignments a ∈ Align(< s, q >) do4: Emit(ReadPairId rpid, Alignment a)5: function Reduce(ReadPairId rpid, Alignments a1,2,... )6: AlignmentPairList ap ← ValidAlignmentPairs (a1,2,... )7: Emit(ReadPairId rp, AlignmentPairList ap)8: job Compute SV Features9: function Map(ReadPairId rp, AlignmentPairList ap)10: for all AlignmentPairs < a1 , a2 >∈ ap do11: for all GenomicLocations l ∈ Loci (a1 , a2 ) do12: ReadPairInfo rpi ← < InsertSize(a1 , a2 ), AlignmentScore(a1 , a2 ) >13: Emit(GenomicLocation l, ReadPairInfo rpi)14: function Reduce(GenomicLocation l, ReadPairInfos rpi1,2,... )15: SVFeatures φl ← Φ(InsertSizes i1,2,... , AlignmentScores q1,2,... )16: Emit(GenomicLocation l, SVFeatures φl )17: StructuralVariationCalls svs ← PostProcess(φ1,2,... ) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 13 / 34
  • Three user-defined functions This framework leaves three functions to be defined May be many different approaches to take within this framework, depending on the application Loci : a1 , a2 → Lm ⊆ L Φ : {ReadPairInfo rpim,i,j } → RN PostProcess : {φ1 , φ2 , . . . , φN } → { SVType s, lstart , lend } Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 14 / 34
  • Cloudbreak implementation We focus on detecting deletions and small insertions Implemented as a native Hadoop application Use features computed from fitting a mixture model to the observed distribution of insert sizes at each locus Process as many mappings as possible for ambiguously mapped reads Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 15 / 34
  • Local distributions of insert sizes Estimate distribution of insert sizes observed at each window as a Gaussian mixture model (GMM) Similar to idea in MoDIL (Lee et al. 2009) Use a constrained expectation-maximization algorithm to find mean, weight of second component. Constrain one component to have the library mean insert size, and constrain both components to have the same variance. Find mean and weight of the second component. Features computed include the log likelihood ratio of fit two-component model to the likelihood of the insert sizes under a model with no variant: normal distribution under library parameters. Other features: weight of the second component, estimated mean of the second component. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 16 / 34
  • Local distributions of insert sizes No Variant Homozygous Deletion 0 100 200 300 400 500 Heterozygous Deletion 0 100 200 300 400 500 0 100 200 300 400 500 Heterozygous Insertion 0 100 200 300 400 500 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 17 / 34
  • Cloudbreak output example Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 18 / 34
  • Handling ambiguous mappings Incorrect mappings of read pairs are unlikely to form clusters of insert sizes at a given window Before fitting GMM, remove outliers using a nearest neighbor method: If kth nearest neighbor of each mapped pair is greater than c * (library fragment size SD) away, remove that mapping Control number of mappings based on an adaptive cutoff for alignment score: Discard mapping m if the ratio of the best alignment score for that window to the score of m is larger than some cutoff. This allows visibility into regions where no reads are mapped unambiguously. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 19 / 34
  • Postprocessing First extract contiguous genomic loci where the log-likelihood ratio of the two models is greater than a given threshold. To eliminate noise we apply a median filter with window size 5. Let µ be the estimated mean of the second component and µ be the library insert size. We end regions when µ changes by more than 60bp (2σ), and discard regions where the length of the region differs from µ by more than µ. Cloudbreak looses some resolution to breakpoint location based on genome windows and filters. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 20 / 34
  • Results Comparison We compare Cloudbreak to a selection of widely used algorithms taking different approaches: Breakdancer (Chen et al. 2009): Traditional RP based approach DELLY (Rausch et al. 2012): RP based approach with SR refinement of calls GASVPro (Sindi et al. 2012): RP based approach, uses ambiguous mappings of discordant read pairs which it resolves through MCMC algorithm; looks for RD signals at predicted breakpoint locations by examining concordant pairs Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairs where only one read could be mapped and searches for split read mappings for the other read MoDIL (Lee et al. 2009): Mixture of distributions; only on simulated data due to runtime requirements. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 21 / 34
  • Simulated Data Very little publicly available NGS data from a genome with fully characterized structural variations Can match algorithm output to validated SVs, but dont know if novel predictions are wrong or undiscovered. Way to get a simulated data set with ground truth known and realistic events: take a (somewhat) fully characterized genome, apply variants to reference sequence, simulate reads from modified reference. Use Venter genome (Levy et al, 2007), chromosome 2. To simulate heterozygosity, randomly assign half of the variants to be homozygous and half heterozygous, and create two modified references. Simulated 100bp paired reads with a 100bp insert size to 30X coverage. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 22 / 34
  • ROC curve for Chromosome 2 Deletion Simulation Deletions in Venter diploid chr2 simulation 350 300 250 True Positives 200 Cloudbreak Breakdancer Pindel 150 GASVPro DELLY 100 50 0 0 100 200 300 400 False Positives Caveat: Methods perform better on simulated data than on real whole genome datasets. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 23 / 34
  • Ability to find simulated deletions by size at 10% FDR Number of deletions found in each size class (number of exclusive predictions for algorithm in that class) Cloudbreak competitive for a range of size classes 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp Total Number 224 84 82 31 26 Cloudbreak 47 ( 7) 50 ( 2) 55 ( 4) 12 ( 4) 15 (0) Breakdancer 52 ( 10) 49 ( 2) 49 (0) 7 (0) 14 (0) GASVPro 31 ( 4) 25 (0) 23 (0) 2 (0) 6 (0) DELLY 22 ( 2) 56 ( 3) 40 (0) 8 (0) 12 (0) Pindel 60 ( 35) 16 (0) 41 ( 2) 1 (0) 12 (0) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 24 / 34
  • Insertions in Simulated Data Insertions in Venter diploid chr2 simulation 80 60 True Positives Cloudbreak Breakdancer Pindel 40 20 0 0 20 40 60 80 False Positives Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 25 / 34
  • NA18507 Data Set Well studied sample from a Yoruban male individual High quality sequence to 37X coverage, 100bp reads with a 100bp insert size We created a gold standard set of deletions from three different studies with low false discovery rates: Mills et al. 2011, Human Genome Structural Variation Project (Kidd et al. 2008), and the 1000 Genomes Project (Mills et al. 2011) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 26 / 34
  • ROC Curve for NA18507 Deletions All algorithms look much worse on real data (could be lack of complete truth) NA18507 2000 1500 True Positives 1000 Cloudbreak Breakdancer 500 Pindel GASVPro DELLY Cloudbreak 0 0 5000 10000 15000 Novel Predictions Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 27 / 34
  • Ability to find NA18507 deletions by size Using the same cutoffs that yielded a 10% FDR on the simulated chromosome 2 data set, adjusted for the difference in coverage from 30X to 37X. Cloudbreak identifies more small deletions Cloudbreak contributes more exclusive predictions Prec. Recall 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp Total Number 7,466 235 218 110 375 Cloudbreak 0.0978 0.115 423 ( 179) 128 ( 9) 158 ( 8) 70 ( 3) 186 ( 12) Breakdancer 0.122 0.112 261 ( 41) 132 ( 8) 167 ( 1) 92 (0) 288 ( 10) GASVPro 0.134 0.0401 104 ( 17) 37 ( 2) 77 (0) 26 (0) 93 (0) DELLY 0.0824 0.091 143 ( 9) 125 ( 7) 158 ( 1) 83 ( 1) 256 ( 3) Pindel 0.16 0.0685 149 ( 12) 57 (0) 140 ( 1) 58 (0) 172 ( 2) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 28 / 34
  • Ability to detect deletions in repetitive regions Detected deletions on the simulated and NA18507 data sets identified by each tool, broken down by whether the deletion overlaps with a RepeatMasker-annotated element. Simulated Data NA18507 Non-repeat Repeat Non-repeat Repeat Total Number 120 327 553 7851 Cloudbreak 28 ( 4) 151 ( 13) 204 ( 46) 761 ( 165) Breakdancer 29 ( 5) 142 ( 7) 186 ( 21) 754 ( 39) GASVPro 15 ( 2) 72 ( 2) 71 ( 6) 266 ( 13) DELLY 21 ( 2) 117 ( 3) 147 ( 11) 618 ( 10) Pindel 18 ( 9) 112 ( 28) 103 ( 4) 473 ( 11) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 29 / 34
  • Genotyping Deletions We can use the mixing parameter that controls the weight of the two components in the GMM to accurately predict deletion genotypes. By setting a simple cutoff of .2 on the average value of the weight in each prediction, we were able to achieve 86.7% and 94.9% accuracy in predicting the genotype of the true positive deletions we detected in the simulated and real data sets, respectively. Actual Genotypes Simulated Data NA18507 Homozygous Heterozygous Homozygous Heterozygous Predicted Homozygous 88 3 70 11 Genotypes Heterozygous 18 70 4 209 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 30 / 34
  • Running Times Running times (wall) on both data sets Cloudbreak took approx. 150 workers for simulated data, 650 workers for NA18507 (42m in MapReduce) Breakdancer and DELLY were run in a single CPU but can be set to process each chromosome independently (10X speedup) Pindel was run in single-threaded mode MoDIL run on 200 cores Simulated Chromosome 2 Data NA18507 Cloudbreak 835s 106m Breakdancer 653s 36h GASVPro 3339s 33h DELLY 1964s 208m Pindel 1336s 38h MoDIL 48h ** Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 31 / 34
  • Ongoing work: Generate additional features, improvepostprocessing Goals: increase accuracy and breakpoint resolution Features involving split read mappings or pairs in which only one end is mapped Features involving sequence and sequence variants Annotations of sequence features and previously identified variants Apply machine learning techniques: conditional random fields, Deep Learning Potential future work: add local assembly of breakpoints Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 32 / 34
  • Ongoing work: automate deployment and execution oncloud providers Many researchers don’t have access to Hadoop clusters, or servers powerful enough process these data sets On-demand creation of clusters with cloud providers can be cost-effective, especially with spot pricing Developing scripts to automate on-demand construction of Hadoop clusters in cloud (Amazon EC2, Rackspace) using Apache Whirr project Bottleneck: transferring data into and out of the cloud Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 33 / 34
  • Conclusions Novel approach to applying MapReduce algorithm to structural variation problem Make insert size distribution clustering approaches have feasible run times Improved accuracy over existing algorithms, especially in repetitive regions Ability to accurately genotype calls Cost of additional CPU hours, somewhat less breakpoint resolution Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 34 / 34