Hanna bosc2010


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hanna bosc2010

  1. 1. The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data Ma#  Hanna  and  Mark  DePristo   Genome  Sequencing  and  Analysis  Group   Medical  and  Popula<on  Gene<cs  Program   Broad  Ins<tute  of  Harvard  and  MIT  
  2. 2. The Genome Analysis Toolkit Agenda •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  Example:  A  Simple  Bayesian  Genotyper   2 2 2
  3. 3. GATK: Overview and Concepts Motivation Coverage in xMHC region of JPT individuals" •  Dataset size greatly increases analysis complexity. •  Implementation issues can prematurely terminate long-running jobs or introduce subtle bugs. 3
  4. 4. GATK: Overview Simplifying the process of writing analysis tools for resequencing data •  The  framework  is  designed  to  support  most  common   paradigms  of  analysis  algorithms   –  Provides  structured  access  to  reads  in  BAM  format,   reference  context,  as  well  as  reference-­‐associated  meta   data   •  General-­‐purpose   –  Op<mized  for  ease  of  use  and  completeness  of   func<onality  within  scope   •  Efficient   –  Engineering  investment  on  performance  of  cri<cal  data   structures  and  manipula<on  rou<nes   •  Convenient   –  Structured  plug-­‐in  model  makes  developing  in  Java  against   the  framework  rela<vely  painfree   4
  5. 5. GATK: Overview The MapReduce design philosophy Data elements a   b   c   d   e   Operations are f(x) independent of each other X = f(x) A   B   C   D   E   r(x,y, …, z) Results depends on all sites R = r(A, R(B,…,E)) R   Result is: Map Function f applied to each element of list Reduce Function r recursively reduced over each f(…) 5
  6. 6. GATK: Overview Rapid development of efficient and robust analysis tools Genome  Analysis   Provides the Toolkit  (GATK)   boilerplate infrastructure   code required to perform any NGS analysis Traversal  engine   Analysis   tool   Provided  by  framework   Implemented  by  user   6
  7. 7. GATK: Workflow Introduction •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  An  example  of  one  of  the  GATK’s  most  common  workflows   •  Data  access  pa#ern:  by  locus   •  Inputs:  reads,  reference,  dbSNP   •  Example:  A  Simple  Bayesian  Genotyper   7
  8. 8. GATK: Workflow The sharding system: dividing data into processor-sized pieces Reads Reference dbSNP •  Divides data into small chunks that can be processed independently •  Handles extraction of subsets of data •  Groups small intervals together to avoid repetitive decompression 8
  9. 9. GATK: Workflow Traversal engines: preparing data for processing Builds data structures easy consumed by the analysis 9
  10. 10. GATK: Workflow Interaction between sharding system and traversal engines •  Datasets are split into shards, which can be processed sequentially or in parallel •  When processing sequentially, the reduce value of each shard is used to bootstrap the next shard. •  When processing in parallel, the result of each shard is computed independently and then “tree-reduced” together. 10
  11. 11. GATK: Workflow Walkers: Analyses written by end-users dbsnp exons A ref A reads C C A C Analysis   tool   •  Walkers (analyses) can easily be written by end users. The GATK is distributed with a significant library of walkers. •  Only the reads, reference, and reference metadata applicable to a single- base location is presented to the analysis tool. •  The GATK provides tools to filter the pileup automatically or on demand. 11
  12. 12. GATK: Workflow Other data access patterns Other data access patterns: Traversal Type Description Reads Call map per read, along with the reference and reference-ordered metadata spanning that read. Duplicates Call map for each set of duplicate reads. Read pair (naïve) Call map for each read and its mate (naïve, requires the input BAM to be sorted in query name order). Straightforward (but not necessarily easy) to add any new access pattern involving streaming data. 12
  13. 13. GATK: Additional features Additional inputs and outputs Reference metadata •  Support for additional input data that is sorted in reference order can easily be added to the GATK. •  Input types can be added by creating two new classes: a feature (data access object) and a codec (parser). •  New file formats are indexed automatically. •  New data types are autodiscovered via a classpath search. •  Joint initiative with IGV. Additional I/O •  Analysis parameters can be added to a walker by annotating a field in the walker with an @Argument annotation. •  Command-line argument types can become very sophisticated. 13
  14. 14. Walkers: Example A simple Bayesian genotyper •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  Example:  A  Simple  Bayesian  Genotyper   •  A  func<onal  genotyper  in  under  150  lines  of  code   •  A  minimal  example:  calls  are  much  lower  in  quality  than   the  UnifiedGenotyper   14
  15. 15. Walkers: Example A simple Bayesian genotyper: the model Likelihood of the Likelihood for Prior for the data given the the genotype genotype genotype Independent base model Bayesian   model     L(G | D) = P(G) P(D | G) = ∏ b∈{good _ bases} P(b | G) •  Likelihood  of  data  computed  using  pileup  of  bases  and   associated  quality  scores  at  given  locus   •  Only  “good  bases”  are  included:  those  sa<sfying  minimum   base  quality,  mapping  read  quality,  pair  mapping  quality,  NQS   •  L(G|D)  computed  for  all  10  genotypes   See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for a more complete approach 15
  16. 16. Walkers: Example A simple Bayesian genotyper •  Walker specifies the data access pattern and declares command-line arguments. •  Inheritance defines traversal type. •  Annotation defines command-line argument. public class GATKPaperGenotyper extends LocusWalker<Integer,Long> { @Argument(fullName = "log_odds_score", shortName = "LOD", doc = "The LOD threshold", required = false) private double LODScore = 3.0; 16
  17. 17. Walkers: Example A simple Bayesian genotyper •  Walker prepares the input dataset. •  ReadBackedPileup utility can be used to filter pileup on demand. public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPrior( ref.getBase(), DiploidGenotypePriors.HUMAN_HETEROZYGOSITY, 0.01); // get the bases and qualities from the pileup ReadBackedPileup pileup = context.getBasePileup(). getPileupWithoutMappingQualityZeroReads(); byte bases[] = pileup.getBases(); byte quals[] = pileup.getQuals(); … 17
  18. 18. Walkers: Example A simple Bayesian genotyper •  Calculate the likelihood for each possible genotype. •  Determine the best of the calculated genotypes. for (GENOTYPE genotype : GENOTYPE.values()) for (int index = 0; index < bases.length; index++) { // our epsilon is the de-Phred scored base quality double epsilon = Math.pow(10, quals[index] / -10.0); byte pileupBase = bases[index]; double p = 0; for (char r : genotype.toString().toCharArray()) p += r == pileupBase ? 1 - epsilon : epsilon / 3; likelihoods[genotype.ordinal()] += Math.log10(p / genotype.length()); } Integer sortedList[] = MathUtils.sortPermutation(likelihoods); 18
  19. 19. Walkers: Example A simple Bayesian genotyper •  Conditionally output the results. •  Use reduce to calculate number of genotypes called. •  Writing to provided output stream is guaranteed to be thread-safe. … if (lod > LODScore) out.printf("%st%st%.4ft%c%n", context.getLocation(), selectedGenotype, lod, (char)ref.getBase()); return 1; } } // end of map() function public Long reduce(Integer value, Long sum) { return value + sum; } public void onTraversalDone(Integer result) { out.printf("Simple Genotyper genotyped %d loci.”, result); } 19
  20. 20. Walkers: Threading performance A simple Bayesian genotyper GATK performance improves nearly linearly as processors are added 20
  21. 21. Genome Analysis Toolkit 1000 Genomes Project •  Supports  any  BAM-­‐ Ini<al  alignment   compa<ble  aligner   •  All  of  these  tools   MSA  realignment   have  been  developed   in  the  GATK     Q-­‐score   recalibra<on   •  They  are  memory   and  CPU  efficient,   Base  error   cluster  friendly  and  are   modeling   easily  parallelized   •  They  are  now   Genotyping   publically  and  are   being  used  at  many   sites  around  the  world   SNP  filtering   More  info:  h#p://www.broadins<tute.org/gsa/wiki/   Support      :  h#p://www.getsa<sfac<on.com/gsa/   21
  22. 22. Acknowledgments   Genome sequencing and Broad postdocs, staff, 1000 Genomes project analysis group (MPG) and faculty In general but notably: Kiran Garimella (Analysis Lead) Anthony Philippakis Matt Hurles Michael Melgar Vineeta Agarwala Philip Awadalla Chris Hartl Manny Rivas Richard Durbin Sherman Jia Jared Maguire Goncalo Abecasis Eric Banks (Development lead) Carrie Sougnez Richard Gibbs Ryan Poplin David Jaffe Gabor Marth Guillermo del Angel Nick Patterson Thomas Keane Aaron McKenna Steve Schaffner Gil McVean Khalid Shakir Shamil Sunyaev Gerton Lunter Brett Thomas Paul de Bakker Heng Li Corin Boyko Copy number group Cancer genome Bob Handsaker analysis Genome Sequencing Platform Jim Nemesh Kristian Cibulskis In general but notably: Josh Korn Andrey Sivachenko Lauren Ambrogio Steve McCarroll Gad Getz Illumina Production Team Tim Fennell Integrative Genomics Kathleen Tibbetts Viewer (IGV) MPG directorship Alec Wysoker Jim Robinson Stacey Gabriel Ben Weisburd Jesse Whitworth David Altshuler Toby Bloom Helga Thorvaldsdottir Mark Daly 22