Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing


Published on

As part of the International Cancer Genome Consortium, the Queensland Centre for Medical Genomics has established a world class laboratory and computational infrastructure balanced with high level expertise to enable the analysis of whole human genomes for the presence of DNA, RNA and epigenetic variants that are associated
with the hallmarks of cancer. This talk will describe and discuss the principles and challenges of identifying structural variants (SVs) using whole genome sequencing. I will present the bases of SVs detection,a tool developed at the QCMG and examples of how SV analysis can identify mechanisms driving tumorigenesis.

First presented at the 2014 Winter School in Mathematical and Computational Biology

  • Be the first to comment

Ann Marie-Patch - Structural variants and mutation detection using whole genome sequencing

  1. 1. Structural variants and mutation detection using  whole genome sequencing  2014 Winter School in Mathematical and  Computational Biology Ann‐Marie Patch
  2. 2. Mutation Detection success depends on previous steps Generalised sequencing workflow Sample  Library  Sequence  Initial data  Data analysis preparation preparation generation processing Data analysis F t ti •Base calling M t ti •DNA •RNA •miRNA BAC •Fragmentation •Size selection •Target  enrichment •Platform •Sequence  length g •Quality  assessment •De novo  bl •Mutation  detection •Annotation •Biological  •BAC •Indexing assembly •Alignment g interpretation
  3. 3. Mutation cataloguing aids understanding of cancer genetics International Cancer Genome Consortium projects sequenced >1500 samples from >600 patients
  4. 4. DNA mutations are normally sensed and repaired or the cell dies Cell Suicide or ApoptosisNormal cell division DNA damage is sensed by the cell Mechanisms for DNA damage Mechanisms for DNA repair Cell cycle check points Signalling for cell death    g g Signalling for growth  
  5. 5. Cancer cells accumulate mutations Cell Suicide or ApoptosisTumour cell division DNA damage is NOT sensed by the cell Disrupted mechanisms for DNA damage Disrupted mechanisms for DNA repair Cell cycle check points Signalling for cell death    g g Signalling for growth   Fourth or later mutation Third  mutation Second  mutation First  mutation Uncontrolled growth
  6. 6. Cancer sequencing basics Typically our projects involve the parallel analysis of at least two samples for each patient Inherited genome sample Germline variants Germline Data Analysis Seen in both samples Tumour genome sample •Mutation  detection •Annotation •Biologicalg p •Biological  interpretation Somatic mutations Specific to tumour sample Tumour This sample contains a mixed population of normal  and tumour cellsand tumour cells Also subpopulations of different tumour cells
  7. 7. Mutation detection – finding differences DNA Mutation detection •SNV/SNP/Substitutions •Small insertions and  deletionsdeletions •Large structural variations •Copy number aberrations Cloonan et al 2011 Large structural variants are only detectable from whole genome sequencing
  8. 8. Whole genome paired‐end sequencing process recap  ‐ library preparationlibrary preparation Genomic DNA Fragmented DNA 
  9. 9. Whole genome paired‐end sequencing process recap  ‐ library preparation Genomic DNA library preparation Fragmented DNA  Clean‐up DNA fragmentsClean‐up DNA fragments Consistent fragment size distribution
  10. 10. Whole genome paired‐end sequencing process recap  ‐ library preparation Clean‐up DNA fragments library preparation p g Adaptors added Sequence reads produced from both ends of  each fragment The distance from the ends of the reads  should follow the DNA size distribution  ~300 b~300 bp
  11. 11. Paired‐end sequence alignment to the reference genome I II I I II Reference genome I II I I II Paired‐end sequences  mapped to genome Coverage depth mapped to genome Examining how the mapping position and content of the pairs of reads  vary across the reference genome allows us to determine mutationsvary across the reference genome allows us to determine mutations  and structural rearrangements
  12. 12. Detection software pinpoints differences in your sample from  the reference II IIII II II IIII ** ** Normal/Germline DNA: Germline SNV * ** Tumour DNA: ** ** Somatic SNV * Somatic Somatic translocation deletion Somatic amplificationamplification We convert mutation data into positional information and counts using  detection software Somatic mutations that only occur in the tumour are determined
  13. 13. Choosing what software to use to identify mutations Software listSoftware list Choice can be guided by Type of data  QCMG DNA mutation detection •Substitutions The biological question A ailable omp tin reso r es •qSNP – in house tool •GATK – Broad •Small insertions and deletions •Pindel ‐ SangerAvailable computing resources Past experience Pindel ‐ Sanger •GATK ‐ Broad •Large structural variations •qSV – in house tool Related literature
  14. 14. Visualising a germline single nucleotide variant example Paired‐end HiSeq data for Ovarian Cancer patientPaired end HiSeq data for Ovarian Cancer patient  Chromosome 11 Grey blocks  100bp reads matching Small coloured blocks  indicate a change Tumour data matching  reference indicate a change  from the reference The reference base is Tumour data The reference base is a G  There is an A present p in some of the reads Normal data Robinson et al 2011 Reference sequence
  15. 15. Pileup analysis produces counts of alleles Coverage 56xCoverage 56x Count of non duplicate reads  that cross any given position Tumour data that cross any given position Allele frequency Count of bases at any position Tumour G=36 Allele frequency Tumour  G 36  A=20 (Total coverage 56x) Normal data Normal  G=26  A=33  T=1 (Total coverage 60x)
  16. 16. Considering error and bias Allele proportions Sample Coverage Reference  Alternate  Other  Bi‐allelic  Hi hl k dallele % allele % allele % ratio Tumour 56 G=64% A=36% ‐ 1:0.56 Normal 60 G=43% A=55% T=2% 1:1.3 Highly skewed  representation in  tumour samples  Sequencing error Diploid organism expected bi‐allelic proportion 50% (ratio 1:1) Tumour data Changes in expected proportions can be due to: Sample contamination/integrity Stochastic sampling/low coverage depth Normal data Capture or enrichment bias Alignment/mapping strategy Sequencing error How should we determine a good call from error?
  17. 17. How many SNVs would we expect to find? Human genome (length ~ 3,000,000,000 bases) Germline changes  = ~ 3,000,000 (~1000 mutations per Mb (0.1%)) Ovarian Cancer genome Somatic mutation  = ~6,000 (~2 mutations per Mb) Thi b l d di h f b i dThis number can vary greatly depending upon the type of cancer being sequenced
  18. 18. Filtering of results from mutation detection tools is necessary Example for sample purity = 64% R Filt d R Filt dRaw  somatic Filtered somatic Raw  Germline Filtered  Germline qSNP 298,388 6,632 4,180,630 3,698,034 GATK 224,839 9,722 4,945,990 4,069,314 K b t 2 4% K b t 84 88% R b th t d b f ti t ti ~6 000 Keep between 2‐4% Keep between 84‐88% Remember the expected number of somatic mutations ~6,000 And Germline variants ~3,000,000 qSNP in‐house, rules‐based heuristic tool sensitive (Kassahn et al 2013) GATK (unified genotyper) a Bayesian tool (McKenna et al 2010)GATK (unified genotyper) a Bayesian tool (McKenna et al 2010) The intersect of these tools produces a high confidence SNV call 
  19. 19. QCMG Strategy for identifying somatic substitution mutations Control of quality of variant calls through input filtering mapping quality for reads >10 maximum number of mismatches in read <=3maximum number of mismatches in read < 3 minimum consecutive matched bases in a read >=34 duplicate reads removed Tumour data Somatic variant calls are made when the minimum number of reads with the variant  minimum coverage in tumour and normal sample maximum variant count for a given coverage in the matched normalmaximum variant count for a given coverage in the matched normal threshold proportion of variant call qualities at that position Potential weakness in calls annotated Normal data Potential weakness in calls annotated Variant seen in unfiltered bam of matched normal Position of variant within 5 bp of ends of reads Variant not seen in sequencing reads of both directions l f h Somatic variant Variant seen in germline of another patient Number of novel starts for reads supporting variant is low Somatic variant  Tumour T=63% C=37% Normal T=100%
  20. 20. Detection – examination – verification ‐ modify We have used a cyclical feedback approach to inform the filtering strategy and   improve our mutation calling Detect mutations Examine Manual IGV review Independent Verification Identify patterns and  •PCR and capillary sequencing modify filtering  strategies p y q g •PCR and deep MiSeq sequencing •SOLiD sequencing •mRNA sequencing This approach has been key for the detection of small insertions and deletionsThis approach has been key for the detection of small insertions and deletions  as sequencing errors and alignment biases are often exaggerated for indels 
  21. 21. Large genomic structural variants need different detection strategies O i h hi h i t bilit d hi hl dOvarian cancer genomes have high instability and are highly rearranged Structural variants underlie copy number changes Spectral Karyotype from HGSOvCa Cell line Ouellet et al 2008 BMC Cancer Deletion Duplication/Insertion Translocation Low resolution Reference Sample
  22. 22. There are 4 main methods for SV detection in WG sequencing Alkan, Coe and Eichler 2011 Most well known tools only use one detection methodMost well known tools only use one detection method a few multi‐method tools are now available
  23. 23. Visualising structural variants Sub microscopic homozygous deletion in a tumour sample Tumour Normal Robinson et al 2011 Chromosome 13: 1.3Kb somatic deletion including exon 17 of RB1 gene
  24. 24. Insert size estimation is key for detection with discordantly  mapped read pairs DNA fragment size distribution pp p Production of  sequence reads fromsequence reads from  the end of the  fragments 300bp median ~300 bp Alignment of read pairs  Typical read‐pair insert size 300bp median g p allows calculation of  insert size Typical read pair insert size  distribution visualised by  qProfiler g countLog Base pairsNormally mapped reads
  25. 25. Discordantly mapped read pairs mark rearrangements reference >1.3kb insert size reference Read pairs too  far apart Tumour Read pairs too  close together Normal Read pairs in  wrong  orientation Detection tools identify clusters of read  pairs with similar characteristics orientation pairs with similar characteristics large clusters indicate more evidence 
  26. 26. Changes in coverage support rearrangements Clear drop in coverage over the region in  the tumour sample Tumour Normal
  27. 27. Coverage changes are often associated with SVs Changes in coverage can be interpreted as copy number and can  mark rearrangement breakpointsg p Deletion Duplication mber Deletion Fewer reads  mapped Copy num More reads  mapped Genomic position CNVnator (Abyzov et al 2011) Tools are available that identify copy number variants from read depth  i i i d GC ipartitioning and GC content correction
  28. 28. Clusters of soft clipping indicate rearrangement break points Alignment software that performs soft clipping  can reveal exact positions of the break points Further realignment of the clipped sequences produces split reads Reads with soft clipping and  unmapped reads can be  assembled into contigs thatassembled into contigs that  span break points
  29. 29. qSV : Detecting Somatic Structural Variants qSV detects 3 types of supporting evidence Resolves all lines of evidence to identify breakpoints to base pair resolutionResolves all lines of evidence to identify breakpoints to base pair resolution Felicity Newell
  30. 30. Automation of SV verification process St t l i t i ifi tiStructural variants require verification PCR amplification over breakpoints followed by sequencing  Automation of key stages can increase throughput of verification PCR of tumour and normal DNA Verified events are circled Quek et al in press
  31. 31. Characterising tumour genomes by the distribution of SVs A huge range in the distribution of SVs in ovarian cancer patients Unstable >300 events Complex localised events Chromosomes Copy number B allele frequency SVs Circos, Krzywinski et al 2009 
  32. 32. Chromothripsis events can be identified SV break  d itdensity SV types and  positions Copy number  segmentation Log R RatioLog R Ratio B allele frequency Chromosome 15 Stephens et al 2011
  33. 33. Breakage‐fusion‐bridge amplification can be identified SV break  density SV types and  positions Copy number  segmentation Log R Ratio B allele frequency Chromosome 12 Loss of telomere region Control sample Tumor sample Kinsella and Bafna 2012
  34. 34. Other complex regions with high density of breakpoints SV break  density Translocations SV types and  positions Copy number  isegmentation Log R Ratio Chromosome 19 B allele frequency Chromosome 19
  35. 35. Associating structural variants with proximal genes Structural variants break points are annotated with genes features
  36. 36. Gene model annotation of break points can predict fusion genes G f i ifGene fusions can occur if: •both breakpoints are within the footprints of genes •the transcription direction of the two genes align •translation phase of adjoining exons match•translation phase of adjoining exons match •splicing signals are not disrupted Barsha Poudel
  37. 37. Patient summary of mutations identified chromosomes Coding small mutationsCoding small mutations  with amino acid change SNP array track that shows copy  number gain in red and loss in  green and regions of loss ofgreen and regions of loss of  heterozygosity Structural variants in centreStructural variants in centre Circos, Krzywinski et al 2009 ICGC catalogue of mutations
  38. 38. Mutation detection summary Output of mutation detection software requires careful filtering Development of filtering strategy typically requires a feedback process  Verification is a key part of this process Detect mutations Examine Manual IGV reviewManual IGV review Independent Verification Identify patterns andIdentify patterns and  modify filtering  strategies Detection – examination – verification ‐ modify
  39. 39. Acknowledgements:  Bioinformatics: John Pearson Felicity Newell Genome Biology: Sean Grimmond Nicola Waddell Peter MacCallum Cancer Centre David Bowtell Dariush Etemadmoghadam Elizabeth Christie Dale GarsedFelicity Newell Lynn Fink Conrad Leonard Oliver Holmes Qinying Xu Matthew Anderson Katia Nones Peter Bailey Michael Quinn Kelly Quek Joshy George Sian Fereday Laura Galletta Kathryn Alsop Nadia TraficanteMatthew Anderson Stephen Kazakoff Nick Waddell Scott Wood Sequencing: David Miller Angelika Christ Tim Bruxner C i N Nadia Traficante Joy Hendley Chris Mitchell Prue Cowin Craig Nourse Ehsan Nourbakhsh Suzanne Manning Ivon Harliwong Senel Idrisoglu Previous team members Karin Kassahn Barsha Poudel Sarah Song Westmead Institute for Cancer Research A d F i g Shivangi Wani Sarah Song Nicole Cloonan Darrin Taylor Deborah Gywnne Peter Wilson Anita Steptoe Anna deFazio Catherine Kennedy Yoke-Eng Chiew Jillian Hung National Health and Medical Research Council Australian Government Clinicians and patients