Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Variant (SNPs/Indels) calling in DNA sequences, Part 2

5,437 views

Published on

Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.

Published in: Technology
  • Be the first to comment

Variant (SNPs/Indels) calling in DNA sequences, Part 2

  1. 1. [www.absolutefab.com]<br />Variant calling for disease association (2/2)<br />Searching the haystack<br />July 14, 2011<br />
  2. 2. Quick recap: DNA sequence read mapping<br />July 14, 2011<br />Sequencing->FASTQ->alignment to reference genome<br />Resulting file type: BAM<br />Visualized in Genome Viewer<br />“What genomic regions were sequenced?”<br />Quality Control<br />Projects<br />Fastq<br />Bam<br />
  3. 3. Production Informatics and Bioinformatics<br />July 14, 2011<br />Produce raw sequence reads<br />Basic Production<br />Informatics<br />Map to genome and generate raw genomic features (e.g. SNPs)<br />Advanced <br />Production Inform.<br />Analyze the data; Uncover the biological meaning<br />Bioinformatics<br />Research<br />Per one-flowcell project<br />
  4. 4. Good mapping is crucial<br />Mapping tools compromise accuracy for speed: approximate mapping.<br />Identifying exactly where the reads map is the fundament for all subsequent analyses.<br />The exact alignment of each read is especially important for variant calling.<br />July 14, 2011<br />by neilalderney123<br />
  5. 5. Mapping challenges <br />Incorrect mapping<br />Amongst 3 billion bp (human) a 100-mer can occur by chance <br />Multi-mappers<br />The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen<br />Duplicates<br />PCR duplicates can introduce artifacts.<br />July 14, 2011<br />Streptococcus suis (squares) <br />Musmusculus (triangles) <br />ACGATATTACACGTACACTCAAGTCGTTCGGAACCT<br /> TTACACGTACA<br /> TACACGTACAC<br /> ACACGTACACT<br /> CACGTACTCTC<br /> CACGTACTCTC<br /> CACGTACTCTC<br /> CACGTACTCTC<br />Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216<br />
  6. 6. Methods for ensuring a good alignment<br />Biological: Using paired end reads to increase coverage<br />Bioinformatically: <br />Local-realignment<br />Base pair quality score re-calibration<br />July 14, 2011<br />~200 bp<br />?<br />Repeat region<br />
  7. 7. Local Realignment (GATK)<br />July 14, 2011<br />QBI data<br />Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome<br />Reduces erroneous SNPs refines location of INDELS<br />original<br />realigned<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
  8. 8. Quality score recalibration (GATK)<br />PHRED scores are predicted<br />Looking at all reads at a specific location allows a better estimate on base pair quality score. <br />Excludes all known dbSNP sites <br /> Assume all other mismatches are sequencing errors <br />Compute a new calibration table bases on mismatch rates per position on the read<br />Important for variant calling <br />July 14, 2011<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />Thomas Keane 9th European Conference on Computational Biology 26th September, 2010 <br />
  9. 9. Recalibration of quality score<br />July 14, 2011<br />All bases are called with Q25<br />In reality not all are that good:<br />bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
  10. 10. Variant calling methods<br />> 15 different algorithm <br />Three categories<br />Allele counting<br />Probabilistic methods, e.g. Bayesian model <br />to quantify statistical uncertainty<br />Assign priors e.g. by taking the observed allele frequency of multiple samples into account<br />Incorporating linkage disequilibrium (LD)<br />Specifically helpful for low coverage and common variants<br />July 14, 2011<br />variant<br />SNP<br />Ref<br />A<br />Ind1<br />G/G<br />Ind2<br />A/G<br />Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300.<br />http://seqanswers.com/wiki/Software/list<br />
  11. 11. VCF format<br />[HEADER LINES]<br />#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878<br />chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255<br />chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0<br />chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26<br />chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255<br />Individual statistics<br />GT - genotype - 0/1<br />AD – total number of REF/ALT seen – 173 T, 141 A<br />DP – depth MAPQ > 17 – 282<br />GQ - Genotype Quality - 99 <br />PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely<br />Location statistics, e.g.<br />Strand bias<br />How many reads have a deletion at this site<br />July 14, 2011<br />
  12. 12. When to call a variant ?<br />July 14, 2011<br />Hom<br />REF: 0% <br />ALT: 100%<br />Het<br />REF: 50% <br />ALT: 50%<br />??<br />REF: 77% <br />ALT: 23%<br />QBI data<br />QBI data<br />
  13. 13. Hard Filtering<br />Reducing false positives by e.g. requiring<br />Sufficient Depth<br />Variant to be in >30% reads<br />High quality<br />Strand balance <br />…<br />Subjective and dangerous in this high dimensional search space<br />July 14, 2011<br />Strand Bias<br />Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). <br />Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).<br />QBI data<br />
  14. 14. Gaussian mixture model<br />Train on trusted variants and require the new variants to live in the same hyperspace<br />Potential problem: Overfitting and biasing to features of knownSNPs!!!<br />July 14, 2011<br />
  15. 15. Indel calling<br />First local realignment might not be sufficient to confidently determine the beginning and end of indels<br />Dindel-algorithm<br />Local realignment for every indel candidate <br />July 14, 2011<br />Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.<br />
  16. 16. Recap<br />July 14, 2011<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
  17. 17. Outcome: How many variants will I find ?<br />July 14, 2011<br />Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)<br />Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878)<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
  18. 18. Three things to remember<br />Getting the mapping right is critical<br />Variant calling is not merely to count the differences<br />Just listing the variants does not tell you anything biologically relevant. <br />July 14, 2011<br />by Яick Harris<br />Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790<br />
  19. 19. Next week:<br />July 14, 2011<br />Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants. <br />
  20. 20. Walk-in-clinic<br />July 14, 2011<br />

×