[www.absolutefab.com]<br />Variant calling for disease association (2/2)<br />Searching the haystack<br />July 14, 2011<br />
Quick recap: DNA sequence read mapping<br />July 14, 2011<br />Sequencing->FASTQ->alignment to reference genome<br />Resul...
Production Informatics and Bioinformatics<br />July 14, 2011<br />Produce raw sequence reads<br />Basic Production<br />In...
Good mapping is crucial<br />Mapping tools compromise accuracy for speed: approximate mapping.<br />Identifying exactly wh...
Mapping challenges <br />Incorrect mapping<br />Amongst 3 billion bp (human) a 100-mer 	can occur by chance <br />Multi-ma...
Methods for ensuring a good alignment<br />Biological: Using paired end reads to increase coverage<br />Bioinformatically:...
Local Realignment (GATK)<br />July 14, 2011<br />QBI data<br />Local realignment of all reads at a specific location simul...
Quality score recalibration (GATK)<br />PHRED scores are predicted<br />Looking at all reads at a specific location allows...
Recalibration of quality score<br />July 14, 2011<br />All bases are called with Q25<br />In reality not all are that good...
Variant calling methods<br />> 15 different algorithm <br />Three categories<br />Allele counting<br />Probabilistic metho...
VCF format<br />[HEADER LINES]<br />#CHROM  POS		ID		REF	ALT	QUAL	    FILTER	INFO	        FORMAT	      NA12878<br />chr1	 ...
When to call a variant ?<br />July 14, 2011<br />Hom<br />REF: 0%  <br />ALT: 100%<br />Het<br />REF: 50%  <br />ALT: 50%<...
Hard Filtering<br />Reducing false positives by e.g. requiring<br />Sufficient Depth<br />Variant to be in >30% reads<br /...
Gaussian mixture model<br />Train on trusted variants and require the new variants to live in the same hyperspace<br />Pot...
Indel calling<br />First local realignment might not be sufficient to confidently determine the beginning and end of indel...
Recap<br />July 14, 2011<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation...
Outcome: How many variants will I find ?<br />July 14, 2011<br />Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)<b...
Three things to remember<br />Getting the mapping right is critical<br />Variant calling is not merely to count the differ...
Next week:<br />July 14, 2011<br />Abstract: This seminar aims at answering the question of what to make of the identified...
Walk-in-clinic<br />July 14, 2011<br />
Upcoming SlideShare
Loading in...5
×

Variant (SNPs/Indels) calling in DNA sequences, Part 2

4,296

Published on

Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,296
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
245
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
  • The proportion of unique sequence in the Streptococcus suis (squares) and Musmusculus (triangles) genomes for varying read lengths. This graph indicates that read length has a critical affect on the ability to place reads uniquely to the genome
  • Variant (SNPs/Indels) calling in DNA sequences, Part 2

    1. 1. [www.absolutefab.com]<br />Variant calling for disease association (2/2)<br />Searching the haystack<br />July 14, 2011<br />
    2. 2. Quick recap: DNA sequence read mapping<br />July 14, 2011<br />Sequencing->FASTQ->alignment to reference genome<br />Resulting file type: BAM<br />Visualized in Genome Viewer<br />“What genomic regions were sequenced?”<br />Quality Control<br />Projects<br />Fastq<br />Bam<br />
    3. 3. Production Informatics and Bioinformatics<br />July 14, 2011<br />Produce raw sequence reads<br />Basic Production<br />Informatics<br />Map to genome and generate raw genomic features (e.g. SNPs)<br />Advanced <br />Production Inform.<br />Analyze the data; Uncover the biological meaning<br />Bioinformatics<br />Research<br />Per one-flowcell project<br />
    4. 4. Good mapping is crucial<br />Mapping tools compromise accuracy for speed: approximate mapping.<br />Identifying exactly where the reads map is the fundament for all subsequent analyses.<br />The exact alignment of each read is especially important for variant calling.<br />July 14, 2011<br />by neilalderney123<br />
    5. 5. Mapping challenges <br />Incorrect mapping<br />Amongst 3 billion bp (human) a 100-mer can occur by chance <br />Multi-mappers<br />The genome has none-unique regions (e.g. repeats) one read mapping to multiple sites can happen<br />Duplicates<br />PCR duplicates can introduce artifacts.<br />July 14, 2011<br />Streptococcus suis (squares) <br />Musmusculus (triangles) <br />ACGATATTACACGTACACTCAAGTCGTTCGGAACCT<br /> TTACACGTACA<br /> TACACGTACAC<br /> ACACGTACACT<br /> CACGTACTCTC<br /> CACGTACTCTC<br /> CACGTACTCTC<br /> CACGTACTCTC<br />Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216<br />
    6. 6. Methods for ensuring a good alignment<br />Biological: Using paired end reads to increase coverage<br />Bioinformatically: <br />Local-realignment<br />Base pair quality score re-calibration<br />July 14, 2011<br />~200 bp<br />?<br />Repeat region<br />
    7. 7. Local Realignment (GATK)<br />July 14, 2011<br />QBI data<br />Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome<br />Reduces erroneous SNPs refines location of INDELS<br />original<br />realigned<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
    8. 8. Quality score recalibration (GATK)<br />PHRED scores are predicted<br />Looking at all reads at a specific location allows a better estimate on base pair quality score. <br />Excludes all known dbSNP sites <br /> Assume all other mismatches are sequencing errors <br />Compute a new calibration table bases on mismatch rates per position on the read<br />Important for variant calling <br />July 14, 2011<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />Thomas Keane 9th European Conference on Computational Biology 26th September, 2010 <br />
    9. 9. Recalibration of quality score<br />July 14, 2011<br />All bases are called with Q25<br />In reality not all are that good:<br />bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
    10. 10. Variant calling methods<br />> 15 different algorithm <br />Three categories<br />Allele counting<br />Probabilistic methods, e.g. Bayesian model <br />to quantify statistical uncertainty<br />Assign priors e.g. by taking the observed allele frequency of multiple samples into account<br />Incorporating linkage disequilibrium (LD)<br />Specifically helpful for low coverage and common variants<br />July 14, 2011<br />variant<br />SNP<br />Ref<br />A<br />Ind1<br />G/G<br />Ind2<br />A/G<br />Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300.<br />http://seqanswers.com/wiki/Software/list<br />
    11. 11. VCF format<br />[HEADER LINES]<br />#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878<br />chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255<br />chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0<br />chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26<br />chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255<br />Individual statistics<br />GT - genotype - 0/1<br />AD – total number of REF/ALT seen – 173 T, 141 A<br />DP – depth MAPQ > 17 – 282<br />GQ - Genotype Quality - 99 <br />PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely, and 1/110-25.5=unlikely<br />Location statistics, e.g.<br />Strand bias<br />How many reads have a deletion at this site<br />July 14, 2011<br />
    12. 12. When to call a variant ?<br />July 14, 2011<br />Hom<br />REF: 0% <br />ALT: 100%<br />Het<br />REF: 50% <br />ALT: 50%<br />??<br />REF: 77% <br />ALT: 23%<br />QBI data<br />QBI data<br />
    13. 13. Hard Filtering<br />Reducing false positives by e.g. requiring<br />Sufficient Depth<br />Variant to be in >30% reads<br />High quality<br />Strand balance <br />…<br />Subjective and dangerous in this high dimensional search space<br />July 14, 2011<br />Strand Bias<br />Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). <br />Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).<br />QBI data<br />
    14. 14. Gaussian mixture model<br />Train on trusted variants and require the new variants to live in the same hyperspace<br />Potential problem: Overfitting and biasing to features of knownSNPs!!!<br />July 14, 2011<br />
    15. 15. Indel calling<br />First local realignment might not be sufficient to confidently determine the beginning and end of indels<br />Dindel-algorithm<br />Local realignment for every indel candidate <br />July 14, 2011<br />Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.<br />
    16. 16. Recap<br />July 14, 2011<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
    17. 17. Outcome: How many variants will I find ?<br />July 14, 2011<br />Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)<br />Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878)<br />DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889<br />
    18. 18. Three things to remember<br />Getting the mapping right is critical<br />Variant calling is not merely to count the differences<br />Just listing the variants does not tell you anything biologically relevant. <br />July 14, 2011<br />by Яick Harris<br />Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790<br />
    19. 19. Next week:<br />July 14, 2011<br />Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants. <br />
    20. 20. Walk-in-clinic<br />July 14, 2011<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×