Variant (SNPs/Indels) calling in DNA sequences, Part 1

3,110 views

Published on

Abstract: This session will focus on the first steps involved in identifying SNPs from whole genome, exome capture or targeted resequencing data: The different read mapping approaches to a DNA reference sequence will be introduced and quality metrics discussed.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,110
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
203
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • http://www.absolutefab.com/assets/images/home_needle_haystack.jpghttp://www.slideshare.net/thomaskeane/eccb-2010-nextgen-sequencing-tutorial
  • unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
  • unmethylated ‘C’ bases, or cytosines, are converted to ‘T’
  • BWT causes limitation on allowed mismatches: BWA is only able to find alignments within a certain 'edit distance’ : 100-bp reads, BWA allows 5 'edits
  • Variant (SNPs/Indels) calling in DNA sequences, Part 1

    1. 1. [www.absolutefab.com]<br />Variant calling for disease association (1/2)<br />Ordering the haystack<br />June 30, 2011<br />
    2. 2. Quick recap: Production informatics<br />June 30, 2011<br />Sequencing->Images->Conversion (Demultiplexing)<br />Resulting file type: FASTQ<br />Several projects can be processed on one flowcell<br />One project can have several samples<br />Sequencing<br />Image<br />Fastq<br />Quality Control<br />Projects<br />
    3. 3. Production Informatics and Bioinformatics<br />June 30, 2011<br />Produce raw sequence reads<br />Basic Production<br />Informatics<br />Map to genome and generate raw genomic features (e.g. SNPs)<br />Advanced <br />Production Inform.<br />Analyze the data; Uncover the biological meaning<br />Bioinformatics<br />Research<br />Per one-flowcell project<br />
    4. 4. Where in the genome do the reads come from? <br />June 30, 2011<br />Reads<br />Alignment<br />
    5. 5. Short read mapping<br />Brute-Force algorithm would take years to process one lane: Data structures matter !<br />Constant trade-off: speed vs. sensitivity<br />To date >50 read mapping tools<br />Two categories<br />Hash tables: MAQ, ELAND, SOAP, <br /> BFAST, RazerS, Novoalign<br />Suffix trees: BWA, SOAP2, BOWTIE<br />June 30, 2011<br />Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 2011 Apr 28. PubMed PMID: 21525877.<br />Thomas Keane 9th European Conference on Computational Biology 26th September, 2010 <br />
    6. 6. Hash table based aligners<br />Modification<br />Speed-up: Spaced seeds 111010010100110111<br />Gapped seeds: Qgrams<br />Hash of the reads: MAQ, ELAND, ZOOM and SHRiMP<br />Potentially much smaller memory requirements <br />Hash the reference: SOAP, BFAST and MOSAIK<br />Constant memory cost, one time effort <br />June 30, 2011<br />Thomas Keane 9th European Conference on Computational Biology 26th September, 2010 <br />
    7. 7. Suffix tree and Burrows‐Wheeler Transformation <br />Suffix trees are much faster<br />E.g.  BWA is ~20-times faster than hash-based MAQ<br />BW transformation makes them applicable (memory)<br />June 30, 2011<br />Reference: queensland<br />BWT(Ref): dlnuesae$nq<br />queensland$<br />ueensland$q<br />eensland$qu<br />ensland$que<br />nsland$quee<br />sland$queen<br />land$queens<br />and$queensl<br />nd$queensla<br />d$queenslan<br />$queensland<br />$queensland<br />and$queensl<br />d$queenslan<br />eensland$qu<br />ensland$que<br />land$queens<br />nd$queensla<br />nsland$quee<br />queensland$<br />sland$queen<br />ueensland$q<br />$queensland<br />and$queensl<br />d$queenslan<br />eensland$qu<br />ensland$que<br />land$queens<br />nd$queensla<br />nsland$quee<br />queensland$<br />sland$queen<br />ueensland$q<br />Sorted<br />Rotated<br />Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. PMID: 19451168<br />
    8. 8. Find exact matches in transformed sequence<br />June 30, 2011<br /> P BWT C<br /> 0 $queensland1<br /> 6 and$queensl 1<br />10 d$queenslan1<br /> 3 eensland$qu 1<br /> 4 ensland$que 1<br /> 7 land$queens 1<br /> 9 nd$queensla 1<br /> 5 nsland$quee 2<br /> 1 queensland$ 1<br /> 6 sland$queen 2<br /> 2 ueensland$q 1 <br />Reference:<br />queensland<br />12345678910<br />Read:<br />ensl<br />Search backwards<br />Find letter i in last column<br />Jump to the countthi letter in first column<br />Set i to be the letter in the last column <br />repeat 3+4 to the end <br />John Pearson Winter School in Mathematical and Computational Biology 5-9 July 2010<br />Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25; PMID: 19261174.<br />
    9. 9. Which aligner to use ?<br />Hash based approaches are more suitable for divergent alignments <br /> Rule of thumb: <br /><2% divergence -> BWT <br /> E.g. human alignments <br />>2% divergence -> hash based approach <br /> E.g. wild mouse strains alignments<br />However, the space develops fast: don’t be sentimental<br />June 30, 2011<br />Thomas Keane 9th European Conference on Computational Biology 26th September, 2010 <br />
    10. 10. File format: Sam/Bam<br />June 30, 2011<br />ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT <br />+r001/1 TTAGATAAAGGATA*CTG <br />+r002 aaaAGATAA*GGATA <br />+r003 gcctaAGCTAA<br />+r004 ATAGCT..............TCAGC <br />-r003 ttagctTAGGC<br />-r001/2 CAGCGCCAT <br />+ unlimited add. fields: TAG:TYPE:VALUE, e.g. NM edit distance <br />The SAM Format Specification (v1.4-r962) The SAM Format April 17, 2011 <br />
    11. 11. Flag<br />June 30, 2011<br />Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1<br />Bit 128 64 32 16 8 4 2 1 = 163<br /> 1 1 1 1 <br />
    12. 12. CIGAR String<br />June 30, 2011<br />
    13. 13. Visualizing Bam files: IGV<br />June 30, 2011<br />Exome capture<br />Whole genome sequencing<br />http://www.broadinstitute.org/igv/<br />
    14. 14. Bam file: Quality control<br />//cluster-vm.qbi.uq.edu/<yourProject><br />Percentage mapped<br />Aim for 80%<br />Coverage<br />Aim for coverage >10<br />Duplicates<br />Aim for <1% (whole genome)<br />June 30, 2011<br />
    15. 15. Getting the mapping right is critical<br />QC are the mapping stats and visualizing the bam file <br />Knowing where the reads are does not necessarily tell you about their function<br />Three things to remember<br />June 30, 2011<br />
    16. 16. Next week: Part 2<br />June 30, 2011<br />Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced and quality metrics discussed. <br />http://climbers.net/blog/Exhibiting-at-Cliffhanger-12-13th-July-Sheffield<br />
    17. 17. Walk-in-clinic<br />June 30, 2011<br />

    ×