SSAHA_pileup

3,107 views
2,942 views

Published on

A talk I gave at AGBT 2008

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,107
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
72
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

SSAHA_pileup

  1. 1. SSAHA_pileup: A Genome Variation Detection Pipeline for Various Sequencing Platforms Photo Credit: saynine on flickr.com Ben Blackburne Wellcome Trust Sanger Institute
  2. 2. Acknowledgments ●Zemin Ning ●Yong Gu ●Antony Cox ●Adam Spargo ●Hannes Ponstingl
  3. 3. Introduction ●New sequencing technologies – More data – Different kinds of data ●Solexa, 454 ●capillary, too – Diploid genomes – SNPs, indels, VNTRs Photo Credit: mknowles on flickr.com
  4. 4. SSAHA_pileup ●Sequence Search and Alignment by Hashing Algorithm ●SSAHA_SNP – Global positioning with SSAHA algorithm – Fast Smith-Waterman implementation (from Cross_Match) – Identification of best match ●SSAHA_pileup – Determines SNPs from set of best alignments ●Works on Solexa, 454, and capillary reads
  5. 5. The Toolchain Reference Genome SSAHA_snp/ Alignments SSAHA_pileup SSAHA2 variations Reads refinement
  6. 6. SSAHA_SNP ●Reference genome is “hashed” – table made of all k-mer words – overlapping or not, at user's option
  7. 7. SSAHA_SNP ●k-mer matches found for query in reference chr n chr m
  8. 8. SSAHA_SNP chr n Global Mapping chr m
  9. 9. SSAHA_SNP chr n score: 126 Local Mapping (Smith-Waterman) score: 113 chr m
  10. 10. SSAHA_SNP chr n score: 126 Select best match score: 113 chr m
  11. 11. SSAHA_SNP ●Read pair information – currently possible with extra step using SSAHA2 – being integrated into SSAHA_SNP – Removes incorrectly mapped pairs Photo Credit: Matthew Fang on flickr.com
  12. 12. SSAHA_pileup Reference Genome SSAHA_snp/ Alignments SSAHA_pileup SSAHA2 variations Reads refinement
  13. 13. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACGGAGCTGGAG CCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Homozygous SNP
  14. 14. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACAGAGCTGGAG CCACAGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Heterozygous SNP
  15. 15. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCACAGAGCTGGAG CCACAGAGCTGGAGAAAGCCT TCCCACggagCTGGAGAAAGCCT TCCCACggagcTGGAGAAAGCCT TCCCacggagcTGGAGAAAGCCT Aligned reads Heterozygous SNP?? (Probably not)
  16. 16. SSAHA_pileup Reference ...GGTCCCACAGAGCTGGAGAAAG... GGTCCCAC-----TGGAG CCAC-----TGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT TCCCACGGAGCTGGAGAAAGCCT Aligned reads Heterozygous indel
  17. 17. How well does it work?
  18. 18. Datasets ●Venter: ABI capillary reads – Celera: 19,397,599 55% in pairs – JCVI: 12,541,352 98% in pairs – Total: 31,938,951 72% in pairs (90% mapped) ●Watson: 454 GS FLX reads – Baylor & Roche 74,198,831 (90.5% mapped) – single end reads with length 150 – 280 bps ●Chromosome X Illumina reads – 278,557,156 reads (71.6% mapped) – (paired with insert size 200bps)
  19. 19. How conservative should we be?
  20. 20. How conservative should we be?
  21. 21. Or.... How liberal should we be?
  22. 22. How do we even know if we are winning?
  23. 23. dbSNP (but not ideal)
  24. 24. Filtering ●Processes that cause bogus SNPs – Incorrect global mapping – Incorrect local alignment – Poor quality reads – Sequence amplification errors
  25. 25. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` ` ` ` ` ` `` ` ` `` ` chr m
  26. 26. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` ` ` ` ` ` `` ` ` `` ` chr m
  27. 27. Global Mapping Problems ●Reads from unmapped regions of the genome – Lead to absurdly high apparent coverage chr n ` ` ` `` ` ` ` ` ` ` ` `` `
  28. 28. SNPs
  29. 29. Solution: Filter out SNPs called from abnormally high read depths
  30. 30. Global Mapping Problems ●Incorrectly aligned reads chr n ` score: 132 ` score: 136 chr m
  31. 31. Solution: nd Filter out SNPs where 2 best score is too close
  32. 32. Local Alignment Problems ●Misalignment – Uncaught incorrect global alignment – Variations in short repeats
  33. 33. Local Misalignment Reference ...GGTCCCACAGAGCTGGAGAAAA... GGTCCCACT---CTAGTG CCACT---CTAGTGAAAA TCCCACT---CTAGTGAAAA Aligned reads Real SNPs?
  34. 34. Local Misalignment Reference ..TAATAATAATAATAATAATAAGAAG.. AATAATAAGAAGAAGAAGAAGAAG AATAATAAGAAGAAGAAGAAGAAG AATAATAAGAAGAAGAAGAAGAAG Aligned reads Real SNPs?
  35. 35. Solution: Filter out short blocks of many SNPs
  36. 36. Venter SNP Calling (Capillary) count fraction in dbSNP Homozygous SNPs 1 347 806 97.1% Heterozygous SNPs 1 857 167 90.9% Total SNPs 3 204 973 93.5%
  37. 37. Watson SNP Calling (454) count fraction in dbSNP Homozygous SNPs 1 298 309 93.0% Heterozygous SNPs 1 767 951 63.9% Total SNPs 3 066 260 76.3%
  38. 38. X Chromosome SNPs (Solexa) count fraction in dbSNP Homozygous SNPs 27 708 92.8% Heterozygous SNPs 63 197 81.8% Total SNPs 90 905 85.1%
  39. 39. Venter-Watson Overlap 1 593 791 1 611 182 1 455 078 Venter Watson
  40. 40. X Chromosome Overlap Solexa X reads 40 625 19 978 12 590 17 712 26 502 6 588 22 872 Venter Watson
  41. 41. Conclusions ●SSAHA_pileup is effective across both new and old sequencing technologies ●Questions – When is a SNP not a SNP? – Homozygous/Heterozygous SNPs
  42. 42. Conclusions ●SSAHA_pileup is effective across both new and old sequencing technologies ●Questions – When is a SNP not a SNP? – Homozygous/Heterozygous SNPs ●Length matters...? – But it's what you do with it that counts
  43. 43. Obtaining SSAHA_pileup SSAHA_pileup: ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ SSAHA2: http://www.sanger.ac.uk/Software/analysis/SSAHA2/ These Slides: http://slideshare.net/bpb/

×