Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

0 views
369 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
0
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

  1. 1. Making the most of short reads Torsten Seemann Victorian Bioinformatics Consortium Monash University
  2. 2. Outline ● About the VBC ● Sequencing technologies ● Read mapping ● Applications ● Conclusion ● Questions07/08/12 Making the most of short reads 2
  3. 3. What is the VBC ?● Victorian Bioinformatics Consortium● 2000-2005 – Monash .med .infotech, CSIRO, DPI – $4M STI grant from State Govt.● 2005+ – Dept. Microbiology, Monash Uni. – NHMRC/ARC Network Parisitology – Micromon (sequencing centre)07/08/12 Making the most of short reads 3
  4. 4. Where is the VBC ? ● Monash Uni. ● Clayton Campus ● STRIP2 / Bldg 76 ● Level 2 ● Microbiology ● Rooms 223-22507/08/12 Making the most of short reads 4
  5. 5. VBC capabilities ● Sequence analysis ● Assembly, annotation, SNPs ● Anything-omics! ● Microarray analysis/storage ● Data mining/visualization ● Custom software development ● Computer system architecture07/08/12 Making the most of short reads 5
  6. 6. VBC Collaborators ● Monash Uni. ● CSIRO : FSA, LI ● Uni. Melbourne ● USDA : ARS ● Bio21 ● Pasteur Institute ● UNSW, Uni. Syd ● TIGR ● UQ : IMB ● UCSD ● MIMR, MMC, Austin ● UCLA ● MISCL ● Uni. Copenhagen07/08/12 Making the most of short reads 6
  7. 7. Sanger sequencing ● Dye terminated capillary sequencing ● Read length ~ 300 - 900 bp ● Yield ~ 1 Mbp per day maximum ● Cost ~ $HIGH07/08/12 Making the most of short reads 7
  8. 8. Roche 454 FLX+ ● Pyro-sequencing ● Read length ~ 100 - 250 bp ● Yield ~ 600 Mbp (250 bp PE) ● Run time ~ 1 day ● Prep time ~ 5 days ● Homo-polymer run errors ● Cost $MEDIUM07/08/12 Making the most of short reads 8
  9. 9. ABI SOLID 3 ● Sequencing by ligation ● Read length ~ 35 – 50 bp ● Yield ~ 15,000 Mbp (50 bp PE) ● Run time ~ 14 days ● Prep time ~ ? days ● Colour space error propagation ● Cost $MEDIUM07/08/12 Making the most of short reads 9
  10. 10. Illumina GA2 (Solexa) ● Sequencing by synthesis ● Read length ~ 36 – 100 bp ● Yield ~ 6,000 Mbp (36bp PE) ● Run time ~ 5 days ● Prep time ~ 1 day ● No homo-polymer errors ● Cost $LOW07/08/12 Making the most of short reads 10
  11. 11. Illumina output 36bp Good read @HWUSI-EAS100R:3:1:5:1526#0/1 TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG +HWUSI-EAS100R:3:1:5:1526#0/1 abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a` a=Q33 Pr(wrong)=0.0005 Bad read @HWUSI-EAS100R:3:1:3:1073#0/2 TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT +HWUSI-EAS100R:3:1:3:1073#0/2 aDDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB B=Q2 Pr(wrong)=0.3807/08/12 Making the most of short reads 11
  12. 12. Read mapping 8 ● Align 10 36bp reads to 5 Mbp reference ● Traditional tools too slow ● New crop of “short read aligners” (SRA) – SHRiMP – MAQ – Bowtie – ELAND – Novocraft07/08/12 Making the most of short reads 12
  13. 13. SRA capabilities ● SNP = Single nucleotide polymorphism – Subsitution, eg. A → C – insertion or deletion (“indel”) eg. A → - ● Warning: not all aligners support indels! ● We tend to use SHRiMP – Supports substitutions and indels – Fast SIMD implementation & parallelizable – Full post-hit Smith-Waterman alignment – Will identify “most” high scoring hits07/08/12 Making the most of short reads 13
  14. 14. Genome coverage ● Mapped 7 M reads to 4 Mbp genome ● Yellow line is mean coverage (56x) ● Bowl shaped coverage = circular genome ● Could be used to guide scaffolding07/08/12 Making the most of short reads 14
  15. 15. Missing DNA ● Read coverage drops to zero where reference has DNA that the new sequence does not ● LB022 absent ● hemH present07/08/12 Making the most of short reads 15
  16. 16. Repeated DNA ● Coverage increases in repeated areas ● LA_SNP3199 is probably triplicated in this strain – depth 120, average 4007/08/12 Making the most of short reads 16
  17. 17. SNPs ● SNPs appear as dips/pinches in the coverage graph ● LA1299 gene has possible 4 SNPs relative to ref. ● Rest of gene has average coverage07/08/12 Making the most of short reads 17
  18. 18. Repairing 454 data ● 454 has “homopolymer” errors ● Loses track if same base > 3 times in row ● Traditional assemblers dont like too many indels or frame shifts ● 454 developed Newbler assembler ● Challenging for hybrid assemblies ● What if we could “repair” our 454 data?07/08/12 Making the most of short reads 18
  19. 19. 454 Repair Guide ● One sample with 454 and Illumina reads ● Get a read mapper supporting indels ● Align all your Illumina reads to 454 data ● If sufficient un-ambiguous depth – correct the 454 sequence! ● Can apply to old closed sequences, 454 contigs, 454 reads etc. ● Find old errors via resequencing07/08/12 Making the most of short reads 19
  20. 20. Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGATSequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT07/08/12 Making the most of short reads 20
  21. 21. Trimming short reads ● Quality worsens toward 3 end ● Many reads have “N” basecalls ● Variation across flowcell/slide ● Will reduce data size ● Trade quality for depth ● Is it worth it?07/08/12 Making the most of short reads 21
  22. 22. Should I trim? ● For 36 bp – Results are mixed – Usually best NOT to trim – Depth will “fix” most errors ● For 75+ bp – 3 quality can be very poor – Seems best to trim – Not all reads need trimming ● More research needed07/08/12 Making the most of short reads 22
  23. 23. Conclusion ● Short read mapping is a powerful tool for genomic discovery – Automated analysis eg. SNPs – Visualization eg. depth/coverage graphs – Repairing longer read data ● Still need de novo assembly for unmapped reads07/08/12 Making the most of short reads 23
  24. 24. Contact me Web http://www.vicbioinformatics.com/ Email torsten.seemann@infotech.monash.edu.au07/08/12 Making the most of short reads 24

×