• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009
 

Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

on

  • 402 views

 

Statistics

Views

Total Views
402
Views on SlideShare
402
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Making the most of short reads   torsten seemann - agrf ngs sig - 28 apr 2009 Making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009 Presentation Transcript

    • Making the most of short reads Torsten Seemann Victorian Bioinformatics Consortium Monash University
    • Outline ● About the VBC ● Sequencing technologies ● Read mapping ● Applications ● Conclusion ● Questions07/08/12 Making the most of short reads 2
    • What is the VBC ?● Victorian Bioinformatics Consortium● 2000-2005 – Monash .med .infotech, CSIRO, DPI – $4M STI grant from State Govt.● 2005+ – Dept. Microbiology, Monash Uni. – NHMRC/ARC Network Parisitology – Micromon (sequencing centre)07/08/12 Making the most of short reads 3
    • Where is the VBC ? ● Monash Uni. ● Clayton Campus ● STRIP2 / Bldg 76 ● Level 2 ● Microbiology ● Rooms 223-22507/08/12 Making the most of short reads 4
    • VBC capabilities ● Sequence analysis ● Assembly, annotation, SNPs ● Anything-omics! ● Microarray analysis/storage ● Data mining/visualization ● Custom software development ● Computer system architecture07/08/12 Making the most of short reads 5
    • VBC Collaborators ● Monash Uni. ● CSIRO : FSA, LI ● Uni. Melbourne ● USDA : ARS ● Bio21 ● Pasteur Institute ● UNSW, Uni. Syd ● TIGR ● UQ : IMB ● UCSD ● MIMR, MMC, Austin ● UCLA ● MISCL ● Uni. Copenhagen07/08/12 Making the most of short reads 6
    • Sanger sequencing ● Dye terminated capillary sequencing ● Read length ~ 300 - 900 bp ● Yield ~ 1 Mbp per day maximum ● Cost ~ $HIGH07/08/12 Making the most of short reads 7
    • Roche 454 FLX+ ● Pyro-sequencing ● Read length ~ 100 - 250 bp ● Yield ~ 600 Mbp (250 bp PE) ● Run time ~ 1 day ● Prep time ~ 5 days ● Homo-polymer run errors ● Cost $MEDIUM07/08/12 Making the most of short reads 8
    • ABI SOLID 3 ● Sequencing by ligation ● Read length ~ 35 – 50 bp ● Yield ~ 15,000 Mbp (50 bp PE) ● Run time ~ 14 days ● Prep time ~ ? days ● Colour space error propagation ● Cost $MEDIUM07/08/12 Making the most of short reads 9
    • Illumina GA2 (Solexa) ● Sequencing by synthesis ● Read length ~ 36 – 100 bp ● Yield ~ 6,000 Mbp (36bp PE) ● Run time ~ 5 days ● Prep time ~ 1 day ● No homo-polymer errors ● Cost $LOW07/08/12 Making the most of short reads 10
    • Illumina output 36bp Good read @HWUSI-EAS100R:3:1:5:1526#0/1 TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG +HWUSI-EAS100R:3:1:5:1526#0/1 abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a` a=Q33 Pr(wrong)=0.0005 Bad read @HWUSI-EAS100R:3:1:3:1073#0/2 TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT +HWUSI-EAS100R:3:1:3:1073#0/2 aDDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB B=Q2 Pr(wrong)=0.3807/08/12 Making the most of short reads 11
    • Read mapping 8 ● Align 10 36bp reads to 5 Mbp reference ● Traditional tools too slow ● New crop of “short read aligners” (SRA) – SHRiMP – MAQ – Bowtie – ELAND – Novocraft07/08/12 Making the most of short reads 12
    • SRA capabilities ● SNP = Single nucleotide polymorphism – Subsitution, eg. A → C – insertion or deletion (“indel”) eg. A → - ● Warning: not all aligners support indels! ● We tend to use SHRiMP – Supports substitutions and indels – Fast SIMD implementation & parallelizable – Full post-hit Smith-Waterman alignment – Will identify “most” high scoring hits07/08/12 Making the most of short reads 13
    • Genome coverage ● Mapped 7 M reads to 4 Mbp genome ● Yellow line is mean coverage (56x) ● Bowl shaped coverage = circular genome ● Could be used to guide scaffolding07/08/12 Making the most of short reads 14
    • Missing DNA ● Read coverage drops to zero where reference has DNA that the new sequence does not ● LB022 absent ● hemH present07/08/12 Making the most of short reads 15
    • Repeated DNA ● Coverage increases in repeated areas ● LA_SNP3199 is probably triplicated in this strain – depth 120, average 4007/08/12 Making the most of short reads 16
    • SNPs ● SNPs appear as dips/pinches in the coverage graph ● LA1299 gene has possible 4 SNPs relative to ref. ● Rest of gene has average coverage07/08/12 Making the most of short reads 17
    • Repairing 454 data ● 454 has “homopolymer” errors ● Loses track if same base > 3 times in row ● Traditional assemblers dont like too many indels or frame shifts ● 454 developed Newbler assembler ● Challenging for hybrid assemblies ● What if we could “repair” our 454 data?07/08/12 Making the most of short reads 18
    • 454 Repair Guide ● One sample with 454 and Illumina reads ● Get a read mapper supporting indels ● Align all your Illumina reads to 454 data ● If sufficient un-ambiguous depth – correct the 454 sequence! ● Can apply to old closed sequences, 454 contigs, 454 reads etc. ● Find old errors via resequencing07/08/12 Making the most of short reads 19
    • Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGATSequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT07/08/12 Making the most of short reads 20
    • Trimming short reads ● Quality worsens toward 3 end ● Many reads have “N” basecalls ● Variation across flowcell/slide ● Will reduce data size ● Trade quality for depth ● Is it worth it?07/08/12 Making the most of short reads 21
    • Should I trim? ● For 36 bp – Results are mixed – Usually best NOT to trim – Depth will “fix” most errors ● For 75+ bp – 3 quality can be very poor – Seems best to trim – Not all reads need trimming ● More research needed07/08/12 Making the most of short reads 22
    • Conclusion ● Short read mapping is a powerful tool for genomic discovery – Automated analysis eg. SNPs – Visualization eg. depth/coverage graphs – Repairing longer read data ● Still need de novo assembly for unmapped reads07/08/12 Making the most of short reads 23
    • Contact me Web http://www.vicbioinformatics.com/ Email torsten.seemann@infotech.monash.edu.au07/08/12 Making the most of short reads 24