Your SlideShare is downloading. ×
0
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Benchmarking short-read mapping programs
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Benchmarking short-read mapping programs

3,045

Published on

These slides were made by Stella Hartono, in the Korf Lab, UC Davis. For a rotation project in graduate school, she benchmarked the performance of various short-read mapping programs using simulated …

These slides were made by Stella Hartono, in the Korf Lab, UC Davis. For a rotation project in graduate school, she benchmarked the performance of various short-read mapping programs using simulated datasets.

Published in: Education, Business, Technology
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,045
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
44
Comments
2
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Benchmarking DNA mapping program Stella Hartono Tagkopoulos and Korf Lab
  • 2. INTRODUCTION
  • 3. WHY DO BENCHMARKING? There are 20+ mapping programs  Bowtie, Bowtie2, Eland, BFAST, BWA, GMAP, MAQ, MOSAIK, RMAP, Zoom, SHRiMP, SOAP2, etc Which is the best program?  It depends on what you define by best  Depends on data, time, available resources, expertise
  • 4. WHAT HAPPEN IF YOU DON’T BENCHMARK? Choose random program Top Google Hit “Friend told me” Might use wrong program for your data Results might not be as accurate as what is reported Might take too much time
  • 5. MY PROJECT Question: What is the best mapping program for short read (Illumina) and long read (PacBio) data? I used 3 main criteria: • Accuracy • Speed • Memory Usage
  • 6. METHODS
  • 7. DATA IS GENERATED USING READ SIMULATIONPROGRAM• In order to assess accuracy, I simulated reads tagged with their correct coordinate• Dwgsim: DNAA Whole Genome Simulator • Available on GitHub• Outputs reads that mimic various sequencing platforms • Illumina • PacBio • IonTorrent• Has a feature that evaluate result generated by mapping programs
  • 8. READ DATA PARAMETERS• Genome sequence used: Human Genome (hg19)• Chr 1:50-60 Mb (represent average human genome) • Dwgsim randomly “chops” up genomic sequence file• Illumina-like reads • 100 bp long • 0.5 to 2% (increasing along the read) base substitution• PacBio-like reads • 3000 bp long • 16% random error rate represented by 14% indels and 2% base substitution• Coverage: 4x and 20x
  • 9. MAPPING PROGRAMS• There are 20+ mapping programs available• Ideally, I should try all of them, but within 1 month rotation, I was only able to try 8 1. BWA 2. Bowtie2 3. MAQ 4. Soap2 ? 5. Rmap – output format can’t be evaluated by Dwgsim ? 6. SHRiMP - output format can’t be evaluated by Dwgsim 7. SSAHA2 - very slow (10-20x times slower) 8. Novoalign – very slow (10-20x times slower)
  • 10. RESULTS
  • 11. ILLUMINA-LIKE READ: ACCURACY Accuracy = (Read Mapped Correctly/Total Read) *100% BWA and Bowtie2 have the best accuracy Soap2 has least accuracy
  • 12. ILLUMINA-LIKE READ: SPEED Bowtie2 is slowest Speed within each programs in different coverages decrease in linear fashion (20x = 5*4x)
  • 13. ILLUMINA-LIKE READ: MEMORY Bow2 uses the least memory All but MAQ use consistent memory between coverage Memory used by MAQ increased ~4 times at 20x coverage
  • 14. ILLUMINA-LIKE READ: OVERALL  Ranking Table (lower is better) Accuracy Speed Memory BWA 1 (94-95%) 1 2 (150MB) Bow2 1 (94-95%) 4 1 (80MB) MAQ 3 (90-91%) 1 4 (300-1200MB) Soap2 4 (71-82%) 1 3 (650MB)  BWA is accurate, fast, and quite memory efficient  Bow2 is accurate and memory efficient, but slow  MAQ is pretty accurate, fast, but uses lots of memory  SOAP2 is fast, but not very accurate, and uses lots of memory
  • 15. PACBIO-LIKE READ All but BWA failed to map anything Newest BWA has function specific for PacBio
  • 16. CONCLUSION Benchmarking 4 Mapping Programs (BWA, Bowtie2, MAQ, Soap2) Criteria: Accuracy, Speed, and Memory Illumina-like Reads (100bp, 0.5-2% substitution rate)  BWA is the best for Illumina-like Data Pacbio-like Reads (3000bp, 4% indels, 2% substitution)  All but BWA failed  BWA is the best for Pacbio-like Read  High accuracy (~90%)
  • 17. CONCLUSION It takes a lot of effort to benchmark programs, but the results are useful From this rotation, I learned that BWA seems to be the best for mapping both short and long read data Future Directions:  Different data types (Nanopore 60kb reads?)  Benchmark more programs  Fine tune parameters for each programs
  • 18. ACKNOWLEDGEMENTS• UC Davis GGG for funding• My overlords in Tagkopoulos and Korf Lab: Ilias Tagkopoulos Ian Korf• Everyone else in the lab! • Vadim, Jiyeon, Eren, Linh, Ted, Keith B, Keith D, Natalie, Ken, Paul, Abby, Yen, Kristen, Matt, Daniel, Danielle,

×