Benchmarking short-read mapping programs


Published on

These slides were made by Stella Hartono, in the Korf Lab, UC Davis. For a rotation project in graduate school, she benchmarked the performance of various short-read mapping programs using simulated datasets.

Published in: Education, Business, Technology
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Benchmarking short-read mapping programs

  1. 1. Benchmarking DNA mapping program Stella Hartono Tagkopoulos and Korf Lab
  3. 3. WHY DO BENCHMARKING? There are 20+ mapping programs  Bowtie, Bowtie2, Eland, BFAST, BWA, GMAP, MAQ, MOSAIK, RMAP, Zoom, SHRiMP, SOAP2, etc Which is the best program?  It depends on what you define by best  Depends on data, time, available resources, expertise
  4. 4. WHAT HAPPEN IF YOU DON’T BENCHMARK? Choose random program Top Google Hit “Friend told me” Might use wrong program for your data Results might not be as accurate as what is reported Might take too much time
  5. 5. MY PROJECT Question: What is the best mapping program for short read (Illumina) and long read (PacBio) data? I used 3 main criteria: • Accuracy • Speed • Memory Usage
  6. 6. METHODS
  7. 7. DATA IS GENERATED USING READ SIMULATIONPROGRAM• In order to assess accuracy, I simulated reads tagged with their correct coordinate• Dwgsim: DNAA Whole Genome Simulator • Available on GitHub• Outputs reads that mimic various sequencing platforms • Illumina • PacBio • IonTorrent• Has a feature that evaluate result generated by mapping programs
  8. 8. READ DATA PARAMETERS• Genome sequence used: Human Genome (hg19)• Chr 1:50-60 Mb (represent average human genome) • Dwgsim randomly “chops” up genomic sequence file• Illumina-like reads • 100 bp long • 0.5 to 2% (increasing along the read) base substitution• PacBio-like reads • 3000 bp long • 16% random error rate represented by 14% indels and 2% base substitution• Coverage: 4x and 20x
  9. 9. MAPPING PROGRAMS• There are 20+ mapping programs available• Ideally, I should try all of them, but within 1 month rotation, I was only able to try 8 1. BWA 2. Bowtie2 3. MAQ 4. Soap2 ? 5. Rmap – output format can’t be evaluated by Dwgsim ? 6. SHRiMP - output format can’t be evaluated by Dwgsim 7. SSAHA2 - very slow (10-20x times slower) 8. Novoalign – very slow (10-20x times slower)
  10. 10. RESULTS
  11. 11. ILLUMINA-LIKE READ: ACCURACY Accuracy = (Read Mapped Correctly/Total Read) *100% BWA and Bowtie2 have the best accuracy Soap2 has least accuracy
  12. 12. ILLUMINA-LIKE READ: SPEED Bowtie2 is slowest Speed within each programs in different coverages decrease in linear fashion (20x = 5*4x)
  13. 13. ILLUMINA-LIKE READ: MEMORY Bow2 uses the least memory All but MAQ use consistent memory between coverage Memory used by MAQ increased ~4 times at 20x coverage
  14. 14. ILLUMINA-LIKE READ: OVERALL  Ranking Table (lower is better) Accuracy Speed Memory BWA 1 (94-95%) 1 2 (150MB) Bow2 1 (94-95%) 4 1 (80MB) MAQ 3 (90-91%) 1 4 (300-1200MB) Soap2 4 (71-82%) 1 3 (650MB)  BWA is accurate, fast, and quite memory efficient  Bow2 is accurate and memory efficient, but slow  MAQ is pretty accurate, fast, but uses lots of memory  SOAP2 is fast, but not very accurate, and uses lots of memory
  15. 15. PACBIO-LIKE READ All but BWA failed to map anything Newest BWA has function specific for PacBio
  16. 16. CONCLUSION Benchmarking 4 Mapping Programs (BWA, Bowtie2, MAQ, Soap2) Criteria: Accuracy, Speed, and Memory Illumina-like Reads (100bp, 0.5-2% substitution rate)  BWA is the best for Illumina-like Data Pacbio-like Reads (3000bp, 4% indels, 2% substitution)  All but BWA failed  BWA is the best for Pacbio-like Read  High accuracy (~90%)
  17. 17. CONCLUSION It takes a lot of effort to benchmark programs, but the results are useful From this rotation, I learned that BWA seems to be the best for mapping both short and long read data Future Directions:  Different data types (Nanopore 60kb reads?)  Benchmark more programs  Fine tune parameters for each programs
  18. 18. ACKNOWLEDGEMENTS• UC Davis GGG for funding• My overlords in Tagkopoulos and Korf Lab: Ilias Tagkopoulos Ian Korf• Everyone else in the lab! • Vadim, Jiyeon, Eren, Linh, Ted, Keith B, Keith D, Natalie, Ken, Paul, Abby, Yen, Kristen, Matt, Daniel, Danielle,