Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014 abic-talk

1,386 views

Published on

Talk at ABiC 2014.

Published in: Science
  • Be the first to comment

2014 abic-talk

  1. 1. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu Assistant Professor, MMG / CSE Michigan State University
  2. 2. BUILDING BETTER BIOINFORMATICS SOFTWARE (WHY THE HECK NOT?) C. Titus Brown ctb@msu.edu A???????? Professor, VetMed, UC Davis
  3. 3. Lansing, Michigan -> Davis, California
  4. 4. Dot plots FTW! Brown et al., 2005.
  5. 5. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  6. 6. So I said these things… “this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.” “[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.” ivory.idyll.org/blog/2014-bosc-keynote.html
  7. 7. There is a real problem.
  8. 8. There is a massive profusion of software! Mick Watson, @BioMickWatson: biomickwatson.wordpress.com/20 12/12/28/an-embargo-on-short-read- alignment-software/ jeffvictor.deviantart.com
  9. 9. The players, in caricature: 1. Computer scientists 2. Software engineers 3. Data scientists 4. Statisticians 5. Biologists
  10. 10. The Computer Scientist Fast, sensitive, specific – pick one.
  11. 11. The (Good) Software Engineer Does it have any unit tests?
  12. 12. The Data Scientist How quickly can I run it, starting from scratch?
  13. 13. The Statistician What gives me the best p-value?
  14. 14. The Biologist What gives me the most publishable result?
  15. 15. Problems all along the way… 1. Computer scientists: build delicate, hard to use, very high performance software that solves the wrong problem. 2. Software engineers: all work for Google. 3. Data scientists: uses the wrong programs -- because they’re actually usable. 4. Statisticians: only get invited into the project six months after all the data is generated. 5. Biologists: are desperate to find any one of the above that know any biology at all.
  16. 16. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression Every one of these steps is still an open research problem, with computational challenges and direct biological implications!
  17. 17. So: 1. This is all still research. 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach?
  18. 18. So: 1. This is all still research 2. We’re unlikely to ever find out the right answer, but will merely settle for one that’s not obviously terrible. 3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory... 4. Who are any of us to judge the value of any particular approach? (Well, sometimes me, when I’m peer reviewer #2.)
  19. 19. All hands on deck! Quality control Assembly Annotation Differential expression We need it all! • Fast/sensitive/specific algorithms; • Solid software; • Statistical robustness; • Biological insight; • Well-trained data scientists. (The best bioinformaticians have multiple personality disorder, or so I tell myself.)
  20. 20. That sort of explains why. But this still leaves us with too many choices.
  21. 21. Example: de novo mRNAseq Quality control Assembly Annotation Differential expression 10-20 packages x 2-5 packages x 5-10 packages x 20-40 packages = 2000-40,000 combinations
  22. 22. What’s the solution!? Ultimately? All of… Whole-workflow evaluations of tools. Small tools (see “small tools manifesto”). Automation! Simulations, synthetic data, mock data, real data. Antagonistic data set development (**). Tool development driven with use cases. Build based on solid command-line workflows. Those things called “controls”. …and more
  23. 23. Trying out a few approaches…
  24. 24. 1. Automate the hell out of everything (Ubuntu 14.04, git, make, IPython Notebook, latex)
  25. 25. Time from publication of KAnalyze to our 100% reproducible re-evaluation? ~8 hours.
  26. 26. 2. Protocols, not pipelines. STOP HIDING THE ANALYSIS STEPS. BIG BLACK BOXES ARE NOT SMALL TOOLS!
  27. 27. Write down what you’re doing… https://khmer-protocols.readthedocs.org/
  28. 28. …and add automated end-to-end tests. c.f. “literate ReSTing”
  29. 29. 3. Drive sustainable software development with use cases.
  30. 30. …that are explicit…
  31. 31. …versioned…
  32. 32. …and automated.
  33. 33. 4. Put everything in the cloud and measure it. ~40 hours; m1.xlarge Eel Pond mRNAseq protocol.
  34. 34. 5. Compare programs and workflows fairly. Genome Reference Quality Filtered Diginorm Partition Reinflation Velvet - 80.90 83.64 84.57 IDBA 90.96 91.38 90.52 88.80 SPAde 90.42 90.35 89.57 90.02 s Mis-assembled Contig Length Velvet - 52071358 44730449 45381867 IDBA 21777032 20807513 17159671 18684159 SPAde 28238787 21506019 14247392 18851571 s Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 Also! Tip o’ the hat to Michael Barton, nucleotid.es
  35. 35. A super fun way to do reviews! • “What a nice new transcriptome assembler! Interesting how it doesn’t perform that well on my 10 test data sets.” • “Hey, so you make these claims, but I ran your code, and…” • “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?” • “Here – use our evaluation pipeline, since you clearly need something better.” The Brown Lab: taking passive aggression to a whole new level!
  36. 36. We breed our own problems. Reward the behavior you want to see. Let’s level up the field, already.
  37. 37. What are we working on, scientifically speaking?
  38. 38. Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment Jason Pell, Jordan Fish, Michael Crusoe
  39. 39. Error correction on simulated E. coli data TP FP TN FN 1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Michael Crusoe, Jordan Fish, Jason Pell
  40. 40. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling. (Hey, look, ma – a new mapper!)
  41. 41. Infrastructure: distributed graph database server Web interface + API Compute server (Galaxy? Arvados?) Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-talk.html
  42. 42. AGTA talk on Monday • 3:15-4pm – come see me try to convince biomedical researchers to share their data! • 4-4:30pm – come listen to Ana Conesa talk about multi-omics data integration! Thanks!

×