Discussing three community strategies to improve bioinformatics: collective open source development (biopython), critical assessment challenge (CAFA), and crowdsourcing (maize phenomics).
The roles communities play in improving bioinformatics: better software, better algorithms, better data
1. The roles communities play in improving
bioinformatics: better software, better
algorithms, better data
Iddo Friedberg
Iowa State University
@iddux iddo-friedberg.net
3. DNA is Information
â—Ź DNA is composed of ordered nucleotides
â—Ź An alphabet of 4 characters
â—Ź Genome: the complete set of genetic material
in an organism
attttaaccgcaagcta...“The cellular machinery”
4. DNA is Information
â—Ź DNA is composed of ordered nucleotides
â—Ź An alphabet of 4 characters
â—Ź Genome: the complete set of genetic material
in an organism
attttaaccgcaagcta...“The cellular machinery”
https://www.youtube.com/watch?v=xlPDEpimzB8
5. Genomes have many features
positive regulation site
negative regulatorsstop transcription
genes REPs
11. Managing Biopython
● A “coding culture” and standards are adhered to
– Python PEP-8 style
– Inline documentation
– Unit testing
â—Ź Dual Licensing: Python license and BSD-3
● Use version control and “social coding”: Github
â—Ź Distributions: Conda, pip, ...
12. Why Better Software?
â—Ź Many contributors and users
â—Ź Most contributors are users: eat your own dogfood
â—Ź Low barriers to contribution and use
â—Ź Quality breeds more quality:
biopython incorporated
in other software
16. Most protein function are determined
computationally, by homology inference
â—Ź Homology: originating from a common ancestor
â—Ź Often same function, but sometimes not
â—Ź Sequence similarity is a proxy for homology
24. The Critical Assessment of Function Annotation
“On the one hand, we have enormous “protein” databases that are replete with
errors, wishful thinking, phantoms, and uncertainties. On the other, we have a
tiny fraction of real proteins that have been studied in any depth.”
–- Dan Graur
31. Experimental annotations during this period are
used as ground truth in evaluation
9/2016 2/2017 7/2017
CAFA3 announced with
> 100,000 from
9 species
Prediction phase ends
147 submission
62 groups
Prediction phase ends
148 submissions
69 methods
1,755
Sequences collected
as benchmarks for the
assessment
CAFA Timeline
62. Conclusions
â—Ź Communities of different types enhance and
advance computational biology
– Community software development (biopython)
– Community challenges (CAFA)
– Crowdsourcing
â—Ź Join or form a community!
63. Acknowledgements (CAFA)
Iowa State University
Huy Nguyen
Md Nafiz Hamid
Iddo Friedberg
Indiana University Bloomington
Yuxiang Jiang
Predrag Radivojac
University of Washington
Timothy Bergquist
Sean Mooney
University of Pennsylvania
Casey Greene
European Bioinformatics Institute
Maria Martin
Claire O'Donovan
George Georghiou
Dartmouth College
Deborah A. Hogan (Microbe Screen)
Kimberley A. Lewis (Microbe Screen)
Alex W. Crocker (Microbe Screen)
Giovanni Bosco (Drosophila Screen)
University of Illinois, Chicago
Constance Jeffery
64. Corn Tassel Crowdsourcing
Naihui Zhou, Zachary D. Siegel,
Scott Zarecor
Nigel Lee
Darwin A. Campbell
Carson M. Andorf
Dan Nettleton
Carolyn J. Lawrence-Dill
Baskar Ganapathysubramanian
Jonathan W. Kelly
Iowa State University
Presidential Interdisciplinary Research Initiative