The roles communities play in improving bioinformatics: better software, better algorithms, better data

The roles communities play in improving
bioinformatics: better software, better
algorithms, better data
Iddo Friedberg
Iowa State University
@iddux iddo-friedberg.net

What is Bioinformatics?
Applying computational tools to biological problems

DNA is Information
● DNA is composed of ordered nucleotides
● An alphabet of 4 characters
● Genome: the complete set of genetic material
in an organism
attttaaccgcaagcta...“The cellular machinery”

DNA is Information
● DNA is composed of ordered nucleotides
● An alphabet of 4 characters
● Genome: the complete set of genetic material
in an organism
attttaaccgcaagcta...“The cellular machinery”
https://www.youtube.com/watch?v=xlPDEpimzB8

Genomes have many features
positive regulation site
negative regulatorsstop transcription
genes REPs

Genome Sizes Vary
Genome size in bases~bytes

Describing Genomes
Genbank format
FASTA format
FASTQ format

Biopython: mostly parsers and data structures...

... but many other functions
Alignment, translation, transcription
Visualization
Interfaces with common bioinformatics programs
Phylogenetics

Over 200 Contributors!
https://github.com/biopython/biopython/blob/master/CONTRIB.rst
1999 2018
1350 citations (9/25/2018)

Managing Biopython
● A “coding culture” and standards are adhered to
– Python PEP-8 style
– Inline documentation
– Unit testing
● Dual Licensing: Python license and BSD-3
● Use version control and “social coding”: Github
● Distributions: Conda, pip, ...

Why Better Software?
● Many contributors and users
● Most contributors are users: eat your own dogfood
● Low barriers to contribution and use
● Quality breeds more quality:
biopython incorporated
in other software

How do we know what genes do?
positive regulation site
negative regulatorsstop transcription
genes REPs

Experiments and Observations to
Determine Gene Function

Most protein function are determined
computationally, by homology inference
● Homology: originating from a common ancestor
● Often same function, but sometimes not
● Sequence similarity is a proxy for homology

Homologs, same & different functions
Locomotion

Locomotion
Ground locomotion

Locomotion
Ground locomotion
Water locomotion

Inferring gene or protein homology by pairwise
sequence alignment

Different methods to computationally predict function
Lobb & Doxey, Current Opinion in Structural Biology 38 (2016)

There are Many Methods:
Which is Best?

The Critical Assessment of Function Annotation
“On the one hand, we have enormous “protein” databases that are replete with
errors, wishful thinking, phantoms, and uncertainties. On the other, we have a
tiny fraction of real proteins that have been studied in any depth.”
–- Dan Graur

Standardizing “Function” with Gene Ontology
Locomotion
Ground locomotion Water locomotion

Standardizing “Function” with Gene Ontology

CAFA: The Critical Assessment of
Functional Annotation
CAFA is an ongoing community challenge for
assessing protein function prediction methods.

Assessment
Ground Truth
Assessment Metrics
Results

Experimental annotations during this period are
used as ground truth in evaluation
9/2016 2/2017 7/2017
CAFA3 announced with
> 100,000 from
9 species
Prediction phase ends
147 submission
62 groups
Prediction phase ends
148 submissions
69 methods
1,755
Sequences collected
as benchmarks for the
assessment
CAFA Timeline

Metrics
DNA binding
True function
Predicted function

Nucleic Acid binding
DNA binding
Binding
Metrics
True function
Predicted function

Nucleic Acid binding
DNA binding
Binding
rRNA binding
RNA binding
Metrics
True Positives : 2
False Positives: 2
False Negatives: 1
True function
Predicted function

Precision Recall
00 1
1
Precision
2/4
2/3

AUTHOR ZZZ
MODEL 1
KEYWORDS sequence alignment.
T96060020120 GO:0008270 0.80
T96060020120 GO:0003700 0.80
T96060020120 GO:0006351 0.80
T96060020119 GO:0005730 0.01
T96060020119 GO:0003676 0.07
T96060020119 GO:0005622 0.07
T96060020119 GO:0046872 0.07
T96060020118 GO:0008270 0.75
T96060020118 GO:0006351 0.68
T96060020118 GO:0003677 0.67
T96060020118 GO:0005634 0.67
T96060020118 GO:0006355 0.55
T96060020118 GO:0003700 0.34
Protein ID GO term Confidence

Precision Recall Curve
00 1
1
Recall
Precision
2/4
2/3
● Results are calculated for each threshold of
prediction confidence

Information content of a graph
Clark & Radivojac. Bioinformatics. 29: i53 (2013).

Predicted: True:
Information-theoretic similarity between graphs
Missing Information: Remaining uncertainty:
Clark & Radivojac. Bioinformatics. 29: i53 (2013).

Topological score vs. information theoretic score
Precision
Recall Remaining UncertaintyMissingInformation

Molecular Function
Benchmark: all species Type: no knowledge Mode: full Metric: F-max

Molecular Function
Benchmark: all species Type: no knowledge Mode: full Metric: S-min

Molecular Function
Benchmark: all species Type: no knowledge Mode: full

Biological Process
Benchmark: all species Type: no knowledge Mode: full Metric: F-max

Biological Process
Benchmark: all species Type: no knowledge Mode: full Metric: S-min

Biological Process
Benchmark: all species Type: no knowledge Mode: full

• Our ability to predict in MFO has improved
• CAFA1: 0.60
• CAFA2: 0.60
• CAFA3: 0.70
Things We’ve Learned (top method)
• Our ability to predict in BPO has improved, kinda...
• CAFA1: 0.40
• CAFA2: 0.37
• CAFA3: 0.39

Are we doing better?
Biological Process

Phenomics
A set of physical traits belonging ot a given organism
Used in agriculture to improve crops
Problem: obtain good gold standard data

Train a classifier using images
labeled by different methods

Fatigue: performance over time

Training data from the three crowds gives same classifier performance

Conclusions
● Communities of different types enhance and
advance computational biology
– Community software development (biopython)
– Community challenges (CAFA)
– Crowdsourcing
● Join or form a community!

Acknowledgements (CAFA)
Huy Nguyen
Md Nafiz Hamid
Iddo Friedberg
Indiana University Bloomington
Yuxiang Jiang
Predrag Radivojac
University of Washington
Timothy Bergquist
Sean Mooney
University of Pennsylvania
Casey Greene
European Bioinformatics Institute
Maria Martin
Claire O'Donovan
George Georghiou
Dartmouth College
Deborah A. Hogan (Microbe Screen)
Kimberley A. Lewis (Microbe Screen)
Alex W. Crocker (Microbe Screen)
Giovanni Bosco (Drosophila Screen)
University of Illinois, Chicago
Constance Jeffery

Corn Tassel Crowdsourcing
Naihui Zhou, Zachary D. Siegel,
Scott Zarecor
Nigel Lee
Darwin A. Campbell
Carson M. Andorf
Dan Nettleton
Carolyn J. Lawrence-Dill
Baskar Ganapathysubramanian
Jonathan W. Kelly
Presidential Interdisciplinary Research Initiative

The roles communities play in improving bioinformatics: better software, better algorithms, better data

The roles communities play in improving bioinformatics: better software, better algorithms, better data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The roles communities play in improving bioinformatics: better software, better algorithms, better data

Similar to The roles communities play in improving bioinformatics: better software, better algorithms, better data (20)

More from Iddo

More from Iddo (20)

Recently uploaded

Recently uploaded (20)

The roles communities play in improving bioinformatics: better software, better algorithms, better data