Photo: Amy Apprill, Woods Hole Oceanographic Institute
pain
bioinformatics
oh and btw Elvira works in my lab
Custom Perl scripts FTW
• Luscombe et al., 2001:
*especially, but definitely not limited to, gene & protein sequence data
**often impressively large datasets. Please do not call it “big data”
https://xenabrowser.net/heatmap/
http://www.rcsb.org/structure/3V6D
Sylvester et al. (2017)
bentzenlab.ca
https://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/ucm363134.htm
Keddy and Beiko (2017)
65 protein sequences
1962
Dr. Dayhoff established an on-line computer
database and a sophisticated retrieval system, accessable by
phone to outside users, in September 1980
http://www.dayhoff.cc/MODBiography.html
Dec 1982:
680,338 nucleotides Apr 2003:
~40 billion nucleotides
Dec 2017:
~3 trillion nucleotides
Kaye et al.
“Data sharing in genomics — re-shaping
scientific practice” Nat Rev Genet 2008
Langille et al. (2018) Microbiome
Data-release policies are
only as good as researchers’
willingness to abide by
them, and the will on the
part of journals and funding
bodies to enforce them!
• Standard formats
• Efficient representations
quantitative data
• Gene expression
• Metabolite concentrations
many ways
protein functional prediction
new results
re-run the analysis yourself
ontologies controlled
vocabularies
“(i) the recorded information about each experiment should be sufficient to
interpret the experiment and should be detailed enough to enable
comparisons to similar experiments and permit replication of experiments
and (ii) the information should be structured in a way that enables useful
querying as well as automated data analysis and mining.”
Brazma et al. (2001) Nat Genet
g Genes
s Samples (Diffuse large B cell
lymphoma patients of two types,
with different prognoses)
Expression levels:
Low
High
Li et al (2001) Bioinformatics
(Also available as XML)
Pedro’s Biomolecular ResearchTools, ca. 1997
All articles published in Science in 2009 that mention H1N1:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[
journal]+AND+h1n1+AND+2009[pdat]
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=19516283&re
tmode=text&rettype=abstract
Evidence codes – was protein function inferred in the lab (and if so, how?), or through
computational predictions?
Evidence codes
en.wikipedia.org
?
GOLD implements MIxS standards
Nucleic Acids Research 2018 Database issue
181 databases
Nucleic Acids Research 2017 Web Server issue
86 servers (284 submitted)
right
right exact same
exact same
3. Archive the ExactVersions of All External Programs Used
5. Record All Intermediate Results, When Possible in
Standardized Formats
Sandve et al., PLoS Comput Biol 2013
Galaxy – customize, execute,
save, and share workflows
CSCI6802Tutorial #2 (Michael Hall)
G Dudas et al. Nature 1–7 (2017) doi:10.1038/nature22040
Recent paper about Ebola transmission
• 1610 publicly available genomes, 2014-2015
• Data cleaning
• Relaxed molecular clock to infer root
• Markov chains to infer transmission rates
(not location specific, but based on
attributes)
Key Conclusions:
• Median transmission distance = 72 km
• Important factors:
• National vs int’l dispersal
• Distances between regions
• Population at source and destination
• Shared int’l border
Dudas et al. (2017) Nature
Jupyter (iPython) Notebook - https://github.com/ebov/space-time
must
enforced
modes of access
FIN

Biomedical data