Why Research Data
Management
May Save Science
Anita de Waard
VP Research Data Collaborations
a.dewaard@elsevier.com
http://researchdata.elsevier.com/
Why Life is Difficult,
And What We Can Do About It
Outline:
• The problem: life is difficult.
• One approach to tackling this: claim-evidence
networks.
– How do we find claims?
– How do we find evidence?
– How do we connect the two?
• What is still missing?
• Call to action!
The Problem
Problem 1: a rose is not a rose:
• “…there was significant variability of the
injected venom composition from
specimen to specimen, in spite of their
common biogeographic origin.”
Jose A. Rivera-Ortiz, Herminsul Cano, Frank Marí, Intraspecies variability of the
injected venom of Conus ermineus, doi:10.1016/j.peptides.2010.11.014
• “…Strains DV-3/84 DV-7/84 (group 3)
showed 76.6% similarity to each other and
were similar to all other strains at the
67.6% level.”
Zofia Dzierżewicz et al., Intraspecies variability of Desulfovibrio desulfuricans
strains determined by the genetic profiles, FEMS Microbiology Letters, Volume
219, Issue 1, 14 February 2003, Pages 69–74, doi:10.1016/S0378-
1097(02)01199-0
=> A specimen is not a species!
Problem 2: gene expression varies with:
Age: “SIRT1-Associated genes are deregulated in the aged brain”
Philipp Oberdoerffer et al., SIRT1 Redistribution on Chromatin Promotes Genomic Stability but Alters Gene Expression
during Aging, Cell, Volume 135, Issue 5, 28 November 2008, Pages 907–918, doi:10.1016/j.cell.2008.10.025
Smell: “…major urinary proteins *…+ mediate the pregnancy blocking
effects of male urine”
P.A. Brennan, et al, Patterns of expression of the immediate-early gene egr-1 in the accessory olfactory bulb of female
mice exposed to pheromonal constituents of male urine, Neuroscience, Volume 90, Issue 4, June 1999, P 1463–
1470, doi:10.1016/S0306-4522(98)00556-9
Hunger: “Out of the ~30K genes, about 10K are differentially expressed
in liver cells when an animal is in different states of satiety.“
Zhang F, Xu X, Zhou B, He Z, Zhai Q (2011) Gene Expression Profile Change and Associated Physiological and
Pathological Effects in Mouse Liver Induced by Fasting and Refeeding.
PLoS ONE 6(11): e27553. doi:10.1371/journal.pone.002755
Light: “Longer-term enrichment training also altered the mRNA levels of
many genes associated with structural changes that occur during
neuronal growth.”
Cailotto C., et al. (2009) Effects of Nocturnal Light on (Clock) Gene Expression in Peripheral Organs: A Role for the
Autonomic Innervation of the Liver. PLoS ONE 4(5): e5650. doi:10.1371/journal.pone.0005650:
=> Knowing genes is not knowing
how they are expressed!
• “We found the diversity and abundance of each habitat’s
signature microbes to vary widely even among healthy
subjects, with strong niche specialization both within
and among individuals.”
The Human Microbiome Project Consortium, Structure, function and diversity of the healthy
human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234
• “Colonization of an infant’s gastrointestinal tract begins
at birth. The acquisition and normal development of the
neonatal microflora is vital for the healthy maturation of
the immune system.”
Mackie RI, Sghir A, Gaskins HR., Developmental microbial ecology of the neonatal
gastrointestinal tract. Am J Clin Nutr. 1999 May;69(5):1035S-1045S
Problem 3:
No man (or mouse) is an island…
=> An animal is an ecosystem!
Problem 4:
Interactions create more complexity:
• Computing cancer: “No amount of information about
what happens inside a single cell can ever tell you
what a tissue is going to do,” *Glazier+ said. “Much of
the information and complexity of tissues and life is
embedded in the way cells talk to each other and the
extracellular environment.”
• Megadata:“These complex emergent systems are
impossible to understand,”,”*we+ founded Applied
Proteomics to create a protein diagnostic that reveals
not just where a cancer is, but how it interacts with
the body..” Nature Special Issue Vol. 491 No. 7425
‘Physical Scientists Take On Cancer’ :
=> The whole is more than the sum of its parts!
Big problems in biology:
http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg
1. Interspecies variability > A specimen is not a species!
2. Gene expression variability > Knowing genes is not
knowing how they are expressed!
3. Microbiome > An animal is an ecosystem!
4. Systems biology > Whole is more than the sum of its parts!
5. Models vs. experiment > Are we talking about the same
things? In a way we can all use?
6. Dynamics > Life is not in equilibrium!
Life is complicated!
Reductionism doesn’t
work for living systems.
Statistics could help!
With enough observations, trends and anomalies can be
detected:
• “Here we present resources from a population of 242
healthy adults sampled at 15 or 18 body sites up to three
times, which have generated 5,177 microbial taxonomic
profiles from 16S ribosomal RNA genes and over 3.5
terabases of metagenomic sequence so far.”
The Human Microbiome Project Consortium, Structure, function and diversity of
the healthy human microbiome, Nature 486, 207–214 (14 June 2012)
doi:10.1038/nature11234
• “The large sample size — 4,298 North Americans of
European descent and 2,217 African Americans — has
enabled the researchers to mine down into the human
genome.”
Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing
study emphasizes importance of rare variants in disease.
But biological research is insular!
• Biology is small: size 10^-5 – 10^2
m, scientist can work alone (‘King’ and
‘subjects’).
• Biology is messy: it doesn’t
happen behind a terminal.
• Biology is competitive: many
people with similar skill sets,
vying for the same grants
• In summary: the structure of biological
research does not inherently promote
collaboration (vs., for instance, HE physics or
astronomy (and they’re not all they’re cracked up to
be, either…)).
Prepare
Observe
Analyze
Ponder
Communicate
How Can We Connect
This Knowledge?
Claim-Evidence Networks
Offer A Model for Connecting Knowledge:
Experimental
Evidence
Converging on Claim/Evidence/Networks, e.g. here:
• The Karyotype Ontology: a computational representation for human cytogenetic patterns. Jennifer Warrender and
Phillip Lord
• Lexical Analysis and Characterization of the OBOFoundry Ontologies. Manuel Quesada-Martínez, Jesualdo Tomás
Fernández-Breis and Robert Stevens
• Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison. Peter
Robinson, Sebastian Köhler, Anika Oellrich, Kai Wang, Chris Mungall, Suzanna E. Lewis, Sebastian Bauer, Dominik
Seelow, Peter Krawitz, Christian Gilissen, Melissa Haendel and Damian Smedley
• BioAssay Ontology (BAO): Modularization, Integration and Applications. Uma Vempati, Hande Kucuk, Saminda
Abeyruwan, Ubbo Visser, Vance Lemmon, Ahsan Mir and Stephan Schürer
• eXframe: A Semantic Web Platform for Genomics Experiments. Emily Merrill, Stephane Corlosquet, Paolo
Ciccarese, Tim Clark and Sudeshna Das
• Ovopub: Modular data publication with minimal. provenance Alison Callahan and Michel Dumontier
• Zooma – A tool for automated ontology annotation. Tony Burdett, Simon Jupp, James Malone, Helen
Parkinson, Eleanor Williams and Adam Faulconbridge
• A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature. Chayan
Chakrabarti, Thomas B. Jones, Jiawei F. Xu, George F. Luger, Angela R. Laird, Matthew D. Turner and Jessica A.
Turner
• Preserving sequence annotations across reference sequences. Zuotian Tatum, Andrew Gibson, Marco Roos, Peter
E.M. Taschner, Mark Thompson, Erik A. Schultes and Jeroen F. J. Laros
• A Taxonomy for Immunologists. James A. Overton, Randi Vita, Jason A. Greenbaum, Heiko Dietze, Alessandro Sette
and Bjoern Peters
• Health Data Ontology Trunk: A middle-layer ontology for health- care. Ulf Schwarz, Luc Schneider, Emilio
Sanfilippo, Holger Stenzhorn and Nikolina Koleva
• Structured representation of scientific evidence using semantic web techniques – a biochemistry use
case.Christian Bölling, Michael Weidlich and Hermann-Georg Holzhütter
• Synthetic Biology Open Language Visual: an ontological use case. Jacqueline Quinn, Michal Galdzicki, Robert
Step 1: Find claims:
E.g., using XIP for discourse analysis:
In contrast with previous hypotheses compact plaques form before significant
deposition of diffuse A beta, suggesting that different mechanisms are involved
in the deposition of diffuse amyloid and the aggregation into plaques.
Entities
Relationships
Temporality
Connections thematic roles
Status
core information
(proposition)
information extraction
rhetorical
metadiscourse
discourse analysis
discourse analysisdiscourse structure
Sándor, Àgnes and de Waard, Anita, (2012).
Finding Claimed Knowledge Updates:
Sandor, A. and de Waard, A. (2012)
Here we used mass spectrometry to identify HuD as a novel
neuronal SMN-interacting partner
Our analysis of known HuD-associated mRNAs in neurons identified
cpg15 mRNA as a highly abundant mRNA in HuD IPs
Our finding that SMN protein associates with HuD protein and the
HuD target cpg15 mRNA in neurons …
Definition:
1) A CKU expresses a verbal or nominal proposition about biological entities.
2) A CKU is a new proposition.
3) The authors present the CKU as factual.
4) A CKU is derived from the experimental work described in the article.
5) The ownership of the proposition is attributed to the author(s) of the article.
6) 4) and 5) are either explicitly expressed or are implicitly conveyed by a
structural position as title, section or caption title.
Allow for Hedging and Uncertainty:
Ontology of Reasoning, Certainty and Attribution (ORCA)
For a Proposition P, an epistemically marked clause E
is an evaluation of P, where EV, B, S(P), with:
– V = Value:
3 = Assumed true, 2 = Probable, 1 = Possible, 0 = Unknown,
(- 1= possibly untrue, - 2 = probably untrue, -3 = assumed untrue)
– B = Basis:
Reasoning
Data
– S = Source:
A = speaker is author A, explicit
IA = speaker author, A, implicit
N = other author N, explicit
NN = other author NN, implicit
Based on a conversation with Ed Hovy;
de Waard, A. and Schneider, J. (2012)
Turning claims into formal representations:
Biological statement with BEL/ epistemic
markup
BEL representation: Epistemic
evaluation
These miRNAs neutralize p53-mediated CDK
inhibition, possibly through direct inhibition
of the expression of the tumor-suppressor
LATS2.
r(MIR:miR-372) -
|(tscript(p(HUGO:Trp53)) -|
kin(p(PFH:”CDK Family”)))
Increased abundance of miR-
372 decreases abundance of
LATS2
r(MIR:miR-372) -|
r(HUGO:LATS2)
Value =
Possible
Source =
Unknown
Basis =
Unknown
Biological statement with
Medscan/epistemic markup
MedScan Representation: Epistemic
evaluation
Furthermore, we present evidence that the
secretion of nesfatin-1 into the culture
media was dramatically increased during the
differentiation of 3T3-L1 preadipocytes into
adipocytes (P < 0.001) and after treatments
with TNF-alpha, IL-6, insulin, and
dexamethasone (P < 0.01).
IL-6  NUCB2 (nesfatin-1)
Relation: MolTransport
Effect: Positive
CellType: Adipocytes
Cell Line: 3T3-L1
Value =
Probable
Source =
Author
Basis = Data
Claims Link to Evidence:
The evidence is in data. To structure this:
• There are many different research databases– both generic
(Dryad, Dataverse, DataBank, Zenodo, etc) and specific
(NIF, IEDA, PDB)
• There are many systems for creating/sharing workflows
(Taverna, MyExperiment, Vistrails, Workflow4Ever,)
• There are many e-lab notebooks
(LabGuru, LabArchives, LaBlog etc)
• There are scores of
projects, committees, standards, bodies, grants, initiatives,
conferences for discussing and connecting all of this
(KEfED, Pegasus, PROV, RDA, Science
Gateways, Codata, BRDI, Earthcube, etc. etc)
• … you could make a living out of this !
…but this is what most scientists do:
Using antibodies
and squishy bits
Grad Students experiment
and enter details into their
lab notebook.
The PI then tries to make
sense of their slides,
and writes a paper.
End of story.
One attempt to structure data:
CMU Urban Legend
de Waard, A., Burton, S. et al., 2013
Connecting experimental results:
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observations
Observations
Observations
Across labs, experiments:
track reagents and how
they are used
Prepare
Analyze Communicate
Prepare
Analyze Communicate
Observations
Observations
Observations
Compare outcome of
interactions with these
entities
Connecting experimental results:
Prepare
Analyze Communicate
Prepare
AnalyzeCommunicate
Observations
Observations
Observations
Build a ‘virtual reagent
spectrogram’ by comparing
how different entities
interacted in different
experiments Think
Reason collectively!
Connecting experimental results:
NIF Antibodies Registry
collects antibody information:
Step 3: Connect Claims and Evidence
Example: Hunter et al., Hanalyzer:
Step 1: Manually identify DDIs and
drug names in wide collection of
content sources
Step 2: Develop a model of Drug-Drug
Interaction and define candidates
Step 3: Automate this process
and store as Linked Data
Example: Drug-Drug Interactions
Boyce, Schroeder et al., 2013
Connect recommendations
in clinical guidelines to underlying
evidence
Hoekstra, de Waard and Vdovjak, 2012
Example:
Using what is known about interactions in fly & yeast,
predict new interactions with a human protein –
Running over data on the web that he neither created nor knew about!
Given a protein P in Species X:
Find proteins similar to P in Species Y
Retrieve interactors in Species Y
Sequence-compare Y-interactors with Species X
genome
(1)  Keep only those with homologue in
Find proteins similar to P in Species Z
Retrieve interactors in Species Z
Sequence-compare Z-interactors with (1)
 Putative interactors in Species X
Example: do science ON the web:
Great! So we’re almost
done, right – and we can all go
home!
Not so fast…
Both seminomas and the EC component of
nonseminomas share features with ES cells. To
exclude that the detection of miR-371-3 merely
reflects its expression pattern in ES cells, we tested
by RPA miR-302a-d, another ES cells-specific
miRNA cluster (Suh et al, 2004). In many of the
miR-371-3 expressing seminomas and
nonseminomas, miR-302a-d was undetectable (Figs
S7 and S8), suggesting that miR-371-3 expression is
a selective event during tumorigenesis.
Both seminomas and the EC component of
nonseminomas share features with ES cells.
To exclude that
the detection of miR-371-3 merely reflects its
expression pattern in ES cells,
we tested by RPA miR-302a-d, another ES cells-
specific miRNA cluster (Suh et al, 2004).
In many of the miR-371-3 expressing seminomas
and nonseminomas, miR-302a-d was undetectable
(Figs S7 and S8),
suggesting that
miR-371-3 expression is a selective event during
tumorigenesis.
Fact
Hypothesis
Method
Result
Implication
Goal
Reg-Implication
Conceptual
knowledge
Experimental
Evidence
What is a claim? In a paragraph?
• Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK
inhibition, possibly through direct inhibition of the expression of the tumor
suppressor LATS2.”
• Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373
were found to allow proliferation of primary human cells that express
oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor
LATS2 (Voorhoeve et al., 2006).”
• Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly
inhibit the expression of Lats2, thereby allowing tumorigenic growth in the
presence of p53 (Voorhoeve et al., 2006).”
“[Y]ou can transform .. fiction into fact, just by adding
or subtracting references”, Latour, 1987
What is the claim? Who makes it?
> 50 My Papers
2 M scientists
2 My papers/year
Evidence is largely lost….
Majority of data
(90%?) is stored
on local hard drives
Dryad:
7,631 files
Dataverse:
0.6 My
Datacite:
1.5 My
Some data
(8%?) stored in large,
generic data
repositories
MiRB:
25k
PetDB:
1,5 k
TAIR:
72,1 k
PDB:
88,3 k
SedDB:
0.6 k
A small portion of data
(1-2%?) stored in small,
topic-focused
data repositories
…or buried..
• In 220 publications only 40% of antibodies, 40% of cell lines and 25% of
constructs can be manually identified (Vasilevsly et al, submitted)
• The good news: we can find automatically
what we can find manually
• Proposal (NIH, June 2013):
– Author is asked to add methods section to a tool
– Tool extracts likely reagents / resources
– User interface asks author to confirm or select
…and you can’t extract it after the fact.
49 publications193 publications 76 publications 214 publications 210 publica
Entity
Type
Precision Recall
Antibody 87.5 63.3
Resource 95.6 98.9
Even if we can link to evidence:
• Is it true?
In Summary:
We’re not out of the woods
(or a job) just yet!
We need to improve claim networks:
• Can we make systems of computer-readable
meaning that still represent the fullness of
natural language?
>> Let’s work with computational linguists!
• Trace claims across publications:
>> Let’s work with legal/political argumentation
specialists! Sentiment analysis!
> 50 My Papers
2 M scientists
2 My papers/year
Improve evidence: scale up data curation!
Dryad:
7,631 files
Dataverse:
0.6 My
Datacite:
1.5 My
MiRB:
25k
PetDB:
1,5 k
Majority of data
(90%?) is stored
on local hard drives
Some data
(8%?) stored in large,
generic data
repositories
TAIR:
72,1 k
PDB:
88,3 k
SedDB:
0.6 k
A small portion of data
(1-2%?) stored in small,
topic-focused
data repositories
INCREASE DATA
DIGITISATION
DEVELOP
SUSTAINABLE MODELS
IMPROVE
REPOSITORY
INTEROPERABILITY
Keep asking big questions:
• Is this true?
• Does it matter?
• To whom?
“Let us now build systems that allow a kid in Mali
who wants to learn about proteomics to not be
overwhelmed by the irrelevant and the untrue.”
- John Perry Barlow, iAnnotate, SF 2013
In Memoriam Douglas C. Engelbart, 1925-2013:
“This is an initial summary report of a project taking a new
and systematic approach to improving the intellectual
effectiveness of the individual human being. A detailed
conceptual framework explores the nature of the system
composed of the individual and the tools, concepts, and
methods that match his basic capabilities to his problems.
One of the tools that shows the greatest immediate promise
is the computer, when it can be harnessed for direct on-line
assistance, integrated with new concepts and methods.”
Summary:
• The problem: life is difficult.
• One approach to tackle this: claim-evidence
networks:
– Find claims
– Identify evidence
– Connect the two.
• But we still need:
– Better ways to represent subtlety of natural language
– Better evidence: more structured, better connected
– Focus on the big questions.
• There’s a lot of work to do!
Collaborations and discussions gratefully acknowledged:
• CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy
• UCSD: Phil Bourne, Brian Shoettlander, Ilya Zaslavsky
• NIF: Maryann Martone, Anita Bandrowski
• MSU: Brian Bothner
• OHSU: Melissa Haendel, Nicole Vasilevsky
• CDL: Carly Strasser, John Kunze, Stephen Abrams
• Harvard/MGH: Tim Clark, Paolo Ciccarese
• VU: Rinke Hoekstra, Frank van Harmelen, Paul Groth
• Columbia/IEDA: Kerstin Lehnert, Leslie Hsu
• University of Pittsburgh: Richard Boyce
• Xerox Research Europe: Agnes Sandor
• DERI: Jodi Schneider
Thank you!
References:
• de Waard, Buckingham Shum, Park, Samwald, Sandor, 2009: Hypotheses, Evidence and Relationships, ISWC2009
• Biological Expression Language – http://www.openbel.org
• Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications
• Latour, B., Science in Action, 1987
• de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A
Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012.
• Data2Semantics project: http://www.data2semantics.org/
• Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research
Articles, Workshop on Detecting Structure in Scholarly Discourse, ACL 2012.
• de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution
(ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012
• de Waard, A., Burton, S.D., Gerkin, R.C., Harviston, M., Marques, D., Tripathy, S.J., Urban, N.N., Creating an Urban
Legend: A System for Electrophysiology Data Management and Exploration, Discovery Informatics, 2013
• Boyce, R.D., Horn, J.R., Hassanzadeh, O., de Waard, A., Schneider, J., Luciano, J. S, Liakata, M., Dynamic enhancement of
drug process labels to support drug safety, efficacy, and effectiveness. Jnl of Biomedical Semantics, 2013, 4:5.
• Hoekstra, R., de Waard,A., Vdovjak, R. (2012) Annotating Evidenced Based Clinical Guidelines - A Lightweight
Ontology, Proceedings of SWAT4LS 2012, Paris, Adrian Paschke, Albert Burger, Paolo Roma, M. Scott Marshall, Andrea
Splendiani (ed.), Springer.
http://researchdata.elsevier.com/

Why Life is Difficult, and What We MIght Do About It

  • 1.
    Why Research Data Management MaySave Science Anita de Waard VP Research Data Collaborations a.dewaard@elsevier.com http://researchdata.elsevier.com/ Why Life is Difficult, And What We Can Do About It
  • 2.
    Outline: • The problem:life is difficult. • One approach to tackling this: claim-evidence networks. – How do we find claims? – How do we find evidence? – How do we connect the two? • What is still missing? • Call to action!
  • 3.
  • 4.
    Problem 1: arose is not a rose: • “…there was significant variability of the injected venom composition from specimen to specimen, in spite of their common biogeographic origin.” Jose A. Rivera-Ortiz, Herminsul Cano, Frank Marí, Intraspecies variability of the injected venom of Conus ermineus, doi:10.1016/j.peptides.2010.11.014 • “…Strains DV-3/84 DV-7/84 (group 3) showed 76.6% similarity to each other and were similar to all other strains at the 67.6% level.” Zofia Dzierżewicz et al., Intraspecies variability of Desulfovibrio desulfuricans strains determined by the genetic profiles, FEMS Microbiology Letters, Volume 219, Issue 1, 14 February 2003, Pages 69–74, doi:10.1016/S0378- 1097(02)01199-0 => A specimen is not a species!
  • 5.
    Problem 2: geneexpression varies with: Age: “SIRT1-Associated genes are deregulated in the aged brain” Philipp Oberdoerffer et al., SIRT1 Redistribution on Chromatin Promotes Genomic Stability but Alters Gene Expression during Aging, Cell, Volume 135, Issue 5, 28 November 2008, Pages 907–918, doi:10.1016/j.cell.2008.10.025 Smell: “…major urinary proteins *…+ mediate the pregnancy blocking effects of male urine” P.A. Brennan, et al, Patterns of expression of the immediate-early gene egr-1 in the accessory olfactory bulb of female mice exposed to pheromonal constituents of male urine, Neuroscience, Volume 90, Issue 4, June 1999, P 1463– 1470, doi:10.1016/S0306-4522(98)00556-9 Hunger: “Out of the ~30K genes, about 10K are differentially expressed in liver cells when an animal is in different states of satiety.“ Zhang F, Xu X, Zhou B, He Z, Zhai Q (2011) Gene Expression Profile Change and Associated Physiological and Pathological Effects in Mouse Liver Induced by Fasting and Refeeding. PLoS ONE 6(11): e27553. doi:10.1371/journal.pone.002755 Light: “Longer-term enrichment training also altered the mRNA levels of many genes associated with structural changes that occur during neuronal growth.” Cailotto C., et al. (2009) Effects of Nocturnal Light on (Clock) Gene Expression in Peripheral Organs: A Role for the Autonomic Innervation of the Liver. PLoS ONE 4(5): e5650. doi:10.1371/journal.pone.0005650: => Knowing genes is not knowing how they are expressed!
  • 6.
    • “We foundthe diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals.” The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234 • “Colonization of an infant’s gastrointestinal tract begins at birth. The acquisition and normal development of the neonatal microflora is vital for the healthy maturation of the immune system.” Mackie RI, Sghir A, Gaskins HR., Developmental microbial ecology of the neonatal gastrointestinal tract. Am J Clin Nutr. 1999 May;69(5):1035S-1045S Problem 3: No man (or mouse) is an island… => An animal is an ecosystem!
  • 7.
    Problem 4: Interactions createmore complexity: • Computing cancer: “No amount of information about what happens inside a single cell can ever tell you what a tissue is going to do,” *Glazier+ said. “Much of the information and complexity of tissues and life is embedded in the way cells talk to each other and the extracellular environment.” • Megadata:“These complex emergent systems are impossible to understand,”,”*we+ founded Applied Proteomics to create a protein diagnostic that reveals not just where a cancer is, but how it interacts with the body..” Nature Special Issue Vol. 491 No. 7425 ‘Physical Scientists Take On Cancer’ : => The whole is more than the sum of its parts!
  • 8.
    Big problems inbiology: http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg 1. Interspecies variability > A specimen is not a species! 2. Gene expression variability > Knowing genes is not knowing how they are expressed! 3. Microbiome > An animal is an ecosystem! 4. Systems biology > Whole is more than the sum of its parts! 5. Models vs. experiment > Are we talking about the same things? In a way we can all use? 6. Dynamics > Life is not in equilibrium! Life is complicated! Reductionism doesn’t work for living systems.
  • 9.
    Statistics could help! Withenough observations, trends and anomalies can be detected: • “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.” The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234 • “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.” Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.
  • 10.
    But biological researchis insular! • Biology is small: size 10^-5 – 10^2 m, scientist can work alone (‘King’ and ‘subjects’). • Biology is messy: it doesn’t happen behind a terminal. • Biology is competitive: many people with similar skill sets, vying for the same grants • In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be, either…)). Prepare Observe Analyze Ponder Communicate
  • 11.
    How Can WeConnect This Knowledge?
  • 12.
    Claim-Evidence Networks Offer AModel for Connecting Knowledge: Experimental Evidence
  • 13.
    Converging on Claim/Evidence/Networks,e.g. here: • The Karyotype Ontology: a computational representation for human cytogenetic patterns. Jennifer Warrender and Phillip Lord • Lexical Analysis and Characterization of the OBOFoundry Ontologies. Manuel Quesada-Martínez, Jesualdo Tomás Fernández-Breis and Robert Stevens • Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison. Peter Robinson, Sebastian Köhler, Anika Oellrich, Kai Wang, Chris Mungall, Suzanna E. Lewis, Sebastian Bauer, Dominik Seelow, Peter Krawitz, Christian Gilissen, Melissa Haendel and Damian Smedley • BioAssay Ontology (BAO): Modularization, Integration and Applications. Uma Vempati, Hande Kucuk, Saminda Abeyruwan, Ubbo Visser, Vance Lemmon, Ahsan Mir and Stephan Schürer • eXframe: A Semantic Web Platform for Genomics Experiments. Emily Merrill, Stephane Corlosquet, Paolo Ciccarese, Tim Clark and Sudeshna Das • Ovopub: Modular data publication with minimal. provenance Alison Callahan and Michel Dumontier • Zooma – A tool for automated ontology annotation. Tony Burdett, Simon Jupp, James Malone, Helen Parkinson, Eleanor Williams and Adam Faulconbridge • A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature. Chayan Chakrabarti, Thomas B. Jones, Jiawei F. Xu, George F. Luger, Angela R. Laird, Matthew D. Turner and Jessica A. Turner • Preserving sequence annotations across reference sequences. Zuotian Tatum, Andrew Gibson, Marco Roos, Peter E.M. Taschner, Mark Thompson, Erik A. Schultes and Jeroen F. J. Laros • A Taxonomy for Immunologists. James A. Overton, Randi Vita, Jason A. Greenbaum, Heiko Dietze, Alessandro Sette and Bjoern Peters • Health Data Ontology Trunk: A middle-layer ontology for health- care. Ulf Schwarz, Luc Schneider, Emilio Sanfilippo, Holger Stenzhorn and Nikolina Koleva • Structured representation of scientific evidence using semantic web techniques – a biochemistry use case.Christian Bölling, Michael Weidlich and Hermann-Georg Holzhütter • Synthetic Biology Open Language Visual: an ontological use case. Jacqueline Quinn, Michal Galdzicki, Robert
  • 14.
    Step 1: Findclaims: E.g., using XIP for discourse analysis: In contrast with previous hypotheses compact plaques form before significant deposition of diffuse A beta, suggesting that different mechanisms are involved in the deposition of diffuse amyloid and the aggregation into plaques. Entities Relationships Temporality Connections thematic roles Status core information (proposition) information extraction rhetorical metadiscourse discourse analysis discourse analysisdiscourse structure Sándor, Àgnes and de Waard, Anita, (2012).
  • 15.
    Finding Claimed KnowledgeUpdates: Sandor, A. and de Waard, A. (2012) Here we used mass spectrometry to identify HuD as a novel neuronal SMN-interacting partner Our analysis of known HuD-associated mRNAs in neurons identified cpg15 mRNA as a highly abundant mRNA in HuD IPs Our finding that SMN protein associates with HuD protein and the HuD target cpg15 mRNA in neurons … Definition: 1) A CKU expresses a verbal or nominal proposition about biological entities. 2) A CKU is a new proposition. 3) The authors present the CKU as factual. 4) A CKU is derived from the experimental work described in the article. 5) The ownership of the proposition is attributed to the author(s) of the article. 6) 4) and 5) are either explicitly expressed or are implicitly conveyed by a structural position as title, section or caption title.
  • 16.
    Allow for Hedgingand Uncertainty: Ontology of Reasoning, Certainty and Attribution (ORCA) For a Proposition P, an epistemically marked clause E is an evaluation of P, where EV, B, S(P), with: – V = Value: 3 = Assumed true, 2 = Probable, 1 = Possible, 0 = Unknown, (- 1= possibly untrue, - 2 = probably untrue, -3 = assumed untrue) – B = Basis: Reasoning Data – S = Source: A = speaker is author A, explicit IA = speaker author, A, implicit N = other author N, explicit NN = other author NN, implicit Based on a conversation with Ed Hovy; de Waard, A. and Schneider, J. (2012)
  • 17.
    Turning claims intoformal representations: Biological statement with BEL/ epistemic markup BEL representation: Epistemic evaluation These miRNAs neutralize p53-mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor-suppressor LATS2. r(MIR:miR-372) - |(tscript(p(HUGO:Trp53)) -| kin(p(PFH:”CDK Family”))) Increased abundance of miR- 372 decreases abundance of LATS2 r(MIR:miR-372) -| r(HUGO:LATS2) Value = Possible Source = Unknown Basis = Unknown Biological statement with Medscan/epistemic markup MedScan Representation: Epistemic evaluation Furthermore, we present evidence that the secretion of nesfatin-1 into the culture media was dramatically increased during the differentiation of 3T3-L1 preadipocytes into adipocytes (P < 0.001) and after treatments with TNF-alpha, IL-6, insulin, and dexamethasone (P < 0.01). IL-6  NUCB2 (nesfatin-1) Relation: MolTransport Effect: Positive CellType: Adipocytes Cell Line: 3T3-L1 Value = Probable Source = Author Basis = Data
  • 18.
    Claims Link toEvidence:
  • 19.
    The evidence isin data. To structure this: • There are many different research databases– both generic (Dryad, Dataverse, DataBank, Zenodo, etc) and specific (NIF, IEDA, PDB) • There are many systems for creating/sharing workflows (Taverna, MyExperiment, Vistrails, Workflow4Ever,) • There are many e-lab notebooks (LabGuru, LabArchives, LaBlog etc) • There are scores of projects, committees, standards, bodies, grants, initiatives, conferences for discussing and connecting all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc) • … you could make a living out of this !
  • 20.
    …but this iswhat most scientists do: Using antibodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story.
  • 21.
    One attempt tostructure data: CMU Urban Legend de Waard, A., Burton, S. et al., 2013
  • 22.
    Connecting experimental results: Prepare AnalyzeCommunicate Prepare Analyze Communicate Observations Observations Observations Across labs, experiments: track reagents and how they are used
  • 23.
    Prepare Analyze Communicate Prepare Analyze Communicate Observations Observations Observations Compareoutcome of interactions with these entities Connecting experimental results:
  • 24.
    Prepare Analyze Communicate Prepare AnalyzeCommunicate Observations Observations Observations Build a‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think Reason collectively! Connecting experimental results:
  • 25.
    NIF Antibodies Registry collectsantibody information:
  • 26.
    Step 3: ConnectClaims and Evidence Example: Hunter et al., Hanalyzer:
  • 27.
    Step 1: Manuallyidentify DDIs and drug names in wide collection of content sources Step 2: Develop a model of Drug-Drug Interaction and define candidates Step 3: Automate this process and store as Linked Data Example: Drug-Drug Interactions Boyce, Schroeder et al., 2013
  • 28.
    Connect recommendations in clinicalguidelines to underlying evidence Hoekstra, de Waard and Vdovjak, 2012 Example:
  • 29.
    Using what isknown about interactions in fly & yeast, predict new interactions with a human protein – Running over data on the web that he neither created nor knew about! Given a protein P in Species X: Find proteins similar to P in Species Y Retrieve interactors in Species Y Sequence-compare Y-interactors with Species X genome (1)  Keep only those with homologue in Find proteins similar to P in Species Z Retrieve interactors in Species Z Sequence-compare Z-interactors with (1)  Putative interactors in Species X Example: do science ON the web:
  • 30.
    Great! So we’realmost done, right – and we can all go home! Not so fast…
  • 31.
    Both seminomas andthe EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 expression is a selective event during tumorigenesis. Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells- specific miRNA cluster (Suh et al, 2004). In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 expression is a selective event during tumorigenesis. Fact Hypothesis Method Result Implication Goal Reg-Implication Conceptual knowledge Experimental Evidence What is a claim? In a paragraph?
  • 32.
    • Voorhoeve etal., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.” • Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).” • Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).” “[Y]ou can transform .. fiction into fact, just by adding or subtracting references”, Latour, 1987 What is the claim? Who makes it?
  • 33.
    > 50 MyPapers 2 M scientists 2 My papers/year Evidence is largely lost…. Majority of data (90%?) is stored on local hard drives Dryad: 7,631 files Dataverse: 0.6 My Datacite: 1.5 My Some data (8%?) stored in large, generic data repositories MiRB: 25k PetDB: 1,5 k TAIR: 72,1 k PDB: 88,3 k SedDB: 0.6 k A small portion of data (1-2%?) stored in small, topic-focused data repositories
  • 34.
  • 35.
    • In 220publications only 40% of antibodies, 40% of cell lines and 25% of constructs can be manually identified (Vasilevsly et al, submitted) • The good news: we can find automatically what we can find manually • Proposal (NIH, June 2013): – Author is asked to add methods section to a tool – Tool extracts likely reagents / resources – User interface asks author to confirm or select …and you can’t extract it after the fact. 49 publications193 publications 76 publications 214 publications 210 publica Entity Type Precision Recall Antibody 87.5 63.3 Resource 95.6 98.9
  • 36.
    Even if wecan link to evidence: • Is it true?
  • 37.
    In Summary: We’re notout of the woods (or a job) just yet!
  • 38.
    We need toimprove claim networks: • Can we make systems of computer-readable meaning that still represent the fullness of natural language? >> Let’s work with computational linguists! • Trace claims across publications: >> Let’s work with legal/political argumentation specialists! Sentiment analysis!
  • 39.
    > 50 MyPapers 2 M scientists 2 My papers/year Improve evidence: scale up data curation! Dryad: 7,631 files Dataverse: 0.6 My Datacite: 1.5 My MiRB: 25k PetDB: 1,5 k Majority of data (90%?) is stored on local hard drives Some data (8%?) stored in large, generic data repositories TAIR: 72,1 k PDB: 88,3 k SedDB: 0.6 k A small portion of data (1-2%?) stored in small, topic-focused data repositories INCREASE DATA DIGITISATION DEVELOP SUSTAINABLE MODELS IMPROVE REPOSITORY INTEROPERABILITY
  • 40.
    Keep asking bigquestions: • Is this true? • Does it matter? • To whom? “Let us now build systems that allow a kid in Mali who wants to learn about proteomics to not be overwhelmed by the irrelevant and the untrue.” - John Perry Barlow, iAnnotate, SF 2013
  • 41.
    In Memoriam DouglasC. Engelbart, 1925-2013: “This is an initial summary report of a project taking a new and systematic approach to improving the intellectual effectiveness of the individual human being. A detailed conceptual framework explores the nature of the system composed of the individual and the tools, concepts, and methods that match his basic capabilities to his problems. One of the tools that shows the greatest immediate promise is the computer, when it can be harnessed for direct on-line assistance, integrated with new concepts and methods.”
  • 42.
    Summary: • The problem:life is difficult. • One approach to tackle this: claim-evidence networks: – Find claims – Identify evidence – Connect the two. • But we still need: – Better ways to represent subtlety of natural language – Better evidence: more structured, better connected – Focus on the big questions. • There’s a lot of work to do!
  • 43.
    Collaborations and discussionsgratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy • UCSD: Phil Bourne, Brian Shoettlander, Ilya Zaslavsky • NIF: Maryann Martone, Anita Bandrowski • MSU: Brian Bothner • OHSU: Melissa Haendel, Nicole Vasilevsky • CDL: Carly Strasser, John Kunze, Stephen Abrams • Harvard/MGH: Tim Clark, Paolo Ciccarese • VU: Rinke Hoekstra, Frank van Harmelen, Paul Groth • Columbia/IEDA: Kerstin Lehnert, Leslie Hsu • University of Pittsburgh: Richard Boyce • Xerox Research Europe: Agnes Sandor • DERI: Jodi Schneider Thank you!
  • 44.
    References: • de Waard,Buckingham Shum, Park, Samwald, Sandor, 2009: Hypotheses, Evidence and Relationships, ISWC2009 • Biological Expression Language – http://www.openbel.org • Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications • Latour, B., Science in Action, 1987 • de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012. • Data2Semantics project: http://www.data2semantics.org/ • Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles, Workshop on Detecting Structure in Scholarly Discourse, ACL 2012. • de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012 • de Waard, A., Burton, S.D., Gerkin, R.C., Harviston, M., Marques, D., Tripathy, S.J., Urban, N.N., Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration, Discovery Informatics, 2013 • Boyce, R.D., Horn, J.R., Hassanzadeh, O., de Waard, A., Schneider, J., Luciano, J. S, Liakata, M., Dynamic enhancement of drug process labels to support drug safety, efficacy, and effectiveness. Jnl of Biomedical Semantics, 2013, 4:5. • Hoekstra, R., de Waard,A., Vdovjak, R. (2012) Annotating Evidenced Based Clinical Guidelines - A Lightweight Ontology, Proceedings of SWAT4LS 2012, Paris, Adrian Paschke, Albert Burger, Paolo Roma, M. Scott Marshall, Andrea Splendiani (ed.), Springer. http://researchdata.elsevier.com/