Why Life is Difficult, and What We MIght Do About It


Published on

Keynote ISMB 2013 Bio-ontologies Workshop - http://www.bio-ontologies.org.uk/programme

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Why Life is Difficult, and What We MIght Do About It

  1. 1. Why Research Data Management May Save Science Anita de Waard VP Research Data Collaborations a.dewaard@elsevier.com http://researchdata.elsevier.com/ Why Life is Difficult, And What We Can Do About It
  2. 2. Outline: • The problem: life is difficult. • One approach to tackling this: claim-evidence networks. – How do we find claims? – How do we find evidence? – How do we connect the two? • What is still missing? • Call to action!
  3. 3. The Problem
  4. 4. Problem 1: a rose is not a rose: • “…there was significant variability of the injected venom composition from specimen to specimen, in spite of their common biogeographic origin.” Jose A. Rivera-Ortiz, Herminsul Cano, Frank Marí, Intraspecies variability of the injected venom of Conus ermineus, doi:10.1016/j.peptides.2010.11.014 • “…Strains DV-3/84 DV-7/84 (group 3) showed 76.6% similarity to each other and were similar to all other strains at the 67.6% level.” Zofia Dzierżewicz et al., Intraspecies variability of Desulfovibrio desulfuricans strains determined by the genetic profiles, FEMS Microbiology Letters, Volume 219, Issue 1, 14 February 2003, Pages 69–74, doi:10.1016/S0378- 1097(02)01199-0 => A specimen is not a species!
  5. 5. Problem 2: gene expression varies with: Age: “SIRT1-Associated genes are deregulated in the aged brain” Philipp Oberdoerffer et al., SIRT1 Redistribution on Chromatin Promotes Genomic Stability but Alters Gene Expression during Aging, Cell, Volume 135, Issue 5, 28 November 2008, Pages 907–918, doi:10.1016/j.cell.2008.10.025 Smell: “…major urinary proteins *…+ mediate the pregnancy blocking effects of male urine” P.A. Brennan, et al, Patterns of expression of the immediate-early gene egr-1 in the accessory olfactory bulb of female mice exposed to pheromonal constituents of male urine, Neuroscience, Volume 90, Issue 4, June 1999, P 1463– 1470, doi:10.1016/S0306-4522(98)00556-9 Hunger: “Out of the ~30K genes, about 10K are differentially expressed in liver cells when an animal is in different states of satiety.“ Zhang F, Xu X, Zhou B, He Z, Zhai Q (2011) Gene Expression Profile Change and Associated Physiological and Pathological Effects in Mouse Liver Induced by Fasting and Refeeding. PLoS ONE 6(11): e27553. doi:10.1371/journal.pone.002755 Light: “Longer-term enrichment training also altered the mRNA levels of many genes associated with structural changes that occur during neuronal growth.” Cailotto C., et al. (2009) Effects of Nocturnal Light on (Clock) Gene Expression in Peripheral Organs: A Role for the Autonomic Innervation of the Liver. PLoS ONE 4(5): e5650. doi:10.1371/journal.pone.0005650: => Knowing genes is not knowing how they are expressed!
  6. 6. • “We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals.” The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234 • “Colonization of an infant’s gastrointestinal tract begins at birth. The acquisition and normal development of the neonatal microflora is vital for the healthy maturation of the immune system.” Mackie RI, Sghir A, Gaskins HR., Developmental microbial ecology of the neonatal gastrointestinal tract. Am J Clin Nutr. 1999 May;69(5):1035S-1045S Problem 3: No man (or mouse) is an island… => An animal is an ecosystem!
  7. 7. Problem 4: Interactions create more complexity: • Computing cancer: “No amount of information about what happens inside a single cell can ever tell you what a tissue is going to do,” *Glazier+ said. “Much of the information and complexity of tissues and life is embedded in the way cells talk to each other and the extracellular environment.” • Megadata:“These complex emergent systems are impossible to understand,”,”*we+ founded Applied Proteomics to create a protein diagnostic that reveals not just where a cancer is, but how it interacts with the body..” Nature Special Issue Vol. 491 No. 7425 ‘Physical Scientists Take On Cancer’ : => The whole is more than the sum of its parts!
  8. 8. Big problems in biology: http://en.wikipedia.org/wiki/File:Duck_of_Vaucanson.jpg 1. Interspecies variability > A specimen is not a species! 2. Gene expression variability > Knowing genes is not knowing how they are expressed! 3. Microbiome > An animal is an ecosystem! 4. Systems biology > Whole is more than the sum of its parts! 5. Models vs. experiment > Are we talking about the same things? In a way we can all use? 6. Dynamics > Life is not in equilibrium! Life is complicated! Reductionism doesn’t work for living systems.
  9. 9. Statistics could help! With enough observations, trends and anomalies can be detected: • “Here we present resources from a population of 242 healthy adults sampled at 15 or 18 body sites up to three times, which have generated 5,177 microbial taxonomic profiles from 16S ribosomal RNA genes and over 3.5 terabases of metagenomic sequence so far.” The Human Microbiome Project Consortium, Structure, function and diversity of the healthy human microbiome, Nature 486, 207–214 (14 June 2012) doi:10.1038/nature11234 • “The large sample size — 4,298 North Americans of European descent and 2,217 African Americans — has enabled the researchers to mine down into the human genome.” Nidhi Subbaraman, Nature News, 28 November 2012, High-resolution sequencing study emphasizes importance of rare variants in disease.
  10. 10. But biological research is insular! • Biology is small: size 10^-5 – 10^2 m, scientist can work alone (‘King’ and ‘subjects’). • Biology is messy: it doesn’t happen behind a terminal. • Biology is competitive: many people with similar skill sets, vying for the same grants • In summary: the structure of biological research does not inherently promote collaboration (vs., for instance, HE physics or astronomy (and they’re not all they’re cracked up to be, either…)). Prepare Observe Analyze Ponder Communicate
  11. 11. How Can We Connect This Knowledge?
  12. 12. Claim-Evidence Networks Offer A Model for Connecting Knowledge: Experimental Evidence
  13. 13. Converging on Claim/Evidence/Networks, e.g. here: • The Karyotype Ontology: a computational representation for human cytogenetic patterns. Jennifer Warrender and Phillip Lord • Lexical Analysis and Characterization of the OBOFoundry Ontologies. Manuel Quesada-Martínez, Jesualdo Tomás Fernández-Breis and Robert Stevens • Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison. Peter Robinson, Sebastian Köhler, Anika Oellrich, Kai Wang, Chris Mungall, Suzanna E. Lewis, Sebastian Bauer, Dominik Seelow, Peter Krawitz, Christian Gilissen, Melissa Haendel and Damian Smedley • BioAssay Ontology (BAO): Modularization, Integration and Applications. Uma Vempati, Hande Kucuk, Saminda Abeyruwan, Ubbo Visser, Vance Lemmon, Ahsan Mir and Stephan Schürer • eXframe: A Semantic Web Platform for Genomics Experiments. Emily Merrill, Stephane Corlosquet, Paolo Ciccarese, Tim Clark and Sudeshna Das • Ovopub: Modular data publication with minimal. provenance Alison Callahan and Michel Dumontier • Zooma – A tool for automated ontology annotation. Tony Burdett, Simon Jupp, James Malone, Helen Parkinson, Eleanor Williams and Adam Faulconbridge • A Probabilistic Framework for Ontology-Based Annotation in Neuroimaging Literature. Chayan Chakrabarti, Thomas B. Jones, Jiawei F. Xu, George F. Luger, Angela R. Laird, Matthew D. Turner and Jessica A. Turner • Preserving sequence annotations across reference sequences. Zuotian Tatum, Andrew Gibson, Marco Roos, Peter E.M. Taschner, Mark Thompson, Erik A. Schultes and Jeroen F. J. Laros • A Taxonomy for Immunologists. James A. Overton, Randi Vita, Jason A. Greenbaum, Heiko Dietze, Alessandro Sette and Bjoern Peters • Health Data Ontology Trunk: A middle-layer ontology for health- care. Ulf Schwarz, Luc Schneider, Emilio Sanfilippo, Holger Stenzhorn and Nikolina Koleva • Structured representation of scientific evidence using semantic web techniques – a biochemistry use case.Christian Bölling, Michael Weidlich and Hermann-Georg Holzhütter • Synthetic Biology Open Language Visual: an ontological use case. Jacqueline Quinn, Michal Galdzicki, Robert
  14. 14. Step 1: Find claims: E.g., using XIP for discourse analysis: In contrast with previous hypotheses compact plaques form before significant deposition of diffuse A beta, suggesting that different mechanisms are involved in the deposition of diffuse amyloid and the aggregation into plaques. Entities Relationships Temporality Connections thematic roles Status core information (proposition) information extraction rhetorical metadiscourse discourse analysis discourse analysisdiscourse structure Sándor, Àgnes and de Waard, Anita, (2012).
  15. 15. Finding Claimed Knowledge Updates: Sandor, A. and de Waard, A. (2012) Here we used mass spectrometry to identify HuD as a novel neuronal SMN-interacting partner Our analysis of known HuD-associated mRNAs in neurons identified cpg15 mRNA as a highly abundant mRNA in HuD IPs Our finding that SMN protein associates with HuD protein and the HuD target cpg15 mRNA in neurons … Definition: 1) A CKU expresses a verbal or nominal proposition about biological entities. 2) A CKU is a new proposition. 3) The authors present the CKU as factual. 4) A CKU is derived from the experimental work described in the article. 5) The ownership of the proposition is attributed to the author(s) of the article. 6) 4) and 5) are either explicitly expressed or are implicitly conveyed by a structural position as title, section or caption title.
  16. 16. Allow for Hedging and Uncertainty: Ontology of Reasoning, Certainty and Attribution (ORCA) For a Proposition P, an epistemically marked clause E is an evaluation of P, where EV, B, S(P), with: – V = Value: 3 = Assumed true, 2 = Probable, 1 = Possible, 0 = Unknown, (- 1= possibly untrue, - 2 = probably untrue, -3 = assumed untrue) – B = Basis: Reasoning Data – S = Source: A = speaker is author A, explicit IA = speaker author, A, implicit N = other author N, explicit NN = other author NN, implicit Based on a conversation with Ed Hovy; de Waard, A. and Schneider, J. (2012)
  17. 17. Turning claims into formal representations: Biological statement with BEL/ epistemic markup BEL representation: Epistemic evaluation These miRNAs neutralize p53-mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor-suppressor LATS2. r(MIR:miR-372) - |(tscript(p(HUGO:Trp53)) -| kin(p(PFH:”CDK Family”))) Increased abundance of miR- 372 decreases abundance of LATS2 r(MIR:miR-372) -| r(HUGO:LATS2) Value = Possible Source = Unknown Basis = Unknown Biological statement with Medscan/epistemic markup MedScan Representation: Epistemic evaluation Furthermore, we present evidence that the secretion of nesfatin-1 into the culture media was dramatically increased during the differentiation of 3T3-L1 preadipocytes into adipocytes (P < 0.001) and after treatments with TNF-alpha, IL-6, insulin, and dexamethasone (P < 0.01). IL-6  NUCB2 (nesfatin-1) Relation: MolTransport Effect: Positive CellType: Adipocytes Cell Line: 3T3-L1 Value = Probable Source = Author Basis = Data
  18. 18. Claims Link to Evidence:
  19. 19. The evidence is in data. To structure this: • There are many different research databases– both generic (Dryad, Dataverse, DataBank, Zenodo, etc) and specific (NIF, IEDA, PDB) • There are many systems for creating/sharing workflows (Taverna, MyExperiment, Vistrails, Workflow4Ever,) • There are many e-lab notebooks (LabGuru, LabArchives, LaBlog etc) • There are scores of projects, committees, standards, bodies, grants, initiatives, conferences for discussing and connecting all of this (KEfED, Pegasus, PROV, RDA, Science Gateways, Codata, BRDI, Earthcube, etc. etc) • … you could make a living out of this !
  20. 20. …but this is what most scientists do: Using antibodies and squishy bits Grad Students experiment and enter details into their lab notebook. The PI then tries to make sense of their slides, and writes a paper. End of story.
  21. 21. One attempt to structure data: CMU Urban Legend de Waard, A., Burton, S. et al., 2013
  22. 22. Connecting experimental results: Prepare Analyze Communicate Prepare Analyze Communicate Observations Observations Observations Across labs, experiments: track reagents and how they are used
  23. 23. Prepare Analyze Communicate Prepare Analyze Communicate Observations Observations Observations Compare outcome of interactions with these entities Connecting experimental results:
  24. 24. Prepare Analyze Communicate Prepare AnalyzeCommunicate Observations Observations Observations Build a ‘virtual reagent spectrogram’ by comparing how different entities interacted in different experiments Think Reason collectively! Connecting experimental results:
  25. 25. NIF Antibodies Registry collects antibody information:
  26. 26. Step 3: Connect Claims and Evidence Example: Hunter et al., Hanalyzer:
  27. 27. Step 1: Manually identify DDIs and drug names in wide collection of content sources Step 2: Develop a model of Drug-Drug Interaction and define candidates Step 3: Automate this process and store as Linked Data Example: Drug-Drug Interactions Boyce, Schroeder et al., 2013
  28. 28. Connect recommendations in clinical guidelines to underlying evidence Hoekstra, de Waard and Vdovjak, 2012 Example:
  29. 29. Using what is known about interactions in fly & yeast, predict new interactions with a human protein – Running over data on the web that he neither created nor knew about! Given a protein P in Species X: Find proteins similar to P in Species Y Retrieve interactors in Species Y Sequence-compare Y-interactors with Species X genome (1)  Keep only those with homologue in Find proteins similar to P in Species Z Retrieve interactors in Species Z Sequence-compare Z-interactors with (1)  Putative interactors in Species X Example: do science ON the web:
  30. 30. Great! So we’re almost done, right – and we can all go home! Not so fast…
  31. 31. Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells-specific miRNA cluster (Suh et al, 2004). In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 expression is a selective event during tumorigenesis. Both seminomas and the EC component of nonseminomas share features with ES cells. To exclude that the detection of miR-371-3 merely reflects its expression pattern in ES cells, we tested by RPA miR-302a-d, another ES cells- specific miRNA cluster (Suh et al, 2004). In many of the miR-371-3 expressing seminomas and nonseminomas, miR-302a-d was undetectable (Figs S7 and S8), suggesting that miR-371-3 expression is a selective event during tumorigenesis. Fact Hypothesis Method Result Implication Goal Reg-Implication Conceptual knowledge Experimental Evidence What is a claim? In a paragraph?
  32. 32. • Voorhoeve et al., 2006: “These miRNAs neutralize p53- mediated CDK inhibition, possibly through direct inhibition of the expression of the tumor suppressor LATS2.” • Kloosterman and Plasterk, 2006: “In a genetic screen, miR-372 and miR-373 were found to allow proliferation of primary human cells that express oncogenic RAS and active p53, possibly by inhibiting the tumor suppressor LATS2 (Voorhoeve et al., 2006).” • Okada et al., 2011: “Two oncogenic miRNAs, miR-372 and miR-373, directly inhibit the expression of Lats2, thereby allowing tumorigenic growth in the presence of p53 (Voorhoeve et al., 2006).” “[Y]ou can transform .. fiction into fact, just by adding or subtracting references”, Latour, 1987 What is the claim? Who makes it?
  33. 33. > 50 My Papers 2 M scientists 2 My papers/year Evidence is largely lost…. Majority of data (90%?) is stored on local hard drives Dryad: 7,631 files Dataverse: 0.6 My Datacite: 1.5 My Some data (8%?) stored in large, generic data repositories MiRB: 25k PetDB: 1,5 k TAIR: 72,1 k PDB: 88,3 k SedDB: 0.6 k A small portion of data (1-2%?) stored in small, topic-focused data repositories
  34. 34. …or buried..
  35. 35. • In 220 publications only 40% of antibodies, 40% of cell lines and 25% of constructs can be manually identified (Vasilevsly et al, submitted) • The good news: we can find automatically what we can find manually • Proposal (NIH, June 2013): – Author is asked to add methods section to a tool – Tool extracts likely reagents / resources – User interface asks author to confirm or select …and you can’t extract it after the fact. 49 publications193 publications 76 publications 214 publications 210 publica Entity Type Precision Recall Antibody 87.5 63.3 Resource 95.6 98.9
  36. 36. Even if we can link to evidence: • Is it true?
  37. 37. In Summary: We’re not out of the woods (or a job) just yet!
  38. 38. We need to improve claim networks: • Can we make systems of computer-readable meaning that still represent the fullness of natural language? >> Let’s work with computational linguists! • Trace claims across publications: >> Let’s work with legal/political argumentation specialists! Sentiment analysis!
  39. 39. > 50 My Papers 2 M scientists 2 My papers/year Improve evidence: scale up data curation! Dryad: 7,631 files Dataverse: 0.6 My Datacite: 1.5 My MiRB: 25k PetDB: 1,5 k Majority of data (90%?) is stored on local hard drives Some data (8%?) stored in large, generic data repositories TAIR: 72,1 k PDB: 88,3 k SedDB: 0.6 k A small portion of data (1-2%?) stored in small, topic-focused data repositories INCREASE DATA DIGITISATION DEVELOP SUSTAINABLE MODELS IMPROVE REPOSITORY INTEROPERABILITY
  40. 40. Keep asking big questions: • Is this true? • Does it matter? • To whom? “Let us now build systems that allow a kid in Mali who wants to learn about proteomics to not be overwhelmed by the irrelevant and the untrue.” - John Perry Barlow, iAnnotate, SF 2013
  41. 41. In Memoriam Douglas C. Engelbart, 1925-2013: “This is an initial summary report of a project taking a new and systematic approach to improving the intellectual effectiveness of the individual human being. A detailed conceptual framework explores the nature of the system composed of the individual and the tools, concepts, and methods that match his basic capabilities to his problems. One of the tools that shows the greatest immediate promise is the computer, when it can be harnessed for direct on-line assistance, integrated with new concepts and methods.”
  42. 42. Summary: • The problem: life is difficult. • One approach to tackle this: claim-evidence networks: – Find claims – Identify evidence – Connect the two. • But we still need: – Better ways to represent subtlety of natural language – Better evidence: more structured, better connected – Focus on the big questions. • There’s a lot of work to do!
  43. 43. Collaborations and discussions gratefully acknowledged: • CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy • UCSD: Phil Bourne, Brian Shoettlander, Ilya Zaslavsky • NIF: Maryann Martone, Anita Bandrowski • MSU: Brian Bothner • OHSU: Melissa Haendel, Nicole Vasilevsky • CDL: Carly Strasser, John Kunze, Stephen Abrams • Harvard/MGH: Tim Clark, Paolo Ciccarese • VU: Rinke Hoekstra, Frank van Harmelen, Paul Groth • Columbia/IEDA: Kerstin Lehnert, Leslie Hsu • University of Pittsburgh: Richard Boyce • Xerox Research Europe: Agnes Sandor • DERI: Jodi Schneider Thank you!
  44. 44. References: • de Waard, Buckingham Shum, Park, Samwald, Sandor, 2009: Hypotheses, Evidence and Relationships, ISWC2009 • Biological Expression Language – http://www.openbel.org • Latour, B. and Woolgar, S., Laboratory Life: the Social Construction of Scientific Facts, 1979, Sage Publications • Latour, B., Science in Action, 1987 • de Waard, A. and Pander Maat, H. (2012). Epistemic Modality and Knowledge Attribution in Scientific Discourse: A Taxonomy of Types and Overview of Features. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 47–55, Jeju, Republic of Korea, 12 July 2012. • Data2Semantics project: http://www.data2semantics.org/ • Sándor, Àgnes and de Waard, Anita, (2012). Identifying Claimed Knowledge Updates in Biomedical Research Articles, Workshop on Detecting Structure in Scholarly Discourse, ACL 2012. • de Waard, A. and Schneider, J. (2012) Formalising Uncertainty: An Ontology of Reasoning, Certainty and Attribution (ORCA), Semantic Technologies Applied to Biomedical Informatics and Individualized Medicine workshop, ISWC 2012 • de Waard, A., Burton, S.D., Gerkin, R.C., Harviston, M., Marques, D., Tripathy, S.J., Urban, N.N., Creating an Urban Legend: A System for Electrophysiology Data Management and Exploration, Discovery Informatics, 2013 • Boyce, R.D., Horn, J.R., Hassanzadeh, O., de Waard, A., Schneider, J., Luciano, J. S, Liakata, M., Dynamic enhancement of drug process labels to support drug safety, efficacy, and effectiveness. Jnl of Biomedical Semantics, 2013, 4:5. • Hoekstra, R., de Waard,A., Vdovjak, R. (2012) Annotating Evidenced Based Clinical Guidelines - A Lightweight Ontology, Proceedings of SWAT4LS 2012, Paris, Adrian Paschke, Albert Burger, Paolo Roma, M. Scott Marshall, Andrea Splendiani (ed.), Springer. http://researchdata.elsevier.com/