Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CEDAR work bench for metadata management


Published on

With the explosion of interest in both enhanced knowledge management and open science, the past few years have seen considerable discussion about making scientific data “FAIR” — findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way in online repositories. The lack of standardization makes it extremely difficult for other investigators to locate relevant datasets, to re-analyse them, and to integrate those datasets with other data. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. The CEDAR work bench for metadata management will be presented in this webinar. CEDAR illustrates the importance of semantic technology to driving open science. It also demonstrates a means for simplifying access to scientific data sets and enhancing the reuse of the data to drive new discoveries.

Published in: Data & Analytics
  • Could you use an extra $1750 a week? I'm guessing you could right? If you would like to see how you could make this type of money, right from the comfort of your own home, you absolutely need to check out this short free video. ➤➤
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

CEDAR work bench for metadata management

  1. 1. The CEDAR Workbench for Metadata Management Mark A. Musen, M.D., Ph.D. Stanford University musen@Stanford.EDU
  2. 2. A Story
  3. 3. Purvesh Khatri, Ph.D. A self-professed “data parasite”
  4. 4. Gene Expression Omnibus (GEO) 4
  5. 5. Khatri has used GEO to build a career from the analysis of “other people’s data” Khatri has a pipeline for • Searching for datasets in the GEO database • Elucidating genomic signals • Confirming those signals in “validation” data sets • Making discoveries without ever performing a primary experiment
  6. 6. Khatri has reused public data sets to identify genomic signatures … • For incipient sepsis • For active tuberculosis • For distinguishing viral from bacterial respiratory infection • For rejection of organ transplants … and he has never touched a pipette!
  7. 7. But you can be a data parasite only if scientific data are … • Available in an accessible repository • Findable through some sort of search facility • Retrievable in a standard format • Self-describing so that third parties can make sense of the data • Intended to be reused, and to outlive the experiment for which they were collected
  8. 8. But data in most repositories are a mess! • Investigators view their work as publishing papers, not leaving a legacy of reusable data • Sponsors—both public and private—may require data sharing, but they do not explicitly pay for it • Creating the metadata to describe data sets is unbearably hard
  9. 9. age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years Failure to use standard terms makes datasets often impossible to search
  10. 10. Metadata from NCBI BioSample Stink! • 73% of “Boolean” metadata values are not actually true or false – nonsmoker, former-smoker • 26% of “integer” metadata values cannot be parsed into integers – JM52, UVPgt59.4, pig • 68% of metadata entries that are supposed to represent terms from biomedical ontologies do not actually do so. – presumed normal, wild_type
  11. 11. At a minimum, science needs • Open, online access to experimental data sets • Mechanisms to search for metadata to find relevant experimental results • Annotation of online data sets with adequate metadata • Use of controlled terms in metadata whenever possible • A culture that cares about data sharing, secondary analyses, and the ability to make new discoveries from legacy data
  12. 12. Requirement #1: Standard ontologies to describe what exists in a dataset completely and consistently
  13. 13.
  14. 14. BioPortal User Activity—Interactive
  15. 15. BioPortal User Activity—Programmatic
  16. 16. Requirement #2: Describe properties of experiments completely and consistently
  17. 17. When we look at experimental data … • How can we tell what the data are about? • How can we search for additional data that we might be interested in? • How can we group data into similar categories?
  18. 18. The microarray community took the lead in standardizing biological metadata DNA Microarray — What was the substrate of the experiment? — What array platform was used? — What were the experimental conditions?
  19. 19. But it didn’t stop with MIAME! • Minimal Information About T Cell Assays (MIATA) • Minimal Information Required in the Annotation of biochemical Models (MIRIAM) • MINImal MEtagemome Sequence analysis Standard (MINIMESS) • Minimal Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE)
  20. 20. lists more than 100 guidelines for reporting experimental data Guidelines
  21. 21. Minimal Information Guidelines are not Models • MIAME and its kin specify only the “kinds of things” that investigators should include in their metadata • They do not provide a detailed list of standard metadata elements • They do not provide datatypes for valid metadata entries • It takes work to convert a prose checklist into a computable model
  22. 22. To make sense of metadata … • We need computer-based standards – For what we want to say about experimental data – For the language that we use to say those things – For how we structure the metadata • We need a scientific culture that values making experimental data accessible to others
  23. 23. Requirement #3: Make it palatable—maybe even fun—to describe experiments completely and consistently
  24. 24.
  25. 25. A Metadata Ecosystem • Scientists perform experiments and generate datasets • Community-based standards groups create minimal information models (like MIAME) to define requirements for annotating particular kinds of experimental data in a uniform manner • CEDAR users create formal metadata templates based on community standards • CEDAR users fill in metadata templates to create instances of metadata • CEDAR stores the datasets (and the metadata instances) in public repositories
  26. 26. The CEDAR Approach to Metadata
  27. 27. The CEDAR Workbench
  28. 28. The CEDAR Workbench
  29. 29. x
  30. 30. brain Parkinson’s disease (DOID) (39%) central nervous system lymphoma (DOID) (27%) autistic disorder (DOID) (22%) melanoma (DOID) (5%) schizophrenia (DOID) (1%) Edwards syndrome (DOID) (2%)
  31. 31. The CEDAR Workbench
  32. 32. age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years
  33. 33. AIRR is providing our first experience uploading CEDAR-authored metadata directly to NCBI
  34. 34. Scientists Data Curators MiAIRR Template generate experimental data Submission Manager Template Designer Submission Service create template enter metadata manage submissions submission status Biomedical Ontologies 1. TEMPLATE CREATION 2. METADATA EDITING 3. SUBMISSION Submission Generator Submission Status Monitor Metadata Editor Metadata Data Metadata Metadata Data BioProject BioSample SRA NCBI Repositories Metadata Data check submission status MiAIRR Standard NCBI FTP Server Metadata AIRR: CEDAR serves as a conduit to various NCBI repositories
  35. 35. LINCS Repository generate experimental data 1. TEMPLATE CREATION 2. METADATA EDITING 3. SUBMISSION Metadata Editor Template Designer Biomedical OntologiesCurators create template Scientists Metadata Data enter metadata LINCS Templates Templat es Templat es LINCS Templates Data submit Metadata Data Metadata LINCS: CEDAR provides access to a central data-coordinating center
  36. 36. 56
  37. 37. GO-FAIR: Sponsors use CEDAR to declare minimal metadata for grantees
  38. 38. With CEDAR, the data parasites of the world will be able to learn from data that are “cleaner,” better organized, and more searchable
  39. 39. Scientists will learn when new results become available that are relevant to their work, and how those results relate to the data that they are already collecting
  40. 40. Physicians will be able to find the clinical evidence that best helps them to select treatments for their patients
  41. 41. Pharma will know what experiments have already been done—by their own company and by those with which they have merged!
  42. 42. Once scientific knowledge and data are disseminated in machine-interpretable form … • Technology such as CEDAR will assist in the automated “publication” of scientific results online • Intelligent agents will – Search and “read” the “literature” – Integrate information – Track scientific advances – Re-explore existing scientific datasets – Suggest the next set of experiments to perform – And maybe even do them!
  43. 43.