RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework


Published on

Research Data Access and Preservation Summit, 2014
San Diego, CA
March 26-28, 2014

Maryann Martone, Principal Investigator, Neuroscience Information Framework, University of California, San Diego

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

RDAP14: Maryann Martone, Keynote, The Neuroscience Information Framework

  1. 1. The Neuroscience Information Framework Maryann E. Martone, Ph. D. University of California, San Diego
  2. 2. We say this to each other all the time, but we set up systems for scholarly advancement and communication that are the antithesis of integration Whole brain data (20 um microscopic MRI) Mosiac LM images (1 GB+) Conventional LM images Individual cell morphologies EM volumes & reconstructions Solved molecular structures No single technology serves these all equally well. Multiple data types; multiple scales; multiple databases A data integration problem
  3. 3. • NIF is an initiative of the NIH Blueprint consortium of institutes – What types of resources (data, tools, materials, services) are available to the neuroscience community? – How many are there? – What domains do they cover? What domains do they not cover? – Where are they? • Web sites • Databases • Literature • Supplementary material – Who uses them? – Who creates them? – How can we find them? – How can we make them better in the future? http://neuinfo.org • PDF files • Desk drawers
  4. 4. How many resources are there? •NIF Registry: A catalog of neuroscience-relevant resources •> 10,000 currently listed •> 2500 databases •And we are finding more every day June10, 2013 4
  5. 5. But we have Google! • Current web is designed to share documents – Documents are unstructured data • Much of the content of digital resources is part of the “hidden web” • Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
  6. 6. Which databases do you use? • Mouse Genome Database • Allen Brain Atlas • Clinical Trials.gov • Pub Med • dbGAP • GEO • NIH Reporter • OMIM • Bionumbers: – -a database of numerical values extracted from literature • Epigenomics – - human epigenomic data to catalyze basic biology and disease-oriented research • Antibody Registry – -2M antibodies • BioGrid – an interaction repository of protein and genetic interactions June10, 2013 6Most resources are largely unknown and underutilized
  7. 7. NIF: A New Type of Entity for New Modes of Scientific Dissemination • NIF’s mission is to maximize the awareness of, access to and utility of research resources produced worldwide to enable better science and promote efficient use – NIF unites neuroscience information without respect to domain, funding agency, institute or community – NIF is like a “Pub Med” for all biomedical resources and a “Pub Med Central” for databases – Makes them searchable from a single interface – Practical and cost-effective; tries to be sensible – Learned a lot about the effective data sharing
  8. 8. How do resources get added to the NIF? •NIF curators •Nomination by the community •Semi-automated text mining pipelines NIF Registry Requires no special skills Site map available for local hosting •NIF Data Federation •DISCO interop •Requires some programming skill •Open Source Brain < 2 hr Two tiered system: low barrier to entry
  9. 9. NIF searches across 3 main indices: Registry, Federation and Literature Data Federation: 200 databases/400M records Registry: 6300 resources (2500 databases) Literature: 22 million articles
  10. 10. What resources are available for GRM1? With the thousands of databases and other information sources available, simple descriptive metadata will not suffice
  11. 11. NIF makes it easier to browse different databases Hippocampus OR “CornuAmmonis” OR “Ammon’s horn” Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Common views across multiple sources Tutorials for using full resource when getting there from NIF Link back to record in original source
  12. 12. Making it easier to access and understand distributed databases Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
  13. 13. NIF Semantic Framework: NIFSTD ontology • NIF covers multiple structural scales and domains of relevance to neuroscience • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology NIFSTD Organism NS FunctionMolecule Investigation Subcellular structure Macromolecule Gene Molecule Descriptors Techniques Reagent Protocols Cell Resource Instrument Dysfunction Quality Anatomical Structure NIF capitalizes on the growing set of community ontologies available in biomedical science
  14. 14. NIF Concept Mapper: Reducing false positives
  15. 15. Is there a framework for neuroscience? • Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories: – Organism – Anatomical structure – Cell – Molecule – Function – Dysfunction – Technique • When NIF combines multiple sources, a set of common fields emerges – >Basic information models/semantic models exist for certain types of entities Biomedical science does have a conceptual framework
  16. 16. Purkinje Cell Axon Terminal Axon Dendritic Tree Dendritic Spine Dendrite Cell body Cerebellar cortex Bringing knowledge to data: Ontologies as framework There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent
  17. 17. : C Neurolex: > 1 million triples Dr. Yi Zeng: Chinese neural knowledge base NIF Cell Graph This is your brain on computers
  18. 18. • Incorporate basic neuroscience knowledge into search – Google: searches for string “GABAergic neuron) – NIF automatically searches for types of GABAergic neurons Types of GABAergicneurons NIF Concept-Based Search Neuroscience Information Framework – http://neuinfo.org
  19. 19. Ontologies as a data integration framework •NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions •Brain Architecture Management System (rodent) •Temporal lobe.com (rodent) •Connectome Wiki (human) •Brain Maps (various) •CoCoMac (primate cortex) •UCLA Multimodal database (Human fMRI) •Avian Brain Connectivity Database (Bird) •Total: 1800 unique brain terms (excluding Avian) •Number of exact terms used in > 1 database: 42 •Number of synonym matches: 99 •Number of 1st order partonomy matches: 385
  20. 20. 0 1-10 11-100 >101 Open World-Closed World: Mapping the knowledge - data space Data Sources NIF lets us ask: where isn’t there data? What isn’t studied? Why?
  21. 21. Forebrain Midbrain Hindbrain 0 1-10 11-100 >101 The data space is not uniform Data Sources
  22. 22. “The Data Homunculus” Funding drives representation in the data space
  23. 23. What can we learn from the NIF Registry? NIF supports a semantic model for describing research resources
  24. 24. Resource Curation June10, 2013 24 • NIF Registry is hosted on Semantic Media Wiki platform Neurolex – Community can add, review, edit without special privileges – Searchable by Google – Integrated with NIF ontologies – Graph structure http://neurolex.org
  25. 25. Can we mine relationships between resources? http://neuinfo.org NIF semantic graph of research resources Text mining gives a picture of the most used resources PDB http://force11.org/Resource_identification_initiative
  26. 26. • Automated text mining is used to look for “web page last updated” or copyright dates – Identified for 570 resources – 373 were not updated within the last 2 years (65%) • Manual review of ~200 resources – 38 not updated within the past 2 years (~20%) – 8 migrated to new addresses or institutions – 7 are no longer in service (~3%) – 3 were deemed no longer appropriate Tracking digital resources since 2008 NIF helps stabilize the dynamic resource landscape
  27. 27. Keeping content up to dateConnectome Tractography Epigenetics •New tags come into existence •New resource types come into existence, e.g., Mobile apps •Resources add new types of content •Change name •Change scope •> 7000 updates to the registry last year It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review
  28. 28. What can we learn from the NIF Data Federation? NIF supports a semantic model for describing research resources
  29. 29. 0 50 100 150 200 250 0.01 0.1 1 10 100 1000 Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 NumberofFederatedDatabases NumberofFederatedRecords(Millions) Data Federation Growth NIF searches the largest collation of neuroscience-relevant data on the web DISCO June10, 2013 dkCOIN Investigator's Retreat 29
  30. 30. What do you mean by data? Databases come in many shapes and sizes • Primary data: – Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL) • Secondary data – Data features extracted through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS) • Tertiary data – Claims and assertions about the meaning of data • E.g., gene upregulation/downregulation, brain activation as a function of task • Registries: – Metadata – Pointers to data sets or materials stored elsewhere • Data aggregators – Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede • Single source – Data acquired within a single context , e.g., Allen Brain Atlas Researchers are producing a variety of information artifacts using a multitude of technologies
  31. 31. What have we learned: Grabbing the long tail of small data • NIF is in a unique position to ask questions against the data resource landscape • The data space is not uniform • Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or added to • Currently very difficult to track data as it moves across the landscape – Makes it difficult to learn from combined efforts NIF is trying to make it easier to work with diverse data
  32. 32. Phases of NIF • 2006-2008: A survey of what was out there • 2008-2009: Strategy for resource discovery – NIF Registry vs NIF data federation – Ingestion of data contained within different technology platforms, e.g., XML vs relational vs RDF – Effective search across semantically diverse sources • NIFSTD ontologies • 2009-2011: Strategy for data integration – Unified views across common sources – Mapping of content to NIF vocabularies • 2011-present: Data analytics – Uniform external data references • 2012-present: SciCrunch: unified biomedical resource services NIF provides a strategy and set of tools applicable to all biomedical science
  33. 33. NIF team (past and present) Jeff Grethe, UCSD, Co Investigator, Interim PI AmarnathGupta, UCSD, Co Investigator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech ArunRangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University SrideviPolavarum Fahim Imam Larry Lui Andrea Arnaud Stagg Jonathan Cachat Jennifer Lawrence Svetlana Sulima Davis Banks VadimAstakhov XufeiQian Chris Condit Mark Ellisman Stephen Larson Willie Wong Tim Clark, Harvard University Paolo Ciccarese Karen Skinner, NIH, Program Officer (retired) Jonathan Pollock, NIH, Program Officer And my colleagues in Monarch, dkNet, 3DVC, Force 11