A Deep Survey of the Digital Resource Landscape


Published on

A Deep Survey of the Digital Resource Landscape: Perspectives from the Neuroscience Information Framework

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Lists all NIF resources registered at levels 2+ in the DISCO server.Shows their DISCO services, and location of DISCO filesControls to filter, sort and page all resources
  • A Deep Survey of the Digital Resource Landscape

    1. 1. A Deep Survey of the Digital Resource Landscape: Perspectives from the Neuroscience Information Framework Maryann E. Martone, Ph. D. University of California, San Diego
    2. 2. • NIF is an initiative of the NIH Blueprint consortium of institutes – What types of resources (data, tools, materials, services) are available to the neuroscience community? – How many are there? – What domains do they cover? What domains do they not cover? – Where are they? • Web sites • Databases • Literature • Supplementary material – Who uses them? – Who creates them? – How can we find them? – How can we make them better in the future? http://neuinfo.org • PDF files • Desk drawers
    3. 3. The Neuroscience Information Framework • NIF has developed a production technology platform for researchers to: – Discover – Share – Analyze – Integrate neuroscience-relevant information • Since 2008, NIF has assembled the largest searchable catalog of neuroscience data and resources on the web • Cost-effective and innovative strategy for managing data assets “This unique data depository serves as a model for other Web sites to provide research data. “ - Choice Reviews Online NIF is poised to capitalize on the new tools and emphasis on big data and open science
    4. 4. http://neuinfo.org June10, 2013 dkCOIN Investigator's Retreat 4 The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience • A portal for finding and using neuroscience resources  A consistent framework for describing resources  Provides simultaneous search of multiple types of information, organized by category  Supported by an expansive ontology for neuroscience  Utilizes advanced technologies to search the “hidden web” UCSD, Yale, Cal Tech, George Mason, Washington Univ Literature Database Federation Registry
    5. 5. Part 1: Surveying the resource landscape •NIF Registry: A catalog of neuroscience- relevant resources •> 6000 currently listed •>2200 databases •And we are finding more every day
    6. 6. How do resources get added to the NIF Registry? June10, 2013 dkCOIN Investigator's Retreat 6 •NIF curators •Nomination by the community •Semi-automated text mining pipelines NIF Registry Requires no special skills Site map available for local hosting •NIF Data Federation •DISCO interop •Requires some programming skill Bandrowski et al., 2012
    7. 7. NIF Registry • Extended over time – Parent resource – Supporting agency – Grant numbers – Accessibility – Related to – Organism – Disease or condition – Last updated First catalog: SFN Neuroscience Database Gateway  NIF 0.5  NIF 1.0+ Simple metadata model Name, description, type, URL, other names, keywords, unique identifier ~2003 2006 2008
    8. 8. Resource Curation June10, 2013 dkCOIN Investigator's Retreat 8 • NIF Registry is hosted on Semantic Media Wiki platform Neurolex – Community can add, review, edit without special privileges – Searchable by Google – Integrated with NIF ontologies – Graph structure http://neurolex.org
    9. 9. The resource graph NIF is creating the linked data graph of resources
    10. 10. Keeping the Registry Current – NIF employs an automated link checker – Last analysis: 478/6100 invalid URL’s (~8%) – 199 can’t locate at another university or location  out of service (~3%) – Bigger issue: Many resources are no longer updated or maintained 0 20 40 60 80 100 120 140 160 180 200 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 0 500 1000 1500 2000 2500 3000 3500 Resourcesadded Lastupdated
    11. 11. • Automated text mining is used to look for “web page last updated” or copyright dates – Identified for 570 resources; manual review suggested that the results were accurate although we can’t guarantee that the date itself is accurate – 373 were not updated within the last 2 years (65%) • Manual review of ~200 resources identified by 3DVC for their catalog – 38 not updated within the past 2 years (~20%) – 8 migrated to new addresses or institutions – 7 are no longer in service (~3%) – 3 were deemed no longer appropriate Tracking the fate of digital resources Yuling Li, Paul Sternberg, Cal Tech
    12. 12. Keeping content up to date Connectome Tractography Epigenetics •New tags come into existence •New resource types come into existence, e.g., Mobile apps •Resources add new types of content •Change name •Change scope •> 7000 updates to the registry last year It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review
    13. 13. Ontology provides a human-centric model for search and data integration June10, 2013 dkCOIN Investigator's Retreat 13
    14. 14. Last updated... • Some neglected resources are still valuable – Complete data sets – Rare data • Software may still be usable • Some databases, however, ma y only be of historical interest – “all metalloproteins found in PDB” Are all databases and data sets equally valuable?
    15. 15. • The NIF Registry has created a linked data graph of web-accessible resources • Maintained on a community wiki platform • Provides data on the fluidity of the resource landscape – New resources continue to be created and found – Relatively few disappear altogether – Many more grow stale, although their value may still be significant – Maintaining up to date curation requires frequent updating Summary NIF Registry provides insight into the state of digital resources on the web
    16. 16. Part 2: Surveying the data landscape •The NIF data federation performs deep search over the content of over 200 databases •New databases are added at a rate of 25-40 per year •Latest update: Open Source Brain; ingest completed in 2 hours •Databases chosen on a variety of criteria: •Early: testing different types of resources •Thematic areas •Volunteers
    17. 17. 0 50 100 150 200 250 0.01 0.1 1 10 100 1000 Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 NumberofFederatedDatabases NumberofFederatedRecords(Millions) Data Federation Growth NIF searches the largest collation of neuroscience-relevant data on the web DISCO June10, 2013 dkCOIN Investigator's Retreat 17
    18. 18. Data Ingestion Architecture Current Planned DISCO Dashboard Functions • Ingest Script Manager • Public Script Repository • Data & Event Tracker • Versioning System • Curator Tool • Data Transformer Manager June10, 2013 dkCOIN Investigator's Retreat 18Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd Yale University
    19. 19. DISCO Dashboard June10, 2013 dkCOIN Investigator's Retreat 19 • Management of registry resources through a single administrative dashboard • Associated discovery pipeline • Tools to manage data updates • Change tracking • Globally unique identifier creation Luis Marenco, Rixin Wang, Perrry Miller, Gordon Shepherd Yale University
    20. 20. NIF data federation NIF was designed to be populated rapidly with progressive refinement
    21. 21. What are the connections of the hippocampus? Hippocampus OR “CornuAmmonis” OR “Ammon’s horn” Query expansion: Synonyms and related concepts Boolean queries Data sources categorized by “data type” and level of nervous system Common views across multiple sources Tutorials for using full resource when getting there from NIF Link back to record in original source
    22. 22. Results are organized within a common framework Connects to Synapsed with Synapsed by Input region innervates Axon innervates Projects toCellular contact Subcellular contact Source site Target site Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
    23. 23. NIF Semantic Framework: NIFSTD ontology • NIF covers multiple structural scales and domains of relevance to neuroscience • Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology NIFSTD Organism NS FunctionMolecule Investigation Subcellular structure Macromolecule Gene Molecule Descriptors Techniques Reagent Protocols Cell Resource Instrument Dysfunction Quality Anatomical Structure
    24. 24. Use of Ontologies • Controlled vocabulary for describing type of resource and content – Database, Image, Diabetes • Entity-mapping of database and data content • Data integration across sources • Search: Mixture of mapped content and string-based search – Different parts of the infrastructure use the vocabularies in different ways – Utilize synonyms, parents, children to refine search – Increasing use of other relationships and logical inferencing • Generation of semantic content (i.e. RDF, Linked Data) June10, 2013 dkCOIN Investigator's Retreat 24
    25. 25. NIF Concept Mapper June10, 2013 25 Aligns sources to the NIF semantic framework
    26. 26. Column level mapping: Reducing false positives
    27. 27. The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework •NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions •Brain Architecture Management System (rodent) •Temporal lobe.com (rodent) •Connectome Wiki (human) •Brain Maps (various) •CoCoMac (primate cortex) •UCLA Multimodal database (Human fMRI) •Avian Brain Connectivity Database (Bird) •Total: 1800 unique brain terms (excluding Avian) •Number of exact terms used in > 1 database: 42 •Number of synonym matches: 99 •Number of 1st order partonomy matches: 385
    28. 28. Content Annotation – Google Refine June10, 2013 dkCOIN Investigator's Retreat 28
    29. 29. Resource Provider Services - Linkout June10, 2013 dkCOIN Investigator's Retreat 29
    30. 30. What have we learned: Grabbing the long tail of small data • NIF can be used to survey the data landscape • Analysis of NIF shows multiple databases with similar scope and content • Many contain partially overlapping data • Data “flows” from one resource to the next – Data is reinterpreted, reanalyzed or added to • Is duplication good or bad?
    31. 31. What do you mean by data? Databases come in many shapes and sizes • Primary data: – Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL) • Secondary data – Data features extracted through data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS) • Tertiary data – Claims and assertions about the meaning of data • E.g., gene upregulation/downregulation, brain activation as a function of task • Registries: – Metadata – Pointers to data sets or materials stored elsewhere • Data aggregators – Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede • Single source – Data acquired within a single context , e.g., Allen Brain Atlas Researchers are producing a variety of information artifacts using a multitude of technologies
    32. 32. NIF Analytics: The Neuroscience Landscape NIF is in a unique position to answer questions about the neuroscience landscape Where are the data? Striatum Hypothalamus Olfactory bulb Cerebral cortex Brain Brainregion Data source VadimAstakhov, Kepler Workflow Engine
    33. 33. Whither neuroscience information? ∞ What is easily machine processable and accessible What is potentially knowable What is known: Literature, images, human knowledge Unstructured; Natural language processing, entity recognition, image processing and analysis; communication
    34. 34. Open world meets closed world We know a lot about some things and less about others; some of NIF’s sources are comprehensive; others are highly biased But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray records
    35. 35. Diseases of nervous system What drives discovery? The combination of ontologies, diverse data and analytics lets us look at the current landscape in interesting ways Neurodegenerative Seizuredisorders Neoplasticdiseaseofnervoussystem NIH Reporter NIFdatafederatedsources
    36. 36. Embracing duplication: Data Mash ups •NIF queries across 3 of approximately 10 fMRIdatabases •Two resources, Brede and SUMSdbcurated activation foci from the literature •~300 PMID’swere common between Brede and SUMSdb •PMID serves as a unique identifier for an article •Same information; value added Data is additive
    37. 37. Same data: different analysis • Gemma: Gene ID + Gene Symbol • DRG: Gene name + Probe ID • Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases Chronic vs acute morphine in striatum • Analysis: •1370 statements from Gemma regarding gene expression as a function of chronicmorphine •617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis •Results for 1 gene were opposite in DRG and Gemma •45 did not have enough information provided in the paper to make a judgment Relatively simple standards would make life easier
    38. 38. Phases of NIF • 2006-2008: A survey of what was out there • 2008-2009: Strategy for resource discovery – NIF Registry vs NIF data federation – Ingestion of data contained within different technology platforms, e.g., XML vs relational vs RDF – Effective search across semantically diverse sources • NIFSTD ontologies • 2009-2011: Strategy for data integration – Unified views across common sources – Mapping of content to NIF vocabularies • 2011-present: Data analytics – Uniform external data references • 2012-present: SciCrunch: unified biomedical resource services NIF provides a strategy and set of tools applicable to all biomedical science
    39. 39. Where is the Neuroscience in NIF? • Search semantics • Ranking • Resources supported by NIH Blueprint Institutes are more thoroughly covered • Data types, e.g., Brain activation foci June10, 2013 dkCOIN Investigator's Retreat 39
    40. 40. Building a Uniform Resource Layer Discoverability Accessibility Web of Data Data specified via simple semantics Data in a usable form Semantically-enabled search Enhanced semantics Standardized representation Linked Open Data - RDF Data resources simply described Automated data harvesting technologies Common resource registry A production data (resource) catalog and underlying technology platform for researchers to discover, share, access, analyze, and integrate biomedical information June10, 2013 40
    41. 41. Community Built Uniform Resource Layer June10, 2013 41 SciCrunch NIF Neuroscience MONARCH Animal Models Community Services dkCOIN Shared Resources Undiagnosed Disease Program Phenotype RCN 3D Virtual Cell National Institute on Aging One Mind for Research BIRN International Neuroinformatics Coordinating Facility Model Organism Databases Community Outreach DELSA Varied (not just a data catalog)
    42. 42. Each project shares resources and adds unique value to the resource layer 42 •3dVC: Focus on models and simulation •Gene Ontology: Focus on bioinformatics tools •National Institute on aging: Aging- related data sets •Monarch: Phenotype-Genotype; deep semantic data integration •One Mind for Research: Biospecimen repositories •NeuroGateway: Computational resources •FORCE11: Tools for next-gen publishing and e-scholarship SciCrunch SciCrunch is actively supporting multiple communities; multiple communities are enriching and improving SciCrunch
    43. 43. Customized portals and rankings June10, 2013 dkCOIN Investigator's Retreat 43 SciCrunch NIF Neuroscience MONARCH Animal Models Community Services dkCOIN Shared Resources Undiagnosed Disease Program Phenotype RCN 3D Virtual Cell National Institute on Aging One Mind for Research BIRN International Neuroinformatics Coordinating Facility Model Organism Databases Community Outreach DELSA Varied dkCOIN Ontology SciCrunch Shared Resources
    44. 44. Community database: beginning Community database: End Register your resource to NIF! “How do I share my data/tool?” “There is no database for my data” 1 2 3 4 Institutional repositories Cloud INCF: Global infrastructure Government Education Industry NIF is designed to leverage existing investments in resources and infrastructure Tool repositories
    45. 45. Collaboration, competition, coordinat ion, cooperation • The diversity and dynamism of biomedical data will make data integration challenging always • The overall data space is vast: No one group or individual can do everything – Cooperation and coordination is essential • Creating a core resource registry and data catalog allows the entire community to track resources, work together to keep it updated, promote cross-fertilization, and build better resources June10, 2013 dkCOIN Investigator's Retreat 45
    46. 46. NIF team (past and present) Jeff Grethe, UCSD, Co Investigator, Interim PI AmarnathGupta, UCSD, Co Investigator Anita Bandrowski, NIF Project Leader Gordon Shepherd, Yale University Perry Miller Luis Marenco Rixin Wang David Van Essen, Washington University Erin Reid Paul Sternberg, Cal Tech ArunRangarajan Hans Michael Muller Yuling Li Giorgio Ascoli, George Mason University SrideviPolavarum FahimImam Larry Lui Andrea Arnaud Stagg Jonathan Cachat Jennifer Lawrence Svetlana Sulima Davis Banks VadimAstakhov XufeiQian Chris Condit Mark Ellisman Stephen Larson Willie Wong Tim Clark, Harvard University Paolo Ciccarese Karen Skinner, NIH, Program Officer (retired) Jonathan Pollock, NIH, Program Officer And my colleagues in Monarch, dkNet, 3DVC, Force 11