Overview of the Neuroscience Information Framework and how it brings together data, in the form of distributed databases, and knowledge, in the form of ontologies to show the mapping of the dataspace and places where there are mismatches between data and knowledge.
Neuroscience research increasingly relies on large, heterogeneous datasets from various sources. Integrating these diverse data types and making them accessible presents challenges. The NIF (Neuroscience Information Framework) addresses this by creating a federated search engine and unified interface to access multiple neuroscience databases. NIF aims to make neuroscience data more discoverable, accessible, and usable through techniques like unique identifiers, metadata standards, and semantic integration. This will help researchers more effectively find and use relevant neuroscience information.
The document discusses the Neuroscience Information Framework (NIF), which aims to provide a consistent framework and portal for discovering and utilizing web-based neuroscience resources. It summarizes the goals of NIF in indexing over 2000 databases and making their content searchable through an expansive neuroscience ontology. The document outlines the history and development of NIF, describes its search capabilities and use of ontologies, and provides examples of tools and resources that integrate NIF services like the Whole Brain Catalog.
A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuros...Maryann Martone
The NIF Registry provides insight into the state of digital neuroscience resources on the web. It has cataloged over 6,000 resources, including more than 2,200 databases. While some resources disappear over time, many more grow stale as they are not updated regularly. Maintaining an up-to-date registry requires frequent updates. The NIF data federation can search over 200 databases containing over 1 billion records. This collection continues to grow as new databases are added. The NIF utilizes ontologies and semantic frameworks to integrate data across diverse sources and provide insights into the neuroscience landscape.
How do we know what we don't know? Exploring the data and knowledge space th...Maryann Martone
The document discusses the Neuroscience Information Framework (NIF), an initiative that aims to catalog and integrate neuroscience resources and data. NIF surveys the neuroscience resource landscape, currently cataloging over 3000 databases and datasets. It provides semantic integration of these resources through the use of ontologies and allows deep search of aggregated data. However, significant amounts of neuroscience data and resources remain inaccessible in publications, databases, and file drawers. Barriers to data sharing include lack of incentives, standards, and resources. NIF and related efforts aim to develop solutions to make more neuroscience data FAIR - findable, accessible, interoperable, and reusable.
Data Landscapes: The Neuroscience Information FrameworkMaryann Martone
Overview of how to use the Neuroscience Information Framework for data discovery presented at the Genetics of Addiction Workshop, held at Jackson Lab Aug 28- Sept 1, 2014.
The document discusses the Neuroscience Information Framework (NIF), which aims to provide a portal for finding and utilizing web-based neuroscience resources. NIF provides a consistent framework for describing various resources like databases, literature, and images. It allows simultaneous searches across these different data types and is supported by neuroscience ontologies. NIF currently catalogs over 5,000 resources and is working to integrate these diverse data sources to help answer questions and discover gaps in our knowledge about the brain.
Neuroscience research increasingly relies on large, heterogeneous datasets from various sources. Integrating these diverse data types and making them accessible presents challenges. The NIF (Neuroscience Information Framework) addresses this by creating a federated search engine and unified interface to access multiple neuroscience databases. NIF aims to make neuroscience data more discoverable, accessible, and usable through techniques like unique identifiers, metadata standards, and semantic integration. This will help researchers more effectively find and use relevant neuroscience information.
The document discusses the Neuroscience Information Framework (NIF), which aims to provide a consistent framework and portal for discovering and utilizing web-based neuroscience resources. It summarizes the goals of NIF in indexing over 2000 databases and making their content searchable through an expansive neuroscience ontology. The document outlines the history and development of NIF, describes its search capabilities and use of ontologies, and provides examples of tools and resources that integrate NIF services like the Whole Brain Catalog.
A Deep Survey of the Digital Resource Landscape:Perspectives from the Neuros...Maryann Martone
The NIF Registry provides insight into the state of digital neuroscience resources on the web. It has cataloged over 6,000 resources, including more than 2,200 databases. While some resources disappear over time, many more grow stale as they are not updated regularly. Maintaining an up-to-date registry requires frequent updates. The NIF data federation can search over 200 databases containing over 1 billion records. This collection continues to grow as new databases are added. The NIF utilizes ontologies and semantic frameworks to integrate data across diverse sources and provide insights into the neuroscience landscape.
How do we know what we don't know? Exploring the data and knowledge space th...Maryann Martone
The document discusses the Neuroscience Information Framework (NIF), an initiative that aims to catalog and integrate neuroscience resources and data. NIF surveys the neuroscience resource landscape, currently cataloging over 3000 databases and datasets. It provides semantic integration of these resources through the use of ontologies and allows deep search of aggregated data. However, significant amounts of neuroscience data and resources remain inaccessible in publications, databases, and file drawers. Barriers to data sharing include lack of incentives, standards, and resources. NIF and related efforts aim to develop solutions to make more neuroscience data FAIR - findable, accessible, interoperable, and reusable.
Data Landscapes: The Neuroscience Information FrameworkMaryann Martone
Overview of how to use the Neuroscience Information Framework for data discovery presented at the Genetics of Addiction Workshop, held at Jackson Lab Aug 28- Sept 1, 2014.
The document discusses the Neuroscience Information Framework (NIF), which aims to provide a portal for finding and utilizing web-based neuroscience resources. NIF provides a consistent framework for describing various resources like databases, literature, and images. It allows simultaneous searches across these different data types and is supported by neuroscience ontologies. NIF currently catalogs over 5,000 resources and is working to integrate these diverse data sources to help answer questions and discover gaps in our knowledge about the brain.
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
The document discusses using the Neuroscience Information Framework (NIF) to reveal knowledge gaps in neuroscience. It summarizes that NIF aims to maximize awareness, access, and utility of neuroscience research resources by uniting information from over 200 databases containing over 400 million records. However, it notes that certain domains may still be underrepresented due to biases in available data driven by factors like funding priorities. The framework uses ontologies to help integrate diverse data types and link them with defined concepts, but notes that neuroanatomical structures in particular pose challenges due to inconsistent naming conventions across studies.
The document discusses navigating the neuroscience data landscape. It notes that a grand challenge in neuroscience is to understand brain function across multiple scales of organization. Central to this effort is understanding "neural choreography" - the integrated functioning of neurons into brain circuits. The Neuroscience Information Framework (NIF) aims to facilitate discovery and utilization of web-based neuroscience resources. However, the neuroscience community has not fully exploited currently available data or prepared for forthcoming data.
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...Maryann Martone
The document discusses the challenges of managing and utilizing the large amount of neuroscience data being generated. It notes that currently, about half of researchers only store data in their own labs and many lack funding for proper archiving. The National Information Framework (NIF) is working to address these issues by creating a catalog and federation of neuroscience resources to facilitate discovery, access, analysis and integration of data. NIF has assembled the largest searchable collection of neuroscience data on the web using an ontology and technologies that can search the "hidden web" of resources.
Big data from small data: A survey of the neuroscience landscape through the...Maryann Martone
The document discusses the Neuroscience Information Framework (NIF), an initiative by the NIH Blueprint to provide a single access point for searching across multiple neuroscience databases and data types. NIF aims to maximize access to and utility of worldwide neuroscience resources by creating a consistent framework for describing resources and enabling simultaneous searches. It notes that neuroscience data exists in many forms, from raw data to processed data to claims, across multiple scales and data types. NIF is designed to rapidly integrate these diverse resources through a tiered system that has a low barrier for data providers to participate.
The document discusses methodologies for sharing long-tail data and what has been learned. It notes that unique identifiers (PIDs) are important for identifying entities across contexts. Standards like MINI and common data elements (CDEs) help ensure data is findable, accessible, and reusable. The Neuroscience Information Framework (NIF) aggregates ontologies and searches over 200 data sources to organize information. What we have learned is that data should be in repositories, not personal servers; people are key to these efforts; and resources should be comprehensive and support each other to advance open data sharing.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
the Neuroscience Information Framework has over 100 big data databases indexed, allowing us to ask big data landscape questions. Anita Bandrowski presents an overview of the NIF system and provides insights into the addiction data landscape to JAX laboratories.
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
This document discusses bioinformatics and some of its key concepts and tools. It begins with definitions of bioinformatics as the intersection of biology, computer science, and information technology. It then discusses some of the data formats, tools, and skills used in bioinformatics, including working with nucleotide sequence data, translating sequences into amino acids, and analyzing large datasets. It also summarizes how ontologies are used to represent concepts and how various data types are organized and stored in databases for analysis.
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
Amit Sheth's Keynote at Semantic Web Technologies for Science and Engineering Workshop (held in conjunction with ISWC2003), Sanibel Island, FL, October 20, 2003.
This document discusses biological networks and how to analyze genome-scale data using networks. It defines different types of biological networks including DNA-protein, RNA-RNA, RNA-protein, and protein-protein networks. It also describes popular network visualization and analysis tools like Cytoscape and different databases for retrieving protein-protein and pathway interaction networks. The document emphasizes that networks can help validate findings, explore and discover new insights from large genomic and omics datasets.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
The document provides an overview of a presentation on open science and open data for librarians. It includes:
- An introduction to open science/open data concepts and the library's role in research data services.
- Examples of activities working with research data, including data collection, visualization, cleaning, analysis and preservation.
- A discussion of the benefits of open data, challenges researchers face in opening their data, and the role of data repositories and standards.
- An overview of the African Open Science Platform project which aims to promote open science on the continent.
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
Ora Lassila and Amit Sheth, "Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Interoperability", Invited Talk at ONC-HHS Invitational Workshop on Next Generation Interoperability for Health, Washington DC, January 19-20, 2011.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Data Science and What It Means to Library and Information ScienceJian Qin
Data science involves collecting, analyzing, and preserving large datasets to extract knowledge and make predictions. It differs from traditional disciplines by dealing with heterogeneous, unstructured data from complex networks. A data scientist requires math, computing, communication skills, and the ability to ask the right questions. Libraries are well-positioned to offer various data services including data discovery, consulting, mining, integration, and curation to support research and decision-making. Practicing data science in libraries requires vision, risk-taking, data science knowledge, careful planning, and collaboration.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes text mining techniques for information retrieval, extraction, and indexing. It discusses common information retrieval techniques like inverted indices and signature files. It also covers stemming, domain dictionaries, exclusion lists, and research directions in text mining like finding better representations for extracted information, enabling multilingual analysis, and integrating domain knowledge. The key techniques discussed are text indexing, query processing, and information extraction from text.
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
The Neuroscience Information Framework (NIF) is an initiative of the NIH Blueprint to maximize access to and utility of worldwide neuroscience research resources. NIF catalogs over 10,000 resources including databases, literature, and materials. It provides search capabilities across these resources and develops ontologies and semantic frameworks to integrate diverse data types and scales. NIF aims to make dispersed neuroscience information more findable, accessible, interoperable, and reusable to enable new insights.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
Talk presented in Spain (WiMS 2013/UAM-Madrid, UMA-Malaga), June 2013.
Replaces earlier version at: http://www.slideshare.net/apsheth/semantic-technology-empowering-real-world-outcomes-in-biomedical-research-and-clinical-practices
Biomedical and translational research as well as clinical practice are increasingly data driven. Activities routinely involve large number of devices, data and people, resulting in the challenges associated with volume, velocity (change), variety (heterogeneity) and veracity (provenance, quality). Equally important is to realize the challenge of serving the needs of broader ecosystems of people and organizations, extending traditional stakeholders like drug makers, clinicians and policy makers, to increasingly technology savvy and information empowered patients. We believe that semantics is becoming centerpiece of informatics solutions that convert data into meaningful, contextually relevant information and insights that lead to optimal decisions for translational research and 360 degree health, fitness and well-being.
In this talk, I will provide a series of snapshots of efforts in which semantic approach and technology is the key enabler. I will emphasize real-world and in-use projects, technologies and systems, involving significant collaborations between my team and biomedical researchers or practicing clinicians. Examples include:
• Active Semantic Electronic Medical Record
• Semantics and Services enabled Problem Solving Environment for T.cruzi (SPSE)
• Data Mining of Cardiology data
• Semantic Search, Browsing and Literature Based Discovery
• PREscription Drug abuse Online Surveillance and Epidemiology (PREDOSE)
• kHealth: development of a knowledge-enhanced sensing and mobile computing applications (using low cost sensors and smartphone), along with ability to convert low level observations into clinically relevant abstractions
Further details are at http://knoesis.org/amit/hcls
How do we know what we don’t know: Using the Neuroscience Information Framew...Maryann Martone
The document discusses using the Neuroscience Information Framework (NIF) to reveal knowledge gaps in neuroscience. It summarizes that NIF aims to maximize awareness, access, and utility of neuroscience research resources by uniting information from over 200 databases containing over 400 million records. However, it notes that certain domains may still be underrepresented due to biases in available data driven by factors like funding priorities. The framework uses ontologies to help integrate diverse data types and link them with defined concepts, but notes that neuroanatomical structures in particular pose challenges due to inconsistent naming conventions across studies.
The document discusses navigating the neuroscience data landscape. It notes that a grand challenge in neuroscience is to understand brain function across multiple scales of organization. Central to this effort is understanding "neural choreography" - the integrated functioning of neurons into brain circuits. The Neuroscience Information Framework (NIF) aims to facilitate discovery and utilization of web-based neuroscience resources. However, the neuroscience community has not fully exploited currently available data or prepared for forthcoming data.
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...Maryann Martone
The document discusses the challenges of managing and utilizing the large amount of neuroscience data being generated. It notes that currently, about half of researchers only store data in their own labs and many lack funding for proper archiving. The National Information Framework (NIF) is working to address these issues by creating a catalog and federation of neuroscience resources to facilitate discovery, access, analysis and integration of data. NIF has assembled the largest searchable collection of neuroscience data on the web using an ontology and technologies that can search the "hidden web" of resources.
Big data from small data: A survey of the neuroscience landscape through the...Maryann Martone
The document discusses the Neuroscience Information Framework (NIF), an initiative by the NIH Blueprint to provide a single access point for searching across multiple neuroscience databases and data types. NIF aims to maximize access to and utility of worldwide neuroscience resources by creating a consistent framework for describing resources and enabling simultaneous searches. It notes that neuroscience data exists in many forms, from raw data to processed data to claims, across multiple scales and data types. NIF is designed to rapidly integrate these diverse resources through a tiered system that has a low barrier for data providers to participate.
The document discusses methodologies for sharing long-tail data and what has been learned. It notes that unique identifiers (PIDs) are important for identifying entities across contexts. Standards like MINI and common data elements (CDEs) help ensure data is findable, accessible, and reusable. The Neuroscience Information Framework (NIF) aggregates ontologies and searches over 200 data sources to organize information. What we have learned is that data should be in repositories, not personal servers; people are key to these efforts; and resources should be comprehensive and support each other to advance open data sharing.
How Portable Are the Metadata Standards for Scientific Data?Jian Qin
The one-covers-all approach in current metadata standards for scientific data has serious limitations in keeping up with the ever-growing data. This paper reports the findings from a survey to metadata standards in the scientific data domain and argues for the need for a metadata infrastructure. The survey collected 4400+ unique elements from 16 standards and categorized these elements into 9 categories. Findings from the data included that the highest counts of element occurred in the descriptive category and many of them overlapped with DC elements. This pattern also repeated in the elements co-occurred in different standards. A small number of semantically general elements appeared across the largest numbers of standards while the rest of the element co-occurrences formed a long tail with a wide range of specific semantics. The paper discussed implications of the findings in the context of metadata portability and infrastructure and pointed out that large, complex standards and widely varied naming practices are the major hurdles for building a metadata infrastructure.
the Neuroscience Information Framework has over 100 big data databases indexed, allowing us to ask big data landscape questions. Anita Bandrowski presents an overview of the NIF system and provides insights into the addiction data landscape to JAX laboratories.
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
This document discusses bioinformatics and some of its key concepts and tools. It begins with definitions of bioinformatics as the intersection of biology, computer science, and information technology. It then discusses some of the data formats, tools, and skills used in bioinformatics, including working with nucleotide sequence data, translating sequences into amino acids, and analyzing large datasets. It also summarizes how ontologies are used to represent concepts and how various data types are organized and stored in databases for analysis.
Semantics for Bioinformatics: What, Why and How of Search, Integration and An...Amit Sheth
Amit Sheth's Keynote at Semantic Web Technologies for Science and Engineering Workshop (held in conjunction with ISWC2003), Sanibel Island, FL, October 20, 2003.
This document discusses biological networks and how to analyze genome-scale data using networks. It defines different types of biological networks including DNA-protein, RNA-RNA, RNA-protein, and protein-protein networks. It also describes popular network visualization and analysis tools like Cytoscape and different databases for retrieving protein-protein and pathway interaction networks. The document emphasizes that networks can help validate findings, explore and discover new insights from large genomic and omics datasets.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
The document provides an overview of a presentation on open science and open data for librarians. It includes:
- An introduction to open science/open data concepts and the library's role in research data services.
- Examples of activities working with research data, including data collection, visualization, cleaning, analysis and preservation.
- A discussion of the benefits of open data, challenges researchers face in opening their data, and the role of data repositories and standards.
- An overview of the African Open Science Platform project which aims to promote open science on the continent.
Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Inte...Amit Sheth
Ora Lassila and Amit Sheth, "Semantic Web for 360-degree Health: State-of-the-Art & Vision for Better Interoperability", Invited Talk at ONC-HHS Invitational Workshop on Next Generation Interoperability for Health, Washington DC, January 19-20, 2011.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Data Science and What It Means to Library and Information ScienceJian Qin
Data science involves collecting, analyzing, and preserving large datasets to extract knowledge and make predictions. It differs from traditional disciplines by dealing with heterogeneous, unstructured data from complex networks. A data scientist requires math, computing, communication skills, and the ability to ask the right questions. Libraries are well-positioned to offer various data services including data discovery, consulting, mining, integration, and curation to support research and decision-making. Practicing data science in libraries requires vision, risk-taking, data science knowledge, careful planning, and collaboration.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes text mining techniques for information retrieval, extraction, and indexing. It discusses common information retrieval techniques like inverted indices and signature files. It also covers stemming, domain dictionaries, exclusion lists, and research directions in text mining like finding better representations for extracted information, enabling multilingual analysis, and integrating domain knowledge. The key techniques discussed are text indexing, query processing, and information extraction from text.
RDAP14: Maryann Martone, Keynote, The Neuroscience Information FrameworkASIS&T
The Neuroscience Information Framework (NIF) is an initiative of the NIH Blueprint to maximize access to and utility of worldwide neuroscience research resources. NIF catalogs over 10,000 resources including databases, literature, and materials. It provides search capabilities across these resources and develops ontologies and semantic frameworks to integrate diverse data types and scales. NIF aims to make dispersed neuroscience information more findable, accessible, interoperable, and reusable to enable new insights.
bioCADDIE Webinar: The NIDDK Information Network (dkNET) - A Community Resear...dkNET
dkNET provides a single portal for discovering over 3,500 biomedical research resources and datasets. It aims to make these resources findable, accessible, interoperable, and reusable in accordance with the FAIR principles. The portal contains three main sections for browsing community resources, additional resources, and literature. It utilizes faceted searching and provides analytics and notifications to help users track changes to resources over time.
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
Talk presented in Spain (WiMS 2013/UAM-Madrid, UMA-Malaga), June 2013.
Replaces earlier version at: http://www.slideshare.net/apsheth/semantic-technology-empowering-real-world-outcomes-in-biomedical-research-and-clinical-practices
Biomedical and translational research as well as clinical practice are increasingly data driven. Activities routinely involve large number of devices, data and people, resulting in the challenges associated with volume, velocity (change), variety (heterogeneity) and veracity (provenance, quality). Equally important is to realize the challenge of serving the needs of broader ecosystems of people and organizations, extending traditional stakeholders like drug makers, clinicians and policy makers, to increasingly technology savvy and information empowered patients. We believe that semantics is becoming centerpiece of informatics solutions that convert data into meaningful, contextually relevant information and insights that lead to optimal decisions for translational research and 360 degree health, fitness and well-being.
In this talk, I will provide a series of snapshots of efforts in which semantic approach and technology is the key enabler. I will emphasize real-world and in-use projects, technologies and systems, involving significant collaborations between my team and biomedical researchers or practicing clinicians. Examples include:
• Active Semantic Electronic Medical Record
• Semantics and Services enabled Problem Solving Environment for T.cruzi (SPSE)
• Data Mining of Cardiology data
• Semantic Search, Browsing and Literature Based Discovery
• PREscription Drug abuse Online Surveillance and Epidemiology (PREDOSE)
• kHealth: development of a knowledge-enhanced sensing and mobile computing applications (using low cost sensors and smartphone), along with ability to convert low level observations into clinically relevant abstractions
Further details are at http://knoesis.org/amit/hcls
The document discusses the challenges of managing and analyzing the large amounts of neuroscience data being generated. It notes that currently, about half of researchers only store their data locally in their labs instead of in shared databases or archives. This prevents other researchers from accessing and using the data. The National Information Forum (NIF) is working to address these issues by creating a registry of neuroscience resources and developing technologies to allow researchers to discover, share, analyze and integrate data from various sources. NIF's registry currently catalogs over 6000 resources, including 2200 databases. The goal is for NIF to help the neuroscience community better exploit existing data and prepare for future increases in data.
The real world of ontologies and phenotype representation: perspectives from...Maryann Martone
The document discusses the Neuroscience Information Framework (NIF) and its role in facilitating discovery and use of neuroscience resources through a consistent semantic framework. NIF provides a portal for searching various types of neuroscience data and information organized by categories. It utilizes ontologies and advanced technologies to allow simultaneous searching of multiple sources. Challenges include the large number of databases and other resources, differing data types, and inconsistent naming of brain structures across sources.
The hippocampus receives input from the entorhinal cortex and sends projections to multiple targets in the brain. Its main outputs are to the subiculum, which projects to regions like the nucleus accumbens, amygdala, and medial prefrontal cortex. The hippocampus plays an important role in memory formation and spatial navigation.
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016Jisc
There is broad recognition within the scientific community that the emerging data deluge will fundamentally alter disciplines in areas throughout academic research. A wide variety of researchers - from scientists and engineers to social scientists and humanities researchers - will require tools, technologies, and platforms that seamlessly integrate into standard scientific methodologies and processes.
'The fourth paradigm' refers to the data management techniques and the computational systems needed to manipulate, visualize, and manage large amounts of research data. This talk will illustrate the challenges researchers will face, the opportunities these changes will afford, and the resulting implications for data-intensive researchers.
In addition, the talk will review the global movement towards open access, research repositories and open science and the importance of curation of digital data. The talk concludes with some comments on the research requirements for campus e-infrastructure and the end-to-end performance of the network.
A description of software as infrastructure at NSF, and how Apache projects may be similar. What lessons can be shared from one organization to the other? How does science software compare with more general software?
This document provides an introduction to big data, including:
- Big data is characterized by its volume, velocity, and variety, which makes it difficult to process using traditional databases and requires new technologies.
- Technologies like Hadoop, MongoDB, and cloud platforms from Google and Amazon can provide scalable storage and processing of big data.
- Examples of how big data is used include analyzing social media and search data to gain insights, enabling personalized experiences and targeted advertising.
- As data volumes continue growing exponentially from sources like sensors, simulations, and digital media, new tools and approaches are needed to effectively analyze and make sense of "big data".
Biological databases store and organize large amounts of biological data for research use. There are many types of biological databases that classify data by type, such as nucleotide sequences, protein sequences, genomes, protein structures, gene expression, and metabolic pathways. Databases can also be classified by their data source as primary databases containing experimental results or secondary databases that analyze primary database results. Database availability varies, with some publicly open and others proprietary. Common biological databases discussed include GenBank, UniProt, PDB, KEGG, and FlyBase.
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets in a scalable way. An in-memory graph database was used to model variants, annotations, and their relationships. Simple queries on the graph performed as well or better than a relational database. More complex queries and analysis, like spectral clustering of populations, were also possible with the graph model and helped identify patterns not feasible with relational approaches. The results indicate graph databases are a powerful tool for precision medicine research by enabling both known and novel analysis of large genomic datasets.
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
These slides cover evolving federal research requirements for sharing scientific data. Provided are updates on federal agency responses to the 2013 OSTP memo, guidance on data management plans, resources for data management and curation training for staff/researchers, and tips for evaluating public data-sharing services. ICPSR's public data-sharing service, openICPSR, is also presented. Recording of this presentation is here: https://www.youtube.com/watch?v=2_erMkASSv4&feature=youtu.be
Data and Donuts: How to write a data management planC. Tobin Magle
This presentation describes best practices for how to write a data management plan for your research data. Additionally, it provides information about finding funder requirements, metadata standards, and repositories.
Datat and donuts: how to write a data management planC. Tobin Magle
This document provides guidance on how to write a data management plan (DMP). It discusses what a DMP is, why researchers should care about data management, and where data management fits into the research cycle. It also covers the key components of a successful DMP, including a data inventory, a strategy for describing the data, a plan for long-term data preservation, and methods for making the data accessible. The document provides examples and exercises to help researchers develop the sections of a DMP for their own research projects.
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...Spark Summit
This document describes a project at Novartis to use Apache Spark for high-dimensional data analysis from drug screening. Large datasets from various screening technologies were analyzed using Spark pipelines for quality control, normalization, and classification. Visualizations were built using WebGL. The goals were to speed up multi-day batch jobs, create a unified analysis workflow, and build an application for scientists. Future work includes elastic infrastructure, supervised learning of cell phenotypes, and contributing methods to open source.
Reproducibility in human cognitive neuroimaging: a community-driven data sha...Nolan Nichols
The document summarizes Nolan Nichols' dissertation defense on a community-driven data sharing framework for integrating and interoperating neuroimaging provenance information. His research aimed to enhance the reusability of neuroimaging data and workflows by advancing data exchange standards that incorporate provenance. Through two phases involving multiple collaborations, he extended existing standards and developed neuroimaging data models and web services to compute and discover provenance from brain imaging workflows in order to improve reproducibility in cognitive neuroimaging research.
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets. In phase I, simple queries on a graph database had performance speeds better than or equal to a relational database. Complex queries exploring patterns and clusters were also possible. In phase II, spectral clustering of 1000 genomes data identified three main clusters supporting known population genetics patterns, demonstrating the potential of graph databases for mining complex genomic correlations. The results indicate a graph database provides an effective approach for precision cancer research by enabling both known and novel queries on large genomic datasets.
Similar to Data-knowledge transition zones within the biomedical research ecosystem (20)
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
2. • NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008
3. BD2K: Big Data to Knowledge
• BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise, to facilitate discovery and support new knowledge,
and to maximize community engagement.
• BD2K aims to develop the new approaches, standards, methods, tools,
software, and competencies that will enhance the use of biomedical Big Data
by:
– Facilitating broad use of biomedical digital assets by making them
discoverable, accessible, and citable
– Conducting research and developing the methods, software, and tools
needed to analyze biomedical Big Data
– Enhancing training in the development and use of methods and tools
necessary for biomedical Big Data science
– Supporting a data ecosystem that accelerates discovery as part of a digital
enterprise
http://bd2k.nih.gov/
5. How do resources get added to the NIF?
•NIF curators
•Nomination by the community
•Semi-automated text mining
pipelines
NIF Registry
Requires no special skills
Manual and semi-
automated updates
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
•Open Source Brain < 2 hr
•Automated update via NIF
DISCO dashboard
Low barrier to entry; incremental refinementMarenco et al., 2010; 2014
7. What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
8. THE STATE OF RESEARCH
RESOURCES: RESOURCE REGISTRY
9. Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
Anita Bandrowski and Burak Ozyurt
Population, Coverage and Linkage of Resource
Registry
10. • Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
What happens to these resources?
The Registry provides a persistent identifier and metadata
record for what once existed but no longer does
11. Keeping content up
to date
Connectome
Tractography
Epigenetics
•New tags come into
existence
•New resource types come
into existence, e.g., Mobile
apps
•Resources add new types of
content
•Change name
•Change scope
•> 7000 updates to the
registry last year
It’s a challenge to keep the registry up to date;
sitemaps, curation, ontologies, community review
13. NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
250 sources
> 800 M records
14. What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
16. NIF Information Framework: Query and alignment
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
• Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF uses ontologies to enhance search
and discovery but is not constrained by
them
18. Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated
by chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and
tools, the answers may differ
• Many databases have tool
bases and workflows that
they support
– Much value has been added to
individual data sets
19. Facets and filters: Progressive
refinement of search
Facet/Filter
Source
Category
Index
Query Addiction
Registry Data
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
More effective to start with a general query and use
the navigation to refine search
20. Concept Mapper: Alignment and weighting
Find:gene cerebellum=find all sources with column mapped to gene that also contain
keyword cerebellum; Find:gene Anatomy:cerebellum
22. Query across Registry and Federation
• Registry and
Federation were
treated
separately, even
though
Federation
comprises views
of Registry
entries
• Experimenting
with new
combined index
23. SciCrunch: A “social network” for
resources
• NIF is a general search
engine across all of
neuroscience
• Very powerful for discovery
and general browsing
• Can perform analytics across
the spectrum of biomedical
resources
• Many communities want to
create more focused portals
• Specialized for their domain
• Restrict the particular sources
• Organize the data according
to their needs
• Use their own branding
• How do we create a system
that satisfies community
needs without creating
another silo?
29. Making use of community
Facet/Filter
Source
Category
Index
Community Community
Community
resources
SciCrunch
data (all)
Gene
Gemma
Gene Organism
Expression
level
Geo
Integrated
Expression
Literature
Brings expertise of community to understanding how to work
with data
33. SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the
midbrain
34. The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
35. Importance of comprehensive indices: For how
many proteins are there antibodies?
0
1-10
11-100
101-1000
1001+
Human, protein coding genes (Entrez Gene) vs # of
search results from the antibodyregistry.org
Antibodyregistry.orgTrish Whetzel and Anita Bandrowski
38. The scourge of neuroanatomical nomenclature
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
39. 6 parcellation schemes of mouse
prefrontal cortex based on Nissl alone
Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi:
10.1007/s00429-013-0630-
40. How many neuron types are
there?
NIH funding announcement: BRAIN Initiative: Transformative
Approaches for Cell-Type Classification in the Brain
“The mammalian brain contains a vast number of cells. These cells are
generally grouped within broad classes (e.g., neurons or glia) but it is
currently unknown exactly how many classes exist.”
41. Location of Cell Soma
Location of dendrites
Location of local axon
arbor
Transition Zones: Neurons and their properties
42. Analysis of Red Links in the Neuron
Registry
• INCF Project
– Neuron Registry
• Neurolex.org
• Semantic
MediaWiki
– > 30 experts
worldwide
– Fill out neuron
pages in Neurolex
Wiki
Soma location
Dendrite location
Axon location
0
50
100
150
200
250
300
Number
Total
redlinks
easy fixes
hard fixes
Soma location
Dendrite location
Axon location
Social networks and community sites let us learn things from
the collective behavior of contributors show limits in our
knowledge and our knowledge representations
43. Domain Knowledge
Ontologies
Atlases/Maps
Annotation
Claims, assertions
Registries
Derived data
Models and
simulations
Analyses
Data
Databases Data sets
Literature
Search and Discovery
Cannot try to shoe-horn everything into a single representation or system, but figure
out how information (data + knowledge) can flow between them; Knowledge is fluid
and will continually update
SciCrunch: Creating a Data and Resource
Discovery Environment
44. BD2K: Creating a Data Discovery
Index
• BioCADDIE
– Dr. Lucila Ohno-
Machado PI
– FORCE11:
Community
engagement piece
• What should a data
discovery index do?
– Task Forces
– Pilot projects
• How should it be
built? http://biocaddie.org
BIOMEDICAL AND HEALTH CARE DATA
DISCOVERY AND INDEXING ENGINE CENTER
45. NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
46. BD2K-K2BD: Data Discovery Index
• Accounting of what is available
– Comprehensive resource registry
– UPC’s for research resources
• Information framework
– Major concepts contained in data, but also accounting of what happens to
data as it flows through the ecosystem (provenance)
• Community-based portals into shared data resources
– Share expertise
– Metrics of trust
– Shared curation and upkeep
• Two way validation of knowledge to data
47. Registry vs Federation: Metadata about
resource vs metadata/data in database
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
48. What have we learned: Grabbing the
long tail of small data
• NIF is in a unique position to ask
questions against the data resource
landscape
• The data space is not uniform
• Data “flows” from one resource to
the next
– Data is reinterpreted, reanalyzed or added
to
• Currently very difficult to track data
as it moves across the landscape
– Makes it difficult to learn from combined
efforts
49.
50. Working with and extending
ontologies: Neurolex.org
http://neurolex.org Larson et al, Frontiers in Neuroinformatics, in press
•Semantic MediWiki
•Provide a simple interface
for defining the concepts
required
•Light weight semantics-sets of
triples
•Good teaching tool for
learning about semantic
integration and the benefits of
a consistent semantic
framework
•Community based:
•Anyone can contribute their
terms, concepts, things
•Anyone can edit
•Anyone can link
•Accessible: searched by Google
•Growing into a significant
knowledge base for
neuroscience
Demo D03
51. Neuron Lexicon: Gauging the state of
knowledge in neuroscience
• Led by Dr. Gordon
Shepherd
• > 30 world wide
experts
• Simple set of
properties
• Consistent naming
scheme
• Integrated with
Structural Lexicon
• Used for annotation
in other resources,
e.g., NeuroElectro
54. Same data: different analysis
• Gemma: Gene ID + Gene Symbol
• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of change is
opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:
•1370 statements from Gemma regarding gene expression as a function of chronic
morphine
•617 were consistent with DRG; over half of the claims of the paper were not
confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data
has gone and what has been done with it
55. How many do we use?
These resources themselves need to be citable
56. Resource Identification Initiative:
Linking resources to literature
• Have authors supply appropriate
identifiers for key resources used
within a study such that they are:
– Machine processible (i.e., unique
identifier that resolves to a single
resource)
– Outside of the paywall
– Uniform across journals and
publishers
• Pilot project: SciCrunch portal
serving identifiers for
– Software/databases
– Antibodies
– Genetically modified organisms
Launched February 2014: > 30 journals
participating
57. What studies have used...?
•>200 articles have appeared to date
•>30 journals
•Data set being made available to
community
•> 650 RRID’s
•~10% disappeared after
copyediting
•5% were in error
Database available at: https://www.force11.org/node/5635
58. : C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers