Presented at Beyond the PDF2 in Amsterdam 2013 http://www.force11.org/beyondthepdf2. This talk describes preliminary data showing the lack of scientific reproducibility solely based on an inability to identify the material resources used in the research. Final work to be published soon!
What is the reproducibility crisis in science and what can we do about it?Dorothy Bishop
Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...mhaendel
Presented at AMIA TBI CRI 2018.
Rare disease patients are expert in their medical history and these patients not only are some of the most engaged, but also they can themselves provision data for use in clinical evaluation. We therefore created a lay-person version of our clinical deep phenotyping instrument, the Human Phenotype Ontology. Here, we evaluate the diagnostic utility of this lay-HPO, and debut a new software tool for patient-led deep phenotyping.
The Software and Data Licensing Solution: Not Your Dad’s UBMTA mhaendel
Presented at the Association of University Technology Managers (AUTM) Annual Conference 2018
Moderator: Arvin Paranjpe, Oregon Health & Science University
Speakers: Frank Curci, Ater Wynne LLP
Melissa Haendel, Oregon Health & Science University
Charles Williams, University of Oregon
Big data is an open frontier, and it’s quickly expanding. However, transaction costs and legal barriers stand squarely in the way of meaningful, far-reaching data integration. We’ll grapple with the issues regarding a large-scale data integration project across humans, model and non-model organisms. Without pointing fingers, we’ll also share a few highlights from the (Re)usable Data Project, which outlined a five-part rubric to evaluate data licenses with respect to clarity and the reuse and redistribution of data. In addition, the topic raises the question: How well-suited are off-the-shelf software and data licenses for universities? Data scientists and software programmers are all too quick to pick one when they release their technology on GitHub. What should technology transfer professionals
recommend? We’ll discuss the usefulness and attributes of a uniform software and data license for university researchers and software programmers.
Equivalence is in the (ID) of the beholdermhaendel
Presented at PIDapalooza 2018. https://pidapalooza.org/
Determining identifier equivalency is key to data integration and to realizing the scientific discoveries that can only be made by collating our vast disconnected data stores.
There are two key problems in determining equivalency - conceptual and syntactic alignment. Conceptual alignment often relies on Xrefs and string-matching against synonyms. There is indeed a better way! Algorithmic determination of identifier equivalency across different sources can use a combination of Xrefs, priors rules, existing semantic relations, and synonyms to create equivalency cliques than can highlight the discrepancies in conceptual definitions for manual review. This is especially useful for data sources annotated with concept drift and differences, such as diseases. Syntactic issues are that there are so many variations of the same identifier, making data joins difficult. We present a framework to reconcile and provide authoritative and integration-ready prefixed identifiers (CURIES), to capture and consolidate prefixes and to build links across key resource registries. The combination of JSON-LD context technology with a prefix metadata repository provides the basis for the infrastructure to handle identifiers in a consistent fashion. Finally, this architecture also allows resources to be self describing "beacons" with respect to their identifiers.
Building (and traveling) the data-brick road: A report from the front lines ...mhaendel
The NIH Data Commons must treat the data it will contain not unlike the mortar and stones of a road. To help our fellow scientists travelers use the road, we must engineer for heavy traffic and diverse destinations. There are many steps to architecting a robust and persistent road. First, the data must be sourced and manipulated into common data models. This requires versioned access to the data, equivalency determination of identifiers within the data or minting of new ones for the data and/or within it, manipulating the data according to common data models (e.g. a genotype-to-pehnotype association in one source may relate a variant to a disease, where in another it may be a set of alleles associated with a set of phenotypes, each source models the data differently). Inclusion of the data in the Commons must meet all licensing restrictions, which are varied and usually poorly declared, as well as security, HIPAA, and ethics requirements. Software tools are needed to perform the Enhance-Transform-Load (ETL) process on a regular cycle to keep the data current, and to assess changes and quality assurance over time. For records that disappear, there needs to be a way to keep an archive of them. Once in the Commons, the data requires a map to navigate the roads: where do you want to go? Indexing and search across the data requires having the data be self-reporting - loading ontologies used in the data for indexing and providing faceted query over these and other attributes, sophisticated text mining tools, relevance ranking, and equivalency and similarity determination from amongst different providers. Once found, the users need vehicles to drive upon the road. These are their workspaces, the place where they design and implement the operations they need in order to get where they want to go. Unimaginable scientific emeralds are to be found at the end of the road, as the sum of all the data, if well integrated and made computationally reusable, has proven to be well beyond the sum of its parts in getting us where we want to go.
What is the reproducibility crisis in science and what can we do about it?Dorothy Bishop
Talk given to the Rhodes Biomedical Association, 4th May 2016.
For references see: http://www.slideshare.net/deevybishop/references-on-reproducibility-crisis-in-science-by-dvm-bishop
Patient-led deep phenotyping using a lay-friendly version of the Human Phenot...mhaendel
Presented at AMIA TBI CRI 2018.
Rare disease patients are expert in their medical history and these patients not only are some of the most engaged, but also they can themselves provision data for use in clinical evaluation. We therefore created a lay-person version of our clinical deep phenotyping instrument, the Human Phenotype Ontology. Here, we evaluate the diagnostic utility of this lay-HPO, and debut a new software tool for patient-led deep phenotyping.
The Software and Data Licensing Solution: Not Your Dad’s UBMTA mhaendel
Presented at the Association of University Technology Managers (AUTM) Annual Conference 2018
Moderator: Arvin Paranjpe, Oregon Health & Science University
Speakers: Frank Curci, Ater Wynne LLP
Melissa Haendel, Oregon Health & Science University
Charles Williams, University of Oregon
Big data is an open frontier, and it’s quickly expanding. However, transaction costs and legal barriers stand squarely in the way of meaningful, far-reaching data integration. We’ll grapple with the issues regarding a large-scale data integration project across humans, model and non-model organisms. Without pointing fingers, we’ll also share a few highlights from the (Re)usable Data Project, which outlined a five-part rubric to evaluate data licenses with respect to clarity and the reuse and redistribution of data. In addition, the topic raises the question: How well-suited are off-the-shelf software and data licenses for universities? Data scientists and software programmers are all too quick to pick one when they release their technology on GitHub. What should technology transfer professionals
recommend? We’ll discuss the usefulness and attributes of a uniform software and data license for university researchers and software programmers.
Equivalence is in the (ID) of the beholdermhaendel
Presented at PIDapalooza 2018. https://pidapalooza.org/
Determining identifier equivalency is key to data integration and to realizing the scientific discoveries that can only be made by collating our vast disconnected data stores.
There are two key problems in determining equivalency - conceptual and syntactic alignment. Conceptual alignment often relies on Xrefs and string-matching against synonyms. There is indeed a better way! Algorithmic determination of identifier equivalency across different sources can use a combination of Xrefs, priors rules, existing semantic relations, and synonyms to create equivalency cliques than can highlight the discrepancies in conceptual definitions for manual review. This is especially useful for data sources annotated with concept drift and differences, such as diseases. Syntactic issues are that there are so many variations of the same identifier, making data joins difficult. We present a framework to reconcile and provide authoritative and integration-ready prefixed identifiers (CURIES), to capture and consolidate prefixes and to build links across key resource registries. The combination of JSON-LD context technology with a prefix metadata repository provides the basis for the infrastructure to handle identifiers in a consistent fashion. Finally, this architecture also allows resources to be self describing "beacons" with respect to their identifiers.
Building (and traveling) the data-brick road: A report from the front lines ...mhaendel
The NIH Data Commons must treat the data it will contain not unlike the mortar and stones of a road. To help our fellow scientists travelers use the road, we must engineer for heavy traffic and diverse destinations. There are many steps to architecting a robust and persistent road. First, the data must be sourced and manipulated into common data models. This requires versioned access to the data, equivalency determination of identifiers within the data or minting of new ones for the data and/or within it, manipulating the data according to common data models (e.g. a genotype-to-pehnotype association in one source may relate a variant to a disease, where in another it may be a set of alleles associated with a set of phenotypes, each source models the data differently). Inclusion of the data in the Commons must meet all licensing restrictions, which are varied and usually poorly declared, as well as security, HIPAA, and ethics requirements. Software tools are needed to perform the Enhance-Transform-Load (ETL) process on a regular cycle to keep the data current, and to assess changes and quality assurance over time. For records that disappear, there needs to be a way to keep an archive of them. Once in the Commons, the data requires a map to navigate the roads: where do you want to go? Indexing and search across the data requires having the data be self-reporting - loading ontologies used in the data for indexing and providing faceted query over these and other attributes, sophisticated text mining tools, relevance ranking, and equivalency and similarity determination from amongst different providers. Once found, the users need vehicles to drive upon the road. These are their workspaces, the place where they design and implement the operations they need in order to get where they want to go. Unimaginable scientific emeralds are to be found at the end of the road, as the sum of all the data, if well integrated and made computationally reusable, has proven to be well beyond the sum of its parts in getting us where we want to go.
Reusable data for biomedicine: A data licensing odysseymhaendel
Biomedical data integrators grapple with a fundamental blocker in research today: licensing for data use and redistribution. Complex licensing and data reuse restrictions hinder most publicly-funded, seemingly “open” biomedical data from being put to its full potential. Such issues include missing licenses, non-standard licenses, and restrictive provisions. The sheer diversity of licenses are particularly thorny for those that aim to redistribute data. Redistributors are often required to contact each sub-source to obtain permissions, and this is complicated by the fact that on each side of the agreement there may be multiple legal entities involved and some sub-sources may themselves already be aggregating data from other sub-sources. Furthermore, interpreting legal compliance with source data licensing and use agreements is complicated, as data is often manipulated, shared, and redistributed by many types of research groups and users in various and subtle ways. Here, we debut a new effort, the (Re)usable Data Project, where we have created a five-part rubric to evaluate biomedical data sources and their licensing information to determine the degree to which unnegotiated and unrestricted reuse and redistribution are provided. We have tested the (Re)usable Data rubric against various biomedical data sources, ranking each source on a scale of zero to five stars, and have found that approximately half of the resources rank poorly, getting 2.5 stars or less. Our goal is to help biomedical informaticians and other users navigate the plethora of issues in reusing and redistributing biomedical data. The (Re)usable Data project aims to promote standardization and ease of reuse licensing practices by data providers.
Data Translator: an Open Science Data Platform for Mechanistic Disease Discoverymhaendel
Architecture of language and data translation that underlays the NCATS Biomedical Data Translator. Presented at the Fanconi Anemia Annual Meeting. http://fanconi.org/index.php/research/annual_symposium
How open is open? An evaluation rubric for public knowledgebasesmhaendel
Presented at the 2017 International Biocuration Conference.
Data relevant to any given scientific investigation is highly decentralized across thousands of specialized databases. Within the Biocuration community, we recognize that the value of open scientific knowledge bases is that they make scientific knowledge easier to find and compute, thereby maximizing impact and minimizing waste. The ever-increasing number of databases makes us necessarily question what are our priorities with respect to maintaining them, developing new ones, or senescing/subsuming ones that have completed in their mission. Therefore, open biomedical data repositories should be carefully evaluated according to quality, accessibility, and value of the database resources over time and across the translational divide.
Traditional citation count and publication impact factors as a measure of success or value are known to be inadequate to assess the usefulness of a resource. This is especially true for integrative resources. For example, almost everyone in biomedicine relies on PubMed, but almost no one ever cites or mentions it in their publications. While the Nucleic Acids Research Database issues have increased citation of some databases, many still go unpublished or uncited; even novel derivations of methodology, applications, and workflows from biomedical knowledge bases are often “adapted” but never cited. There is a lack of citation best practices for widely used biomedical database resources (e.g. should a paper be cited? A URL? Is mention of the name and access date sufficient?).
We have developed a draft evaluation rubric for evaluating open science databases according to the commonly cited FAIR principles -- Findable, Accessible, Interoperable, and Reusable, but with three additional principles: Traceable, Licensed, and Connected. These additions are largely overlooked and underappreciated, yet are critical to reuse of the knowledge contained within any given database. It is worth noting that FAIR principles apply not only to the resource as a whole, but also to their key components; this “fractal FAIRness” means that even the license, identifiers, vocabularies, APIs themselves must be Findable, Accessible, Interoperable, Reusable, etc. Here we report on initial testing of our evaluation rubric on the recent NIH/Wellcome Trust Open Science projects and seek community input for how to further advance this rubric as a Biocuration community resource.
Deep phenotyping to aid identification of coding & non-coding rare disease v...mhaendel
Whole-exome sequencing has revolutionized disease research, but many cases remain unsolved because ~100-1000 candidates remain after removing common or non-pathogenic variants. We present Genomiser to prioritize coding and non-coding variants by leveraging phenotype data encoded with the Human Phenotype Ontology and a curated database of non-coding Mendelian variants. Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes.
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...mhaendel
Presented at the IRDiRC 2017 conference in Paris, Feb 9th, 2017 (http://irdirc-conference.org/). This talk reviews use of the Human Phenotype Ontology for phenotype comparisons against other patients, known diseases, and animal models for diagnostic discovery. It also discusses the new Phenopackets Exchange mechanism for open phenotypic data sharing.
www.monarchinitiative.org
www.phenopackets.org
www.human-phenotype-ontology.org
Credit where credit is due: acknowledging all types of contributionsmhaendel
This is an update for COASP (http://oaspa.org/conference/) on the representation of attribution beyond authorship of a publication. Publications are proxies for the projects and people that area actually engaged in the work, and represent the dissemination aspect. How can we better understand the individual contributions and their impact? The openRIF, openVIVO and FORCE11 Attribution WG efforts aim to represent scholarship in a computationally tractable manner so as to enable credit and evaluation of all types of scholarly contributions.
The Human Phenotype Ontology (HPO) was developed to describe phenotypic abnormalities, aka, “deep phenotyping”, whereby symptoms and characteristic phenotypic findings (a phenotypic profile) are captured. The HPO has been utilized to great success for assisting computational phenotype comparison against known diseases, other patients, and model organisms to support diagnosis of rare disease patients. Clinicians and geneticists create phenotypic profiles based on clinical evaluation, but this is time consuming and can miss important phenotypic features. Patients are sometimes the best source of information about their symptoms that might otherwise be missed in a clinical encounter. However, HPO primarily use medical terminology, which can be difficult for patients and their families to understand. To make the HPO accessible to patients, we systematically added non-expert terminology (i.e., layperson terms) synonyms. Using semantic similarity, patient-recorded phenotypic profiles can be evaluated against those created clinically for undiagnosed patients to determine the improvement gained from the patient-driven phenotyping, as well as how much the patient phenotyping narrows the diagnosis. This patient-centric HPO can be utilized by all: in patient-centered rare disease websites, in patient community platforms and registries, or even to post one’s hard-to-diagnosed phenotypic profile on the Web.
Why the world needs phenopacketeers, and how to be onemhaendel
Keynote presented at the the Ninth International Biocuration Conference Geneva, Switzerland, April 10-14, 2016
The health of an individual organism results from complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient disease features, variation across biodiversity) or environmental factors that may influence such phenotypic outcomes. Phenotypic features of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, museum collections, and even social media. In these contexts, biocuration has been pivotal to obtaining a computable representation, but is still deeply challenged by the lack of standardization, accessibility, persistence, and computability among these contexts. How can we help all phenotype data creators contribute to this biocuration effort when the data is so distributed across so many communities, sources, and scales? How can we track contributions and provide proper attribution? How can we leverage phenotypic data from the model organism or biodiversity communities to help diagnose disease or determine evolutionary relatedness? Biocurators unite in a new community effort to address these challenges.
On the frontier of genotype-2-phenotype data integrationmhaendel
Presented at AMIA TBI 2016 BD2K Panel. A description of the Monarch Initiative's efforts to perform deep phenotyping data integration across species, facilitate exchange, and build computable G2P evidence modesl to aid variant interpretation.
Envisioning a world where everyone helps solve diseasemhaendel
Keynote presented at the Semantic Web for Life Sciences conference in Cambridge, UK, December 9th, 2015
http://www.swat4ls.org/
The talk focuses on the use of ontologies for data integration to support rare disease diagnostics, and how so very many people unbeknownst to the patient or even to the researchers creating the data are involved in a diagnosis.
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
NIH BD2K all-hands meeting poster November 12, 2015.
Attempts at correlating phenotypic aspects of disease with causal genetic influences are often confounded by the challenges of interpreting diverse data distributed across numerous resources. New approaches to data modeling, integration, tooling, and community practices are needed to make efficient use of these data. The Monarch Initiative is an international consortium working on the development of shared data, tools, and standards to enable direct translation of integrated genotype, phenotype, and environmental data from human and model organisms to enhance our understanding of human disease. We utilize sophisticated semantic mapping techniques across a diverse set of standardized ontologies to deeply integrate data across species, sources, and modalities. Using phenotype similarity matching algorithms across these data enables disorder prediction, variant prioritization, and patient matching against known diseases and model organisms. These similarity algorithms form the core of several innovative tools. The Exomiser, which enables exome variant prioritization by combining pathogenicity, frequency, inheritance, protein interaction, and cross-species phenotype data. Our Phenotype Sufficiency tool provides clinicians the ability to compare patient phenotypic profiles using the Human Phenotype Ontology to determine uniqueness and specificity in support of variant prioritization. The PhenoGrid visualization widget illustrates phenotype similarity between patients, known diseases, and model organisms. Monarch develops models in collaboration with the community in support of the burgeoning genotype-phenotype disease research community. We have successfully used Exomiser to solve a number of undiagnosed patient cases in collaboration with the NIH Undiagnosed Disease Program. Ongoing development in coordination with the Global Alliance for Genetic Health (GA4GH) and other groups will catalyze the realization of our goal of a vital translational community focused on the collaborative application of integrated genotype, phenotype, and environmental data to human disease.
Force11: Enabling transparency and efficiency in the research landscapemhaendel
Presented at the Feb 2015, NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
http://www.niso.org/news/events/2015/virtual_conferences/sci_data_management/
Reusable data for biomedicine: A data licensing odysseymhaendel
Biomedical data integrators grapple with a fundamental blocker in research today: licensing for data use and redistribution. Complex licensing and data reuse restrictions hinder most publicly-funded, seemingly “open” biomedical data from being put to its full potential. Such issues include missing licenses, non-standard licenses, and restrictive provisions. The sheer diversity of licenses are particularly thorny for those that aim to redistribute data. Redistributors are often required to contact each sub-source to obtain permissions, and this is complicated by the fact that on each side of the agreement there may be multiple legal entities involved and some sub-sources may themselves already be aggregating data from other sub-sources. Furthermore, interpreting legal compliance with source data licensing and use agreements is complicated, as data is often manipulated, shared, and redistributed by many types of research groups and users in various and subtle ways. Here, we debut a new effort, the (Re)usable Data Project, where we have created a five-part rubric to evaluate biomedical data sources and their licensing information to determine the degree to which unnegotiated and unrestricted reuse and redistribution are provided. We have tested the (Re)usable Data rubric against various biomedical data sources, ranking each source on a scale of zero to five stars, and have found that approximately half of the resources rank poorly, getting 2.5 stars or less. Our goal is to help biomedical informaticians and other users navigate the plethora of issues in reusing and redistributing biomedical data. The (Re)usable Data project aims to promote standardization and ease of reuse licensing practices by data providers.
Data Translator: an Open Science Data Platform for Mechanistic Disease Discoverymhaendel
Architecture of language and data translation that underlays the NCATS Biomedical Data Translator. Presented at the Fanconi Anemia Annual Meeting. http://fanconi.org/index.php/research/annual_symposium
How open is open? An evaluation rubric for public knowledgebasesmhaendel
Presented at the 2017 International Biocuration Conference.
Data relevant to any given scientific investigation is highly decentralized across thousands of specialized databases. Within the Biocuration community, we recognize that the value of open scientific knowledge bases is that they make scientific knowledge easier to find and compute, thereby maximizing impact and minimizing waste. The ever-increasing number of databases makes us necessarily question what are our priorities with respect to maintaining them, developing new ones, or senescing/subsuming ones that have completed in their mission. Therefore, open biomedical data repositories should be carefully evaluated according to quality, accessibility, and value of the database resources over time and across the translational divide.
Traditional citation count and publication impact factors as a measure of success or value are known to be inadequate to assess the usefulness of a resource. This is especially true for integrative resources. For example, almost everyone in biomedicine relies on PubMed, but almost no one ever cites or mentions it in their publications. While the Nucleic Acids Research Database issues have increased citation of some databases, many still go unpublished or uncited; even novel derivations of methodology, applications, and workflows from biomedical knowledge bases are often “adapted” but never cited. There is a lack of citation best practices for widely used biomedical database resources (e.g. should a paper be cited? A URL? Is mention of the name and access date sufficient?).
We have developed a draft evaluation rubric for evaluating open science databases according to the commonly cited FAIR principles -- Findable, Accessible, Interoperable, and Reusable, but with three additional principles: Traceable, Licensed, and Connected. These additions are largely overlooked and underappreciated, yet are critical to reuse of the knowledge contained within any given database. It is worth noting that FAIR principles apply not only to the resource as a whole, but also to their key components; this “fractal FAIRness” means that even the license, identifiers, vocabularies, APIs themselves must be Findable, Accessible, Interoperable, Reusable, etc. Here we report on initial testing of our evaluation rubric on the recent NIH/Wellcome Trust Open Science projects and seek community input for how to further advance this rubric as a Biocuration community resource.
Deep phenotyping to aid identification of coding & non-coding rare disease v...mhaendel
Whole-exome sequencing has revolutionized disease research, but many cases remain unsolved because ~100-1000 candidates remain after removing common or non-pathogenic variants. We present Genomiser to prioritize coding and non-coding variants by leveraging phenotype data encoded with the Human Phenotype Ontology and a curated database of non-coding Mendelian variants. Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes.
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...mhaendel
Presented at the IRDiRC 2017 conference in Paris, Feb 9th, 2017 (http://irdirc-conference.org/). This talk reviews use of the Human Phenotype Ontology for phenotype comparisons against other patients, known diseases, and animal models for diagnostic discovery. It also discusses the new Phenopackets Exchange mechanism for open phenotypic data sharing.
www.monarchinitiative.org
www.phenopackets.org
www.human-phenotype-ontology.org
Credit where credit is due: acknowledging all types of contributionsmhaendel
This is an update for COASP (http://oaspa.org/conference/) on the representation of attribution beyond authorship of a publication. Publications are proxies for the projects and people that area actually engaged in the work, and represent the dissemination aspect. How can we better understand the individual contributions and their impact? The openRIF, openVIVO and FORCE11 Attribution WG efforts aim to represent scholarship in a computationally tractable manner so as to enable credit and evaluation of all types of scholarly contributions.
The Human Phenotype Ontology (HPO) was developed to describe phenotypic abnormalities, aka, “deep phenotyping”, whereby symptoms and characteristic phenotypic findings (a phenotypic profile) are captured. The HPO has been utilized to great success for assisting computational phenotype comparison against known diseases, other patients, and model organisms to support diagnosis of rare disease patients. Clinicians and geneticists create phenotypic profiles based on clinical evaluation, but this is time consuming and can miss important phenotypic features. Patients are sometimes the best source of information about their symptoms that might otherwise be missed in a clinical encounter. However, HPO primarily use medical terminology, which can be difficult for patients and their families to understand. To make the HPO accessible to patients, we systematically added non-expert terminology (i.e., layperson terms) synonyms. Using semantic similarity, patient-recorded phenotypic profiles can be evaluated against those created clinically for undiagnosed patients to determine the improvement gained from the patient-driven phenotyping, as well as how much the patient phenotyping narrows the diagnosis. This patient-centric HPO can be utilized by all: in patient-centered rare disease websites, in patient community platforms and registries, or even to post one’s hard-to-diagnosed phenotypic profile on the Web.
Why the world needs phenopacketeers, and how to be onemhaendel
Keynote presented at the the Ninth International Biocuration Conference Geneva, Switzerland, April 10-14, 2016
The health of an individual organism results from complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient disease features, variation across biodiversity) or environmental factors that may influence such phenotypic outcomes. Phenotypic features of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, museum collections, and even social media. In these contexts, biocuration has been pivotal to obtaining a computable representation, but is still deeply challenged by the lack of standardization, accessibility, persistence, and computability among these contexts. How can we help all phenotype data creators contribute to this biocuration effort when the data is so distributed across so many communities, sources, and scales? How can we track contributions and provide proper attribution? How can we leverage phenotypic data from the model organism or biodiversity communities to help diagnose disease or determine evolutionary relatedness? Biocurators unite in a new community effort to address these challenges.
On the frontier of genotype-2-phenotype data integrationmhaendel
Presented at AMIA TBI 2016 BD2K Panel. A description of the Monarch Initiative's efforts to perform deep phenotyping data integration across species, facilitate exchange, and build computable G2P evidence modesl to aid variant interpretation.
Envisioning a world where everyone helps solve diseasemhaendel
Keynote presented at the Semantic Web for Life Sciences conference in Cambridge, UK, December 9th, 2015
http://www.swat4ls.org/
The talk focuses on the use of ontologies for data integration to support rare disease diagnostics, and how so very many people unbeknownst to the patient or even to the researchers creating the data are involved in a diagnosis.
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
NIH BD2K all-hands meeting poster November 12, 2015.
Attempts at correlating phenotypic aspects of disease with causal genetic influences are often confounded by the challenges of interpreting diverse data distributed across numerous resources. New approaches to data modeling, integration, tooling, and community practices are needed to make efficient use of these data. The Monarch Initiative is an international consortium working on the development of shared data, tools, and standards to enable direct translation of integrated genotype, phenotype, and environmental data from human and model organisms to enhance our understanding of human disease. We utilize sophisticated semantic mapping techniques across a diverse set of standardized ontologies to deeply integrate data across species, sources, and modalities. Using phenotype similarity matching algorithms across these data enables disorder prediction, variant prioritization, and patient matching against known diseases and model organisms. These similarity algorithms form the core of several innovative tools. The Exomiser, which enables exome variant prioritization by combining pathogenicity, frequency, inheritance, protein interaction, and cross-species phenotype data. Our Phenotype Sufficiency tool provides clinicians the ability to compare patient phenotypic profiles using the Human Phenotype Ontology to determine uniqueness and specificity in support of variant prioritization. The PhenoGrid visualization widget illustrates phenotype similarity between patients, known diseases, and model organisms. Monarch develops models in collaboration with the community in support of the burgeoning genotype-phenotype disease research community. We have successfully used Exomiser to solve a number of undiagnosed patient cases in collaboration with the NIH Undiagnosed Disease Program. Ongoing development in coordination with the Global Alliance for Genetic Health (GA4GH) and other groups will catalyze the realization of our goal of a vital translational community focused on the collaborative application of integrated genotype, phenotype, and environmental data to human disease.
Force11: Enabling transparency and efficiency in the research landscapemhaendel
Presented at the Feb 2015, NISO Virtual Conference
Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
http://www.niso.org/news/events/2015/virtual_conferences/sci_data_management/
Semantic phenotyping for disease diagnosis and discovery
On the reproducibility of science
1. On the reproducibility
of science
Melissa Haendel
Beyond the PDF2
20 March 2013
@ontowonka
haendel@ohsu.edu
2. The
science
cycle
Slide
from
Gully
Burns
Do we know if the infrastructure is
actually broken?
3. The
science
cycle
Image:
h6p://www.joinchangena=on.org/blog/post/roadblocks-‐on-‐the-‐pathway-‐to-‐ci=zenship
This is a broken data story.
4. Reproducibility
is
dependent
at
a
minimum,
on
using
the
same
resources.
But…
“All
companies
from
which
materials
were
obtained
should
be
listed.”
-‐
A
well-‐known
journal
Journal guidelines for methods are
often poor and space is limited
5. Hypothesis:
AnAbodies
in
the
published
literature
are
not
uniquely
idenAfiable
Gather
journal
ar=cles
28
Journals
Iden=fying
ques=ons:
5
domains:
Is
the
an=body
iden=fiable
Immunology
119
papers
in
the
vendor
site?
Cell
biology
Neuroscience
Is
the
catalog
number
Developmental
biology
454
an=bodies
reported?
General
biology
408
commercial
an=bodies
Is
the
source
organism
3
impact
factors:
reported?
High
46
non-‐commercial
Medium
an=bodies
Low
Is
the
an=body
target
iden=fiable?
An experiment in reproducibility
6. Approximately
half
of
anAbodies
are
not
uniquely
idenAfiable
in
119
publicaAons
60%
n=46
50%
Percent
idenAfiable
n=408
40%
30%
20%
10%
0%
Commercial
an=body
Non-‐commerical
an=body
The data shows…
7. Unique
idenAficaAon
of
commercial
anAbodies
varies
across
discipline
and
impact
factor
100%
n=87
90%
80%
n=95
Percent
iden=fiable
70%
60%
n=94
High
50%
n=124
n=56
Medium
40%
Low
30%
20%
10%
0%
Immunology
Neuroscience
Dev
Bio
Cell
Bio
General
Bio
In some domains high impact journals have worse
reporting, and in others it is the opposite
12. 90%
80%
70%
Percent
idenAfiable
60%
50%
40%
30%
20%
10%
0%
Commerical
Ab
Non-‐commercial
Catalog
number
Source
organism
Target
uniquely
iden=fiable
Ab
iden=fiable
reported
reported
iden=fiable
Of 14 antibodies published in 45 articles,
only 38% were identifiable
15. Ø Promote
beJer
reporAng
guidelines
in
journals
Ø Include
reviewing
guidelines
Ø Provide
tools
to
reference
research
resources
with
unique
and
persistent
IDs/URIs
Ø Train
librarians
and
other
data
stewards
to
apply
data
standards
What are we going to do about it?