On the reproducibility of science

•

2 likes•467 views

Presented at Beyond the PDF2 in Amsterdam 2013 http://www.force11.org/beyondthepdf2. This talk describes preliminary data showing the lack of scientific reproducibility solely based on an inability to identify the material resources used in the research. Final work to be published soon!

On the reproducibility
of science
Melissa Haendel

Beyond the PDF2
20 March 2013

@ontowonka
haendel@ohsu.edu

The
science
cycle

Slide
from
Gully
Burns

Do we know if the infrastructure is
actually broken?

The
science
cycle

Image:
h6p://www.joinchangena=on.org/blog/post/roadblocks-‐on-‐the-‐pathway-‐to-‐ci=zenship

This is a broken data story.

Reproducibility
is
dependent
at
a
minimum,
on
using
the

same
resources.
But…

“All
companies
from
which
materials
were
obtained
should

be
listed.”
-‐
A
well-‐known
journal

Journal guidelines for methods are
often poor and space is limited

Hypothesis:
AnAbodies
in
the
published
literature

are
not
uniquely
idenAﬁable

Gather
journal

ar=cles
28
Journals
Iden=fying
ques=ons:

5
domains:
Is
the
an=body
iden=ﬁable

Immunology
119
papers
in
the
vendor
site?

Cell
biology

Neuroscience
Is
the
catalog
number

Developmental
biology
454
an=bodies
reported?

General
biology

408
commercial

an=bodies
Is
the
source
organism

3
impact
factors:
reported?

High
46
non-‐commercial

Medium
an=bodies

Low
Is
the
an=body
target

iden=ﬁable?

An experiment in reproducibility

Approximately
half
of
anAbodies
are
not
uniquely
idenAﬁable
in

119
publicaAons

60%

n=46

50%

Percent
idenAﬁable

n=408

40%

30%

20%

10%

0%

Commercial
an=body
Non-‐commerical
an=body

The data shows…

Unique
idenAﬁcaAon
of
commercial
anAbodies
varies
across
discipline
and

impact
factor

100%

n=87

90%

80%
n=95

Percent
iden=ﬁable

70%

60%

n=94
High

50%
n=124

n=56
Medium

40%
Low

30%

20%

10%

0%

Immunology
Neuroscience
Dev
Bio
Cell
Bio
General
Bio

In some domains high impact journals have worse
reporting, and in others it is the opposite

Image:
Gourami
Watcher

Meet the Urban Lab

The
Urban
lab
anAbodies

A+ organization!

90%

80%

70%

Percent
idenAﬁable

60%

50%

40%

30%

20%

10%

0%

Commerical
Ab
Non-‐commercial
Catalog
number
Source
organism
Target
uniquely

iden=ﬁable
Ab
iden=ﬁable

reported
reported
iden=ﬁable

Of 14 antibodies published in 45 articles,
only 38% were identifiable

Scientists really do put their
data in cardboard boxes.

Ø Promote
beJer
reporAng
guidelines
in
journals

Ø Include
reviewing
guidelines

Ø Provide
tools
to
reference
research
resources

with
unique
and
persistent
IDs/URIs

Ø Train
librarians
and
other
data
stewards
to

apply
data
standards

What are we going to do about it?

Presented at AMIA TBI CRI 2018. Rare disease patients are expert in their medical history and these patients not only are some of the most engaged, but also they can themselves provision data for use in clinical evaluation. We therefore created a lay-person version of our clinical deep phenotyping instrument, the Human Phenotype Ontology. Here, we evaluate the diagnostic utility of this lay-HPO, and debut a new software tool for patient-led deep phenotyping.

Semantics for rare disease phenotyping, diagnostics, and discovery

mhaendel

The Software and Data Licensing Solution: Not Your Dad’s UBMTA

mhaendel

Presented at the Association of University Technology Managers (AUTM) Annual Conference 2018 Moderator: Arvin Paranjpe, Oregon Health & Science University Speakers: Frank Curci, Ater Wynne LLP Melissa Haendel, Oregon Health & Science University Charles Williams, University of Oregon Big data is an open frontier, and it’s quickly expanding. However, transaction costs and legal barriers stand squarely in the way of meaningful, far-reaching data integration. We’ll grapple with the issues regarding a large-scale data integration project across humans, model and non-model organisms. Without pointing fingers, we’ll also share a few highlights from the (Re)usable Data Project, which outlined a five-part rubric to evaluate data licenses with respect to clarity and the reuse and redistribution of data. In addition, the topic raises the question: How well-suited are off-the-shelf software and data licenses for universities? Data scientists and software programmers are all too quick to pick one when they release their technology on GitHub. What should technology transfer professionals recommend? We’ll discuss the usefulness and attributes of a uniform software and data license for university researchers and software programmers.

Equivalence is in the (ID) of the beholder

mhaendel

Presented at PIDapalooza 2018. https://pidapalooza.org/ Determining identifier equivalency is key to data integration and to realizing the scientific discoveries that can only be made by collating our vast disconnected data stores. There are two key problems in determining equivalency - conceptual and syntactic alignment. Conceptual alignment often relies on Xrefs and string-matching against synonyms. There is indeed a better way! Algorithmic determination of identifier equivalency across different sources can use a combination of Xrefs, priors rules, existing semantic relations, and synonyms to create equivalency cliques than can highlight the discrepancies in conceptual definitions for manual review. This is especially useful for data sources annotated with concept drift and differences, such as diseases. Syntactic issues are that there are so many variations of the same identifier, making data joins difficult. We present a framework to reconcile and provide authoritative and integration-ready prefixed identifiers (CURIES), to capture and consolidate prefixes and to build links across key resource registries. The combination of JSON-LD context technology with a prefix metadata repository provides the basis for the infrastructure to handle identifiers in a consistent fashion. Finally, this architecture also allows resources to be self describing "beacons" with respect to their identifiers.

Building (and traveling) the data-brick road: A report from the front lines ...

mhaendel

The NIH Data Commons must treat the data it will contain not unlike the mortar and stones of a road. To help our fellow scientists travelers use the road, we must engineer for heavy traffic and diverse destinations. There are many steps to architecting a robust and persistent road. First, the data must be sourced and manipulated into common data models. This requires versioned access to the data, equivalency determination of identifiers within the data or minting of new ones for the data and/or within it, manipulating the data according to common data models (e.g. a genotype-to-pehnotype association in one source may relate a variant to a disease, where in another it may be a set of alleles associated with a set of phenotypes, each source models the data differently). Inclusion of the data in the Commons must meet all licensing restrictions, which are varied and usually poorly declared, as well as security, HIPAA, and ethics requirements. Software tools are needed to perform the Enhance-Transform-Load (ETL) process on a regular cycle to keep the data current, and to assess changes and quality assurance over time. For records that disappear, there needs to be a way to keep an archive of them. Once in the Commons, the data requires a map to navigate the roads: where do you want to go? Indexing and search across the data requires having the data be self-reporting - loading ontologies used in the data for indexing and providing faceted query over these and other attributes, sophisticated text mining tools, relevance ranking, and equivalency and similarity determination from amongst different providers. Once found, the users need vehicles to drive upon the road. These are their workspaces, the place where they design and implement the operations they need in order to get where they want to go. Unimaginable scientific emeralds are to be found at the end of the road, as the sum of all the data, if well integrated and made computationally reusable, has proven to be well beyond the sum of its parts in getting us where we want to go.

GA4GH Monarch Driver Project Introduction

mhaendel

GA4GH Phenotype Ontologies Task team update

mhaendel

Biomedical data integrators grapple with a fundamental blocker in research today: licensing for data use and redistribution. Complex licensing and data reuse restrictions hinder most publicly-funded, seemingly “open” biomedical data from being put to its full potential. Such issues include missing licenses, non-standard licenses, and restrictive provisions. The sheer diversity of licenses are particularly thorny for those that aim to redistribute data. Redistributors are often required to contact each sub-source to obtain permissions, and this is complicated by the fact that on each side of the agreement there may be multiple legal entities involved and some sub-sources may themselves already be aggregating data from other sub-sources. Furthermore, interpreting legal compliance with source data licensing and use agreements is complicated, as data is often manipulated, shared, and redistributed by many types of research groups and users in various and subtle ways. Here, we debut a new effort, the (Re)usable Data Project, where we have created a five-part rubric to evaluate biomedical data sources and their licensing information to determine the degree to which unnegotiated and unrestricted reuse and redistribution are provided. We have tested the (Re)usable Data rubric against various biomedical data sources, ranking each source on a scale of zero to five stars, and have found that approximately half of the resources rank poorly, getting 2.5 stars or less. Our goal is to help biomedical informaticians and other users navigate the plethora of issues in reusing and redistributing biomedical data. The (Re)usable Data project aims to promote standardization and ease of reuse licensing practices by data providers.

Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery

mhaendel

Global phenotypic data sharing standards to maximize diagnostic discovery

mhaendel

How open is open? An evaluation rubric for public knowledgebases

mhaendel

Presented at the 2017 International Biocuration Conference. Data relevant to any given scientific investigation is highly decentralized across thousands of specialized databases. Within the Biocuration community, we recognize that the value of open scientific knowledge bases is that they make scientific knowledge easier to find and compute, thereby maximizing impact and minimizing waste. The ever-increasing number of databases makes us necessarily question what are our priorities with respect to maintaining them, developing new ones, or senescing/subsuming ones that have completed in their mission. Therefore, open biomedical data repositories should be carefully evaluated according to quality, accessibility, and value of the database resources over time and across the translational divide. Traditional citation count and publication impact factors as a measure of success or value are known to be inadequate to assess the usefulness of a resource. This is especially true for integrative resources. For example, almost everyone in biomedicine relies on PubMed, but almost no one ever cites or mentions it in their publications. While the Nucleic Acids Research Database issues have increased citation of some databases, many still go unpublished or uncited; even novel derivations of methodology, applications, and workflows from biomedical knowledge bases are often “adapted” but never cited. There is a lack of citation best practices for widely used biomedical database resources (e.g. should a paper be cited? A URL? Is mention of the name and access date sufficient?). We have developed a draft evaluation rubric for evaluating open science databases according to the commonly cited FAIR principles -- Findable, Accessible, Interoperable, and Reusable, but with three additional principles: Traceable, Licensed, and Connected. These additions are largely overlooked and underappreciated, yet are critical to reuse of the knowledge contained within any given database. It is worth noting that FAIR principles apply not only to the resource as a whole, but also to their key components; this “fractal FAIRness” means that even the license, identifiers, vocabularies, APIs themselves must be Findable, Accessible, Interoperable, Reusable, etc. Here we report on initial testing of our evaluation rubric on the recent NIH/Wellcome Trust Open Science projects and seek community input for how to further advance this rubric as a Biocuration community resource.

Deep phenotyping to aid identification of coding & non-coding rare disease v...

mhaendel

Whole-exome sequencing has revolutionized disease research, but many cases remain unsolved because ~100-1000 candidates remain after removing common or non-pathogenic variants. We present Genomiser to prioritize coding and non-coding variants by leveraging phenotype data encoded with the Human Phenotype Ontology and a curated database of non-coding Mendelian variants. Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes.

Science in the open, what does it take?

mhaendel

Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...

mhaendel

Presented at the IRDiRC 2017 conference in Paris, Feb 9th, 2017 (http://irdirc-conference.org/). This talk reviews use of the Human Phenotype Ontology for phenotype comparisons against other patients, known diseases, and animal models for diagnostic discovery. It also discusses the new Phenopackets Exchange mechanism for open phenotypic data sharing. www.monarchinitiative.org www.phenopackets.org www.human-phenotype-ontology.org

Phenopackets as applied to variant interpretation

mhaendel

Credit where credit is due: acknowledging all types of contributions

mhaendel

This is an update for COASP (http://oaspa.org/conference/) on the representation of attribution beyond authorship of a publication. Publications are proxies for the projects and people that area actually engaged in the work, and represent the dissemination aspect. How can we better understand the individual contributions and their impact? The openRIF, openVIVO and FORCE11 Attribution WG efforts aim to represent scholarship in a computationally tractable manner so as to enable credit and evaluation of all types of scholarly contributions.

Deep phenotyping for everyone

mhaendel

The Human Phenotype Ontology (HPO) was developed to describe phenotypic abnormalities, aka, “deep phenotyping”, whereby symptoms and characteristic phenotypic findings (a phenotypic profile) are captured. The HPO has been utilized to great success for assisting computational phenotype comparison against known diseases, other patients, and model organisms to support diagnosis of rare disease patients. Clinicians and geneticists create phenotypic profiles based on clinical evaluation, but this is time consuming and can miss important phenotypic features. Patients are sometimes the best source of information about their symptoms that might otherwise be missed in a clinical encounter. However, HPO primarily use medical terminology, which can be difficult for patients and their families to understand. To make the HPO accessible to patients, we systematically added non-expert terminology (i.e., layperson terms) synonyms. Using semantic similarity, patient-recorded phenotypic profiles can be evaluated against those created clinically for undiagnosed patients to determine the improvement gained from the patient-driven phenotyping, as well as how much the patient phenotyping narrows the diagnosis. This patient-centric HPO can be utilized by all: in patient-centered rare disease websites, in patient community platforms and registries, or even to post one’s hard-to-diagnosed phenotypic profile on the Web.

Why the world needs phenopacketeers, and how to be one

mhaendel

Keynote presented at the the Ninth International Biocuration Conference Geneva, Switzerland, April 10-14, 2016 The health of an individual organism results from complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient disease features, variation across biodiversity) or environmental factors that may influence such phenotypic outcomes. Phenotypic features of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, museum collections, and even social media. In these contexts, biocuration has been pivotal to obtaining a computable representation, but is still deeply challenged by the lack of standardization, accessibility, persistence, and computability among these contexts. How can we help all phenotype data creators contribute to this biocuration effort when the data is so distributed across so many communities, sources, and scales? How can we track contributions and provide proper attribution? How can we leverage phenotypic data from the model organism or biodiversity communities to help diagnose disease or determine evolutionary relatedness? Biocurators unite in a new community effort to address these challenges.

On the frontier of genotype-2-phenotype data integration

mhaendel

The Monarch Initiative: A semantic phenomics approach to disease discovery

mhaendel

Envisioning a world where everyone helps solve disease

mhaendel

Getting (and giving) credit for all that we do

mhaendel

The Monarch Initiative: From Model Organism to Precision Medicine

mhaendel

NIH BD2K all-hands meeting poster November 12, 2015. Attempts at correlating phenotypic aspects of disease with causal genetic influences are often confounded by the challenges of interpreting diverse data distributed across numerous resources. New approaches to data modeling, integration, tooling, and community practices are needed to make efficient use of these data. The Monarch Initiative is an international consortium working on the development of shared data, tools, and standards to enable direct translation of integrated genotype, phenotype, and environmental data from human and model organisms to enhance our understanding of human disease. We utilize sophisticated semantic mapping techniques across a diverse set of standardized ontologies to deeply integrate data across species, sources, and modalities. Using phenotype similarity matching algorithms across these data enables disorder prediction, variant prioritization, and patient matching against known diseases and model organisms. These similarity algorithms form the core of several innovative tools. The Exomiser, which enables exome variant prioritization by combining pathogenicity, frequency, inheritance, protein interaction, and cross-species phenotype data. Our Phenotype Sufficiency tool provides clinicians the ability to compare patient phenotypic profiles using the Human Phenotype Ontology to determine uniqueness and specificity in support of variant prioritization. The PhenoGrid visualization widget illustrates phenotype similarity between patients, known diseases, and model organisms. Monarch develops models in collaboration with the community in support of the burgeoning genotype-phenotype disease research community. We have successfully used Exomiser to solve a number of undiagnosed patient cases in collaboration with the NIH Undiagnosed Disease Program. Ongoing development in coordination with the Global Alliance for Genetic Health (GA4GH) and other groups will catalyze the realization of our goal of a vital translational community focused on the collaborative application of integrated genotype, phenotype, and environmental data to human disease.

The Monarch Initiative: An integrated genotype-phenotype platform for disease...

mhaendel

Integrating clinical and model organism G2P data for disease discovery

mhaendel

Force11: Enabling transparency and efficiency in the research landscape

mhaendel

Semantic phenotyping for disease diagnosis and discovery

mhaendel

More from mhaendel

Reusable data for biomedicine: A data licensing odyssey

mhaendel

Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery

mhaendel

Global phenotypic data sharing standards to maximize diagnostic discovery

mhaendel

How open is open? An evaluation rubric for public knowledgebases

mhaendel

Deep phenotyping to aid identification of coding & non-coding rare disease v...

mhaendel

Science in the open, what does it take?

mhaendel

Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...

mhaendel

Phenopackets as applied to variant interpretation

mhaendel

Credit where credit is due: acknowledging all types of contributions

mhaendel

Deep phenotyping for everyone

mhaendel

Why the world needs phenopacketeers, and how to be one

mhaendel

On the frontier of genotype-2-phenotype data integration

mhaendel

The Monarch Initiative: A semantic phenomics approach to disease discovery

mhaendel

Envisioning a world where everyone helps solve disease

mhaendel

Getting (and giving) credit for all that we do

mhaendel

The Monarch Initiative: From Model Organism to Precision Medicine

mhaendel

The Monarch Initiative: An integrated genotype-phenotype platform for disease...

mhaendel

Integrating clinical and model organism G2P data for disease discovery

mhaendel

Force11: Enabling transparency and efficiency in the research landscape

mhaendel

Semantic phenotyping for disease diagnosis and discovery

mhaendel

More from mhaendel (20)

Reusable data for biomedicine: A data licensing odyssey

Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery

Global phenotypic data sharing standards to maximize diagnostic discovery

How open is open? An evaluation rubric for public knowledgebases

Deep phenotyping to aid identification of coding & non-coding rare disease v...

Science in the open, what does it take?

Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...

Phenopackets as applied to variant interpretation

Credit where credit is due: acknowledging all types of contributions

Deep phenotyping for everyone

Why the world needs phenopacketeers, and how to be one

On the frontier of genotype-2-phenotype data integration

The Monarch Initiative: A semantic phenomics approach to disease discovery

Envisioning a world where everyone helps solve disease

Getting (and giving) credit for all that we do

The Monarch Initiative: From Model Organism to Precision Medicine

The Monarch Initiative: An integrated genotype-phenotype platform for disease...

Integrating clinical and model organism G2P data for disease discovery

Force11: Enabling transparency and efficiency in the research landscape

Semantic phenotyping for disease diagnosis and discovery

On the reproducibility of science

1. On the reproducibility of science Melissa Haendel Beyond the PDF2 20 March 2013 @ontowonka haendel@ohsu.edu

2. The science cycle Slide from Gully Burns Do we know if the infrastructure is actually broken?

3. The science cycle Image: h6p://www.joinchangena=on.org/blog/post/roadblocks-‐on-‐the-‐pathway-‐to-‐ci=zenship This is a broken data story.

4. Reproducibility is dependent at a minimum, on using the same resources. But… “All companies from which materials were obtained should be listed.” -‐ A well-‐known journal Journal guidelines for methods are often poor and space is limited

5. Hypothesis: AnAbodies in the published literature are not uniquely idenAfiable Gather journal ar=cles 28 Journals Iden=fying ques=ons: 5 domains: Is the an=body iden=fiable Immunology 119 papers in the vendor site? Cell biology Neuroscience Is the catalog number Developmental biology 454 an=bodies reported? General biology 408 commercial an=bodies Is the source organism 3 impact factors: reported? High 46 non-‐commercial Medium an=bodies Low Is the an=body target iden=fiable? An experiment in reproducibility

6. Approximately half of anAbodies are not uniquely idenAﬁable in 119 publicaAons 60% n=46 50% Percent idenAﬁable n=408 40% 30% 20% 10% 0% Commercial an=body Non-‐commerical an=body The data shows…

7. Unique idenAﬁcaAon of commercial anAbodies varies across discipline and impact factor 100% n=87 90% 80% n=95 Percent iden=ﬁable 70% 60% n=94 High 50% n=124 n=56 Medium 40% Low 30% 20% 10% 0% Immunology Neuroscience Dev Bio Cell Bio General Bio In some domains high impact journals have worse reporting, and in others it is the opposite

8. Maybe labs are just disorganized?

9. Meet the Urban Lab

10. Image: Gourami Watcher Meet the Urban Lab

11. The Urban lab anAbodies A+ organization!

12. 90% 80% 70% Percent idenAfiable 60% 50% 40% 30% 20% 10% 0% Commerical Ab Non-‐commercial Catalog number Source organism Target uniquely iden=fiable Ab iden=fiable reported reported iden=fiable Of 14 antibodies published in 45 articles, only 38% were identifiable

13. What does this tell us?

14. Scientists really do put their data in cardboard boxes.

15. Ø Promote beJer reporAng guidelines in journals Ø Include reviewing guidelines Ø Provide tools to reference research resources with unique and persistent IDs/URIs Ø Train librarians and other data stewards to apply data standards What are we going to do about it?

On the reproducibility of science

Recommended

Recommended

More Related Content

More from mhaendel

More from mhaendel (20)

On the reproducibility of science