Preparing your
taxonomy to
be ready for data
scientists & machine
readability: A case study
and work in progress
Mary Chitty,
Library Director &
Taxonomist, MSLS
Cambridge Healthtech,
Needham MA
mchitty@healthtech.com
SLA Annual Conference, Cleveland Ohio, Tuesday, June 18, 2019 ,
Taxonomy-Ontology Conversions: Case Studies
1992
2000
2006-14
2016 2018-19
Historical Taxonomy Process
Taxonomies & Ontologies glossary&taxonomy http://www.genomicglossaries.com/content/ontologies.asp
Company founded.
Taxonomy created by CEO
with a few hundred terms.
Major products:
conferences on emerging
technologies. focus on
preclinical drug discovery.
Acquired companies dealing
with bioinformatics, clinical
trials, energy and batteries.
Still integrating their
databases.
Met people from
OntoForce, Belgian
semantic search
engine company.
Began informal
collaboration.
Acquired companies in artificial
intelligence and Internet of
Thing. Still determining how to
integrate databases. Several
data scientists hired. Signed
formal contract with OntoForce
to use Disqover search engine.
https://www.ontoforce.com/
Taxonomy stands at
1,600+ terms now.
Conferences and other
products in preclinical and
clinical biotech and
pharma, clinical trials,
energy , AI and Internet of
Things and more.
Published Genomic
Glossaries & Taxonomies
www.genomicglossaries.com
2019
Ongoing challenges
Legacy data with inconsistencies, redundancies and ambiguities.
Integrating company acquisitions’ data into in-house database.
Still cleaning up, disambiguating and documenting in-house data and database.
Scaling up difficulties often underestimated. A major pain point for us right now.
FAIR Data
Both the EuropeanCommissionand NIH have allocatedconsiderableresourcesto making dataFAIRer.
https://www.go-fair.org/fair-principles/
Findable
• First step in
(re)using data is
to find them.
Metadata and
data should be
easy to find for
both humans
and computers.
… an essential
component of
the FAIRification
process.
Accessible
• Once the user
finds the
required data,
she/he needs to
know how can
they be
accessed
Interoperable
• Data usually
need to be
integrated with
other data …
need to
interoperate with
applications or
workflows.
Reusable
• Ultimate goal of
FAIR is to
optimise the
reuse of data…
metadata and
data should be
well-described
so that they can
be replicated
and/or combined
in different
settings.
Taxonomies and ontologies are critical for interoperability
and reproducibility, particularly in the life sciences.
Life sciences data relatively
sparse, with many attributes
”highly dimensional”, leading
to complexity and sometimes
chaos. Data on longitudinal
health outcomes limited by
HIPAA & other privacy
regulations, but crucial for
validation.
Increasing attention
being paid to data
stewardship and data
curation. Support still
a tough sell.
Reproducibility crisis?
More than 70% of
researchers have tried and
failed to reproduce
experiments.
More than half have failed
to reproduce their own
experiments.
Nature 2016 survey of researchers.
https://www.nature.com/news/1-500-
scientists-lift-the-lid-on-reproducibility-
1.19970
Life science ontologies and taxonomies
So many to choose from!
BioPortal https://bioportal.bioontology.org/
repository of biomedical ontologies has almost
800 ontologies, and mapping from ontologies
to I2B2 http://i2b2.bioontology.org/
Interdisciplinary work holds great
promise – and needs mapping of
terms between disciplines.
Pistoia Alliance Ontologies Mapping
https://www.pistoiaalliance.org/projects/curre
nt-projects/ontologies-mapping/
Data mapping also known as “data
wrangling” or “data munging”. Many
people trying to automate. Still
works in progress.
ROI Return On Investment & Cost Benefit
Cost of not having FAIR research data, PwC EU Services, 2018, European Union Publications.
https://publications.europa.eu/en/publication-detail/-/publication/d375368c-1a0a-11e9-8d04-01aa75ed71a1
Stakeholders may
balk at investing in
taxonomies or
ontologies. Software,
other IT & technology
considerations only
part of the issues.
Educating decision
makers is an
ongoing process,
even with CXOs who
value taxonomies
and ontologies.
Estimated cost
benefit analysis of
not having FAIR
research data:
Minimum of 10.2
billion Euros per
year.
Key insights
“…[T]here is a lot of work that needs doing
to prepare the data sets for these
technologies … there is a disproportionate
amount being invested in the technologies
as opposed to investing in "data-
readiness“… It's just not a slam dunk to
mash up a lot of data and think it will work."
Life Science Leader 2019 March 1, “AI In Life Sciences: Seeing past the Hype” Francois Nicolas and comment by
Christy Wilson https://www.lifescienceleader.com/doc/ai-in-life-sciences-seeing-past-the-hype-0001
“The AI solution may help accelerate some tasks, but
human expertise may be required for the broad
scope of what is needed. Currently AI in healthcare is
in the second stage of the Gartner Hype Cycle: “the
peak of inflated expectation.” However, if we don’t
allow it to catch up to the hype, it may fall back into
what Gartner calls the “trough of disillusionment.”
Key takeaways
Don’t try to “boil the
ocean”. Prototype early and
often. Think modular
• Pareto Principle 80/20
80% of effects come from
20% of effort.
Don’t try for 100%.
• Identify what your
stakeholders value.
Aim for quick wins.
Understand existing
workflows.
• Seek out allies and shared
buy-in for justification and
sustainability.
• Bundle stakeholders’ key
wants and items you know
they will eventually need.
Communicating ROI on
taxonomies, ontologies and
metadata is still challenging.
• Expectations and change
management are crucial
skills to cultivate.
• Report metrics quantitative
and qualitative.
• Recognize some challenges
not yet resolved by anyone.
Acknowledgments
Many people have participated in this ongoing project. I’m grateful for their work, insights and
encouragement.
Cambridge Innovation
Institute CII
& Cambridge Healthtech
• Phillips Kuhl, President
• Tonya Urquizo,
Knowledge Information
Services Analyst and IT
Liaison
Sanaye Bartlett, Data
Analyst & Project Manager
• Kaushik Chaudhuri,
Director of Product
Marketing
CII Disqover Team
• Kaitlyn Barago,
Associate Conference
Producer
• Nancy Clarke, Data Scientist
• Mike Croft,
Software Architect
• Ben Lakin,
Director New Initiatives
• Jaime Parlee, Director
Marketing Analytics
• Craig Wohlers, Manager
Knowledge Foundation
OntoForce
• Hans Constandt, CEO &
Founder
• Filip Pattyn, Scientific Lead
• Carla Suijkerbuijk, Business
Development North America
• Niels Vanneste,
Customer Data Scientist
• Berenice Wulbrecht, Data
Science Director, Systems
Biology
Fruitful Conversations and
emails
• Ingrid Akerblom, IEA
Diversified Consulting
• Juliane Schneider, Lead
Data Curator, eagle-I,
Harvard Catalyst
• Jane Lomax,
Head Ontologist, SciBite
• Terence Russell,
Chief Technologist, IRODS
Consortium
• John Wilbanks,
Chief Commons Officer,
Sage Bionetworks
Editor's Notes
Key motivations for taxonomy changes were company acquisitions in new disciplines, and new data science hires.
No easy answers. issues around integrating internal and external ontologies.. Starting to look into issues around ambiguity. Progress often seems to be three steps forward, one or two steps back.
A colleague commented “As science becomes ever more interdisciplinary, it is a huge challenge to map data on different granular levels but semantically link them across different languages, standards, and cultures .
An ontology colleague notes “Institutions either underestimate the resources needed to do this work , or they are daunted by the entire prospect and researchers have to find repositories/help outside the institution to store and curate their data, if they bother to do so. Honestly, very little data will ever be reused. ”
Some resources for locating life science ontologies and mappings. Bioportal has 773 ontologies as of May 2019. Graph based ontologies, open vs proprietary ontologies, My in-house taxonomy tends to be narrow and deep. Some external taxonomies tend to be broad and shallow.
PwC publication estimates time lost per year at 4.5 billion Euros, cost of storage 5.3 billion Euros [only data from academic research, private sector data not available]; license cost 360 million [private sector data not available]. Interdisciplinary and potential economic growth impacts cannot be estimated reliably.
People don’t always know what they want or will eventually need., and can have difficulty articulating their desires. Important to have understanding of the challenges of the people whose problems you are trying to solve. If you ask them to change their workflow drastically, change will never happen.
Don’t be too hard on yourself . Some of these are issues everyone else is still trying to figure out.