More Related Content

Slideshows for you(20)

Similar to Chitty taxo cleveland 2019 june (20)


Recently uploaded(20)

Chitty taxo cleveland 2019 june

  1. Preparing your taxonomy to be ready for data scientists & machine readability: A case study and work in progress Mary Chitty, Library Director & Taxonomist, MSLS Cambridge Healthtech, Needham MA SLA Annual Conference, Cleveland Ohio, Tuesday, June 18, 2019 , Taxonomy-Ontology Conversions: Case Studies
  2. 1992 2000 2006-14 2016 2018-19 Historical Taxonomy Process Taxonomies & Ontologies glossary&taxonomy Company founded. Taxonomy created by CEO with a few hundred terms. Major products: conferences on emerging technologies. focus on preclinical drug discovery. Acquired companies dealing with bioinformatics, clinical trials, energy and batteries. Still integrating their databases. Met people from OntoForce, Belgian semantic search engine company. Began informal collaboration. Acquired companies in artificial intelligence and Internet of Thing. Still determining how to integrate databases. Several data scientists hired. Signed formal contract with OntoForce to use Disqover search engine. Taxonomy stands at 1,600+ terms now. Conferences and other products in preclinical and clinical biotech and pharma, clinical trials, energy , AI and Internet of Things and more. Published Genomic Glossaries & Taxonomies 2019
  3. Ongoing challenges Legacy data with inconsistencies, redundancies and ambiguities. Integrating company acquisitions’ data into in-house database. Still cleaning up, disambiguating and documenting in-house data and database. Scaling up difficulties often underestimated. A major pain point for us right now.
  4. FAIR Data Both the EuropeanCommissionand NIH have allocatedconsiderableresourcesto making dataFAIRer. Findable • First step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. … an essential component of the FAIRification process. Accessible • Once the user finds the required data, she/he needs to know how can they be accessed Interoperable • Data usually need to be integrated with other data … need to interoperate with applications or workflows. Reusable • Ultimate goal of FAIR is to optimise the reuse of data… metadata and data should be well-described so that they can be replicated and/or combined in different settings.
  5. Taxonomies and ontologies are critical for interoperability and reproducibility, particularly in the life sciences. Life sciences data relatively sparse, with many attributes ”highly dimensional”, leading to complexity and sometimes chaos. Data on longitudinal health outcomes limited by HIPAA & other privacy regulations, but crucial for validation. Increasing attention being paid to data stewardship and data curation. Support still a tough sell. Reproducibility crisis? More than 70% of researchers have tried and failed to reproduce experiments. More than half have failed to reproduce their own experiments. Nature 2016 survey of researchers. scientists-lift-the-lid-on-reproducibility- 1.19970
  6. Life science ontologies and taxonomies So many to choose from! BioPortal repository of biomedical ontologies has almost 800 ontologies, and mapping from ontologies to I2B2 Interdisciplinary work holds great promise – and needs mapping of terms between disciplines. Pistoia Alliance Ontologies Mapping nt-projects/ontologies-mapping/ Data mapping also known as “data wrangling” or “data munging”. Many people trying to automate. Still works in progress.
  7. ROI Return On Investment & Cost Benefit Cost of not having FAIR research data, PwC EU Services, 2018, European Union Publications. Stakeholders may balk at investing in taxonomies or ontologies. Software, other IT & technology considerations only part of the issues. Educating decision makers is an ongoing process, even with CXOs who value taxonomies and ontologies. Estimated cost benefit analysis of not having FAIR research data: Minimum of 10.2 billion Euros per year.
  8. Key insights “…[T]here is a lot of work that needs doing to prepare the data sets for these technologies … there is a disproportionate amount being invested in the technologies as opposed to investing in "data- readiness“… It's just not a slam dunk to mash up a lot of data and think it will work." Life Science Leader 2019 March 1, “AI In Life Sciences: Seeing past the Hype” Francois Nicolas and comment by Christy Wilson “The AI solution may help accelerate some tasks, but human expertise may be required for the broad scope of what is needed. Currently AI in healthcare is in the second stage of the Gartner Hype Cycle: “the peak of inflated expectation.” However, if we don’t allow it to catch up to the hype, it may fall back into what Gartner calls the “trough of disillusionment.”
  9. Key takeaways Don’t try to “boil the ocean”. Prototype early and often. Think modular • Pareto Principle 80/20 80% of effects come from 20% of effort. Don’t try for 100%. • Identify what your stakeholders value. Aim for quick wins. Understand existing workflows. • Seek out allies and shared buy-in for justification and sustainability. • Bundle stakeholders’ key wants and items you know they will eventually need. Communicating ROI on taxonomies, ontologies and metadata is still challenging. • Expectations and change management are crucial skills to cultivate. • Report metrics quantitative and qualitative. • Recognize some challenges not yet resolved by anyone.
  10. Acknowledgments Many people have participated in this ongoing project. I’m grateful for their work, insights and encouragement. Cambridge Innovation Institute CII & Cambridge Healthtech • Phillips Kuhl, President • Tonya Urquizo, Knowledge Information Services Analyst and IT Liaison Sanaye Bartlett, Data Analyst & Project Manager • Kaushik Chaudhuri, Director of Product Marketing CII Disqover Team • Kaitlyn Barago, Associate Conference Producer • Nancy Clarke, Data Scientist • Mike Croft, Software Architect • Ben Lakin, Director New Initiatives • Jaime Parlee, Director Marketing Analytics • Craig Wohlers, Manager Knowledge Foundation OntoForce • Hans Constandt, CEO & Founder • Filip Pattyn, Scientific Lead • Carla Suijkerbuijk, Business Development North America • Niels Vanneste, Customer Data Scientist • Berenice Wulbrecht, Data Science Director, Systems Biology Fruitful Conversations and emails • Ingrid Akerblom, IEA Diversified Consulting • Juliane Schneider, Lead Data Curator, eagle-I, Harvard Catalyst • Jane Lomax, Head Ontologist, SciBite • Terence Russell, Chief Technologist, IRODS Consortium • John Wilbanks, Chief Commons Officer, Sage Bionetworks

Editor's Notes

  1. Key motivations for taxonomy changes were company acquisitions in new disciplines, and new data science hires.
  2. No easy answers. issues around integrating internal and external ontologies.. Starting to look into issues around ambiguity. Progress often seems to be three steps forward, one or two steps back.
  3. A colleague commented “As science becomes ever more interdisciplinary, it is a huge challenge to map data on different granular levels but semantically link them across different languages, standards, and cultures .
  4. An ontology colleague notes “Institutions either underestimate the resources needed to do this work , or they are daunted by the entire prospect and researchers have to find repositories/help outside the institution to store and curate their data, if they bother to do so. Honestly, very little data will ever be reused. ”
  5. Some resources for locating life science ontologies and mappings. Bioportal has 773 ontologies as of May 2019. Graph based ontologies, open vs proprietary ontologies, My in-house taxonomy tends to be narrow and deep. Some external taxonomies tend to be broad and shallow.
  6. PwC publication estimates time lost per year at 4.5 billion Euros, cost of storage 5.3 billion Euros [only data from academic research, private sector data not available]; license cost 360 million [private sector data not available]. Interdisciplinary and potential economic growth impacts cannot be estimated reliably.
  7. People don’t always know what they want or will eventually need., and can have difficulty articulating their desires. Important to have understanding of the challenges of the people whose problems you are trying to solve. If you ask them to change their workflow drastically, change will never happen. Don’t be too hard on yourself . Some of these are issues everyone else is still trying to figure out.