SlideShare a Scribd company logo
CoronaWhy Knowledge Graph
Slava Tykhonov
Senior Information Scientist
(DANS-KNAW, the Netherlands)
W3C Semantic Web Health Care and Life Sciences Interest Group webinar
06.10.2020
About me: DANS-KNAW projects (2016-2020)
● CLARIAH+ (ongoing)
● EOSC Synergy (ongoing)
● SSHOC Dataverse (ongoing)
● CESSDA DataverseEU 2018
● Time Machine Europe Supervisor at DANS-KNAW
● PARTHENOS Horizon 2020
● CESSDA PID (Personal Identifiers) Horizon 2020
● CLARIAH
● RDA (Research Data Alliance) PITTS Horizon 2020
● CESSDA SaW H2020-EU.1.4.1.1 Horizon 2020
2
Source: LinkedIn
7 weeks in lockdown in Spain
3
Resistere (I will resist)
About CoronaWhy
1300+ people
registered in the
organization, more
than 300 actively
contributing! 4
www.coronawhy.org
COVID-19 Open Research Dataset Challenge (CORD-19)
It’s all started from this (March, 2020):
“In response to the COVID-19 pandemic and with the view to boost research, the Allen
Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting
and making available for free the COVID-19 Open Research Dataset (CORD-19). This
resource is updated weekly and contains over 287,000 scholarly articles, including 82,870
with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle)
5
CoronaWhy Community Tasks (March-April)
1. Task-Risk helps to identify risk factors that can increase the chance of being
infected, or affects the severity or the survival outcome of the infection
2. Task-Ties to explore transmission, incubation and environment stability
3. Match Clinical Trials allows exploration of the results from the COVID-19
International Clinical Trials dataset
4. COVID-19 Literature Visualization helps to explore the data behind the AI-
powered literature review
5. Named Entity Recognition across the entire corpus of CORD-19 papers with full
text
6
CORD-19 affiliations recognized with Deep Learning
7
Source: CORD-19 map visualization and institution affiliation data
Collaboration with other organizations
● Harvard Medical School, INDRA integration
● Helix Group, Stanford University
● NASA JPL, COVID-19 knowledge graph and GeoParser
● Kaggle, coronamed application
● Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS,
knowledge graph
● dcyphr, a platform for creating and engaging with distillations of academic
articles
● CAMARADES (Collaborative Approach to Meta-Analysis and Review of
Animal Data from Experimental Studies)
We’ve got almost endless data streams...
8
Looking for Commons
Merce Crosas, “Harvard Data Commons” 9
Building a horizontal platform to serve vertical teams
Source: CoronaWhy infrastructure introduction 10
Turning FAIR into reality!
11
DANS-KNAW is one of worldwide leaders in FAIR Data (FAIRsFAIR)
Dataverse as data integration point
12
● Available as a service for the
community from April, 2020
● Used by CoronaWhy vertical
teams for the data exchange and
share
● Intended to help researchers to
make their data FAIR
● One of the biggest COVID-19 data
archives in the world with 700k
files
● New teams are getting own data
containers and can reuse data
collected and produced by others
https://datasets.coronawhy.org
Dataset from CoronaWhy vertical teams
13
Source: CoronaWhy Dataverse
COVID-19 data files verification
14
We do a verification of every file by
importing its contents to dataframe.
All column names (variables)
extracted from tabular data available
as labels in files metadata
We’ve enabled Dataverse data
previewers to browse through the
content of files without download!
We’re starting internal challenges to
build ML models for the metadata
classification
Dataverse content in Jupyter notebooks
15
Source: Dataverse examples on Google Colabs
COVID-19 Data Crowdsourcing
CoronaWhy data management
team does does the review of all
harvested datasets and try to
identify the important data.
We’re approaching github owners
by creating issues in their repos
and inviting them to help us.
More than 20% of data owners
joining CoronaWhy community or
interested to curate their datasets.
Bottom-up data collection works!
16
Challenge of data integration and various ontologies
CORD-19 collection workflows with NLP pipeline:
● manual annotation and labelling of COVID-19 related papers
● automatic entity extraction and classification of text fragments
● statements extraction and curation
● linking papers to specific research questions with relationships extraction
Dataverse Data Lake streaming COVID-19 datasets from various sources:
● medical data
● socio-economic Data
● political data and statistics
17
The importance of standards and ontologies
Generic controlled vocabularies to link metadata in the bibliographic collections are well
known: ORCID, GRID, GeoNames, Getty.
Medical knowledge graphs powered by:
● Biological Expression Language (BEL)
● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH)
● Wikidata (Open ontology) - Wikipedia
Integration based on metadata standards:
● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI)
18
Statements extraction with INDRA
19
Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly)
“INDRA (Integrated Network and Dynamical
Reasoning Assembler) is an automated
model assembly system, originally
developed for molecular systems biology
and currently being generalized to other
domains.”
Developed as a part of Harvard Program in
Therapeutic Science and the Laboratory of
Systems Pharmacology at Harvard Medical
School.
http://indra.bio
Knowledge Graph curation in INDRA
20
Advanced Text Analysis (NLP pipeline)
Source: D.Shlepakov, How to Build Knowledge Graph, notebook
We need a good understanding of all domain specific texts to make right statements:
21
Building domain specific knowledge graphs
● We’re collecting all possible COVID-19 data and archiving in our Dataverse
● Looking for various related controlled vocabularies and ontologies
● Building and reusing conversion pipelines to get all data values linked in RDF format
The ultimate goal is to automate the process of the Knowledge extraction by using the latest
developments in Artificial Intelligence and Deep Learning. We need to organize all available
knowledge in a common way.
22
Simple Knowledge Organization System (SKOS)
SKOS models a thesauri-like resources:
- skos:Concepts with preferred labels and alternative labels (synonyms) attached to them
(skos:prefLabel, skos:altLabel).
- skos:Concept can be related with skos:broader, skos:narrower and skos:related properties.
- terms and concepts could have more than one broader term and concept.
SKOS allows to create a semantic layer on top of objects, a network with statements and relationships.
A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent
anything from “is-a” to “part-of”.
23
RDF graph using the SKOS Core Vocabulary
24
Source: SKOS Core Guide
Global Research Identifier Database (GRID) in SKOS
25
We already have a lot of data in
CoronaWhy Data Lake.
Can we provide human a
convenient web interface to
create links to data points?
Can we use Machine Learning
algorithms to make a prediction
about links and convert data in
SKOS automatically?
SKOSMOS framework to discover ontologies
26
● SKOSMOS is developed in
Europe by the National Library
of Finland (NLF)
● active global user community
● search and browsing interface
for SKOS concept
● multilingual vocabularies
support
● used for different use cases
(publish vocabularies, build
discovery systems, vocabulary
visualization)
Skosmos pipeline
27
Skosmos backend is based on Jena Fuseki, a SPARQL server and RDF triple store integrated with
Lucene search engine. The jena-text extension can be used for faster text search and used both by
Skosmos GUI and REST API.
MeSH in SKOSMOS
SKOS version of
Medical Subjects
Headings provided by
Finnish Thesaurus and
Ontology Service
(Finto):
https://finto.fi/mesh/
28
SKOSMOS API specification in Swagger
29
Source: Finto API
SKOSMOS API example for GRID ontology
30
COVID-19 expert questions
31
Source: Epidemic Questions Answering
“In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop
systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2,
related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a
large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers.”
COVID-19 questions in SKOSMOS framework
32
COVID-19 questions in SKOSMOS API
33
Visual graph of COVID-19 dataset
34
Source: CoronaWhy GraphDB
COVID-19 questions in datasets metadata
35
Source: COVID-19 European data hub in Harvard Dataverse
● COVID-19 ontologies can be hosted by
SKOSMOS framework
● Researchers can enrich metadata by
adding standardized questions provided
by SKOSMOS ontologies
● rich metadata exported back to Linked
Open Data Cloud to increase a chance
to be found
● enriched metadata can be used for
further ML models training
Semantic chatbot - ontology lookup service
36
Source: CoronaWhy Semantic Bot
SPARQL endpoint for CoronaWhy KG
37
Source: YASGUI
Do you know a lot of people that can use SPARQL?
38
Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem
CLARIAH conclusions
“By developing these decentralised, yet controlled Knowledge Graph development practices we have
contributed to increasing interoperability in the humanities and enabling new research opportunities to a
wide range of scholars. However, we observe that users without Semantic Web knowledge find these
technologies hard to use, and place high value in end-user tools that enable engagement.
Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete
communities – in our case, the analytical and quantitative answering of humanities research questions for
humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to
empower users in deciding under what models these tools operate.” (CLARIAH: Enabling
Interoperability Between Humanities Disciplines with Ontologies)
Chicken-egg problem: users are building tools without data models and ontologies but in
reality they need to build a knowledge graph with common ontologies first!
39
Linked Data integration challenges
● datasets are very heterogeneous and multilingual
● data usually lacks sufficient data quality control
● data providers using different modeling schemas and styles
● linked data cleansing and versioning is very difficult to track and maintain
properly, web resources aren’t persistent
● even modern data repositories providing only metadata records describing
data without giving access to individual data items stored in files
● difficult to assign and manually keep up-to-date entity relationships in
knowledge graph
CoronaWhy has too much information streams that seems to be impossible to
integrate and give back to COVID-19 researchers. So, do we have a solution? 40
Bibliographic Framework (BIBFRAME) as a Web of Data
“The Library of Congress officially launched its Bibliographic Framework Initiative
in May 2011. The Initiative aims to re-envision and, in the long run, implement a
new bibliographic environment for libraries that makes "the network" central and
makes interconnectedness commonplace.”
“Instead of thousands of catalogers repeatedly describing the same resources, the
effort of one cataloger could be shared with many.” (Source)
In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced.
Let’s take a journey and move from domain specific ontology to bibliographic!
41
BIBFRAME 2.0 concepts
42
● Work. The highest level of abstraction, a Work, in the BIBFRAME context,
reflects the conceptual essence of the cataloged resource: authors, languages,
and what it is about (subjects).
● Instance. A Work may have one or more individual, material embodiments, for
example, a particular published form. These are Instances of the Work. An
Instance reflects information such as its publisher, place and date of publication,
and format.
● Item. An item is an actual copy (physical or electronic) of an Instance. It reflects
information such as its location (physical or virtual), shelf mark, and barcode.
● Agents: Agents are people, organizations, jurisdictions, etc., associated with a
Work or Instance through roles such as author, editor, artist, photographer,
composer, illustrator, etc.
● Subjects: A Work might be “about” one or more concepts. Such a concept is
said to be a “subject” of the Work. Concepts that may be subjects include topics,
places, temporal expressions, events, works, instances, items, agents, etc.
● Events: Occurrences, the recording of which may be the content of a Work.
Source: the Library of Congress
MARC as a foundation of the structured Data Hub
43
Source: the Library of Congress, USA
MARC standard was developed in the 1960s to create records that
could be read by computers and shared among libraries. The term
MARC is an abbreviation for MAchine Readable Cataloging.
The MARC 21 bibliographic format was created for the international
community. It’s very rich, with more than 2,000 data elements
defined!
It’s identified by its ISO (International Standards Organization)
number: ISO 2709
How to integrate data in the common KG?
● Use MARC 21 as a basis for all bibliographic and authority records
● All controlled vocabularies should be expressed in MARC 21 format for Authority
Data, we need to build an authority linking process with the “human in the loop”
approach that will allow to verify AI predicted links.
● Different MARC 21 fields could be linked to the different ontologies and/or even
interlinked. For example, we can get some entities linked to both MeSH and
Wikidata in the same bibliographic record to increase the interoperability of the
Knowledge Graph.
● Every CORD-19 paper can get a metadata enrichment provided by any research
team working on the NLP extraction of entities, relations or linking CV together.
44
COVID-19 paper in MARC 21 representation
45
Structure of the bibliographic record:
● authority records contain information
about authors and affiliation
● Medical entities extracted by NLP
pipeline interlinked in 650x fields
● part of metadata fields generated and
filled by Machine Learning models,
part contributed by human experts
● provenance information kept in 833x
fields indicating fully or partially
machine-generated records
● Relations between entities stored in
730x fields
Source: CoronaWhy CORD-19 portal
Vufind discovery tool for libraries powered by MARC 21
46
Source: CoronaWhy VuFind
Landing page of publications from CORD-19
47
Source: CoronaWhy VuFind
CORD-19 collection in BIBFRAME 2.0
48
CoronaWhy Graph published as RDF
49
Source: CoronaWhy Dataverse
CLARIAH and Network Digital Heritage (NDE) GraphiQL
50
CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of
another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints!
If you want to query API:
curl "http://nde.dataverse-
dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20
%7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"
Increasing Dataverse metadata interoperability
51
External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)
Hypothes.is annotations as a peer review service
1. AI pipeline does
domain specific
entities extraction
and ranking of
relevant CORD-19
papers.
2. Automatic entities
and statements will
be added, important
fragments should be
highlighted.
3. Human annotators
should verify results
and validate all
statements.
52
Doccano annotation with Machine Learning
Source: Doccano Labs
Building an Operating System for Open Science
54
CoronaWhy Common Research and Data Infrastructure is distributed and robust
enough to be scaled up and reused for other tasks like cancer research
All services are build from Open Source components
Data processed and published in FAIR way, the provenance information is the
part of our Data Lake
Data evaluation and credibility is the top priority, we’re providing tools for the
expert community for the verification of our datasets
The transparency of data and services guarantees the reproducibility of all
experiments and get bring new insights in COVID-19 research
CoronaWhy Common Research and Data Infrastructure
Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules.
Dataverse as data repository to store data from automatic and curated workflows.
Elasticsearch has CORD-19 indexes on sections and sentences level with spacy
enrichments. Other indexes: MeSH, Geonames, GRID.
Hypothesis and Doccano annotation services to annotate publications
Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph
Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula
https://github.com/CoronaWhy/covid-19-infrastructure
55
56
Source: CoronaWhy API built on FastAPI framework for Python
Thank you! Questions?
@Slava Tykhonov on CoronaWhy Slack
vyacheslav.tykhonov@dans.knaw.nl
www.coronawhy.org
www.dans.knaw.nl/en
57

More Related Content

What's hot

Controlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repositoryControlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repository
vty
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes
vty
 
CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes
vty
 
Building COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science ProjectBuilding COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science Project
vty
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
vty
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
vty
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
vty
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
vty
 
Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)
vty
 
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC,  Service QA and DataverseIntegration of WORSICA’s thematic service in EOSC,  Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
vty
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
Vyacheslav Tykhonov
 
Dataverse SSHOC enrichment of DDI support at EDDI'19 2
Dataverse SSHOC enrichment of DDI support at EDDI'19 2Dataverse SSHOC enrichment of DDI support at EDDI'19 2
Dataverse SSHOC enrichment of DDI support at EDDI'19 2
vty
 
Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...
vty
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
Vyacheslav Tykhonov
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
Andrea Scharnhorst
 
Dataverse in the European Open Science Cloud
Dataverse in the European Open Science CloudDataverse in the European Open Science Cloud
Dataverse in the European Open Science Cloud
vty
 
SSHOC Dataverse in the European Open Science Cloud
SSHOC Dataverse in the European Open Science CloudSSHOC Dataverse in the European Open Science Cloud
SSHOC Dataverse in the European Open Science Cloud
vty
 
LOD2 Webinar Series: D2R and Sparqlify
LOD2 Webinar Series: D2R and SparqlifyLOD2 Webinar Series: D2R and Sparqlify
LOD2 Webinar Series: D2R and Sparqlify
LOD2 Creating Knowledge out of Interlinked Data
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
FAO
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...
FAO
 

What's hot (20)

Controlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repositoryControlled vocabularies and ontologies in Dataverse data repository
Controlled vocabularies and ontologies in Dataverse data repository
 
The world of Docker and Kubernetes
The world of Docker and Kubernetes The world of Docker and Kubernetes
The world of Docker and Kubernetes
 
CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes CLARIN CMDI use case and flexible metadata schemes
CLARIN CMDI use case and flexible metadata schemes
 
Building COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science ProjectBuilding COVID-19 Museum as Open Science Project
Building COVID-19 Museum as Open Science Project
 
External CV support in Dataverse 5.7
External CV support in Dataverse 5.7External CV support in Dataverse 5.7
External CV support in Dataverse 5.7
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and DAN...
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
Metaverse for Dataverse
Metaverse for DataverseMetaverse for Dataverse
Metaverse for Dataverse
 
Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)Running Dataverse repository in the European Open Science Cloud (EOSC)
Running Dataverse repository in the European Open Science Cloud (EOSC)
 
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC,  Service QA and DataverseIntegration of WORSICA’s thematic service in EOSC,  Service QA and Dataverse
Integration of WORSICA’s thematic service in EOSC, Service QA and Dataverse
 
CLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemesCLARIAH CMDI use case and flexible metadata schemes
CLARIAH CMDI use case and flexible metadata schemes
 
Dataverse SSHOC enrichment of DDI support at EDDI'19 2
Dataverse SSHOC enrichment of DDI support at EDDI'19 2Dataverse SSHOC enrichment of DDI support at EDDI'19 2
Dataverse SSHOC enrichment of DDI support at EDDI'19 2
 
Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...Building an electronic repository and archives on Dataverse in the European O...
Building an electronic repository and archives on Dataverse in the European O...
 
Flexible metadata schemes for research data repositories - Clarin Conference...
Flexible metadata schemes for research data repositories  - Clarin Conference...Flexible metadata schemes for research data repositories  - Clarin Conference...
Flexible metadata schemes for research data repositories - Clarin Conference...
 
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
Flexibility in Metadata Schemes and Standardisation: the Case of CMDI and the...
 
Dataverse in the European Open Science Cloud
Dataverse in the European Open Science CloudDataverse in the European Open Science Cloud
Dataverse in the European Open Science Cloud
 
SSHOC Dataverse in the European Open Science Cloud
SSHOC Dataverse in the European Open Science CloudSSHOC Dataverse in the European Open Science Cloud
SSHOC Dataverse in the European Open Science Cloud
 
LOD2 Webinar Series: D2R and Sparqlify
LOD2 Webinar Series: D2R and SparqlifyLOD2 Webinar Series: D2R and Sparqlify
LOD2 Webinar Series: D2R and Sparqlify
 
e-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCubee-Infrastructure Integration-with gCube
e-Infrastructure Integration-with gCube
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...
 

Similar to Building COVID-19 Knowledge Graph at CoronaWhy

Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
mantatheralyasriy
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
mantatheralyasriy
 
BDCC-06-00004.pdf
BDCC-06-00004.pdfBDCC-06-00004.pdf
BDCC-06-00004.pdf
AsiyaKhan63
 
Data and science
Data and scienceData and science
Data and science
Anand Deshpande
 
Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data
AIMS (Agricultural Information Management Standards)
 
Cornell 2011 05-13
Cornell 2011 05-13Cornell 2011 05-13
Cornell 2011 05-13
Johannes Keizer
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
Globus
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
Ghislain ATEMEZING
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
hala Skaf
 
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Joachim Neubert
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
mantatheralyasriy
 
First they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and UsedFirst they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and Used
Rensselaer Polytechnic Institute
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
mantatheralyasriy
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisation
Valentina Carriero
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
Sören Auer
 
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Digitalmikkeli
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
KNOWeSCAPE2014
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
hala Skaf
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
Jennifer Bowen
 
From data portal to knowledge portal: Leveraging semantic technologies to sup...
From data portal to knowledge portal: Leveraging semantic technologies to sup...From data portal to knowledge portal: Leveraging semantic technologies to sup...
From data portal to knowledge portal: Leveraging semantic technologies to sup...
Xiaogang (Marshall) Ma
 

Similar to Building COVID-19 Knowledge Graph at CoronaWhy (20)

Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
BDCC-06-00004.pdf
BDCC-06-00004.pdfBDCC-06-00004.pdf
BDCC-06-00004.pdf
 
Data and science
Data and scienceData and science
Data and science
 
Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data Ciard Initiative and a Global Infrastructure for Linked Open Data
Ciard Initiative and a Global Infrastructure for Linked Open Data
 
Cornell 2011 05-13
Cornell 2011 05-13Cornell 2011 05-13
Cornell 2011 05-13
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
 
Decentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic WebDecentralized Data Management for the Semantic Web
Decentralized Data Management for the Semantic Web
 
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
Linking Knowledge Organization Systems via Wikidata (DCMI conference 2018)
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
First they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and UsedFirst they have to find it: Getting Open Government Data Discovered and Used
First they have to find it: Getting Open Government Data Discovered and Used
 
Dataset Sources Repositories.pptx
Dataset Sources Repositories.pptxDataset Sources Repositories.pptx
Dataset Sources Repositories.pptx
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisation
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
 
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific EndeavourBeyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
 
From data portal to knowledge portal: Leveraging semantic technologies to sup...
From data portal to knowledge portal: Leveraging semantic technologies to sup...From data portal to knowledge portal: Leveraging semantic technologies to sup...
From data portal to knowledge portal: Leveraging semantic technologies to sup...
 

More from vty

Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs
vty
 
Decentralisation and knowledge graphs
Decentralisation and knowledge graphs Decentralisation and knowledge graphs
Decentralisation and knowledge graphs
vty
 
Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure
vty
 
Dataverse repository for research data in the COVID-19 Museum
Dataverse repository for research data  in the COVID-19 MuseumDataverse repository for research data  in the COVID-19 Museum
Dataverse repository for research data in the COVID-19 Museum
vty
 
Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21
vty
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
vty
 
Development in Dataverse SSHOC project
Development in Dataverse SSHOC projectDevelopment in Dataverse SSHOC project
Development in Dataverse SSHOC project
vty
 
DataverseEU as multilingual repository
DataverseEU as multilingual repositoryDataverseEU as multilingual repository
DataverseEU as multilingual repository
vty
 

More from vty (8)

Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs
 
Decentralisation and knowledge graphs
Decentralisation and knowledge graphs Decentralisation and knowledge graphs
Decentralisation and knowledge graphs
 
Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure Decentralised identifiers for CLARIAH infrastructure
Decentralised identifiers for CLARIAH infrastructure
 
Dataverse repository for research data in the COVID-19 Museum
Dataverse repository for research data  in the COVID-19 MuseumDataverse repository for research data  in the COVID-19 Museum
Dataverse repository for research data in the COVID-19 Museum
 
Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21Flexible metadata schemes for research data repositories - CLARIN Conference'21
Flexible metadata schemes for research data repositories - CLARIN Conference'21
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
 
Development in Dataverse SSHOC project
Development in Dataverse SSHOC projectDevelopment in Dataverse SSHOC project
Development in Dataverse SSHOC project
 
DataverseEU as multilingual repository
DataverseEU as multilingual repositoryDataverseEU as multilingual repository
DataverseEU as multilingual repository
 

Recently uploaded

Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 

Recently uploaded (20)

Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 

Building COVID-19 Knowledge Graph at CoronaWhy

  • 1. CoronaWhy Knowledge Graph Slava Tykhonov Senior Information Scientist (DANS-KNAW, the Netherlands) W3C Semantic Web Health Care and Life Sciences Interest Group webinar 06.10.2020
  • 2. About me: DANS-KNAW projects (2016-2020) ● CLARIAH+ (ongoing) ● EOSC Synergy (ongoing) ● SSHOC Dataverse (ongoing) ● CESSDA DataverseEU 2018 ● Time Machine Europe Supervisor at DANS-KNAW ● PARTHENOS Horizon 2020 ● CESSDA PID (Personal Identifiers) Horizon 2020 ● CLARIAH ● RDA (Research Data Alliance) PITTS Horizon 2020 ● CESSDA SaW H2020-EU.1.4.1.1 Horizon 2020 2 Source: LinkedIn
  • 3. 7 weeks in lockdown in Spain 3 Resistere (I will resist)
  • 4. About CoronaWhy 1300+ people registered in the organization, more than 300 actively contributing! 4 www.coronawhy.org
  • 5. COVID-19 Open Research Dataset Challenge (CORD-19) It’s all started from this (March, 2020): “In response to the COVID-19 pandemic and with the view to boost research, the Allen Institute for AI together with CZI, MSR, Georgetown, NIH & The White House is collecting and making available for free the COVID-19 Open Research Dataset (CORD-19). This resource is updated weekly and contains over 287,000 scholarly articles, including 82,870 with full text, about COVID-19 and other viruses of the coronavirus family.” (Kaggle) 5
  • 6. CoronaWhy Community Tasks (March-April) 1. Task-Risk helps to identify risk factors that can increase the chance of being infected, or affects the severity or the survival outcome of the infection 2. Task-Ties to explore transmission, incubation and environment stability 3. Match Clinical Trials allows exploration of the results from the COVID-19 International Clinical Trials dataset 4. COVID-19 Literature Visualization helps to explore the data behind the AI- powered literature review 5. Named Entity Recognition across the entire corpus of CORD-19 papers with full text 6
  • 7. CORD-19 affiliations recognized with Deep Learning 7 Source: CORD-19 map visualization and institution affiliation data
  • 8. Collaboration with other organizations ● Harvard Medical School, INDRA integration ● Helix Group, Stanford University ● NASA JPL, COVID-19 knowledge graph and GeoParser ● Kaggle, coronamed application ● Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, knowledge graph ● dcyphr, a platform for creating and engaging with distillations of academic articles ● CAMARADES (Collaborative Approach to Meta-Analysis and Review of Animal Data from Experimental Studies) We’ve got almost endless data streams... 8
  • 9. Looking for Commons Merce Crosas, “Harvard Data Commons” 9
  • 10. Building a horizontal platform to serve vertical teams Source: CoronaWhy infrastructure introduction 10
  • 11. Turning FAIR into reality! 11 DANS-KNAW is one of worldwide leaders in FAIR Data (FAIRsFAIR)
  • 12. Dataverse as data integration point 12 ● Available as a service for the community from April, 2020 ● Used by CoronaWhy vertical teams for the data exchange and share ● Intended to help researchers to make their data FAIR ● One of the biggest COVID-19 data archives in the world with 700k files ● New teams are getting own data containers and can reuse data collected and produced by others https://datasets.coronawhy.org
  • 13. Dataset from CoronaWhy vertical teams 13 Source: CoronaWhy Dataverse
  • 14. COVID-19 data files verification 14 We do a verification of every file by importing its contents to dataframe. All column names (variables) extracted from tabular data available as labels in files metadata We’ve enabled Dataverse data previewers to browse through the content of files without download! We’re starting internal challenges to build ML models for the metadata classification
  • 15. Dataverse content in Jupyter notebooks 15 Source: Dataverse examples on Google Colabs
  • 16. COVID-19 Data Crowdsourcing CoronaWhy data management team does does the review of all harvested datasets and try to identify the important data. We’re approaching github owners by creating issues in their repos and inviting them to help us. More than 20% of data owners joining CoronaWhy community or interested to curate their datasets. Bottom-up data collection works! 16
  • 17. Challenge of data integration and various ontologies CORD-19 collection workflows with NLP pipeline: ● manual annotation and labelling of COVID-19 related papers ● automatic entity extraction and classification of text fragments ● statements extraction and curation ● linking papers to specific research questions with relationships extraction Dataverse Data Lake streaming COVID-19 datasets from various sources: ● medical data ● socio-economic Data ● political data and statistics 17
  • 18. The importance of standards and ontologies Generic controlled vocabularies to link metadata in the bibliographic collections are well known: ORCID, GRID, GeoNames, Getty. Medical knowledge graphs powered by: ● Biological Expression Language (BEL) ● Medical Subject Headings (MeSH®) by U.S. National Library of Medicine (NIH) ● Wikidata (Open ontology) - Wikipedia Integration based on metadata standards: ● MARC21, Dublin Core (DC), Data Documentation Initiative (DDI) 18
  • 19. Statements extraction with INDRA 19 Source: EMMAA (Ecosystem of Machine-maintained Models with Automated Assembly) “INDRA (Integrated Network and Dynamical Reasoning Assembler) is an automated model assembly system, originally developed for molecular systems biology and currently being generalized to other domains.” Developed as a part of Harvard Program in Therapeutic Science and the Laboratory of Systems Pharmacology at Harvard Medical School. http://indra.bio
  • 21. Advanced Text Analysis (NLP pipeline) Source: D.Shlepakov, How to Build Knowledge Graph, notebook We need a good understanding of all domain specific texts to make right statements: 21
  • 22. Building domain specific knowledge graphs ● We’re collecting all possible COVID-19 data and archiving in our Dataverse ● Looking for various related controlled vocabularies and ontologies ● Building and reusing conversion pipelines to get all data values linked in RDF format The ultimate goal is to automate the process of the Knowledge extraction by using the latest developments in Artificial Intelligence and Deep Learning. We need to organize all available knowledge in a common way. 22
  • 23. Simple Knowledge Organization System (SKOS) SKOS models a thesauri-like resources: - skos:Concepts with preferred labels and alternative labels (synonyms) attached to them (skos:prefLabel, skos:altLabel). - skos:Concept can be related with skos:broader, skos:narrower and skos:related properties. - terms and concepts could have more than one broader term and concept. SKOS allows to create a semantic layer on top of objects, a network with statements and relationships. A major difference of SKOS is logical “is-a hierarchies”. In thesauri the hierarchical relation can represent anything from “is-a” to “part-of”. 23
  • 24. RDF graph using the SKOS Core Vocabulary 24 Source: SKOS Core Guide
  • 25. Global Research Identifier Database (GRID) in SKOS 25 We already have a lot of data in CoronaWhy Data Lake. Can we provide human a convenient web interface to create links to data points? Can we use Machine Learning algorithms to make a prediction about links and convert data in SKOS automatically?
  • 26. SKOSMOS framework to discover ontologies 26 ● SKOSMOS is developed in Europe by the National Library of Finland (NLF) ● active global user community ● search and browsing interface for SKOS concept ● multilingual vocabularies support ● used for different use cases (publish vocabularies, build discovery systems, vocabulary visualization)
  • 27. Skosmos pipeline 27 Skosmos backend is based on Jena Fuseki, a SPARQL server and RDF triple store integrated with Lucene search engine. The jena-text extension can be used for faster text search and used both by Skosmos GUI and REST API.
  • 28. MeSH in SKOSMOS SKOS version of Medical Subjects Headings provided by Finnish Thesaurus and Ontology Service (Finto): https://finto.fi/mesh/ 28
  • 29. SKOSMOS API specification in Swagger 29 Source: Finto API
  • 30. SKOSMOS API example for GRID ontology 30
  • 31. COVID-19 expert questions 31 Source: Epidemic Questions Answering “In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2, related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers.”
  • 32. COVID-19 questions in SKOSMOS framework 32
  • 33. COVID-19 questions in SKOSMOS API 33
  • 34. Visual graph of COVID-19 dataset 34 Source: CoronaWhy GraphDB
  • 35. COVID-19 questions in datasets metadata 35 Source: COVID-19 European data hub in Harvard Dataverse ● COVID-19 ontologies can be hosted by SKOSMOS framework ● Researchers can enrich metadata by adding standardized questions provided by SKOSMOS ontologies ● rich metadata exported back to Linked Open Data Cloud to increase a chance to be found ● enriched metadata can be used for further ML models training
  • 36. Semantic chatbot - ontology lookup service 36 Source: CoronaWhy Semantic Bot
  • 37. SPARQL endpoint for CoronaWhy KG 37 Source: YASGUI
  • 38. Do you know a lot of people that can use SPARQL? 38 Source: Semantic annotation of the Laboratory Chemical Safety Summary in PubChem
  • 39. CLARIAH conclusions “By developing these decentralised, yet controlled Knowledge Graph development practices we have contributed to increasing interoperability in the humanities and enabling new research opportunities to a wide range of scholars. However, we observe that users without Semantic Web knowledge find these technologies hard to use, and place high value in end-user tools that enable engagement. Therefore, for the future we emphasise the importance of tools to specifically target the goals of concrete communities – in our case, the analytical and quantitative answering of humanities research questions for humanities scholars. In this sense, usability is not just important in a tool context; in our view, we need to empower users in deciding under what models these tools operate.” (CLARIAH: Enabling Interoperability Between Humanities Disciplines with Ontologies) Chicken-egg problem: users are building tools without data models and ontologies but in reality they need to build a knowledge graph with common ontologies first! 39
  • 40. Linked Data integration challenges ● datasets are very heterogeneous and multilingual ● data usually lacks sufficient data quality control ● data providers using different modeling schemas and styles ● linked data cleansing and versioning is very difficult to track and maintain properly, web resources aren’t persistent ● even modern data repositories providing only metadata records describing data without giving access to individual data items stored in files ● difficult to assign and manually keep up-to-date entity relationships in knowledge graph CoronaWhy has too much information streams that seems to be impossible to integrate and give back to COVID-19 researchers. So, do we have a solution? 40
  • 41. Bibliographic Framework (BIBFRAME) as a Web of Data “The Library of Congress officially launched its Bibliographic Framework Initiative in May 2011. The Initiative aims to re-envision and, in the long run, implement a new bibliographic environment for libraries that makes "the network" central and makes interconnectedness commonplace.” “Instead of thousands of catalogers repeatedly describing the same resources, the effort of one cataloger could be shared with many.” (Source) In 2019 BIBFRAME 2.0, the Library of Congress Pilot, was announced. Let’s take a journey and move from domain specific ontology to bibliographic! 41
  • 42. BIBFRAME 2.0 concepts 42 ● Work. The highest level of abstraction, a Work, in the BIBFRAME context, reflects the conceptual essence of the cataloged resource: authors, languages, and what it is about (subjects). ● Instance. A Work may have one or more individual, material embodiments, for example, a particular published form. These are Instances of the Work. An Instance reflects information such as its publisher, place and date of publication, and format. ● Item. An item is an actual copy (physical or electronic) of an Instance. It reflects information such as its location (physical or virtual), shelf mark, and barcode. ● Agents: Agents are people, organizations, jurisdictions, etc., associated with a Work or Instance through roles such as author, editor, artist, photographer, composer, illustrator, etc. ● Subjects: A Work might be “about” one or more concepts. Such a concept is said to be a “subject” of the Work. Concepts that may be subjects include topics, places, temporal expressions, events, works, instances, items, agents, etc. ● Events: Occurrences, the recording of which may be the content of a Work. Source: the Library of Congress
  • 43. MARC as a foundation of the structured Data Hub 43 Source: the Library of Congress, USA MARC standard was developed in the 1960s to create records that could be read by computers and shared among libraries. The term MARC is an abbreviation for MAchine Readable Cataloging. The MARC 21 bibliographic format was created for the international community. It’s very rich, with more than 2,000 data elements defined! It’s identified by its ISO (International Standards Organization) number: ISO 2709
  • 44. How to integrate data in the common KG? ● Use MARC 21 as a basis for all bibliographic and authority records ● All controlled vocabularies should be expressed in MARC 21 format for Authority Data, we need to build an authority linking process with the “human in the loop” approach that will allow to verify AI predicted links. ● Different MARC 21 fields could be linked to the different ontologies and/or even interlinked. For example, we can get some entities linked to both MeSH and Wikidata in the same bibliographic record to increase the interoperability of the Knowledge Graph. ● Every CORD-19 paper can get a metadata enrichment provided by any research team working on the NLP extraction of entities, relations or linking CV together. 44
  • 45. COVID-19 paper in MARC 21 representation 45 Structure of the bibliographic record: ● authority records contain information about authors and affiliation ● Medical entities extracted by NLP pipeline interlinked in 650x fields ● part of metadata fields generated and filled by Machine Learning models, part contributed by human experts ● provenance information kept in 833x fields indicating fully or partially machine-generated records ● Relations between entities stored in 730x fields Source: CoronaWhy CORD-19 portal
  • 46. Vufind discovery tool for libraries powered by MARC 21 46 Source: CoronaWhy VuFind
  • 47. Landing page of publications from CORD-19 47 Source: CoronaWhy VuFind
  • 48. CORD-19 collection in BIBFRAME 2.0 48
  • 49. CoronaWhy Graph published as RDF 49 Source: CoronaWhy Dataverse
  • 50. CLARIAH and Network Digital Heritage (NDE) GraphiQL 50 CoronaWhy is running own instance of NDE on Kubernetes cluster and maintains the support of another ontologies (MeSH, ICD, NDF) available via their SPARQL endpoints! If you want to query API: curl "http://nde.dataverse- dev.coronawhy.org/nde/graphql?query=%20%7B%20terms(match%3A%22COVID%22%2Cdataset%3A%5B%22wikidata%22%5D)%20 %7B%20dataset%20terms%20%7Buri%2C%20altLabel%7D%20%7D%20%7D"
  • 51. Increasing Dataverse metadata interoperability 51 External controlled vocabularies support contributed by SSHOC project (data infrastructure for the EOSC)
  • 52. Hypothes.is annotations as a peer review service 1. AI pipeline does domain specific entities extraction and ranking of relevant CORD-19 papers. 2. Automatic entities and statements will be added, important fragments should be highlighted. 3. Human annotators should verify results and validate all statements. 52
  • 53. Doccano annotation with Machine Learning Source: Doccano Labs
  • 54. Building an Operating System for Open Science 54 CoronaWhy Common Research and Data Infrastructure is distributed and robust enough to be scaled up and reused for other tasks like cancer research All services are build from Open Source components Data processed and published in FAIR way, the provenance information is the part of our Data Lake Data evaluation and credibility is the top priority, we’re providing tools for the expert community for the verification of our datasets The transparency of data and services guarantees the reproducibility of all experiments and get bring new insights in COVID-19 research
  • 55. CoronaWhy Common Research and Data Infrastructure Data preprocessing pipeline implemented on Jupyter notebook Docker with extra modules. Dataverse as data repository to store data from automatic and curated workflows. Elasticsearch has CORD-19 indexes on sections and sentences level with spacy enrichments. Other indexes: MeSH, Geonames, GRID. Hypothesis and Doccano annotation services to annotate publications Virtuoso and GraphDB with public SPARQL endpoints to query COVID-19 Knowledge Graph Other services: Colab, MongoDB, Kibana, BEL Commons 3.0, INDRA, Geoparser, Tabula https://github.com/CoronaWhy/covid-19-infrastructure 55
  • 56. 56 Source: CoronaWhy API built on FastAPI framework for Python
  • 57. Thank you! Questions? @Slava Tykhonov on CoronaWhy Slack vyacheslav.tykhonov@dans.knaw.nl www.coronawhy.org www.dans.knaw.nl/en 57