Conférence donéée au LGI2P (Conférence Communication Science et Société) à Nimes le 17 mars 2015. Contenu en partie produit par le travail de Juan Antonio Lossio Ventura.
About the use of biomedical ontologies to play with text in the context of the SIFR project.
1. About the use of
biomedical
ontologies to play
with text
… in the context of the…
Clement Jonquet (jonquet@lirmm.fr)
Conférence Communication Science et Société
LGI2P, Nimes – 17 mars 2015
3. Biologist have adopted
ontologies
To provide canonical representation of scientific
knowledge
To annotate experimental data to enable
interpretation, comparison, and discovery across
databases
To facilitate knowledge-based applications for
Decision support
Natural language-processing
Data integration
But ontologies are: spread out, in different formats, of
different size, with different structures
Conference C2S
LGI2P, Nimes – 17 mars 2015
4. Working with terminologies &
ontologies – a portal please!
You’ve built an ontology, how do you let the world know?
You need an ontology, where do you go o get it?
How do you know whether an ontology is any good?
How do you find resources that are relevant to the
domain of the ontology (or to specific terms)?
How could you leverage your ontology to enable new
science?
How could you use ontologies without managing them ?
Conference C2S
LGI2P, Nimes – 17 mars 2015
6. Annotation challenge
Explosion of biomedical data: diverse,
distributed, unstructured… not link to
ontologies
Hard for biomedical researchers to find the
data they need
Data integration problem
Translational discoveries are prevented
Good examples
GO annotations
PubMed (biomedical literature) indexed with
Mesh headings
Annotate data with ontology concepts
Horizontal approach
ONTOLOGIES
RESOURCES
Conference C2S
LGI2P, Nimes – 17 mars 2015
7. A few words about SIFR
project
Conference C2S LGI2P, Nimes –
17 mars 2015
9. Context:
increasing number of biomedical data
+ multilingualism
Limits of keyword-based indexing
Biomedical community has turned to ontologies to describe their
data and turn them into structured and formalized knowledge
Using ontologies is by means of creating semantic annotations
Crucial need for tools & services for French biomedical data
Biomedical data integration challenge
New potential sceintific discoveries hidden in data
Translational research
Conference C2S LGI2P, Nimes –
17 mars 2015
10. Use ontologies for indexing, mining
and searching (French) biomedical
data
Obj1: Design, development and deployment
of the French Annotator.
Obj2: Obtain new research results to exploit
and enhance ontology-based indexing
services.
semantic distances
ontology alignment
ontology enrichment and disambiguation
Obj3: Valorization of indexing services
Conference C2S LGI2P, Nimes –
17 mars 2015
15. http://data.bioontology.org
Ontology
Services
• Search
• Traverse
• Comment
• Download
Widgets
• Tree-view
• Auto-complete
• Graph-view
Annotation
Data Access
Mapping
Services
• Create
• Upload
• Download
Term recognition
Search “data”
annotated with a
given term
http://bioportal.bioontology.org Conference C2S LGI2P, Nimes –
17 mars 2015
16. SIFR axes of research (1/2)
Design of the SIFR (French) Annotator service
Deployment of a local instance of BioPortal at LIRMM
Scoring of annotations & representation RDF using the AO [SWAT4LS
2014]
Dealing with multilingualism within BioPortal [TOTh-w 2014]
Automatic extraction of biomedical terminology from text
Hereafter [LBM 2013][ISWC 2014][TALN 2014][PolTAL 2014]
Semantic distance framework
Collaboration with LGI2P to reuse Semantic Measure Library (SML)
Conference C2S LGI2P, Nimes –
17 mars 2015
17. SIFR axes of research (2/2)
Dealing with public patient data on blogs, forums and tweets
(Sandra Bringay)
Detection of emotion [EGC 2014]
Patient vocabulary [eTELEMED 2014]
Adverse drug event mining from EHRs
Project to compare pharmacogenomics literature and EHRs
Design of a semantic annotation workflow for plant data -
collaboration with IBC project [CO-PDI 2014]
AgroLD project [RDA 2014]
Cropontology.org
Semantic indexing and users feedback – Viewpoint [IC 2014]
Collaboration with P. Lemoisson (CIRAD)
PhD project of Guillaume Surroca
Conference C2S LGI2P, Nimes –
17 mars 2015
19. Motivations for automatic
terminology extraction
Experiment and validate approaches for French data
Offer services for both English and French communities
Go beyond the state-of-the-art
Contribute to the ontology enrichment process
Acquire some NLP expertise to enhance the NCBO
Annotation workflow
Conference C2S LGI2P, Nimes –
17 mars 2015
20. Combining ATR & AKE
ATR AKE
Automatic Term
Recognition
Automatic Keyword
Extraction
Input one large corpus single document of a dataset
Output technical terms of a domain keywords that describe the
document
Domain very specific none
Exemples C-value TFIDF, Okapi
Automatic Term
Recognition
Automatic Keyword
Extraction
term1
term2
…
termn
Keyword1
Keyword2
…
Keyword1
Keyword2
…
Keyword1
Keyword2
…
Conference C2S LGI2P, Nimes –
17 mars 2015
21. Part-of-Speech Tagging
Candidate terms extraction
Ranking of candidate terms
Computing of new combination
measures
Re-ranking using web-based measure.
Conference C2S LGI2P, Nimes –
17 mars 2015
22. Part-of-Speech Tagging
Candidate terms extraction
Ranking of candidate terms
Computing of new combination
measures
Re-ranking using web-based measure.
Conference C2S LGI2P, Nimes –
17 mars 2015
23. Assign each word in a text to its grammatical category (e.g.,
noun, adjective).
We apply part-of-speech to the whole corpus
Three tools:
• TreeTagger,
• Stanford Tagger,
• Brill’s rules
(1) Part-of-speech tagging
Conference C2S LGI2P, Nimes –
17 mars 2015
25. (2) Candidate term extraction
following patterns
Conference C2S LGI2P, Nimes –
17 mars 2015
~ 5M concepts
161 sources
Unified Medical Language System
…
UMLS
MeSH
ICD
SNOMED
26. (2) Candidate term extraction
following patterns
Conference C2S LGI2P, Nimes –
17 mars 2015
27. Part-of-Speech Tagging
Candidate terms extraction
Ranking of candidate terms
Computing of new combination
measures
Re-ranking using web-based measure.
Conference C2S LGI2P, Nimes –
17 mars 2015
28. (3) Ranking of candidate terms
Conference C2S LGI2P, Nimes –
17 mars 2015
Using C-value
Where:
In order to extract single-word and multi-word terms
29. (3) Ranking of candidate terms
Using TF-IDF and Okapi BM25
Keyword1
Keyword2
…
Keyword1
Keyword2
…
Keyword1
Keyword2
…
Keyword1
Keyword2
…
Conference C2S LGI2P, Nimes –
17 mars 2015
30. Part-of-Speech Tagging
Candidate terms extraction
Ranking of candidate terms
Computing of new combination
measures
Re-ranking using web-based measure.
Conference C2S LGI2P, Nimes –
17 mars 2015
31. (4) Computing of new
combination measures
Conference C2S LGI2P, Nimes –
17 mars 2015
F-OCapi and F-TFIDF-C (Harmonic mean)
32. Conference C2S LGI2P, Nimes –
17 mars 2015
C-Okapi and C-TFIDF
(4) Computing of new
combination measures
33. Part-of-Speech Tagging
Candidate terms extraction
Ranking of candidate terms
Computing of new combination
measures
Re-ranking using web-based measure.
Conference C2S LGI2P, Nimes –
17 mars 2015
34. (5) Re-ranking using web-
based measure
Conference C2S LGI2P, Nimes –
17 mars 2015
term
1
term
2
…
term
n
WEB
“treponema pallidum”
treponema pallidum
35. Experiments: datasets
Plus automatic validation using UMLS (EN) & MeSH (FR)
Conference C2S LGI2P, Nimes –
17 mars 2015
Drugs and Herbs
Medical Tests
PubMed
(EN, FR)
(EN, FR)
(EN)
36. Precision comparison of the best measures for term extraction for
English.
Precision comparison of the best measures for term extraction for
French.
Precision comparison between F-OCapiM and WebR with automatic validation for
French.
Conference C2S LGI2P, Nimes –
17 mars 2015
Experiments: results
39. Current & future work on
term extraction
Methodology for term extraction and ranking for two
languages, French and English.
C-value adapted to extract French biomedical terms.
Two new measures thanks to the combination of three
existing methods and another new web-based measure.
WebR was applied to re-rank the best list positioning
the true biomedical terms at the top of list.
Reuse such NLP within the SIFR Annotator workflow to
enhance semantic annotation
Conference C2S LGI2P, Nimes –
17 mars 2015
40. A few words by way of
conclusion
Conference C2S
LGI2P, Nimes – 17 mars 2015
41. Conference C2S LGI2P, Nimes –
17 mars 2015
Terminologies & ontologies are relevant
features for knowledge representation
But a large majority of the data are texts
Go beyond one language
Share & mutualize relevant resources in the
domain: ontologies, terminologies,
mappings, annotations, technologies