Language suffers an everlasting process of change, both at a semantic level, where existing words acquire new meanings, and at a lexical level, where new concepts appear and old ones disappear or are used less frequently. New words (terms/concepts) may be added as a result of scientific discoveries or socio-cultural influences, while other words are ”forgotten” or are assigned alternative meanings. These changes in a vocabulary usually characterize important shifts in the environment or
the domain they are used in. For experts there is an evident connection between a new concept and some of the existing ones, but for regular people these relations remain hidden and need to be identified. In particular, in the medical domain new terms appear as a result of new discoveries and it becomes an important challenge to establish the connections between different concepts. Moreover, it is important to detect if such a relation even exists. In this paper, we present a graph-based approach to identify the semantic path (which is a chain of semantically related words) between the concepts that appeared in the bio-medicine publications available in the PubMed corpus over a time period of 20 years
Tracing the paths between concepts in large bio medical corpora
1. Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Tracing the Paths Between Concepts in
Large Bio-medical Corpora
Zaruhi Alaverdyan, Marcello Benedetti, Falitokiniaina Rabearison, Nishara
Pathirana, Costin-Gabriel CHIRU and Traian Rebedea
{zara.alaverdyan, 4marcello, r.falitokiniaina, nishara.pdn}@gmail.com,
{costin.chiru, traian.rebedea}@cs.pub.ro
2. Introduction
• Language suffers an everlasting process of change: existing words acquire
new meanings, new concepts appear and old ones disappear or are used
less frequently.
• For experts there is an evident connection between a new concept and
some of the existing ones, but for regular people these relations remain
hidden and need to be identified.
• E.g. bio-medical domain: new terms appear as a result of new discoveries
and it becomes an important challenge to establish the connections
between different concepts.
• Why is important to identify the connections?
– Micro-level: experiments are very costly in terms of time and resources it is
important to find some connections before actually undertaking the
experiments in order to minimize the risks
– Macro-level: better understanding of the domain evolution, establishing some
investment strategies in specific domains, forecasting the next findings, paving
the way for new inventions, etc.
27.05.2015 CSCS 2015 2
3. Solution
• Identify the relations between different concepts
extracted from PubMed (a corpus of bio-medical
publications) over a time period of 20 years.
• Discover the paths from the existing concepts to the
newly introduced terms by building paths leading from
one concept to another.
• Use a graph-based approach for efficiency reasons.
• Use time series + cosine distance and Kullback-Leibler
(KL) divergence to estimate the distance (or
dissimilarity) between two terms.
27.05.2015 CSCS 2015 3
4. Related Work
• Wijaya and Yeniterzi propose an analysis of semantic
changes of a word based on exploring the changes of the
words co-occurring with it over time using the Google N-
gram corpus k-means clustering + topic modelling
• Hall, Jurafsky and Manning try to detect the history of
ideas or topics in a scientific field. assumption is that the
shift in vocabulary usage is closely related to the
discoveries and new ideologies can characterize the
appearance of new ideas or scientific topics
• NERs:
– General: Stanford NER, NaCTeMs TerMINE
– Focused on medical ontology: MetaMap and Open Biomedical
Annotator (OBA), ADEPT (from Stanford University)
27.05.2015 CSCS 2015 4
5. Methodology (1)
• Several steps:
1. Use PudMed to extract medical articles using
different filters: 542,228 articles
2. Pre-process (cleaning + NER)
27.05.2015 CSCS 2015 5
6. Methodology (2)
3. Build the Co-occurrence Graph
• Each vertex belonging to V consists of a tuple <concept,
first year of appearance of that concept in the corpus>;
• There is an edge from vi to vj and vice-versa iff the
concepts i and j co-occur in at least one article;
• The weight of an edge from vi to vj is defined as:
ncoij= the number of co-occurrences of concepts i and j
(the number of articles containing both concepts).
• Wij = the probability of two concepts not appearing
together (distances between different concepts)
pij = 1 – wij is the probability distribution for concept i to
co-appear with concept j.
27.05.2015 CSCS 2015 6
7. Methodology (3)
27.05.2015 CSCS 2015 7
Connection between ”shock therapy”, found for the first time in 46 abstracts published in
1991, and ”tennis elbow” appearing for the first time in 1998 in 28 abstracts.
The two terms co-appeared twice in 1998. the link from ”shock therapy” to ”tennis
elbow” = 1 - 2/46 = 0.95, while the reverse link = 1 - 2/28 = 0.92.
the
connection
from newer
concepts to
older ones
is stronger
(smaller
distances)
than the
reverse
connection.
8. Methodology (4)
4. Filter the graph
• The number of edges increases substantially with the
number of articles in the corpora
• Eliminated concepts that co-occurred in a single article
• Eliminated the top 150 most frequent concepts that
are practically co-occurring with all the other concepts
in the corpus (e.g. therapy, surgery, analysis, etc.).
• Final graph had 743,117 distinct vertices (tuples
<concept, first year of appearance of that concept in
the corpus>) and 13,550,938 edges between them.
27.05.2015 CSCS 2015 8
10. Methodology (6)
5. Discover the Concepts Chains
• For each concept, identify the concepts that co-occur
with it frequently and, hence, are semantically related
extract the chains of related concepts
• distij =
• Computing shortest path in such a huge graph is
computationally expensive - O(E + VlogV)
• Use A* (informed search algorithm) to determine it
faster requires an estimation of the distance
between any two concepts from the graph
• Estimation of the distance between any two concepts
using time series analysis (measure of appearance of
that particular concept in the articles published during
every year from the analyzed time span).
27.05.2015 CSCS 2015 10
11. Methodology (6)
27.05.2015 CSCS 2015 11
• The distance between two concepts is
computed using the cosine similarity or the
Kullback-Leibler distance
12. Results (1)
27.05.2015 CSCS 2015 12
Main achievement: the terms appearing on the path from one concept to another are in
close semantic relationship with each other and with the initial terms.
13. Results (2)
27.05.2015 CSCS 2015 13
Google Search
Wikipedia
Search
Trypanosoma
en.wikipedia.org/wiki/Tr
ypanosoma
Trypanosoma Cruzi
en.wikipedia.org/wiki/Try
panosoma_cruzi
Astrogliosis
en.wikipedia.org/
wiki/Astrogliosis
List of parasites of
humans
en.wikipedia.org/wiki/
List_of_parasites_of_h
umans
Cruzi
No Wikipedia
page
www.humanconnectome.org
/about/project/behavioral-
testing.html
Wikipedia
Link
Google Search
Sleeping
Sickness
en.wikipedia.org
/wiki/African_tr
ypanosomiasis
Trypanosoma
Brucei
en.wikipedia.org
/wiki/Trypanoso
ma_brucei
C
N
S
- central
nervous
system
s
Behavioral Testing
No Wikipedia page
semantics
brainconnectivity
Wikipedia
Search
Wikipedia
Link
Wikipedia
Link
Wikipedia
Link Wikipedia
Link
Wikipedia Link
Wikipedia
Link
14. Conclusions
• The application managed to identify complex paths
from one concept to another It was difficult to find
this path using normal web searches and links
requires a mix of Wikipedia links, Google searches and
other links on the web + implicit knowledge about the
concepts along the path.
• Did that using a graph-based approach which
formalized the concept of term co-occurrence and
allowed us to trace the semantic paths between
concepts.
• The paths were identified using A* algorithm + time
series analysis combined with cosine similarity / KL
distance (cosine better than KL)
• Our approach heavily depends on the identification of
medical terms (ADEPT) better NER better results
27.05.2015 CSCS 2015 14
15. Questions
27.05.2015 CSCS 2015 15
Thank you very much!
This work has been partially funded by the Sectoral Operational Programme
Human Resources Development 2007- 2013 of the Ministry of European Funds
through the Financial Agreement POSDRU/159/1.5/S/132395 and by the FP7
project LTfLL (Language Technologies for Lifelong Learning).
Editor's Notes
search on Wikipedia: Cruzi (page didn’t exist) Trypanosoma (connections to both cruzi and to sleeping sickness and chagas disease) No direct connection to Astrogliosis also, no connection in the reversed direction either (from Astrogliosis to Trypanosoma)
New search on Wikipedia: a different page that allowed to connect Sleeping Sickness and Astroglosis through CNS
New search on Wikipedia: behavioral testing (no relevant results) search on Google connection between “behavioral testing” and brain connectivity brain connectivity can be damaged by Astrogliosis