SlideShare a Scribd company logo
1 of 15
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
Tracing the Paths Between Concepts in
Large Bio-medical Corpora
Zaruhi Alaverdyan, Marcello Benedetti, Falitokiniaina Rabearison, Nishara
Pathirana, Costin-Gabriel CHIRU and Traian Rebedea
{zara.alaverdyan, 4marcello, r.falitokiniaina, nishara.pdn}@gmail.com,
{costin.chiru, traian.rebedea}@cs.pub.ro
Introduction
• Language suffers an everlasting process of change: existing words acquire
new meanings, new concepts appear and old ones disappear or are used
less frequently.
• For experts there is an evident connection between a new concept and
some of the existing ones, but for regular people these relations remain
hidden and need to be identified.
• E.g. bio-medical domain: new terms appear as a result of new discoveries
and it becomes an important challenge to establish the connections
between different concepts.
• Why is important to identify the connections?
– Micro-level: experiments are very costly in terms of time and resources  it is
important to find some connections before actually undertaking the
experiments in order to minimize the risks
– Macro-level: better understanding of the domain evolution, establishing some
investment strategies in specific domains, forecasting the next findings, paving
the way for new inventions, etc.
27.05.2015 CSCS 2015 2
Solution
• Identify the relations between different concepts
extracted from PubMed (a corpus of bio-medical
publications) over a time period of 20 years.
• Discover the paths from the existing concepts to the
newly introduced terms by building paths leading from
one concept to another.
• Use a graph-based approach for efficiency reasons.
• Use time series + cosine distance and Kullback-Leibler
(KL) divergence to estimate the distance (or
dissimilarity) between two terms.
27.05.2015 CSCS 2015 3
Related Work
• Wijaya and Yeniterzi propose an analysis of semantic
changes of a word based on exploring the changes of the
words co-occurring with it over time using the Google N-
gram corpus  k-means clustering + topic modelling
• Hall, Jurafsky and Manning try to detect the history of
ideas or topics in a scientific field.  assumption is that the
shift in vocabulary usage is closely related to the
discoveries and new ideologies  can characterize the
appearance of new ideas or scientific topics
• NERs:
– General: Stanford NER, NaCTeMs TerMINE
– Focused on medical ontology: MetaMap and Open Biomedical
Annotator (OBA), ADEPT (from Stanford University)
27.05.2015 CSCS 2015 4
Methodology (1)
• Several steps:
1. Use PudMed to extract medical articles using
different filters: 542,228 articles
2. Pre-process (cleaning + NER)
27.05.2015 CSCS 2015 5
Methodology (2)
3. Build the Co-occurrence Graph
• Each vertex belonging to V consists of a tuple <concept,
first year of appearance of that concept in the corpus>;
• There is an edge from vi to vj and vice-versa iff the
concepts i and j co-occur in at least one article;
• The weight of an edge from vi to vj is defined as:
ncoij= the number of co-occurrences of concepts i and j
(the number of articles containing both concepts).
• Wij = the probability of two concepts not appearing
together (distances between different concepts) 
pij = 1 – wij is the probability distribution for concept i to
co-appear with concept j.
27.05.2015 CSCS 2015 6
Methodology (3)
27.05.2015 CSCS 2015 7
Connection between ”shock therapy”, found for the first time in 46 abstracts published in
1991, and ”tennis elbow” appearing for the first time in 1998 in 28 abstracts.
The two terms co-appeared twice in 1998.  the link from ”shock therapy” to ”tennis
elbow” = 1 - 2/46 = 0.95, while the reverse link = 1 - 2/28 = 0.92.
the
connection
from newer
concepts to
older ones
is stronger
(smaller
distances)
than the
reverse
connection.
Methodology (4)
4. Filter the graph
• The number of edges increases substantially with the
number of articles in the corpora
• Eliminated concepts that co-occurred in a single article
• Eliminated the top 150 most frequent concepts that
are practically co-occurring with all the other concepts
in the corpus (e.g. therapy, surgery, analysis, etc.).
• Final graph had 743,117 distinct vertices (tuples
<concept, first year of appearance of that concept in
the corpus>) and 13,550,938 edges between them.
27.05.2015 CSCS 2015 8
Methodology (5)
27.05.2015 CSCS 2015 9
Methodology (6)
5. Discover the Concepts Chains
• For each concept, identify the concepts that co-occur
with it frequently and, hence, are semantically related
 extract the chains of related concepts 
• distij =
• Computing shortest path in such a huge graph is
computationally expensive - O(E + VlogV)
• Use A* (informed search algorithm) to determine it
faster  requires an estimation of the distance
between any two concepts from the graph
• Estimation of the distance between any two concepts
using time series analysis (measure of appearance of
that particular concept in the articles published during
every year from the analyzed time span).
27.05.2015 CSCS 2015 10
Methodology (6)
27.05.2015 CSCS 2015 11
• The distance between two concepts is
computed using the cosine similarity or the
Kullback-Leibler distance
Results (1)
27.05.2015 CSCS 2015 12
Main achievement: the terms appearing on the path from one concept to another are in
close semantic relationship with each other and with the initial terms.
Results (2)
27.05.2015 CSCS 2015 13
Google Search
Wikipedia
Search
Trypanosoma
en.wikipedia.org/wiki/Tr
ypanosoma
Trypanosoma Cruzi
en.wikipedia.org/wiki/Try
panosoma_cruzi
Astrogliosis
en.wikipedia.org/
wiki/Astrogliosis
List of parasites of
humans
en.wikipedia.org/wiki/
List_of_parasites_of_h
umans
Cruzi
No Wikipedia
page
www.humanconnectome.org
/about/project/behavioral-
testing.html
Wikipedia
Link
Google Search
Sleeping
Sickness
en.wikipedia.org
/wiki/African_tr
ypanosomiasis
Trypanosoma
Brucei
en.wikipedia.org
/wiki/Trypanoso
ma_brucei
C
N
S
- central
nervous
system
s
Behavioral Testing
No Wikipedia page
semantics
brainconnectivity
Wikipedia
Search
Wikipedia
Link
Wikipedia
Link
Wikipedia
Link Wikipedia
Link
Wikipedia Link
Wikipedia
Link
Conclusions
• The application managed to identify complex paths
from one concept to another  It was difficult to find
this path using normal web searches and links 
requires a mix of Wikipedia links, Google searches and
other links on the web + implicit knowledge about the
concepts along the path.
• Did that using a graph-based approach which
formalized the concept of term co-occurrence and
allowed us to trace the semantic paths between
concepts.
• The paths were identified using A* algorithm + time
series analysis combined with cosine similarity / KL
distance (cosine better than KL)
• Our approach heavily depends on the identification of
medical terms (ADEPT) better NER  better results
27.05.2015 CSCS 2015 14
Questions
27.05.2015 CSCS 2015 15
Thank you very much!
This work has been partially funded by the Sectoral Operational Programme
Human Resources Development 2007- 2013 of the Ministry of European Funds
through the Financial Agreement POSDRU/159/1.5/S/132395 and by the FP7
project LTfLL (Language Technologies for Lifelong Learning).

More Related Content

Similar to Tracing the paths between concepts in large bio medical corpora

Here is a research tip We write the theory into a separate sectio
Here is a research tip We write the theory into a separate sectioHere is a research tip We write the theory into a separate sectio
Here is a research tip We write the theory into a separate sectio
SusanaFurman449
 
Translating Theory Into PracticeWhen water is in a solid state.docx
Translating Theory Into PracticeWhen water is in a solid state.docxTranslating Theory Into PracticeWhen water is in a solid state.docx
Translating Theory Into PracticeWhen water is in a solid state.docx
depoerossie
 
Week 3 Concept Synthesis TemplatePlease use these Headings for.docx
Week 3 Concept Synthesis TemplatePlease use these Headings for.docxWeek 3 Concept Synthesis TemplatePlease use these Headings for.docx
Week 3 Concept Synthesis TemplatePlease use these Headings for.docx
jessiehampson
 

Similar to Tracing the paths between concepts in large bio medical corpora (20)

Here is a research tip We write the theory into a separate sectio
Here is a research tip We write the theory into a separate sectioHere is a research tip We write the theory into a separate sectio
Here is a research tip We write the theory into a separate sectio
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Lecture 1 research methods
Lecture 1 research methodsLecture 1 research methods
Lecture 1 research methods
 
STDEV . I3.pdf
STDEV . I3.pdfSTDEV . I3.pdf
STDEV . I3.pdf
 
Convergence of Occupational and Environmental Exposure Science: the Whole Pic...
Convergence of Occupational and Environmental Exposure Science: the Whole Pic...Convergence of Occupational and Environmental Exposure Science: the Whole Pic...
Convergence of Occupational and Environmental Exposure Science: the Whole Pic...
 
Examining "Borrowed Theory" in Original vs. New Disciplines via Text Mining
Examining "Borrowed Theory" in Original vs. New Disciplines via Text MiningExamining "Borrowed Theory" in Original vs. New Disciplines via Text Mining
Examining "Borrowed Theory" in Original vs. New Disciplines via Text Mining
 
See, Do, then Teach - To See, Show-Do with Feedback, Teach with Feedback-Refl...
See, Do, then Teach - To See, Show-Do with Feedback, Teach with Feedback-Refl...See, Do, then Teach - To See, Show-Do with Feedback, Teach with Feedback-Refl...
See, Do, then Teach - To See, Show-Do with Feedback, Teach with Feedback-Refl...
 
Translating Theory Into PracticeWhen water is in a solid state.docx
Translating Theory Into PracticeWhen water is in a solid state.docxTranslating Theory Into PracticeWhen water is in a solid state.docx
Translating Theory Into PracticeWhen water is in a solid state.docx
 
Experimental_Research_Methods (1).pdf
Experimental_Research_Methods (1).pdfExperimental_Research_Methods (1).pdf
Experimental_Research_Methods (1).pdf
 
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMsNG2S: A Study of Pro-Environmental Tipping Point via ABMs
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
 
How to write a research proposal
How to write a research proposalHow to write a research proposal
How to write a research proposal
 
Fue theory 4 lecture 3 - theory in relation to method
Fue theory 4   lecture 3 - theory in relation to methodFue theory 4   lecture 3 - theory in relation to method
Fue theory 4 lecture 3 - theory in relation to method
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Theoretical and Conceptual framework in Research
 Theoretical and Conceptual  framework in Research Theoretical and Conceptual  framework in Research
Theoretical and Conceptual framework in Research
 
B0740410
B0740410B0740410
B0740410
 
The physics behind systems biology
The physics behind systems biologyThe physics behind systems biology
The physics behind systems biology
 
Week 3 Concept Synthesis TemplatePlease use these Headings for.docx
Week 3 Concept Synthesis TemplatePlease use these Headings for.docxWeek 3 Concept Synthesis TemplatePlease use these Headings for.docx
Week 3 Concept Synthesis TemplatePlease use these Headings for.docx
 
10 heuristics for modeling decision making
10 heuristics for modeling decision making10 heuristics for modeling decision making
10 heuristics for modeling decision making
 
Scoping: The GO-GN Guide to Conceptual Frameworks
Scoping: The GO-GN Guide to Conceptual Frameworks Scoping: The GO-GN Guide to Conceptual Frameworks
Scoping: The GO-GN Guide to Conceptual Frameworks
 

More from University Politehnica Bucharest

Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
University Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
University Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Recently uploaded

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 

Recently uploaded (20)

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 

Tracing the paths between concepts in large bio medical corpora

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare Tracing the Paths Between Concepts in Large Bio-medical Corpora Zaruhi Alaverdyan, Marcello Benedetti, Falitokiniaina Rabearison, Nishara Pathirana, Costin-Gabriel CHIRU and Traian Rebedea {zara.alaverdyan, 4marcello, r.falitokiniaina, nishara.pdn}@gmail.com, {costin.chiru, traian.rebedea}@cs.pub.ro
  • 2. Introduction • Language suffers an everlasting process of change: existing words acquire new meanings, new concepts appear and old ones disappear or are used less frequently. • For experts there is an evident connection between a new concept and some of the existing ones, but for regular people these relations remain hidden and need to be identified. • E.g. bio-medical domain: new terms appear as a result of new discoveries and it becomes an important challenge to establish the connections between different concepts. • Why is important to identify the connections? – Micro-level: experiments are very costly in terms of time and resources  it is important to find some connections before actually undertaking the experiments in order to minimize the risks – Macro-level: better understanding of the domain evolution, establishing some investment strategies in specific domains, forecasting the next findings, paving the way for new inventions, etc. 27.05.2015 CSCS 2015 2
  • 3. Solution • Identify the relations between different concepts extracted from PubMed (a corpus of bio-medical publications) over a time period of 20 years. • Discover the paths from the existing concepts to the newly introduced terms by building paths leading from one concept to another. • Use a graph-based approach for efficiency reasons. • Use time series + cosine distance and Kullback-Leibler (KL) divergence to estimate the distance (or dissimilarity) between two terms. 27.05.2015 CSCS 2015 3
  • 4. Related Work • Wijaya and Yeniterzi propose an analysis of semantic changes of a word based on exploring the changes of the words co-occurring with it over time using the Google N- gram corpus  k-means clustering + topic modelling • Hall, Jurafsky and Manning try to detect the history of ideas or topics in a scientific field.  assumption is that the shift in vocabulary usage is closely related to the discoveries and new ideologies  can characterize the appearance of new ideas or scientific topics • NERs: – General: Stanford NER, NaCTeMs TerMINE – Focused on medical ontology: MetaMap and Open Biomedical Annotator (OBA), ADEPT (from Stanford University) 27.05.2015 CSCS 2015 4
  • 5. Methodology (1) • Several steps: 1. Use PudMed to extract medical articles using different filters: 542,228 articles 2. Pre-process (cleaning + NER) 27.05.2015 CSCS 2015 5
  • 6. Methodology (2) 3. Build the Co-occurrence Graph • Each vertex belonging to V consists of a tuple <concept, first year of appearance of that concept in the corpus>; • There is an edge from vi to vj and vice-versa iff the concepts i and j co-occur in at least one article; • The weight of an edge from vi to vj is defined as: ncoij= the number of co-occurrences of concepts i and j (the number of articles containing both concepts). • Wij = the probability of two concepts not appearing together (distances between different concepts)  pij = 1 – wij is the probability distribution for concept i to co-appear with concept j. 27.05.2015 CSCS 2015 6
  • 7. Methodology (3) 27.05.2015 CSCS 2015 7 Connection between ”shock therapy”, found for the first time in 46 abstracts published in 1991, and ”tennis elbow” appearing for the first time in 1998 in 28 abstracts. The two terms co-appeared twice in 1998.  the link from ”shock therapy” to ”tennis elbow” = 1 - 2/46 = 0.95, while the reverse link = 1 - 2/28 = 0.92. the connection from newer concepts to older ones is stronger (smaller distances) than the reverse connection.
  • 8. Methodology (4) 4. Filter the graph • The number of edges increases substantially with the number of articles in the corpora • Eliminated concepts that co-occurred in a single article • Eliminated the top 150 most frequent concepts that are practically co-occurring with all the other concepts in the corpus (e.g. therapy, surgery, analysis, etc.). • Final graph had 743,117 distinct vertices (tuples <concept, first year of appearance of that concept in the corpus>) and 13,550,938 edges between them. 27.05.2015 CSCS 2015 8
  • 10. Methodology (6) 5. Discover the Concepts Chains • For each concept, identify the concepts that co-occur with it frequently and, hence, are semantically related  extract the chains of related concepts  • distij = • Computing shortest path in such a huge graph is computationally expensive - O(E + VlogV) • Use A* (informed search algorithm) to determine it faster  requires an estimation of the distance between any two concepts from the graph • Estimation of the distance between any two concepts using time series analysis (measure of appearance of that particular concept in the articles published during every year from the analyzed time span). 27.05.2015 CSCS 2015 10
  • 11. Methodology (6) 27.05.2015 CSCS 2015 11 • The distance between two concepts is computed using the cosine similarity or the Kullback-Leibler distance
  • 12. Results (1) 27.05.2015 CSCS 2015 12 Main achievement: the terms appearing on the path from one concept to another are in close semantic relationship with each other and with the initial terms.
  • 13. Results (2) 27.05.2015 CSCS 2015 13 Google Search Wikipedia Search Trypanosoma en.wikipedia.org/wiki/Tr ypanosoma Trypanosoma Cruzi en.wikipedia.org/wiki/Try panosoma_cruzi Astrogliosis en.wikipedia.org/ wiki/Astrogliosis List of parasites of humans en.wikipedia.org/wiki/ List_of_parasites_of_h umans Cruzi No Wikipedia page www.humanconnectome.org /about/project/behavioral- testing.html Wikipedia Link Google Search Sleeping Sickness en.wikipedia.org /wiki/African_tr ypanosomiasis Trypanosoma Brucei en.wikipedia.org /wiki/Trypanoso ma_brucei C N S - central nervous system s Behavioral Testing No Wikipedia page semantics brainconnectivity Wikipedia Search Wikipedia Link Wikipedia Link Wikipedia Link Wikipedia Link Wikipedia Link Wikipedia Link
  • 14. Conclusions • The application managed to identify complex paths from one concept to another  It was difficult to find this path using normal web searches and links  requires a mix of Wikipedia links, Google searches and other links on the web + implicit knowledge about the concepts along the path. • Did that using a graph-based approach which formalized the concept of term co-occurrence and allowed us to trace the semantic paths between concepts. • The paths were identified using A* algorithm + time series analysis combined with cosine similarity / KL distance (cosine better than KL) • Our approach heavily depends on the identification of medical terms (ADEPT) better NER  better results 27.05.2015 CSCS 2015 14
  • 15. Questions 27.05.2015 CSCS 2015 15 Thank you very much! This work has been partially funded by the Sectoral Operational Programme Human Resources Development 2007- 2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132395 and by the FP7 project LTfLL (Language Technologies for Lifelong Learning).

Editor's Notes

  1. search on Wikipedia: Cruzi (page didn’t exist)  Trypanosoma (connections to both cruzi and to sleeping sickness and chagas disease)  No direct connection to Astrogliosis  also, no connection in the reversed direction either (from Astrogliosis to Trypanosoma) New search on Wikipedia: a different page that allowed to connect Sleeping Sickness and Astroglosis through CNS New search on Wikipedia: behavioral testing (no relevant results)  search on Google  connection between “behavioral testing” and brain connectivity  brain connectivity can be damaged by Astrogliosis
  2. \