Your SlideShare is downloading. ×
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
377
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The objective of this work was to develop tools to automatically computer the semantic similarity between two concepts in the biomedical domain using measures originally developed for general EnglishWe obtain these biomedical concepts from the Unified Medical Language System UMLS
  • The Unified Medical Language System (UMLS) which is a knowledge representation framework consisting of three main components: 1. The Metathesaurus2. The Semantic Network3. The Specialist LexiconOur focus with respect to these packages is the Metathesaurus
  • The Metathesaurus integrates biomedical concepts from over 100 controlled medical terminologies into a single sourceThe concepts from the terminologies or often called source vocabularies are referred as Atomic Unique Identifiers (AUIs). These AUIs are semi-automatically clustered into Concept Unique Identifiers CUIs
  • For example, two source vocabularies in the Metathesaurus are MSH and SNOMEDCT In MSH and SNOMEDCT there exists a concept for “Cold Temperature” these AUIs are combined to form a CUI for the concept Cold Temperature
  • These concepts from the source vocabularies may be encoded with information about the concept such as its definition or relationship with other conceptsThis information encoded at the AUI level may be obtained through their respective CUIs
  • For example, in MSH there exists an is-a relationship between the concepts Cold Temperature and TemperatureThis is carried over as a PAR/CHD between the CUIs Cold Temperature and TemperatureThe Metathesaurus keeps track of where the relationships originated
  • So for example,If a relationship existed between Low Temperature and Temperature in SNOMED CT – this also would be carried to the CUI level
  • Creating two entries in the Metathesaurus for the same pair of CUIs – one for each source.
  • This relation information is stored in two tables in the Metathesaurus. MRHIER which stores the relations between AUIs in their respective sourcesAnd MRREL which stores the relations between CUIs.
  • MRHIER contains the full path to root relations between the AUI for each of the different sourcesMRREL contains the pairwise relations between the CUIs derived from these sources. These relations are labeled PAR/CHD relations in MRRELMRREL contains other types of relations - the two considered hierarchical in the literature are the PAR/CHD relations and the RB/RN (broader or narrower)It is possible to generate MRHIER from MRREL through the PAR/CHD relations for most sources although this is not the case for all of them. For example, MSH because a MSH concept may have different children depending on their location within the hierarchy.NOTES: don’t readthe full path-to-root is a transitive closure of the pairwise PAR/CHD relations which does not hold true for MSH because a MSH concept may have different children depending on their location within the hierarchy
  • There are benefits to using either the AUI hierarchy in MRHIER and the CUI hierarchy in MRREL The benefit of using CUIs is that ability to obtain the path information between in concepts in different sources and over relations outside of just the PAR/CHD relation But the benefit of using AUIS is that the PAR/CHD path information between concepts incorporates the tree positional information for the sources that use this information such as MSH.Currently, there is one package that we know of that obtains path information using MRHIER which is UMLS-Query by Shah and Musen
  • The UMLS-Interface package is a perl interface to the UMLS which is present locally in a MySQL databaseIts main purpose is to return path information about CUIs using information from MRREL such as all of the possible paths from the CUI to the root given a specified set of sources and relations
  • UMLS-Similarity package is a Perl package that allows you to take the semantic similarity between two CUIs given a specified set of sources and relationsThe measure currently implemented in UMLS-Similarity are the path based measures proposed by Leacock and Chodorow Wu and Palmer Nguyen and Al-MubaidAnd the conceptual distance measure proposed by Rada, at. Al. and the path measureThese measure are called path-based measures because they rely on the location of the concepts in a hierarchy – the path information is obtained from the UMLS-Interface package which returns the path information given a set of specified sources and relations
  • In the next couple slides I am going to show and example using the path measure which is one over the number of nodes in the shortest path between two concepts
  • So in this example would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing MSH, the similarity is going to be one over one since there is only one link between the two concepts although when using SNOMEDCT the similarity is ½ since there are two links between the two concepts. If we combine the MSH and SNOMEDCT sources the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • So in this example would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing MSH, the similarity is going to be one over one since there is only one link between the two concepts although when using SNOMEDCT the similarity is ½ since there are two links between the two concepts. If we combine the MSH and SNOMEDCT sources the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • So in this example would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing MSH, the similarity is going to be one over one since there is only one link between the two concepts although when using SNOMEDCT the similarity is ½ since there are two links between the two concepts. If we combine the MSH and SNOMEDCT sources the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • So in this example would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing MSH, the similarity is going to be one over one since there is only one link between the two concepts although when using SNOMEDCT the similarity is ½ since there are two links between the two concepts. If we combine the MSH and SNOMEDCT sources the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • And in this example we would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing PAR/CHD relations, the similarity is going to be one over one since there is only one link between the two concepts although when using the RB/RN the similarity is ½ since there are two links between the two concepts. If we combine the PAR/CHD and RB/RN relations the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • And in this example we would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing PAR/CHD relations, the similarity is going to be one over one since there is only one link between the two concepts although when using the RB/RN the similarity is ½ since there are two links between the two concepts. If we combine the PAR/CHD and RB/RN relations the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • And in this example we would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing PAR/CHD relations, the similarity is going to be one over one since there is only one link between the two concepts although when using the RB/RN the similarity is ½ since there are two links between the two concepts. If we combine the PAR/CHD and RB/RN relations the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • And in this example we would like to calculate the similarity between Limbs and Body Region using the path measure given a set of specified sourcesUsing PAR/CHD relations, the similarity is going to be one over one since there is only one link between the two concepts although when using the RB/RN the similarity is ½ since there are two links between the two concepts. If we combine the PAR/CHD and RB/RN relations the similarity is going to be one over the number of links in the shortest path between the two concept which would be one.
  • We compare the results obtained by UMLS-Similarity to those reported by Pedersen, et. Al. Nguyen and Al-Mubaid and Caviedes and Cimino
  • Pedersen evaluate the semantic similarity measure proposed by Leacock and Chodorow and the path measure using SNOMEDCT prior to its inclusion in the UMLS Their dataset consists of 29 medical term pairs whose similarity was determined by 9 Medical Coders and 3 Physicians using a 4 point scale where four indicates that the terms are practically synonymous and 1 indicates that they are unrelatedThe authors evaluate the measures by determining the correlation between the human annotations and the measures using Spearman’s rank correlation coefficient
  • We compare the correlation results obtained by UMLS-Similarity on the same dataset using SNOMEDCT and the PAR/CHD relations in the 2008AB version of the UMLS
  • The results show that the correlationobtained by UMLS-Similarity and the human annotators when using the path measure is only one percentage point lower than those obtained by Pedersen, et. Al. and the results obtained using the measure proposed by Leacock and Chodorow are identical to those obtained by Pedersen, et. al.
  • The results show that the correlationobtained by UMLS-Similarity and the human annotators when using the path measure is only one percentage point lower than those obtained by Pedersen, et. Al. and the results obtained using the measure proposed by Leacock and Chodorow are identical to those obtained by Pedersen, et. al.
  • Nguyen (Wing) and Al-Mubaid evaluate the semantic similarity methods proposed by Leacock and Chodorow, and Wu and Palmer, their own proposed measure and the path measure using the Medical Subject Headings MSH – the actual MSH source not the source integrated in the UMLS. They use the same dataset as Pedersen and evaluate the measure by determining the correlation between the human annotations using Spearman’s Rank Correlation Coefficient
  • We compare the correlation results obtained by UMLS Similarity using MSH and the PAR/CHD relations in the 2008AB version of the UMLS
  • The results show that UMLS-Similarity obtains a lower correlation than those reported by Nguyen and Al-Mubaid for each of the semantic similarity measures.
  • Lastly we compare the results of the UMLS-Similarity package with those obtained by Caviedes and CiminoCaviedes and Cimino evaluate the Conceptual distance measure proposed by Rada, et. Al. on 10 medical terms pairs
  • Caviedes and Cimino use the PAR/CHD relations in MSH to determine the Conceptual Distance for each pair of terms. We compare the actual conceptual distance score obtained by Caviedes and Cimino with those obtained by UMLS-Similarity using MSH and the PAR/CHD relations from the UMLS version 2008AB.
  • The results show that the UMLS-Similarity obtains the same conceptual distance as Caviedes and Cimino for six out of the ten pairs. The remaining four pairs are only off by one point.
  • The results show that the UMLS-Similarity obtains the same conceptual distance as Caviedes and Cimino for six out of the ten pairs. The remaining four pairs are only off by one point.
  • The results show that UMLS-Similarity can be used to reproduce the results reported by Pedersen et al and Caviedes and CiminoWe believe that the minor differences in results are due to different versions of the source vocabularies that were used
  • We believe the difference in the results is due to :Different versions of MSH being used to conduct the experimentPossibly different mappings of the terms to CUIs in MSHThe path information used by Nguyen (Wing) and Al-Mubaid to determine the semantic similarity comes directly from MSH which in the UMLS would be located in the MRHEIR and the PAR/CHD relations in MRRELBut with MSH, the exact path-to-root information for each of the concepts can not be generated in MRREL because a MSH concept may have different children depending on its position in the hierarchy
  • So In conclusion the UMLS-Interface can be used to obain the path information about a CUI given a specified set of sources and relationsUMLS-Similarity can be used to determine the semantic similarity between two CUIs in the UMLSCurrently, there are five path-based semantic similarity measures included in the package: Conceptual Distance proposed by Rada, et. Al.The measures proposed by Leacock and Chodorow, Wu and Palmer and Nguyen and Al-Mubaid and the path measure
  • In the future, For UMLS-Interface, we plan to improve the efficiency in which the path information of a CUI is stored And For UMLS-Similarity, we plan to incorporate other semantic similarity and relatedness measures such as the information-content semantic similarity measures proposed by Resnik, Jiang and Conrath and LinAnd the semantic relatedness measure proposed by Patwardhan.
  • So, this wraps up my presentation about UMLS-Interface and UMLS-Similarity. I have two take home messages that I hope that you remember from this talk. The first is that UMLS-Interface can reliably be used to extract path information about a CUI given a specified set of sources and relations
  • The second is that UMLS-Similarity can reliably be used to determine the semantic similarity between two CUIs
  • UMLS-Interface and UMLS-Similarity can both be downloaded from CPAN. I do have a demo set up on my laptop so if you are interested please let me know and I can show you during the break.Thank you - are there any questions
  • Transcript

    • 1. UMLS-Interface and UMLS-Similarity: Open Source Software for Measuring Paths and Semantic Similarity
      Bridget McInnes
      Ted Pedersen
      Serguei Pakhomov
      1
      11/17/2009
    • 2. Objective
      Develop tools to automatically compute the semantic similarity between two concepts in the biomedical domain using measures originally developed for general English using the Unified Medical Language System (UMLS)
      2
    • 3. Motivation
      Clustering symptoms and disorders found in the text of clinical reports for post marking medication safety and surveillance
      Identification of patients for clinical studies
      Improving the sensitivity of document retrieval of scientific journals and clinical reports
      Development of terminologies and ontologies
      Clustering of biomedical documents
      Word sense disambiguation
      3
    • 4. Unified Medical Language System
      Knowledge representation framework
      Contains 3 Main components:
      Metathesaurus
      Semantic Network
      SPECIALIST Lexicon
      4
    • 5. Metathesaurus
      Semi-automatically integrates biomedical concepts from over a 100 controlled medical terminologies
      Source vocabularies are organized based on their Atomic Unique Identifiers
      Metathesaurus is organized based on their Concept Unique Identifier (CUI)
      5
    • 6. Concept Unique Identifiers (CUIs)
      6
      CUI C0009264
      Cold Temperature
      AUI A15588749
      Cold Temperature
      AUI A3292554
      Low Temperature
      MSH
      SNOMED-CT
    • 7. CUI Information
      The concepts (AUIs) from the source vocabularies may contain information about the concept such as its
      Definition
      Relation information between the concepts
      The information from the AUIs can be obtained through their respective CUIs
      7
    • 8. Relations between CUIs in MSH
      8
      AUI A0123939
      Temperature
      is-a
      AUI A15588749
      Cold Temperature
      CUI C0039476
      Temperature
      PAR/CHD
      (MSH)
      CUI C0009264
      Cold Temperature
      MSH
    • 9. Relations between CUIs in SNOMED-CT
      9
      AUI A2887140
      Temperature
      is-a
      AUI A3292554
      Low Temperature
      CUI C0039476
      Temperature
      PAR/CHD
      (SNOMED-CT)
      CUI C0009264
      Cold Temperature
      SNOMED-CT
    • 10. Multiple Relations
      10
      PAR/CHD
      (SNOMED-CT)
      CUI C0039476
      Temperature
      CUI C0009264
      Cold Temperature
      CUI C0039476
      Temperature
      PAR/CHD
      (MSH)
      CUI C0009264
      Cold Temperature
    • 11. Relation Information
      11
      MRHIER
      MRREL
      AUI A0123939
      Temperature
      CUI C0039476
      Temperature
      Relation
      Relation
      AUI A15588749
      Cold Temperature
      CUI C0009264
      Cold Temperature
    • 12. MRREL and MRHIER
      MRHIER
      Contains the full path to root relations between AUIs from each of the sources
      is-a
      part-of
      MRREL
      Contains the pairwise relations between CUIs
      Relations:
      PAR/CHD
      RB/RN
      It is possible to generate MRHEIR from MRREL except for the following sources:
      AIR
      MSH
      SNM2
      USPMG
      OMS
      12
    • 13. CUI versus AUI Hierarchy
      The benefit of using CUIs
      Ability to obtain the relation information between concepts across sources
      Ability to obtain the relation information between concepts using more than one type of relation:
      PAR/CHD – parent/child (relation in MRHIER)
      RB/RN – narrower/broader
      SIB – sibling
      RL – concepts are similar or ‘alike’
      The benefit of using AUIs
      Ability to obtain relation information (PAR/CHD) between concepts in the same source very quickly
      incorporates tree positional information for sources such as MSH
      UMLS-Query by Shah and Musen, 2008
      13
    • 14. UMLS-Interface
      Perl interface to the UMLS present locally in a MySQL database.
      Its main purpose is to returns path information about CUIs using the relation information in MRREL
      All possible paths to the root
      Shortest path between two concepts
      14
    • 15. UMLS-Similarity
      A suite of perl modules that implement a number of path-based semantic similarity measures to determine the similarity between two CUIs in the UMLS
      Measures are path-based because they rely on the location of the concepts in a hierarchy
      The path information is obtained using UMLS-Interface
      Semantic Similarity Measures:
      Path measure
      Conceptual Distance (Rada, et. al, 1989)
      Leacock and Chodorow, 1998
      Wu and Palmer, 1994
      Nguyen and Al-Mubaid, 2006
      15
    • 16. Semantic Similarity Example
      Path measure
      16
      1
      Sim(c1,c2) =
      N
      where N = # links in the shortest path between the two concepts c1 and c2
    • 17. Similarity Given Specified Sources
      17
      C0005898
      Body
      Regions
      PAR
      (SNOMED-CT)
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      PAR
      (SNOMED-CT)
    • 18. Similarity Given Specified Sources
      18
      C0005898
      Body
      Regions
      PAR
      (SNOMED-CT)
      C0229962
      Anatomic Part
      Similarity = 1/1
      PAR
      (MSH)
      C0015385
      Limbs
      PAR
      (SNOMED-CT)
    • 19. Similarity Given Specified Sources
      19
      C0005898
      Body
      Regions
      PAR
      (SNOMED-CT)
      Similarity = 1/2
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      PAR
      (SNOMED-CT)
    • 20. Similarity Given Specified Sources
      20
      C0005898
      Body
      Regions
      PAR
      (SNOMED-CT)
      Similarity = 1/1
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      PAR
      (SNOMED-CT)
    • 21. Similarity Given Specified Relations
      21
      C0005898
      Body
      Regions
      RB
      (MSH)
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      RB
      (MSH)
    • 22. Similarity Given Specified Relations
      22
      C0005898
      Body
      Regions
      RB
      (MSH)
      C0229962
      Anatomic Part
      Similarity = 1/1
      PAR
      (MSH)
      C0015385
      Limbs
      RB
      (MSH)
    • 23. Similarity Given Specified Relations
      23
      C0005898
      Body
      Regions
      Similarity = 1/2
      RB
      (MSH)
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      RB
      (MSH)
    • 24. Similarity Given Specified Relations
      24
      C0005898
      Body
      Regions
      Similarity = 1/1
      RB
      (MSH)
      C0229962
      Anatomic Part
      PAR
      (MSH)
      C0015385
      Limbs
      RB
      (MSH)
    • 25. Functional Validation
      Comparison with Previous Work:
      Pedersen, et al. 2007
      Nguyen and Al-Mubaid, 2006
      Caviedes and Cimino, 2004
      25
    • 26. Pedersen, et al.
      Semantic Similarity Measures
      Path
      Leacock and Chodorow, 1998
      Source
      SNOMEDCT
      Data
      29 medical terms pairs
      Similarity determined by:
      9 Medical Coders
      3 Physicians
      4 Point Scale
      4 – practically synonymous
      3 – related
      2 – marginally related
      1 - unrelated
      Spearman’s Rank Correlation Coefficient
      26
    • 27. Comparison with Pedersen, et al.
      Semantic Similarity Measures
      Path
      Leacock and Chodorow, 1998
      Source: SNOMED-CT from UMLS 2008AB
      Relations: PAR/CHD
      Comparison with human annotations
      Spearman Rank Correlation Coefficient
      27
    • 28. Comparison with Pedersen, et al.
      28
    • 29. Comparison with Pedersen, et al.
      29
    • 30. Nguyen and Al-Mubaid
      Semantic Similarity Measures
      Nguyen and Al-Mubaid, 2006
      Leacock and Chodorow, 1998
      Wu and Palmer, 1994
      Path
      Source: MSH
      Same Dataset created by Pedersen, et al.
      Data
      29 medical terms pairs
      Similarity determined by:
      9 Medical Coders
      3 Physicians
      Spearman’s Rank Correlation Coefficient
      30
    • 31. Comparison with Nguyen and Al-Mubaid
      Semantic Similarity Measures
      Nguyen and Al-Mubaid, 2006
      Leacock and Chodorow, 1998
      Wu and Palmer, 1994
      Path
      Source: MSH from UMLS 2008AB
      Relations: PAR/CHD
      Comparison with human annotations
      Spearman Rank Correlation Coefficient
      31
    • 32. Comparison with Nguyen and Al-Mubaid
      32
    • 33. Caviedes and Cimino
      Semantic Similarity Measure
      Conceptual Distance – Rada, et al.
      Source: MSH
      Relations: PAR/CHD
      Data
      10 medical terms pairs using following CUIs
      Digestive system disease: C0012242
      Peptic esophagitis: C0014869
      Psychotherapy: C0033968
      Thirst: C0039971
      Thoracic duct: C0039979
      33
    • 34. Comparison with Caviedes and Cimino
      Semantic Similarity Measures
      Conceptual Distance
      Originally proposed by Rada, et. al., 1989
      Source: MSH from UMLS 2008AB
      Relations: PAR/CHD
      Comparison between the Conceptual Distance Scores
      34
    • 35. Comparison with Caviedes and Cimino
      35
    • 36. Comparison with Caviedes and Cimino
      36
    • 37. Results
      The results show that UMLS-Similarity can be used to reproduce the results reported by:
      Pedersen, et al.
      Caviedes and Cimino
      37
    • 38. Results
      The correlation results obtained by UMLS-Similarity and reported by Nguyen and Al-Mubaid vary
      Different versions of MSH were used to conduct the experiment
      Possibly different mappings of the terms to CUIs in MSH were used
      Information used by Nguyen and Al-Mubaid comes directly from MSH which is located in MRHEIR and as PAR/CHD relations in MRREL
      It is not possible to generate MRHIER from MRREL because the full path-to-root is a transitive closure of the pairwise PAR/CHD relations which does not hold true for MSH because a MSH concept may have different children depending on its tree position
      38
    • 39. Conclusions
      UMLS-Similarity
      Used to determine the similarity between two concepts given a specified set of sources and relations
      Contains the following similarity measures
      Path measure
      Conceptual Distance proposed Rada, et. al. 1989
      Leacock and Chodorow, 1998
      Wu and Palmer, 1994
      Nguyen and Al-Mubaid, 2006
      UMLS-Interface
      Used to obtain path information about a CUI given a specified set of sources and relations
      39
    • 40. Future Work
      UMLS-Interface
      Improve the efficiency in which the path information is stored
      UMLS-Similarity
      Information Content Similarity Measures
      Resnik, 1995
      Jiang and Conrath, 1997
      Lin, 1997
      Relatedness Measures
      Patwardhan, 2003
      40
    • 41. Take Home Message #1
      41
      UMLS-Interface can used to extract path information about a concept given a specified
      set of sources and relations.
    • 42. Take Home Message #2
      42
      UMLS-Similarity can be used to compute the semantic similarity between two concepts given a specified set of sources and relations.
    • 43. UMLS-Interface
      http://search.cpan.org/dist/UMLS-Interface
      UMLS-Similarity
      http://search.cpan.org/dist/UMLS-Similarity
      43
      Availability
    • 44. Thank You
      We would like to thank
      Kin Wah Fung,
      Olivier Bodenreider,
      Jan Willis
      Lan Aronson
      The research was supported in parts by:
      Fellowships:
      NLM Research Participation Program
      GAANN fellowship from US Dept. of Ed.
      Grants
      IR01LM009623-01A2 from NIH, NLM
      44

    ×