• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Talk at UAB, April 12, 2013

Talk at UAB, April 12, 2013



These are the slides for a talk given at the University of Alabama, Birmingham on April 19, 2013. The title of the talk is "Measuring Similarity and Relatedness in the Biomedical Domain : Methods and ...

These are the slides for a talk given at the University of Alabama, Birmingham on April 19, 2013. The title of the talk is "Measuring Similarity and Relatedness in the Biomedical Domain : Methods and Applications"



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Talk at UAB, April 12, 2013 Talk at UAB, April 12, 2013 Presentation Transcript

    • Measuring Semantic Similarity andRelatedness in the Biomedical Domain: Methods and ApplicationsTed Pedersen, Ph.D.Department of Computer ScienceUniversity of Minnesota, Duluthtpederse@d.umn.eduhttp://www.d.umn.edu/~tpederse
    • 2Topics● Semantic similarity vs. semantic relatedness● How to measure similarity– With ontologies and corpora● How to measure relatedness– With definitions and corpora● Applications?– Word Sense Disambiguation– Sentiment Classification
    • 3What are we measuring?● Concept pairs– Assign a numeric value that quantifies howsimilar or related two concepts are● Not words– Must know concept underlying a word form– Cold may be temperature or illness● Concept Mapping● Word Sense Disambiguation– This tutorial assumes thats been resolved
    • 4Why?● Being able to organize concepts by theirsimilarity or relatedness to each other is afundamental operation in the human mind,and in many problems in Natural LanguageProcessing and Artificial Intelligence● If we know a lot about X, and if we know Y issimilar to X, then a lot of what we know aboutX may apply to Y– Use X to explain or categorize Y
    • 5GOOD NEWS!Free Open Source Software!● WordNet::Similarity– http://wn-similarity.sourcforge.net– General English– Widely used (+750 citations)● UMLS::Similarity– http://umls-similarity.sourceforge.net– Unified Medical Language System– Spun off from WordNet::Similarity● But has added a whole lot!
    • 6Similar or Related?● Similarity based on is-a relations– How much is X like Y?– Share ancestor in is-a hierarchy● LCS : least common subsumer● Closer / deeper the ancestor the more similar● Tetanus and strep throat are similar– both are kinds-of bacterial infections
    • 7Least Common Subsumer (LCS)
    • 8Similar or Related?● Relatedness more general– How much is X related to Y?– Many ways to be related● is-a, part-of, treats, affects, symptom-of, ...● Tetanus and deep cuts are related but theyreally arent similar– (deep cuts can cause tetanus)● All similar concepts are related, but not allrelated concepts are similar
    • 9Measures of Similarity(WordNet::Similarity & UMLS::Similarity )● Path Based– Rada et al., 1989 (path)– Caviedes & Cimino, 2004 (cdist)*● cdist only in UMLS::Similarity● Path + Depth– Wu & Palmer, 1994 (wup)– Leacock & Chodorow, 1998 (lch)– Zhong et al., 2002 (zhong)*– Nguyen & Al-Mubaid, 2006 (nam)*● zhong and nam only in UMLS::Similarity
    • 10Measures of Similarity(WordNet::Similarity & UMLS::Similarity)● Path + Information Content– Resnik, 1995 (res)– Jiang & Conrath, 1997 (jcn)– Lin, 1998 (lin)
    • 11Path Based Measures● Distance between concepts (nodes) in treeintuitively appealing● Spatial orientation, good for networks or mapsbut not is-a hierarchies– Reasonable approximation sometimes– Assumes all paths have same “weight”– But, more specific (deeper) paths tend totravel less semantic distance● Shortest path a good start, but needscorrections
    • 12Shortest is-a Path1● path(a,b) = ------------------------------shortest is-a path(a,b)
    • 13We count nodes...● Maximum = 1– self similarity– path(tetanus,tetanus) = 1● Minimum = 1 / (longest path in is-a tree)– path(typhoid, oral thrush) = 1/7– path(moccasin athletes foot, strep throat) = 1/7– etc...
    • 14path(strep throat, tetanus) = .25
    • 15path (bacterial infection, yeast infection) = .25
    • 16?● Are bacterial infection and yeast infectionsimilar to the same degree as are tetanus andstrep throat ?● The path measure says “yes, they are.”
    • 17Path + Depth● Path only doesnt account for specificity● Deeper concepts more specific● Paths between deeper concepts travel lesssemantic distance
    • 18Wu and Palmer, 19942 * depth (LCS (a,b))● wup(a,b) = ----------------------------depth (a) + depth (b)● depth(x) = shortest is-a path(root,x)
    • 19wup(strep throat, tetanus) = (2*2)/(4+3) = .57
    • 20wup (bacterial infections, yeast infections) = (2*1)/(2+3) = .4
    • 21?● Wu and Palmer say that strep throat andtetanus (.57) are more similar than arebacterial infections and yeast infections (.4)● Path says that strep throat and tetanus (.25)are equally similar as are bacterial infectionsand yeast infections (.25)
    • 22Information Content● ic(concept) = -log p(concept) [Resnik 1995]– Need to count concepts– Term frequency +Inherited frequency– p(concept) = tf + if / N● Depth shows specificity but not frequency● Low frequency concepts often much morespecific than high frequency ones– Related to Zipfs Law of Meaning? (morefrequent word have more senses)
    • 23Information Contentterm frequency (tf)
    • 24Information Contentinherited frequency (if)
    • 25Information Content (IC = -log (f/N)final count (f = tf + if, N = 365,820)
    • 26Lin, 19982 * IC (LCS (a,b))● lin(a,b) = --------------------------IC (a) + IC (b)● Look familiar?
    • 27Lin, 19982 * IC (LCS (a,b))● lin(a,b) = --------------------------IC (a) + IC (b)● Look familiar?2* depth (LCS (a,b) )● wup(a,b) = ------------------------------depth(a) + depth (b)
    • 28lin (strep throat, tetanus) =2 * 2.26 / (5.21 + 4.11) = 0.485
    • 29lin (bacterial infection, yeast infection) =2 * 0.71 / (2.26+2.81) = 0.280
    • 30?● Lin says that strep throat and tetanus (.49) aremore similar than are bacterial infection andyeast infection (.28)● Wu and Palmer say that strep throat andtetanus (.57) are more similar than arebacterial infection and yeast infection (.4)● Path says that strep throat and tetanus (.25)are equally similar as are bacterial infectionand yeast infection (.25)
    • 31How to decide??● Hierarchies best suited for nouns● If you have a hierarchy of concepts, shortestpath can be distorted/misleading● If the hierarchy is carefully developed and wellbalanced, then wup can perform well● If the hierarchy is not balanced or unevenlydeveloped, the information content measurescan help correct that
    • 32What about conceptsnot connected via is-a relations?● Connected via other relations?– Part-of, treatment-of, causes, etc.● Not connected at all?– In different sections (axes) of an ontology(infections and treatments)– In different ontologies entirely (SNOMEDCTand FMA)● Relatedness!– Use definition information– No is-a relations so cant be similarity
    • 33Measures of relatedness● Path based– Hirst & St-Onge, 1998 (hso)● Definition based– Lesk, 1986– Adapted lesk (lesk)● Banerjee & Pedersen, 2003● Definition + corpus– Gloss Vector (vector)● Patwardhan & Pedersen, 2006
    • 34Path based relatedness● Ontologies include relations other than is-a● These can be used to find shortest pathsbetween concepts– However, a path made up of different kindsof relations can lead to big semantic jumps– Aspirin treats headaches which are asymptom of the flu which can be preventedby a flu vaccine which is recommend forchildren● …. so aspirin and children are related ??
    • 35Measuring relatedness with definitions● Related concepts defined using many of thesame terms● But, definitions are short, inconsistent● Concepts dont need to be connected viarelations or paths to measure them– Lesk, 1986– Adapted Lesk, Banerjee & Pedersen, 2003
    • 36Two separate ontologies...
    • 37Could join them together … ?
    • 38Each concept has definition
    • 39Find overlaps in definitions...
    • 40Overlaps● Oral Thrush and Alopecia– side effect of chemotherapy● Cant see this in structure of is-a hierarchies● Oral thrush and folliculitis just as similar● Alopecia and Folliculitis– hair disorder & hair● Reflects structure of is-a hierarchies● If you start with text like this maybe you canbuild is-a hierarchies automatically!– Future work...
    • 41Lesk and Adapted Lesk● Lesk, 1986 : measure overlaps in definitions toassign senses to words– The more overlaps between two senses(concepts), the more related● Banerjee & Pedersen, 2003, Adapted Lesk– Augment definition of each concept withdefinitions of related concepts● Build a super gloss– Increase chance of finding overlaps● lesk in WordNet::Similarity & UMLS::Similarity
    • 42The problem with definitions ...● Definitions contain variations of terminologythat make it impossible to find exact overlaps● Alopecia : … a result of cancer treatment● Thrush : … a side effect of chemotherapy– Real life example, I modified the alopeciadefinition to work better with Lesk!!!– NO MATCHES!!● How can we see that “result” and “side effect”are similar, as are “cancer treatment” and“chemotherapy” ?
    • 43Gloss Vector Measureof Semantic Relatedness● Rely on co-occurrences of terms– Terms that occur within some given numberof terms of each other● Allows for a fuzzier notion of matching● Exploits second order co-occurrences– Friend of a friend relation– Suppose cancer_treatment andchemotherapy dont occur in text with eachother. But, suppose that “survival” occurswith each.– cancer_treatment and chemotherapy aresecond order co-occurrences via “survival”
    • 44Gloss Vector Measureof Semantic Relatedness● Replace words or terms in definitions withvector of co-occurrences observed in corpus● Defined concept now represented by anaveraged vector of co-occurrences● Measure relatedness of concepts via cosinebetween their respective vectors● Patwardhan and Pedersen, 2006 (EACL)– Inspired by Schutze, 1998 (CL)● vector in WordNet::Similarity & UMLS::Similarity
    • 45Experimental Results● Vector > Lesk > Info Content > Depth > Path– Clear trend across various studies● Dramatic differences when comparing tohuman reference standards (Vector > Lesk >>Info Content > Depth > Path)– Banerjee and Pedersen, 2003 (IJCAI)– Pedersen, et al. 2007 (JBI)● Differences less extreme in extrinsic task-based evaluations– Human raters mix up similarity &relatedness?
    • 46So far weve shown that ...● … we can quantify the similarity andrelatedness between concepts using a varietyof sources of information– Paths– Depths– Information content– Definitions– Co-occurrence / corpus data● There is open source software to help you!
    • 47Sounds great! What now?● SenseRelate Hypothesis : Most words in textwill have multiple possible senses and willoften be used with the sense most related tothose of surrounding words– He either has a cold or the flu● Cold not likely to mean air temperature● The underlying sentiment of a text can bediscovered by determining which emotion ismost related to the words in that text– I cried a lot after my mother died.● Happy?
    • 48SenseRelate!● In coherent text words will be used in similaror related senses, and these will also berelated to the overall topic or mood of a text● First applied to WSD in 2002– Banerjee and Pedersen, 2002 (WordNet)– Patwardhan et al., 2003 (WordNet)– Pedersen and Kolhatkar 2009 (WordNet)– McInnes et al., 2011 (UMLS)● Recently applied to emotion classification– Pedersen, 2012 (i2b2 suicide noteschallenge)
    • 49GOOD NEWS!Free Open Source Software!● WordNet::SenseRelate– AllWords, TargetWord, WordToSet– http://senserelate.sourceforge.net● UMLS::SenseRelate– AllWords– http://search.cpan.org/dist/UMLS-SenseRelate/
    • 50SenseRelate for WSD● Assign each word the sense which is mostsimilar or related to one or more of itsneighbors– Pairwise– 2 or more neighbors● Pairwise algorithm results in a trellis much likein HMMs– More neighbors adds lots of information anda lot of computational complexity
    • 51SenseRelate - pairwise
    • 52SenseRelate – 2 neighbors
    • 53General Observations on WSD Results● Nouns more accurate; verbs, adjectives, andadverbs less so● Increasing the window size nearly alwaysimproves performance● Jiang-Conrath measure often a high performerfor nouns (e.g., Patwardhan et al. 2003)● Info content measures perform well withclinical text (McInnes et al. 2011)● Vector and lesk have coverage advantage– handle mixed pairs while others dont
    • 54Recent Specific Experiment● Compare efficacy of different measures whenperforming WSD using UMLS::SenseRelate● Evaluate on MSH-WSD data (from NLM)● Information Content based on concept countsfrom Medline (UMLSonMedline, from NLM)● More details available– McInnes, et al. 2011 (AMIA)– McInnes & Pedersen, in review
    • 55MSH-WSD data set● Contains 203 ambiguous terms and acronyms– Instances are from Medline– CUIs from 2009 AB version of UMLS– Each word has avg. 187 instances, 2.08possible senses, and 54.5% majority sense● Leverages fact that MedLine is manuallyindexed with Medical Subject Headings(associated with CUIs)● http://wsd.nlm.nih.gov/collaboration.shtml
    • 56ResultsWindowsizePath based Information Content Relatednesspath wup jcn lin lesk vector2 .63 .63 .65 .65 .67 .685 .66 .67 .68 .69 .68 .6810 .68 .69 .70 .71 .68 .6725 .70 .70 .73 .74 .68 .65
    • 57SenseRelate forSentiment Classification● Find emotion most related to context– Similarity less effective since many wordscan be related an emotion, but fewer aresimilar● Related to happy? : love, food, success, ...● Similar to happy? : joyful, ecstatic, pleased, …– Pairwise comparisons between emotion andsenses of words in context● Same form as Naive Bayesian model orLatent Variable model– WordNet::SenseRelate::WordToSet
    • 58SenseRelate - WordToSet
    • 59Experimental Results● Sentiment classification results in 2011 i2b2suicide notes challenge were disappointing(Pedersen, 2012)– Suicide notes not very emotional!– In many cases reflect a decision made andfocus on settling affairs
    • 60Future Work● Find new domains and types of problems– EHR, clinical records, …● Integrate Unsupervised Clustering withWordNet::Similarity and UMLS::Similarity– http://senseclusters.sourceforge.net● Exploit graphical nature of of SenseRelate– e.g., Minimal Spanning Trees / ViterbiAlgorithm to solve larger problem spaces?● Attract and support users for all of these tools!
    • 61UMLS::Similarity Collaborators● Serguei Pakhomov :– Assoc. Professor, UMTC● Bridget McInnes :– PhD UMTC, 2009– Post-doc UMTC, 2009 - 2011– Now at Securboration, NC● Ying Liu :– PhD UAB, 2007– Post-doc UMTC 2009 – 2011– Until recently at City of Hope, LA
    • 62Acknowledgments● This work on semantic similarity andrelatedness has been supported by a NationalScience Foundation CAREER award (2001 –2007, #0092784, PI Pedersen) and by theNational Library of Medicine, NationalInstitutes of Health (2008 – 2012,1R01LM009623-01A2, PI Pakhomov)● The contents of this talk are solely myresponsibility and do not necessarily representthe o cial views of the National ScienceffiFoundation or the National Institutes of Health.
    • 63Conclusion● Measures of semantic similarity andrelatedness are supported by a rich body oftheory, and open source software– http://wn-similarity.sourceforge.net– http://umls-similarity.sourceforge.net● http://atlas.ahc.umn.edu● These measures can be used as buildingblocks for many NLP and AI applications– Word sense disambiguation– Sentiment classification
    • 64References● S. Banerjee and T. Pedersen. An adapted Lesk algorithm forword sense disambiguation using WordNet. In Proceedings ofthe Third International Conference on Intelligent TextProcessing and Computational Linguistics, pages 136—145,Mexico City, February 2002.● S. Banerjee and T. Pedersen. Extended gloss overlaps as ameasure of semantic relatedness. In Proceedings of theEighteenth International Joint Conference on ArtificialIntelligence, pages 805-810, Acapulco, August 2003.● J. Caviedes and J. Cimino. Towards the development of aconceptual distance metric for the UMLS. Journal ofBiomedical Informatics, 37(2):77-85, April 2004.● J. Jiang and D. Conrath. Semantic similarity based on corpusstatistics and lexical taxonomy. In Proceedings onInternational Conference on Research in ComputationalLinguistics, pages 19-33, Taiwan, 1997.
    • 65References● C. Leacock and M. Chodorow. Combining local context andWordNet similarity for word sense identification. In C.Fellbaum, editor, WordNet: An electronic lexical database,pages 265-283. MIT Press, 1998.● M.E. Lesk. Automatic sense disambiguation using machinereadable dictionaries: how to tell a pine code from an ice creamcone. In Proceedings of the 5th annual international conference onSystems documentation, pages 24-26. ACM Press, 1986.● D. Lin. An information-theoretic definition of similarity. InProceedings of the International Conference on Machine Learning,Madison, August 1998.● B. McInnes, T. Pedersen, Y. Liu, G. Melton and S. Pakhomov.Knowledge-based Method for Determining the Meaning ofAmbiguous Biomedical Terms Using Information Content Measuresof Similarity. Appears in the Proceedings of the Annual Symposiumof the American Medical Informatics Association, pages 895-904,Washington, DC, October 2011.
    • 66References● H.A. Nguyen and H. Al-Mubaid. New ontology-based semanticsimilarity measure for the biomedical domain. In Proceedings of theIEEE International Conference on Granular Computing, pages 623-628, Atlanta, GA, May 2006.● S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures ofsemantic relatedness for word sense disambiguation. In roceedingsof the Fourth International Conference on Intelligent TextProcessing and Computational Linguistics, pages 241—257,Mexico City, February 2003.● S. Patwardhan and T. Pedersen. Using WordNet-based ContextVectors to Estimate the Semantic Relatedness of Concepts. InProceedings of the EACL 2006 Workshop on Making Sense ofSense: Bringing Computational Linguistics and PsycholinguisticsTogether, pages 1-8, Trento, Italy, April 2006.● T. Pedersen. Rule-based and lightly supervised methods topredict emotions in suicide notes. Biomedical InformaticsInsights, 2012:5 (Suppl. 1):185-193, January 2012.
    • 67References● T. Pedersen and V. Kolhatkar. WordNet :: SenseRelate ::AllWords - a broad coverage word sense tagger thatmaximizes semantic relatedness. In Proceedings of the NorthAmerican Chapter of the Association for ComputationalLinguistics - Human Language Technologies 2009Conference, pages 17-20, Boulder, CO, June 2009.● T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute.Measures of semantic similarity and relatedness in thebiomedical domain. Journal of Biomedical Informatics, 40(3) :288-299, June 2007.● R. Rada, H. Mili, E. Bicknell, and M. Blettner. Developmentand application of a metric on semantic nets. IEEETransactions on Systems, Man and Cybernetics, 19(1):17-30,1989.
    • 68References● P. Resnik. Using information content to evaluate semanticsimilarity in a taxonomy. In Proceedings of the 14thInternational Joint Conference on Artificial Intelligence, pages448-453, Montreal, August 1995.● H. Schütze. Automatic word sense discrimination.Computational Linguistics, 24(1):97-123, 1998.● J. Zhong, H. Zhu, J. Li, and Y. Yu. Conceptual graph matchingfor semantic search. Proceedings of the 10th InternationalConference on Conceptual Structures, pages 92-106, 2002.