Automated Identification of Similar Health Questions


Published on

Presentation to the AMIA 2013 Summit on Clinical Research Informatics, March 20, 2013

Published in: Health & Medicine
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Automated Identification of Similar Health Questions

  1. 1. Automated Identification of Similar Health Questions Geoffrey W. Rutledge MD, PhD Chief Medical Information Officer  Introduction DiscussionPeople with health questions are increasingly The three similarity criteria tested are The problem of identifying semanticallylooking for physician answers to their health 1. Lexical identity after removal of all non- similar health questions is complicated by thequestions online. Given the repetitive nature alphanumeric characters variability of consumer health language andof common questions, there is a high value in 2. Sum of semantic weights of all matching the difficulties that consumers have inidentifying previously answered questions health concepts spelling medical terms. A comprehensivethat are semantically similar (or identical) to 3. Sum of weights of only the moderate or ontology and synonym set of consumereach new question, so that an answer can be high weight matching health concepts health terms enabled the accurate detectiongiven without delay and without waiting for a of a large fraction of semantically similarnew answer from a physician. consumer health questions that were entered 1    (2)   in an online health site.Background 0.9   True  positive  rate     (3)   0.8    Previous methods to evaluate question pairs 0.7         The automated identification of similarwere based on sentence similarity [1,2] and 0.6   0.5       consumer health questions is challengingare not suitable for consumer health 0.4   (1)       because of the common occurrence ofquestions, which contain many consumer- 0.3   0.2   complex, colloquial, and often misspelledhealth variations and frequent misspellings of 0.1   medical terms in consumer health questions. 0  medical concepts. We developed a method 0   0.1   0.2   0.3   0.4   We collected online health questions andto identify questions with “high semantic False  positive  rate     their paired "nearest search result" matchingsimilarity” from a corpus of consumer health questions to evaluate 3 question similarityquestions and answers, in which the metrics. The best performing metric was Examples of medical concepts:questions and answers are character limited moderate weights: antibiotics, heart disease, based on the sum of semantic weights for allto 150 and 400 characters respectively. sharp pain matching health concepts from a high weights: penicillin, congestive heart failure, comprehensive ontology of consumer healthMethod squeezing chest pain terms and common misspellings, with aWe compare the text of new questions to the measured sensitivity of 0.61 and specificity ofclosest matching question from the Q&A 0.99.corpus. For a set of 1,000 questions and their Resultsclosest match, we evaluated the sensitivity We compared the three similarity criteria [1] The Evaluation of Sentence Similarity Measures, I.-and specificity of alternative similarity criteria Y. Song, J. Eder, and T.M. Nguyen (Eds.): DaWaK against an expert assessment of question 2008, LNCS 5182, pp. 305–316, 2008.for the assertion of “high semantic similarity.” pair similarity. The sensitivities and [2] Finding Similar Questions in Large Question andWe first identified the most similar question specificities for the three criteria are (1) 0.47, Answer Archives. Jiwoon Jeon, W. Bruce Croft andwithin the Q&A corpus using a search engine 1 (2) 0.61, 0.99 (3) 0.63, 0.97, as plotted on Joon Ho Lee. CIKM’05, October 31–November 5, 2005augmented with a semantic-weight driven the chart of False positive versus Trueontology of consumer health concepts, which positive rates (ROC). The criterion with theincludes a rich set of synonyms of consumer best performance was Sum of semantichealth terms, and frequent misspellings of weights of all matching concepts. We are hiringconsumer health terms.