Acm ihi-2010-pedersen-final

733 views

Published on

Presentation slides for ACM IHI 2010 talk "The Effect of Different Context Representations on Word Sense Discrimination in Biomedical Texts"

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
733
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Acm ihi-2010-pedersen-final

  1. 1. The Effect of Different Context Representations on Word Sense Discrimination in Biomedical Texts Ted Pedersen Department of Computer Science University of Minnesota, Duluth http://www.d.umn.edu/~tpederse
  2. 2. Topics  Natural Language Processing − Semantic ambiguity − Disambiguation versus Discrimination  Text (Context) Representations − Latent Semantic Analysis (LSA) − Word Co-occurrence (Schütze)  Cluster contexts to discover word senses  Discrimination Experiments with MedLine
  3. 3. Discrimination or Disambiguation?  Word sense disambiguation assigns a sense to a target word in context by selecting from a pre-existing set of possibilities (classification) − Are you a river bank or a money bank?  Word sense discrimination assigns a target word in context to a cluster without regard to any pre-existing sense inventory (discovery) − How many ways is bank being used?
  4. 4. Discrimination Input : target word contexts  Wellbutrin is used to treat depression.  A deep groove or depression is painful.  It left a tender red depression on his wrist.  Counseling and medication helps depression.
  5. 5. Discrimination Output : clusters of contexts  Wellbutrin is used to treat depression.  Counseling and medication helps depression.  A deep groove or depression is painful.  It left a tender red depression on his wrist.
  6. 6. Word Sense Discrimination!  This is one way to identify senses in the first place - sense inventories aren't static − Word sense discrimination – identify senses − Craft a definition – sense labeling  When doing searches we often only need to know that a word is being used in several distinct senses (but may not care exactly what they are) − depression (economic, indentation, condition) − apache (helicopter, Native American, software)
  7. 7. The Goal  To carry out word sense discrimination based strictly on empirical evidence in the text that contains our target words (local), or other text we can easily obtain (global)  Be knowledge-lean and avoid dependence on existing sense inventories, ontologies, etc. − Language independent − Domain independent − Discover new senses
  8. 8. First Order Methods  Represent each target word context with a feature vector that shows which unigram or bigram features occur within − Other features can be used, including part of speech tags, syntactic info, etc.  Results in a context by feature matrix − Each row is a context to be clustered − Each context contains a target word  Cluster − All the contexts in the same cluster presumed to use target word in same sense
  9. 9. First Order Representations context by features  (2) A deep groove or depression is painful.  (3) It left a tender red depression on his wrist. But ... we know that painful and tender are very similar – we just can't see it here ... deep depression groove left painful red tender wrist (2) 1 1 1 0 1 0 0 0 (3) 0 1 0 1 0 1 1 1
  10. 10. Look to the Second Order...  “You shall know a word by the company it keeps” (JR Firth, 1957) − ... know a friend by their friends − ... words co-occur with other words  Words that occur in similar contexts will tend to have similar meanings (Zelig Harris, 1954) − ... know a friend by the places they go − ... “Distributional Hypothesis”
  11. 11. Look to the Second Order  Replace each word in a target word context with a vector that reveals something about that word − Replace a word by the company it keeps  Feature by feature matrix (Schütze)  Word co-occurrence matrix − Replace a word by the places it has been  Feature by context matrix (LSA)  Term by document matrix
  12. 12. Feature by Feature Matrix ... to replace a word by the company it keeps .... hurt Wellbutrin sore bruise ... medication 1 1 0 0 counseling 0 1 0 0 tender 1 0 1 1 painful 1 0 1 1 red 1 0 1 0
  13. 13. Feature by Context Matrix ... to replace a word by the places it has been ... (100) (101) (102) (103) ... medication 0 0 1 1 counseling 0 1 1 0 tender 1 1 0 1 painful 1 1 0 1 red 0 0 1 1
  14. 14. Second Order Representations  Replace each word in a target word context with a vector − Feature by feature (Schütze)  The company it keeps − Feature by context (LSA)  The places it has been  Remove all words that don't have vectors  Average all word vectors together and represent context with that averaged vector − Do the same with all other target word contexts, then cluster
  15. 15. Second order representations  (2) : A deep groove or depression is painful.  (3) : It left a tender red depression on his wrist.  Nothing matches in first order representation, but in second order since painful and tender ... − both occur with hurt, then there is some similarity between (2) and (3) − both occur in document 100, 102, and 103, then there is some similarity between (2) and (3)
  16. 16. The Question  Which method of representing contexts is best able to discover and discriminate among senses? − First order feature vectors  Traditional vector space  o1-Ngram − Replace word by the company it keeps  Schütze  o2-SC − Replace word by the places it has been  LSA  o2-LSA
  17. 17. Experimental Methodology  Collect contexts with a given target word  Identify lexical features (unigrams or bigrams) within the contexts or in other global data  Use these features to represent contexts using first or second order methods − Perform SVD (optional)  Cluster − Number of clusters automatically discovered − Generate a label for each cluster  Evaluate
  18. 18. Lexical Features  Unigrams − High frequency words that aren't in stop list − Stop list typically made up of function words like the, for, and, but, etc.  Bigrams − Two word sequences (separated by up to 8 words) that occur more often than chance − Selected using Fisher's Exact Test (left sided)  p-value =.99 − Bigrams made up of stop words excluded  Can be identified in the target word contexts
  19. 19. Clustering  Repeated Bisections − Starts by clustering all contexts in one cluster, then repeatedly partitioning (in two) to optimize the criterion function − Partitioning done via k-means with k=2  I2 criterion function − Finds average pairwise similarity between each context in the cluster and the centroid, sums across all clusters to find value  Implemented in Cluto
  20. 20. Cluster Stopping  Find k where criterion function stops improving  PK2 (Hartigan, 1975) takes ratio of criterion function of successive pairs of k  PK3 takes ratio of twice the criterion function at k divided by product of (k-1) and (k+1) − PK2 and PK3 stop when these ratios are within 1 std of 1  Gap Statistic (Tibshirani, 2001) compares observed data with reference sample of noise, find k with greatest divergence from noise
  21. 21. Evaluation  Map discovered clusters to “actual” clusters and find assignment of discovered clusters to actual clusters that maximizes agreement  Assignment Problem − Hungarian Algorithm − Munkres-Kuhn Algorithm  Precision, Recall, F-measure
  22. 22. Experiments  Isolate context representations, hold most everything else equal  Focus on biomedical text − Ambiguities exist, often rather fine grained and not well represented in existing resources − Automatic mapping of terms to concepts of great practical importance  Relatively small amounts of manually annotated evaluation data available, create new collection automatically (imperfectly)
  23. 23. Experimental Data  Randomly select 60 MeSH preferred terms, and pair them randomly − Relatively unambiguous and moderately specific terms − Medical Subject Headings – used to index medical journal articles  “Create” 30 new ambiguous terms (pseudo words) that conflate the terms in a pair COLON-&-LEG PATIENT_CARE-&-OSTEOPOROSIS
  24. 24. Experimental Data  Replace all occurrences of each member of a pair with the new conflated term − Select 1,000 – 10,000 MedLine abstracts that contain each pseudo word − create 50/50 split of two “senses”  Discriminate into some number of clusters − Evaluate with F-measure − All in one cluster results in 50%
  25. 25. Experimental Settings  Unigrams and bigrams as features, selected from target word contexts − First order and second order methods  SVD optional with second order methods  Clustering with repeated bisections and I2  Cluster stopping with PK2, PK3, and Gap  Evaluation relative to 30 pseudo words
  26. 26. Experimental Results : F-Measure
  27. 27. Experimental Results : discovered k
  28. 28. Discussion of Results  Second order methods robust and accurate − o2 SC overall more accurate and better at predicting number of senses − SVD degrades results  First order unigrams effective but brittle  Conflated words not perfect but useful  Knowing the company a word keeps tells us (a bit) more about its meaning than knowing the places it has been
  29. 29. Ongoing and Future Work  Averaging all vectors seems “coarse”, create context representations so that dominant word vectors stand out and noise recedes  Evaluate on more than 2-way distinctions, using manually created gold standard data  Use information in dictionaries and ontologies when we can, but don't be tied to them − UMLS::Similarity – free open source package that measures similarity and relatedness between concepts in the UMLS − http://search.cpan.org/dist/UMLS-Similarity/
  30. 30. Thank you!  All experiments run with SenseClusters, a freely available open source package from the University of Minnesota, Duluth  http://senseclusters.sourceforge.net − Download software (for Linux) − Publications − Web interface for experiments  The creation of SenseClusters was funded by an NSF CAREER Award (#0092784). This particular study was supported by a grant from NIH/NLM (1R01LM009623-01A2).
  31. 31. Extra Slides
  32. 32. Experimental Data colon(s|ic)? & legs? | patient care & osteoporosis | blood transfusions? & ventricular functions? | randomized controlled trials? & haplotypes? | vasodilations? & bronchoalveolar lavages? | toluenes? & thinking | duodenal ulcers? & clonidines? | myomas? & appetites? | glycolipids? & prenatal care | thoracic surger(y|ies) & cytogenetic analys(is|es) | measles virus(es)? & tissue extracts? | lanthanums? & curiums? | adrenal insufficienc(y|ies) & (recurrent )?laryngeal nerves? | glucokinases? & xeroderma pigmentosums? | polyvinyl alcohols? & polyribosomes? | urethral strictures? & resistance training | cholesterol esters? & premature births? | odontoblasts? & anurias? | brain infarctions? & health resources? | turbinates? & aphids? | cochlear nerves? & (protein )?kinases? Inhibitors? | hematemesis & gemfibrozils? | nectars? & work of breathing | fusidic acids? & dicarboxylic acids? | brucellas? & potassium iodides? | walkers? & primidones? | hepatitis( b)? & flavoproteins? | prognathisms? & plant roots? | plant proteins? & (persistent )?vegetative states? | prophages? & porphyrias?
  33. 33. Evaluation COLON LEG C1 10 10 20 C2 10 30 40 C3 40 0 40 60 40 100 COLON LEG C3 4040 0 40 C2 10 3030 40 C1 1010 1010 2020 60 40 100  Precision = 70/80 = 87.5%  Recall = 70/100 = 70%  F-Measure = 2*(87.5*70)/(87.5 + 70) = 77.8%
  34. 34. Experimental Results o1- big o1- uni o2- sc o2-sc- svd o2- lsa o2-lsa- svd PK2 64.63 (5.5) 75.24 (4.0) 90.74 (2.2) 57.52 (5.3) 84.16 (2.9) 57.89 (5.3) PK3 75.08 (3.8) 84.24 (3.0) 90.68 (2.4) 69.44 (2.5) 87.43 (2.3) 67.85 (2.4) Gap 65.51 (6.2) 87.50 (1.9) 88.57 (2.2) 50.00 (1.0) 83.93 (2.3) 49.56 (1.3)
  35. 35. References  LSI : Deerwester, S., et al. (1990) Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 51st Annual Meeting of the American Society for Information Science 25, pp. 36–40.  Word Co-occurrences : Firth, J. R. (1957) Papers in Linguistics 1934-1951. London: Oxford University Press.  Distributional Hypothesis : Harris, Z. (1954) Distributional structure. Word, 10(23): 146-162.  LSA : Landauer, T. K., and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.  Schütze : Schütze, H. (1998) Automatic word sense discrimination. Computational Linguistics, 24(1), pp. 97-123.

×