Identifying Collocations  to Measure Compositionality :  Shared Task System Description  Ted Pedersen Department of Comput...
My original intent... <ul><li>Cluster contexts that contain the candidate pair
Identify the number of clusters automatically  </li><ul><li>http://senseclusters.sourceforge.net </li></ul><li>If a small ...
Non-compositional! </li></ul><li>If a larger number of clusters found … </li><ul><li>No systematic underlying meaning
compositional </li></ul></ul>
Revised Idea <ul><li>Methods that identify collocations often claim that what they find are non-compositional, or “interes...
The Duluth systems seek to evaluate that claim by using various measures of association commonly employed to identify coll...
http://ngram.sourceforge.net </li></ul></ul>
Hypothesis #1 <ul><li>An ngram that has a high score according to a measure of association (for identifying collocations) ...
Well suited for shared task </li></ul></ul>
Measures of Association <ul><li>Log-likelihood ratio (ll)
Mutual Information (tmi)
Pearson's chi-squared test (x2)
Pointwise Mutual Information (pmi)
Poisson-Stiring (ps) </li></ul><ul><li>Fisher's Exact Test (leftFisher)
Jaccard Coefficient (jaccard)
Odds Ratio (odds)
Dice Coefficient (dice)
T-score (tscore) </li></ul>
Measures of Association <ul><li>In general these compare the frequency of a word or pair of words with an expected value b...
Comparing Observed with Expected <ul><li>p(w1,w2) = n_11 /  n_++
p(w1) =  n_1+ / n_++
Upcoming SlideShare
Loading in …5
×

Pedersen ACL Disco-2011 workshop

649 views

Published on

ACL 2011 Disco workshop describing shared task systems from Duluth that measure semantic compositionality.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
649
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Pedersen ACL Disco-2011 workshop

  1. 1. Identifying Collocations to Measure Compositionality : Shared Task System Description Ted Pedersen Department of Computer Science University of Minnesota, Duluth http://www.d.umn.edu/~tpederse
  2. 2. My original intent... <ul><li>Cluster contexts that contain the candidate pair
  3. 3. Identify the number of clusters automatically </li><ul><li>http://senseclusters.sourceforge.net </li></ul><li>If a small number of clusters found... </li><ul><li>A single underlying meaning exists
  4. 4. Non-compositional! </li></ul><li>If a larger number of clusters found … </li><ul><li>No systematic underlying meaning
  5. 5. compositional </li></ul></ul>
  6. 6. Revised Idea <ul><li>Methods that identify collocations often claim that what they find are non-compositional, or “interesting” in some way
  7. 7. The Duluth systems seek to evaluate that claim by using various measures of association commonly employed to identify collocations </li><ul><li>Text::NSP – the Ngram Statistics Package
  8. 8. http://ngram.sourceforge.net </li></ul></ul>
  9. 9. Hypothesis #1 <ul><li>An ngram that has a high score according to a measure of association (for identifying collocations) will be less compositional (and less literal) than those that have lower scores </li><ul><li>Note that the hypothesis is stated in relative terms, not absolute
  10. 10. Well suited for shared task </li></ul></ul>
  11. 11. Measures of Association <ul><li>Log-likelihood ratio (ll)
  12. 12. Mutual Information (tmi)
  13. 13. Pearson's chi-squared test (x2)
  14. 14. Pointwise Mutual Information (pmi)
  15. 15. Poisson-Stiring (ps) </li></ul><ul><li>Fisher's Exact Test (leftFisher)
  16. 16. Jaccard Coefficient (jaccard)
  17. 17. Odds Ratio (odds)
  18. 18. Dice Coefficient (dice)
  19. 19. T-score (tscore) </li></ul>
  20. 20. Measures of Association <ul><li>In general these compare the frequency of a word or pair of words with an expected value based on the assumption of independence </li><ul><li>p(w1,w2) = p(w1)*p(w2) ?? </li></ul><li>If the frequency of a word or pair of words is about what would be expected if they were independent, then these get a low score and aren't considered interesting </li><ul><li>Less likely to be non-compositional </li></ul></ul>
  21. 21. Comparing Observed with Expected <ul><li>p(w1,w2) = n_11 / n_++
  22. 22. p(w1) = n_1+ / n_++
  23. 23. p(w2) = n_+1 / n_++
  24. 24. m_11 = (n_1+ * n_+1) / n_++ </li><ul><li>Generalizes to m_ij </li></ul></ul>W2 NOT W2 W1 n_11 n_12 n_1+ NOT W1 n_21 n_22 n_2+ n_+1 n_+2 n_++
  25. 25. ACL 2011 abstract corpus http://www.d.umn.edu/~tpederse/acl2011.abstracts.txt translation NOT translation machine n_11 = 65 m_11 = 1.58 n_12 = 14 m_12 = 77.41 79 NOT machine n_21 = 48 m_21 = 111.42 n_22 = 5,512 m_22 = 5448.58 5,560 113 5,526 5,639 <ul><li>Do n_ij and m_ij diverge enough to reject the model of independence? </li><ul><li>Different measures answer this question different ways </li></ul></ul>
  26. 26. Counting with windows <ul><li>Text : a b c d e f g h i j k
  27. 27. Window 2 (w1 w2) </li><ul><li>a b, b c, c d, d e, e f, f g, … </li></ul><li>Window 4 (w1 * * w2) </li><ul><li>a b, a c, a d, b c, b d, b e, c d, c e, c f, ... </li></ul><li>Window 10 (w1 * * * * * * * * w2) </li><ul><li>a b, a c, a d, a e, a f, a g, a h, a i, a j, b c, b d ... </li></ul></ul>
  28. 28. Development of Duluth Systems <ul><li>Duluth-1 (aka “The Flagship”) : based on the measure that had the highest correlation with the fine grained gold standard data
  29. 29. Duluth-2 (aka “Coward's Comfort”) : use the measure most distinct from Duluth-1
  30. 30. Duluth-3 (aka “Why not?”) : three submissions allowed, so why not... </li></ul>
  31. 31. Rank correlation with fine grained gold standard 2 4 10 tscore 0.1481 0.2114 0.2674 tmi 0.1335 0.1908 0.2361 ll 0.1336 0.1913 0.2358 frequency 0.1865 0.2100 0.2126 ps 0.0992 0.1554 0.1874 x2 0.1157 0.1172 0.1654 phi 0.1253 0.1167 0.1646 jaccard 0.1253 0.1255 0.1602 dice 0.1253 0.1255 0.1602 odds 0.0216 0.0060 0.0257 pmi -0.0241 -0.0145 0.0143 rightFisher -0.1768 -0.0817 0.0740 leftFisher 0.1316 0.0686 -0.0870 twotailed -0.1445 -0.0651 -0.1064
  32. 32. “So many tests?? I suggested the t-score in 1991...”
  33. 33. The Flagship : Duluth-1 <ul><li>t-score with a window size of 10 </li><ul><li>Rank correlation of 0.2674 </li></ul><li>t = ( n_11 – m_11 ) / sqrt (n_11) </li></ul>
  34. 34. BUT... <ul><li>t-score with window size of 2 has huge rank correlation with frequency (0.9857) </li><ul><li>Somewhat less with window size of 10 (0.8477) but still high... </li></ul><li>Can a measure that correlates so well with frequency really be effective? </li></ul>
  35. 35. Hypothesis #2 <ul><li>Very frequent word pairs are more likely to be compositional (i.e., highly literal) than are less frequent word pairs </li><ul><li>Highly frequent word pairs tend to be very literal and non-compositional (e.g., for the) and it would in general be a surprise to expect a compositional pair to attain as high a frequency </li></ul></ul>
  36. 36. Coward's Comfort? <ul><li>PMI with window size of 2 </li><ul><li>Rank correlation of -0.0241 with gold standard
  37. 37. Low correlation of 0.2487 with frequency </li></ul><li>PMI = log (n_11/m_11)
  38. 38. RightFisher and two tailed Fisher had lower correlation, but aren't really suitable for collocation discovery
  39. 39. PMI has long history of use in collocation discovery … </li></ul>
  40. 40. “I'm no coward, and I like PMI!”
  41. 41. Why not? .. Duluth-3 <ul><li>PMI with a window size of 2
  42. 42. PMI very biased towards word pairs that only occur together </li><ul><li>Highest score always for pairs that occur just 1 time and only with each other </li></ul><li>Wide window in Duluth-2 means that pairs with high PMI scores generally occur only together
  43. 43. Narrow window in Duluth-3 might tend to miss other occurrences of words (outside window) </li></ul>
  44. 44. Scoring <ul><li>Shared task scoring on a scale of 0 – 100 where 100 means highly literal.
  45. 45. In measures of association higher scores mean less literal (at least according to hypothesis 1)
  46. 46. Association scores converted to 0 – 100 scale by normalizing and subtracting from 100 </li><ul><li>100 * (1 – m(w1,w)/max(m(W1,W2)))
  47. 47. Binned 0-33 low, 34-66 medium, 67-100 high </li></ul></ul>
  48. 48. Results <ul><li>Duluth-1 top ranked for coarse evaluation </li><ul><li>Duluth-3 top ranked for coarse EN_V_SUBJ (by a large margin...??) </li></ul><li>Duluth-1 middle of pack for numerical scoring
  49. 49. Duluth-2 and Duluth-3 generally less effective for both coarse and numerical scoring
  50. 50. No correlation with numerical scoring? </li><ul><li>I found correlations with training data...? </li></ul></ul>
  51. 51. Conclusions <ul><li>Standard techniques for ranking collocations are effective at identifying compositionality
  52. 52. Scoring of compositionality is less successful
  53. 53. The t-score is successful because it optimizes two potentially competing hypotheses : </li><ul><li>word pairs with high association scores are more likely to be non-compositional, and
  54. 54. more frequent word pairs are likely to be compositional </li></ul></ul>
  55. 55. Thank You! <ul><li>All experiments conducted with version 1.23 of the Ngram Statistics Package
  56. 56. http://ngram.sourceforge.net </li></ul>

×