SemEval 2012 task 6        A pilot onSemantic Textual Similarity   http://www.cs.york.ac.uk/semeval-2012/task6/     Eneko ...
Outline    Motivation    Description of the task    Source Datasets    Definition of similarity and annotation    Res...
Motivation●   Word similarity and relatedness highly correlated    with humans      Move to longer text fragments (STS)   ...
Motivation●   STS has been part of the core implementation    of TE and paraphrase systems●   Algorithms for STS have been...
Motivation●   STS as a unified framework to combine and evaluate    semantic (and pragmatic components)      word sense di...
Motivation●   Start with a pilot task, with the following goals    1.To set a definition of STS as a graded notion which  ...
Description of the task●   Given two sentences, s1 and s2    ●   Return a similarity score        and an optional confiden...
Data sources●   MSR paraphrase: train (750), test (750)●   MSR video: train (750), test (750)●   WMT 07–08 (EuroParl): tra...
Definition of similarityLikert scale with definitions                 STS task - SemEval 2012   9
Annotation●   Pilot with 200 pairs annotated by three authors    ●   Pairwise (0.84r to 0.87r), with average (0.87r to 0.8...
Results●   Baselines: random, cosine of tokens●   Participation: 120 hours to submit three runs.    ●   35 teams, 88 runs●...
Results●   Large majority better than both baselines●   Best three runs    ●   ALL: 0.82r UKP, TAKELAB, TAKELAB    ●   Mea...
Results●   Datasets (ALL)    ●   MSRpar 0.73r TAKELAB    ●   MSRvid 0.88r TAKELAB    ●   SMT-eur 0.57r SRANJANS    ●   SMT...
Results●   Evaluation using confidence scores    ●   Weighted Pearson correlation    ●   Some systems improve results (IRI...
Tools used●   WordNet, corpora and Wikipedia most used●   Knowledge-based and distributional equally●   Machine learning w...
Conclusions●   Pilot worked!    ●   Define STS as likert scale with definitions    ●   Produce a wealth of data of high qu...
Open issues●   Data sources, alternatives to the opportunistic method    ●   New pairs of sentences    ●   Possibly relate...
STS presentations●   Three best systems will be presented in    last session of Semeval today (4:00pm)●   Analysis of runs...
Thanks for your attention!And thanks to all participants, specially all participants, speciallythose contributing to the e...
SemEval 2012 task 6        A pilot onSemantic Textual Similarity   http://www.cs.york.ac.uk/semeval-2012/task6/     Eneko ...
MSR paraphrase corpus●   Widely used to evaluate text similarity algorithms●   Gleaned over a period of 18 months from    ...
MSR paraphrase corpus●   The Senate Select Committee on Intelligence is preparing a    blistering report on prewar intelli...
MSR paraphrase corpus●   Methodology:    ●   Rank pairs according to string similarity        –   Algorithms for Approxima...
MSR Video Description Corpus●   Show a segment of YouTube video    ●   Ask for one-sentence description of the main       ...
MSR Video Description Corpus                       ●   A person is slicing a cucumber into                           piece...
MSR Video Description Corpus●   Methodology:    ●   All possible pairs from the same video    ●   1% of all possible pairs...
WMT: MT evaluation●   Pairs of segments (~ sentences) that had been part    of the human evaluation for WMT systems    ●  ...
WMT: MT evaluation●   The only instance in which no tax is levied is    when the supplier is in a non-EU country and    th...
Surprise datasets●   human ranked fr-en system submissions from    the WMT 2007 news conversation test set,    resulting i...
Pilot●   Mona, Dan, Eneko●   ~200 pairs from three datasets●   Pairwise agreement:    ●   GS:dan     SYS:eneko     N:188 P...
STS task - SemEval 2012   31
Pilot with turkers●   Average turkers with our average:    ●   N:197 Pearson: 0.959●   Each of us with average of turkers:...
Working with AMT●   Requirements:    ●   95% approval rating for their other HITs on AMT.    ●   To pass a qualification t...
Working with AMT●   Quality control    ●   Each HIT contained one pair from our pilot    ●   After the tagging we check co...
Assessing quality of annotation           STS task - SemEval 2012   35
Assessing quality of annotation●   MSR datasets    ●   Average 2.76    ●   0:2228    ●   1:1456    ●   2:1895    ●   3:407...
Average (MSR data)654                                   ave3210         STS task - SemEval 2012         37
Standard deviation (MSR data)76543210-1-2                STS task - SemEval 2012   38
Standard deviation (MSR data)2.5 21.5                                           sdv 10.5 0                 STS task - SemE...
Average SMTeuroparl      STS task - SemEval 2012   40
Upcoming SlideShare
Loading in …5
×

A pilot on Semantic Textual Similarity

378 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
378
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A pilot on Semantic Textual Similarity

  1. 1. SemEval 2012 task 6 A pilot onSemantic Textual Similarity http://www.cs.york.ac.uk/semeval-2012/task6/ Eneko Agirre (University of the Basque Country) Daniel Cer (Stanford University) Mona Diab (Columbia University) Bill Dolan (Microsoft)Aitor Gonzalez-Agirre (University of the Basque Country)
  2. 2. Outline Motivation Description of the task Source Datasets Definition of similarity and annotation Results Conclusions, open issues STS task - SemEval 2012 2
  3. 3. Motivation● Word similarity and relatedness highly correlated with humans Move to longer text fragments (STS) – Li et al. (2006) 65 pairs of glosses – Lee et al. (2005) 50 documents on news● Paraphrase datasets judge semantic equivalence between text fragments● Textual entailment (TE) judges whether one fragment entails another Move to graded notion of semantic equivalence (STS) STS task - SemEval 2012 3
  4. 4. Motivation● STS has been part of the core implementation of TE and paraphrase systems● Algorithms for STS have been extensively applied ● MT, MT evaluation, Summarization, Generation, Distillation, Machine Reading, Textual Inference, Deep QA● Interest from application side confirmed in a recent STS workshop: ● http://www.cs.columbia.edu/~weiwei/workshop/ STS task - SemEval 2012 4
  5. 5. Motivation● STS as a unified framework to combine and evaluate semantic (and pragmatic components) word sense disambiguation and induction lexical substitution semantic role labeling multiword expression detection and handling anaphora and coreference resolution time and date resolution named-entity handling underspecification hedging semantic scoping discourse analysis STS task - SemEval 2012 5
  6. 6. Motivation● Start with a pilot task, with the following goals 1.To set a definition of STS as a graded notion which can be easily communicated to non-expert annotators beyond the likert-scale 2.To gather a substantial amount of sentence pairs from diverse datasets, and to annotate them with high quality 3.To explore evaluation measures for STS 4.To explore the relation of STS to paraphrase and Machine Translation Evaluation exercises STS task - SemEval 2012 6
  7. 7. Description of the task● Given two sentences, s1 and s2 ● Return a similarity score and an optional confidence score● Evaluation ● Correlation (Pearson) with average of human scores STS task - SemEval 2012 7
  8. 8. Data sources● MSR paraphrase: train (750), test (750)● MSR video: train (750), test (750)● WMT 07–08 (EuroParl): train (734), test (499)● Surprise datasets ● WMT 2007 news: test (399) ● Ontonotes – WordNet glosses: test (750) STS task - SemEval 2012 8
  9. 9. Definition of similarityLikert scale with definitions STS task - SemEval 2012 9
  10. 10. Annotation● Pilot with 200 pairs annotated by three authors ● Pairwise (0.84r to 0.87r), with average (0.87r to 0.89r)● Amazon Mechanical Turk ● 5 annotations per pair, averaged ● Remove turkers with very low correlations with pilot ● Correlation with us 0.90r to 0.94r ● MSR: 2.76 mean, 0.66 sdv. ● SMT: 4.05 mean, 0.66 sdv. STS task - SemEval 2012 10
  11. 11. Results● Baselines: random, cosine of tokens● Participation: 120 hours to submit three runs. ● 35 teams, 88 runs● Evaluation ● Pearson for each dataset ● Concatenate all 5 datasets: ALL – Some systems doing well in each dataset, low results ● Weighted mean over 5 datasets (micro-average): MEAN – Statistical significance ● Normalize each dataset and concatenate (least square): ALLnorm – Corrects errors (random would get 0.59r) STS task - SemEval 2012 11
  12. 12. Results● Large majority better than both baselines● Best three runs ● ALL: 0.82r UKP, TAKELAB, TAKELAB ● Mean: 0.67r TAKELAB, UKP, TAKELAB ● ALLnrm: 0.86r UKP, TAKELAB, SOFT-CARDINALITY● Statistical significance (ALL 95% confidence interval) ● 1st 0.824r [0.812,0.835] ● 2nd 0.814r [0.802,0.825] STS task - SemEval 2012 12
  13. 13. Results● Datasets (ALL) ● MSRpar 0.73r TAKELAB ● MSRvid 0.88r TAKELAB ● SMT-eur 0.57r SRANJANS ● SMT-news 0.61r FBK ● On-WN 0.73r WEIWEI STS task - SemEval 2012 13
  14. 14. Results● Evaluation using confidence scores ● Weighted Pearson correlation ● Some systems improve results (IRIT, TIANTIANZHU7) – IRIT: 0.48r => 0.55r ● Others did not (UNED)● Unfortunately only a few teams sent out confidence scores● Promising direction, potentially useful in applications (Watson) STS task - SemEval 2012 14
  15. 15. Tools used● WordNet, corpora and Wikipedia most used● Knowledge-based and distributional equally● Machine learning widely used for combination and tuning● Best systems used most resources ● Exception: SOFT-CARDINALITY STS task - SemEval 2012 15
  16. 16. Conclusions● Pilot worked! ● Define STS as likert scale with definitions ● Produce a wealth of data of high quality (~ 3750) ● Very successful participation ● All data and system outputs are publicly available● Started to explore evaluation of STS● Started to explore relation to paraphrase and MT evaluation● Planning for STS 2013 STS task - SemEval 2012 16
  17. 17. Open issues● Data sources, alternatives to the opportunistic method ● New pairs of sentences ● Possibly related to specific phenomena, e.g. negation● Definition of task ● Agreement for definitions ● Compare to Likert scale with no definitions ● Define multiple dimensions of similarity (polarity, sentiment, modality, relatedness, entailment, etc.)● Evaluation ● Spearman, Kendalls Tau ● Significance tests over multiple datasets (Bergmann & Hommel, 1989)● And more!! Join STS-semeval google group STS task - SemEval 2012 17
  18. 18. STS presentations● Three best systems will be presented in last session of Semeval today (4:00pm)● Analysis of runs and some thoughts on evaluation will be also presented● Tomorrow in the posters sessions STS task - SemEval 2012 18
  19. 19. Thanks for your attention!And thanks to all participants, specially all participants, speciallythose contributing to the evaluation discussion (Yoan Gutierrez,Michael Heilman, Sergio Jimenez, Nitin Madnami, DianaMcCarthy and Shrutiranjan Satpathy)Eneko Agirre was partially funded by the European Communitys Seventh Framework Programme(FP7/2007-2013) under grant agreement no. 270082 (PATHS project) and the Ministry of Economyunder grant TIN2009-14715-C04-01 (KNOW2 project). Daniel Cer gratefully acknowledges the supportof the Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under AirForce Research Labora-tory (AFRL) prime contract no. FA8750-09-C-0181 and the support of theDARPA Broad Operational Language Translation (BOLT) program through IBM. The STS annotationswere funded by an extension to DARPA GALE subcontract to IBM # W0853748 4911021461.0 to MonaDiab. Any opinions, findings, and conclusion or recommendations expressed in this material are thoseof the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. STS task - SemEval 2012 19
  20. 20. SemEval 2012 task 6 A pilot onSemantic Textual Similarity http://www.cs.york.ac.uk/semeval-2012/task6/ Eneko Agirre (University of the Basque Country) Daniel Cer (Stanford University) Mona Diab (Columbia University) Bill Dolan (Microsoft)Aitor Agirre-Gonzalez (University of the Basque Country)
  21. 21. MSR paraphrase corpus● Widely used to evaluate text similarity algorithms● Gleaned over a period of 18 months from thousands of news sources on the web.● 5801 pairs of sentences ● 70% train, 30% test ● 67% yes, %33 no – completely unrelated semantically, partially overlapping, to those that are almost-but-not-quite semantically equivalent. ● IAA 82%-84%● (Dolan et al. 2004) STS task - SemEval 2012 21
  22. 22. MSR paraphrase corpus● The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.● American intelligence leading up to the war on Iraq will be criticised by a powerful US Congressional committee due to report soon, officials said today.● A strong geomagnetic storm was expected to hit Earth today with the potential to affect electrical grids and satellite communications.● A strong geomagnetic storm is expected to hit Earth sometime %%DAY%% and could knock out electrical grids and satellite communications. STS task - SemEval 2012 22
  23. 23. MSR paraphrase corpus● Methodology: ● Rank pairs according to string similarity – Algorithms for Approximate String Matching", E. Ukkonen, Information and Control Vol. 64, 1985, pp. 100- 118. ● Five bands (0.8 – 0.4 similarity) ● Sample equal number of pairs from each band ● Repeat for paraphrases / non-paraphrases ● 50% from each● 750 pairs for train, 750 pairs for test STS task - SemEval 2012 23
  24. 24. MSR Video Description Corpus● Show a segment of YouTube video ● Ask for one-sentence description of the main action/event in the video (AMT) ● 120K sentences, 2,000 videos ● Roughly parallel descriptions (not only in English)● (Chen and Dolan, 2011) STS task - SemEval 2012 24
  25. 25. MSR Video Description Corpus ● A person is slicing a cucumber into pieces. ● A chef is slicing a vegetable. ● A person is slicing a cucumber. ● A woman is slicing vegetables. ● A woman is slicing a cucumber. ● A person is slicing cucumber with a knife. ● A person cuts up a piece of cucumber. ● A man is slicing cucumber. ● A man cutting zucchini. ● Someone is slicing fruit. STS task - SemEval 2012 25
  26. 26. MSR Video Description Corpus● Methodology: ● All possible pairs from the same video ● 1% of all possible pairs from different videos ● Rank pairs according to string similarity ● Four bands (0.8 – 0.5 similarity) ● Sample equal number of pairs from each band ● Repeat for same video / different video ● 50% from each● 750 pairs for train, 750 pairs for test STS task - SemEval 2012 26
  27. 27. WMT: MT evaluation● Pairs of segments (~ sentences) that had been part of the human evaluation for WMT systems ● a reference translation ● a machine translation submission● To keep things consistent, we just used French to English system submissions translation● Train contains pairs in WMT 2007● Test contains pairs with less than 16 tokens from WMT 2008● Train and test come from Europarl STS task - SemEval 2012 27
  28. 28. WMT: MT evaluation● The only instance in which no tax is levied is when the supplier is in a non-EU country and the recipient is in a Member State of the EU.● The only case for which no tax is still perceived "is an example of supply in the European Community from a third country.● Thank you very much, Commissioner.● Thank you very much, Mr Commissioner. STS task - SemEval 2012 28
  29. 29. Surprise datasets● human ranked fr-en system submissions from the WMT 2007 news conversation test set, resulting in 351 unique system reference pairs.● The second set is radically different as it comprised 750 pairs of glosses from OntoNotes 4.0 (Hovy et al., 2006) and WordNet 3.1 (Fellbaum, 1998) senses. STS task - SemEval 2012 29
  30. 30. Pilot● Mona, Dan, Eneko● ~200 pairs from three datasets● Pairwise agreement: ● GS:dan     SYS:eneko     N:188 Pearson: 0.874 ● GS:dan     SYS:mona      N:174 Pearson: 0.845 ● GS:eneko   SYS:mona      N:184 Pearson: 0.863● Agreement with average of rest of us: ● GS:average  SYS:dan   N:188 Pearson: 0.885 ● GS:average  SYS:eneko N:198 Pearson: 0.889 ● GS:average  SYS:mona  N:184 Pearson: 0.875 STS task - SemEval 2012 30
  31. 31. STS task - SemEval 2012 31
  32. 32. Pilot with turkers● Average turkers with our average: ● N:197 Pearson: 0.959● Each of us with average of turkers: ● dan        N:187 Pearson: 0.937 ● eneko      N:197 Pearson: 0.919 ● mona       N:183 Pearson: 0.896 STS task - SemEval 2012 32
  33. 33. Working with AMT● Requirements: ● 95% approval rating for their other HITs on AMT. ● To pass a qualification test with 80% accuracy. – 6 example pairs – answers were marked correct if they were within +1/-1 of our annotations ● Targetting US, but used all origins● HIT: 5 pairs of sentences, $ 0.20, 5 turkers per HIT● 114.9 seconds per HIT on the most recent data we submitted. STS task - SemEval 2012 33
  34. 34. Working with AMT● Quality control ● Each HIT contained one pair from our pilot ● After the tagging we check correlation of individual turkers with our scores ● Remove annotations of low correlation turkers – A2VJKPNDGBSUOK N:100 Pearson: -0.003 ● Later realized that we could use correlation with average of other Turkers STS task - SemEval 2012 34
  35. 35. Assessing quality of annotation STS task - SemEval 2012 35
  36. 36. Assessing quality of annotation● MSR datasets ● Average 2.76 ● 0:2228 ● 1:1456 ● 2:1895 ● 3:4072 ● 4:3275 ● 5:2126 STS task - SemEval 2012 36
  37. 37. Average (MSR data)654 ave3210 STS task - SemEval 2012 37
  38. 38. Standard deviation (MSR data)76543210-1-2 STS task - SemEval 2012 38
  39. 39. Standard deviation (MSR data)2.5 21.5 sdv 10.5 0 STS task - SemEval 2012 39
  40. 40. Average SMTeuroparl STS task - SemEval 2012 40

×