Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RANLP 2013: DutchSemcor in quest of the ideal corpus

376 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

RANLP 2013: DutchSemcor in quest of the ideal corpus

  1. 1. DutchSemCor: in Quest of the Ideal Sense Tagged Corpus Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl
  2. 2. Outline  Main goal of our project  WSD and annotated corpus  Our approach  Balanced-sense corpus and evaluation  Balanced-context corpus and evaluation  Sense distributions, all words corpus and evaluation  Numbers… 1
  3. 3. Main goal of DSC  Deliver a Dutch corpus enriched with semantic information:  Senses of the most frequent and most polysemous words  Domains  Named Entities linked with Wikipedia  1 million sense tagged tokens:  250K tagged manually by 2 annotators  750K tagged by 1 annotator / automatically through Active Learning 2
  4. 4. Current WSD  Insights on Word Sense Disambiguation 1. Evaluation tasks depend on the corpus / lexicon  It seems that the results depend more on the evaluation data than on WSD systems  Are the evaluation corpora diverse enough? 2. Most frequent sense from SemCor difficult to beat  Are evaluation tasks neglecting low frequent senses? 3. Predominant senses in specific domains give the best results 4. Supervised systems beat unsupervised systems  Which are the best corpora for WSD  How should be the ideal corpus for WSD? (we)  Define criteria for the ideal sense-tagged corpus  Describe a novel approach for building a large scale sense tagged corpus for meet criteria (with as little manual effort as possible) 3
  5. 5. Criteria for a corpus  A good corpus for WSD should:  Be balanced for different senses  Equal number of examples for each meaning  Be balanced for different contexts  Different usages of the words  Provide information on sense frequencies (across domains and genres)  Frequency of the words in a representative meaning 4
  6. 6. Annotating a corpus Sequential Tagging All Words corpus Targeted tagging Lexical Sample Corpus Balanced sense Balanced context  Whole text  Reconsider meanings  KWIC  Repeated contexts  Small numbers of texts, genres, domains and senses  Sense distributions  SemCor  Usually large number of contexts and senses  Line-hard-serve  DSO 5
  7. 7. Annotating a corpus Sense distribution Sense coverage Context diversity All words ✔ ✖ ✖ Balanced-sense ✖ ✔ ✖ Balanced-context ✖ ✖ ✔ 6
  8. 8. Our main approach 1. Annotated corpus that represents ALL the meanings of an existing lexicon  Balanced sense  Manual 2. Train WSD systems using the annotated corpus  Will be trained for all the senses 3. Extend this annotated corpus to acquire a wider representation of contexts  Balanced-context  Manual + WSD 4. Annotate the full raw corpus  Sense distributions  WSD 5. Evaluation of the annotations for the 3 criteria 7
  9. 9. Resources  Cornetto database  Lexical semantic database for Dutch  Structure and content of WN + FrameNet-like data  SoNaR (500M tokens)  Dutch wide range of genres and topics  34 categories: discussion lists, books, chats, autocues…)  CGN (9M tokens)  Transcribed spontaneous Dutch adult speech  Internet 8
  10. 10. WSD systems  DSC-timbl  Memory learning classifier  Supervised K-nearest neighbor  DSC-SVM  Linear classifier / Support Vector Machines  Binary classifiers 1 vs all  DSC-UKB  Knowledge based system  Personalized page rank algorithm  Synsets  nodes Relations  hedges  Context words inject mass into word senses 9
  11. 11. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool and Web-snippets tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 10
  12. 12. Balanced-sense corpus  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 11
  13. 13. Balanced-sense corpus  2870 most polysemous and frequent words (11982 meanings avg polysemy 3)  Student assistants 2 years  SAT tool  80% agreement 25 examples per sense  282,503 tokens double annotated  80% senses with more than 25 examples  90% lemmas with 25 examples for each sense  Distribution-> 67% sonar, 5% CGN, 28% web 12
  14. 14. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 13
  15. 15. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 14
  16. 16. WSD from balanced sense  5-FCV at sense level and focus on nouns  Optimized for annotate SONAR  Specific features (word_id)  Overall result for nouns  82.76  Results used for further annotate weakly performing senses  Active Learning approach  Select 82 lemmas performing under 80%  3 rounds of annotation till reach 81.62% 15
  17. 17. Balanced context  Try to annotate the whole corpus  as many contexts as the whole corpus  have a good WSD  improve problematic cases  Select all words perform under 80%  Annotate all corpus with Timbl-wsd system (optimized)  50 new tokens for senses of words under 80% being different context  High confidence  Low distance / High distance to the nearest neighbor  Manually annotate these 50  Completely different to first phase where annotators could chose  Lemmatization errors, PoS errors, figurative, idiomatic unknown senses 16
  18. 18. Evaluating the Balanced-sense and new annotations Type Accuracy # examples Balanced Sense (BS) 81.62 8641 BS + LowD 78.81 13266 BS+ LowD_agreed 85.02 11405 BS+ High 76.24 19055 BS+ HighD_agreed 83.77 13359 BS + LowD_agreed + HighD_agreed 85.33 16123 • Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas • Better results when using agreed data • High/Low distance does not make big difference 17
  19. 19. Evaluation balanced-context  5-FCV using agreed new instances  Best is majority voting System Nouns Verbs Adjs DSC-timbl 83.97 83.44 78.64 DSC-svm 82.69 84.93 79.03 DSC-ukb 73.04 55.84 56.36 Voting 88.65 87.60 83.06 18
  20. 20. Evaluating representativeness  Our manual annotated corpus probably skewed towards balanced-sense  Required to test the performance of our WSD on the rest of SONAR  Random evaluation  Ranges of accuracy (90-100 80-90 70-80 60-70)  5 nouns 5 verbs and 3 adjs  52 lemmas  100 tokens for each lemma automatic tagged and manual validated 19
  21. 21. Evaluating representativeness  Results lower than previous evaluations  Difference between approach representing the lexicon (sense) and representing the corpus  Results comparable to state-of-the-art English Sens/Sem-eval System Nouns Verbs Adjs DSC-timbl 54.25 48.25 46.50 DSC-svm 64.10 52.20 52.00 DSC-ukb 49.37 44.15 38.13 Voting 60.70 53.95 50.83 20
  22. 22. Obtaining sense distributions  Approach  Annotate the remainder SoNaR with WSD systems an obtain sense frequencies  Assume that automatic annotation still reflects real distribution  Evaluate this frequency distribution (Most Frequent Sense)  How can be evaluated this MFS approach?  Manual annotations  25 examples per sense, no sense distribution  Random evaluation corpus  Only a small selection of words (52 lemmas) 21
  23. 23. Obtaining sense distributions  All-words corpus was created  Completely independent texts from Lassy  Medical journals, manuals, newspapers, magazines, reports, websites, wikipedia  23,907 tokens and covers 1,527 of our set of lemmas (53%)  Evaluation of  3 WSD systems  First sense baseline according to cornetto  Random sense baseline  Most frequent sense  Sense distributions obtained from automatic annotation 22
  24. 24. Obtaining sense distributions  MFS in Dutch similar to English MFS  MFS better than 1st and random sense baselines  MFS automatically derived is a good predictor System Nouns Verbs Adjs 1st sense 53.17 32.84 52.17 Random sense 29.52 24.99 32.16 MFS 61.20 50.76 54.62 DSC-timbl 55.76 37.96 49.00 DSC-svm 64.58 45.81 55.70 DSC-ukb 56.81 31.37 35.93 Voting 66.09 45.68 52.24 23
  25. 25. Numbers of DSC  Balanced-sense annotated corpus  274,344 tokens  2,874 lemmas  Annotated by 2 annotators, 90% IAA  Balanced-context annotated corpus  132,666 tokens  1,133 lemmas  Manually annotated by 1 agreeing with WSD in 44%  Random evaluation corpus  5,200 tokens  52 lemmas  All words corpus  23,907 tokens  1,527 lemmas  3 WSD systems for Dutch  DSC-timbl  DSC-svm  DSC-ukb  Automatic annotations by the 3 WSD  Sense distributions  48 million of tokens with confidence  … and more…  800,000 semantic relations between senses extracted from manual annotations  28.080 sense groups  Improved version of Cornetto  SAT annotation tool  Web search tool  Statistics on figurative, idiomatic and collocational usage of words  … 24
  26. 26. Piek Vossen piek.vossen@vu.nl Rubén Izquierdo ruben.izquierdobevia@vu.nl Attila Görög a.gorog@vu.nl Thanks for your attention

×