CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
242
On Slideshare
242
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. DutchSemCor Building a semantically annotated corpus for Dutch Piek Vossen, Attila Görög, VU University Amsterdam Fons Laan, ISLA, University of Amsterdam Rubén Izquierdo, Tilburg University Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen 1CLIN 22,Tilburg University, 20/01/2012
  • 2. 2 Overview  Project goals and planning  Current progress  Word-sense-disambiguation results  Active learning phase CLIN 22,Tilburg University, 20/01/2012
  • 3. 3 Goals and planning  Funded by NWO, 2009-2012  Create a large semantically tagged corpus for Dutch: − Sense-tags from the Cornetto database (includes Dutch wordnet) − Domain labels from Wordnet Domains − Named entities mapped to Wikipedia CLIN 22,Tilburg University, 20/01/2012
  • 4. 4 Global procedure  Phase-1: − 25 examples per meaning for 3,000 most polysemous and frequent nouns, verbs and adjectives (average nr. of meanings = 3) − Annotated by two student assistents − Minimal IAA 80%  Phase-2: − Word-sense-disambiguation (WSD) systems trained with the data of phase-1 − Active learning: add examples for low performing words and meanings untill we reach accuracy of 80% or no progress  Phase-3: − Apply WSD to rest of the full corpus CLIN 22,Tilburg University, 20/01/2012
  • 5. 5 Corpora  SoNaR: 500M tokens written Dutch  CGN: 1M tokens spoken Dutch  Web snippets mediated through WebCorp.co.uk ( http://www.webcorp.org.uk/) − In case no or insufficient examples are found for particular senses in SoNaR and CGN − Students select snippets (target word and context) which are added to the corpus in the SoNaR annotation format CLIN 22,Tilburg University, 20/01/2012
  • 6. CLIN 22,Tilburg University, 20/01/2012 6 Annotation tool
  • 7. 7 Current results Phase-1  PoS: nouns, verbs and adjectives  Number of annotated lemmas: 2,870  Number of word senses: 11,982  Number of overlapping annotations: 282,503 (67% SoNaR, 5% CGN, 28% Snippets)  Inter Annotator Agreement: 92%  Coverage of senses with 25 examples: 70%  Coverage of annotations for words: 79% CLIN 22,Tilburg University, 20/01/2012
  • 8. WSD Systems  UKB --> Knowledge-based WSD system that employs semantic relations  Tilburg WSD --> Supervised machine-learning based WSD system 8CLIN 22,Tilburg University, 20/01/2012
  • 9. UKB. Description  Knowledge based (Agirre and Soroa, 2009)  WordNet considered as a graph − Senses -> nodes − Relations -> edges  Personalized PageRank algorithm − Modification of traditional PageRank − Context words act as source nodes injecting mass into word senses − Assign stronger probabilities to certain nodes 9CLIN 22,Tilburg University, 20/01/2012
  • 10. UKB. Semantic relations  Dutch WordNet  English WordNet  Dutch WordNet ==> English WordNet  WordNet Domain − tennis player, tennis ball => tennis => − Football player, football => soccer =>  Annotation co-occurrence relations − Polysemous => monosemous − Polysemous => polysemous SPORT 10CLIN 22,Tilburg University, 20/01/2012
  • 11. UKB. Graph relations Relation Number Dutch synset – Dutch synset 140,219 Domain - Domain 125 Dutch synset - Domain 86,798 Dutch synset – English synset 73,935 English synset – English synset 252,392 English synset – English gloss synset 419,387 Annotation co-occurrences polysemous 17,152 Annotation co-occurrences monosemous 151,598 TOTAL 1,266,481 UKB-1 UKB-2 UKB-3 Annot. Co- occurrences ( AC ) UKB-4 = UKB-1 + AC UKB-5 = UKB-3 + AC 11CLIN 22,Tilburg University, 20/01/2012
  • 12. UKB. Evaluation Precision Recall F-measure UKB-1 01.4557 0.4491 0.4523 UKB-2 0.4557 0.4491 0.4524 UKB-3 0.4560 0.4493 0.4526 UKB-4 0.6360 0.6272 0.6316 UKB-5 0.6411 0.6322 0.6366 For comparison SemEval2010 Task on WSD in specific domain, all-words-task:  UKB3 52.6 precision  English UKB 48.1 precision  UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations 12CLIN 22,Tilburg University, 20/01/2012
  • 13. Tilburg WSD System  Based on TiMBL, K-nearest neighbour classifier (Daelemans et at, 2007)  Features: − Local context (words in window around target) − Global context (binary Bag of Words) − Sonar category (domain label)  Parameter Search: − Using TiMBL leave-one-out feature  Evaluation: − 10 examples per sense TEST − >= 15 examples per sense TRAIN 13CLIN 22,Tilburg University, 20/01/2012
  • 14. Tilburg WSD System. First results Feature set Token accuracy Words1 0.6462 Words1 + Bag-of-words 0.7259 Words1 + PoS1 + Bag-of-words 0.7226 Words1 + Bag-of-words + PS 0.7931  Bag-of-words improvement of 8%  Parameter search (PS) improvement of another 7% Previous experiments suggest that the best size for the context window is 1 14CLIN 22,Tilburg University, 20/01/2012
  • 15. TIMBL confidence 0.55: Precision 0.84 (+0.44 compared to no filtering) Fscore 0.78 (only -0.03 less than no filtering) Tilburg WSD System. TiMBL Confidence 15CLIN 22,Tilburg University, 20/01/2012
  • 16. Active Learning 1. Obtain annotated data 2. Train and evaluate the system 3. Select words with accuracy < 80% 4. Apply WSD all tokens of selected words not annotated 5. Select tokens of meanings with F-score < 80% 16CLIN 22,Tilburg University, 20/01/2012
  • 17. Active Learning 6) For each word meaning rank all the tokens according to the combination (F-score) 1) TiMBL confidence 2) Distance to the nearest neighbor 6) Select the 50 first ranking tokens per meaning to be manually reviewed in 2 weeks 6) Go to 1 17CLIN 22,Tilburg University, 20/01/2012
  • 18. Future Work  Fine tune the active learning  Optimize the WSD systems  Combine different WSD systems  Test on independent texts in all-words task  Apply optimal system to full corpora (over 500K tokens) 18CLIN 22,Tilburg University, 20/01/2012
  • 19. 19 Thanks to  Anneleen Schoen  Charlotte van Tongeren  Daphne van Kessel  Dieke Janssen  Elizabeth van Zutphen  Gratia Bruining  Jonica Kaagman  Laura Kipp  Lisanne Ranzijn  Marlisa Hommel  Wilma van Velzen Milou Kerkhof Sam Vossen Niqee Vossen Rosa Scheffer Chantal van Son CLIN 22,Tilburg University, 20/01/2012