Resources for linguistically motivated                  Multilingual Anaphora Resolution                                  ...
Outline           1    Motivation of the research           2    Contributions of this dissertation           3    Limitat...
Motivation                Linguistic research: cross linguistic studies about                anaphora (Poesio et al 2004) ...
Contributions                Development of a linguistically motivated annotation                scheme for anaphoric rela...
Limitations of previous schemes (1)                Coverage of the annotation.                Annotation of reference.    ...
Limitations of previous schemes (2)       Coverage of the annotation:           Annotated relations: only identity        ...
Limitations of previous schemes (3)       Annotation of reference           Expletives: they are not considered.          ...
Limitations of previous schemes (4)       Identification of discontinuous semantic material.            Bill and Hillary Cl...
Annotation scheme                Annotation of all noun phrases                Distinction between referring and non-refer...
Reference                Markables are classified in referring and non-referring                Non-referring markables are...
Reference       Types of non-referring expressions           Expletives                         [There] are two people wai...
Semantic types           1    Person           2    Animate           3    Organization           4    Facility           ...
Annotation of ambiguity                Not always a unique interpretation for a markable.                    1 Be careful ...
List of annotated features                Agreement features                         Gender                         Number...
Description of the annotated data                ARRAU (English)                         Wall Street Journal texts        ...
Description: English corpus       WSJ dataset            205 files            147,600 words in 5585 sentences. 47,900 marka...
Description: Italian corpus       Wikipedia dataset:           144 files.           140.000 words in 4700 sentences. 44.500...
Reliability of the annotation – ARRAU                Previous study for annotation of anaphoric links published           ...
Reliability of the annotation – LMC                Metric: Sigel and Castellan’s κ                Information status and r...
Reliability of the annotation – LMC                Link to the antecedent                         κ = 0.88                ...
Use of the corpus for anaphora resolution (1)                Baseline proposed by (Soon et al 2001)                Classifi...
Use of the corpus for anaphora resolution (2)       English corpora: ARRAU, ACE, MUC                           ACE Carafe ...
Use of the corpus for anaphora resolution (3)       Italian corpora: LMC, ICAB                            ICAB LMC-Sys LMC...
Use of the corpus for anaphora resolution (4)                Use of C4 decision trees to compare the impact of            ...
Use of the data                5th International Workshop on Semantic Evaluations                (SemEval 2010)           ...
Conclusions                Linguistic motivated annotation scheme applicable to                English and Italian.       ...
Upcoming SlideShare
Loading in …5
×

Resources for linguistically motivated Multilingual Anaphora Resolution

610 views

Published on

PhD defense presentation

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
610
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Resources for linguistically motivated Multilingual Anaphora Resolution

  1. 1. Resources for linguistically motivated Multilingual Anaphora Resolution Kepa Joseba Rodr´ ıguez Advisor: Massimo Poesio 18. January 2011Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  2. 2. Outline 1 Motivation of the research 2 Contributions of this dissertation 3 Limitations of previous annotation schemes 4 Annotation scheme proposal 5 Annotated data 6 Usability of the data for anaphora resolution 7 Use of the data 8 ConclusionsKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  3. 3. Motivation Linguistic research: cross linguistic studies about anaphora (Poesio et al 2004) Applications: summarization (Steinberger et al 2007) Applications: machine translation 1 German: Peter hat Maria seine Blumen zum Gießen gegeben. Sie hat sie vertrocknen lassen. 2 English (Babelfish): Peter gave Maria his flowers for pouring. Then it left it to dry. 3 English (Google translate): Peter gave Mary flowers to his casting. Then she let them dry up. 4 English (wanted): Peter gave Maria his flowers to water. Then she let them dry out.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  4. 4. Contributions Development of a linguistically motivated annotation scheme for anaphoric relations. Implementation of the scheme for manual annotation of English and Italian data. Creation of annotated data for English and Italian. Use of the corpora for feature extraction and development of anaphora resolution systems in English and Italian. Participation of the systems in SemEval 2010.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  5. 5. Limitations of previous schemes (1) Coverage of the annotation. Annotation of reference. Identification and annotation of discontinuity of semantic material. Problem of multiple interpretations: ambiguity.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  6. 6. Limitations of previous schemes (2) Coverage of the annotation: Annotated relations: only identity ACE-like annotation schemes constraint the annotation to noun phrases from a list of semantic types. Genres: Most annotation schemes focus the annotation on a few genres.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  7. 7. Limitations of previous schemes (3) Annotation of reference Expletives: they are not considered. There are two people waiting for the interview. Predication: MUC, ACE: No distinction between predication and identity relation. OntoNotes: no semantic criteria to decide which noun phrase is referring and which is a predicate. [The president of the bank] is [John Smith]. [John Smith] is [the president of the bank]. Coordination: coordinated items are considered referring expressions in corpora like MUC or OntoNotes. [Milosevic or anyone else] Nominals and proper names in premodifier position.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  8. 8. Limitations of previous schemes (4) Identification of discontinuous semantic material. Bill and Hillary Clinton black cars and bikes Multiple interpretations are not captured [The house] is on [a long street]. [It] is very dirty.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  9. 9. Annotation scheme Annotation of all noun phrases Distinction between referring and non-referring expressions Annotation of clitics attached to the verb and empty pronouns Introduction of ambiguity Introduction of discontinuous markables Annotation of different kind of relations: identity, discourse deixis and bridging.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  10. 10. Reference Markables are classified in referring and non-referring Non-referring markables are annotated with type of non-referring expression Referring markables are annotated with: Information status: New or old. Semantic typeKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  11. 11. Reference Types of non-referring expressions Expletives [There] are two people waiting for the interview The new car is [there] Predicate: semantic criteria to distinguish predicate and referring expression. [Il presidente della Repubblica, [Giorgio Napolitano]] [The president of the bank] is [John Smith]. [John Smith] is [the president of the bank]. Quantifiers: [All of [the box cars]] Coordination. Idiomatic expressions by [the nape of [the neck]]Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  12. 12. Semantic types 1 Person 2 Animate 3 Organization 4 Facility 5 Geopolitical entity (GPE) 6 Location 7 Temporal 8 Numerical 9 Concrete 10 Abstract 11 Event 12 Other 13 UnknownKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  13. 13. Annotation of ambiguity Not always a unique interpretation for a markable. 1 Be careful hooking up [the engine] to [the boxcar] because [it] is faulty. 2 [The house] is on [a long street]. [It] is very dirty. In case of ambiguity, we tag the markable as ambiguous and we annotate the possible interpretations. Other possible ambiguities are: Information status: between new and old. Old and not referring.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  14. 14. List of annotated features Agreement features Gender Number Person Grammatical function Reference and information status Semantic type Type of non-referring Link to antecedent Ambiguity BridgingKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  15. 15. Description of the annotated data ARRAU (English) Wall Street Journal texts Trains dialogues Gnome corpus Pear stories Live Memories Corpus for Italian (LMC) Wikipedia sites Blog sites VENEX datasetKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  16. 16. Description: English corpus WSJ dataset 205 files 147,600 words in 5585 sentences. 47,900 markables. 1% of discontinuous markables, 12.6% non-referring. Trains dialogues 35 files 26,000 words in 4600 sentences. 5200 markables. GNOME corpus 5 files 21,600 words in 1000 sentences. 6100 markables PEAR stories 20 files 14,000 words in 2,000 sentences. 3,900 markables.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  17. 17. Description: Italian corpus Wikipedia dataset: 144 files. 140.000 words in 4700 sentences. 44.500 markables. 0.5% discontinuous markables, 0.5% clitics attached to the verb, 4.5% empty subjects.13.7% non-referring. Blogs dataset: 75 files. 53.000 words in 2230 sentences. 16.000 markables. VENEX corpus: 30 files 20,300 words in 720 sentences 6.220 markablesKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  18. 18. Reliability of the annotation – ARRAU Previous study for annotation of anaphoric links published by (Poesio and Artstein, 2008) Metric: Krippendorf’s α α = 0.6-0.7 Statistics reflect the complexity of the task.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  19. 19. Reliability of the annotation – LMC Metric: Sigel and Castellan’s κ Information status and reference: old, new and non-referring κ = 0.80 Basic annotation of the markable: new, phrase antecedent, segment antecedent, predicate, quantifier, expletive, coordination and idiom. κ = 0.79 Main disagreement between discourse new and predicate Semantic type κ = 0.85Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  20. 20. Reliability of the annotation – LMC Link to the antecedent κ = 0.88 Antecedent of clitics κ = 0.84 Antecedent of empty pronouns κ = 0.93Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  21. 21. Use of the corpus for anaphora resolution (1) Baseline proposed by (Soon et al 2001) Classifier: MaxEnt English data: ACE02, MUC-7 and ARRAU Italian data: ICAB and LMC Evaluation metrics: MUC (Vilain et al. 1995) CEAF (Luo, 2005) Link based evaluationKepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  22. 22. Use of the corpus for anaphora resolution (2) English corpora: ARRAU, ACE, MUC ACE Carafe MUC-7 ACE02 ARRAU MUC 0.618 0.585 0.590 0.557 CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683 CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717 Link-based 0.638 0.594 0.532 0.540 Pronouns 0.686 0.492 0.597 0.558 Nominals 0.355 0.455 0.239 0.352 Names 0.638 0.817 0.784 0.763Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  23. 23. Use of the corpus for anaphora resolution (3) Italian corpora: LMC, ICAB ICAB LMC-Sys LMC-Gold MUC 0.494 0.456 0.619 CEAF-AGGR Φ-3 0.557 0.622 0.798 CEAF-AGGR Φ-4 0.560 0.671 0.869 Link-based 0.556 0.470 0.580 Pronouns 0.452 0.520 0.521 Nominals 0.421 0.303 0.522 Names 0.741 0.642 0.752Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  24. 24. Use of the corpus for anaphora resolution (4) Use of C4 decision trees to compare the impact of individual features. The impact of the baseline features is similar for English and Italian with two exceptions: The impact of gender matching is high in English, but has no effect for Italian. The use of automatically computed aliases have a high impact for Italian and a low impact for English.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  25. 25. Use of the data 5th International Workshop on Semantic Evaluations (SemEval 2010) Task: Coreference Resolution in Multiple Languages. Comparative research about zero-anaphora in Italian and Japanese Training and evaluation of content extraction models in the Live Memories project.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution
  26. 26. Conclusions Linguistic motivated annotation scheme applicable to English and Italian. Scheme used to annotate different genres: newspapers, encyclopedic text, dialogue, narrative and weblogs. Corpora are usable to build anaphora resolution models. Datasets have been used for international competitions and for linguistic research.Kepa Joseba Rodr´ ıguezResources for linguistically motivated Multilingual Anaphora Resolution

×