• Like
Active Annotation of Corpora.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Active Annotation of Corpora.

  • 1,623 views
Published

Text Analysis Seminar at the Göttingen Center of Digital Humanities. 02.05.2012

Text Analysis Seminar at the Göttingen Center of Digital Humanities. 02.05.2012

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,623
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Active Annotation of CorporaKepa J. RodriguezText Analysis Seminar at the Göttingen Center of Digital Humanities02.05.2012
  • 2. Outline• Goal of the presentation.• The LUNA corpus.• Active annotation. – Concept – Algorithm. – Evaluation.• Potential use of Active Annotation in projects in humanities.
  • 3. Goal of the presentation• Introduce concepts of: – Active Learning – Active Annotation.• Present its use in the annotation of the LUNA corpus.• Discuss the utility of the Active Annotation in projects in humanities.
  • 4. The LUNA Corpus (1)• Corpus consists of: – 3000 Human-Human and 8100 WOZ dialogues – Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts, etc. – in French, Italian and Polish.• French subcorpus: – Application domains: travel information and reservation, IT help desk, telecom costumer care and financial information transaction – Human-Machine dialogues: 7100• Italian subcorpus: – Application domain: IT helpdesk – 2500 Human-Human and 500 WOZ dialogues• Polish subcorpus: – Application domain: public transportation information – 500 Human-Human and 500 WOZ dialoguesMore information about annotation scheme and levels:http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf
  • 5. The LUNA Corpus (2)[Operator:] allora mha detto che [non riusciva]c1 ad [accedere]c2 [al computer]c3 e [le manca]c4 [la procedura]c5so, you have told me that you cannot access the computer, and that you need the procedure c1 trouble : unable_to c2 action : access c3 computer-hardware : pc c4 trouble : lack_of c5 computer-software : procedure[Caller:] esattoexactly[Operator:] allora avrei bisogno [dell RWS]c6 [del PC]c7so I need the RWS of the computer c6 code-identificationCode : rws c7 computer-hardware : pc[Caller:] si allora [tredici zero ottantasei]c8yes, 13 0 86 c8 code-identificationCode-rws : 13086
  • 6. Active annotation (1)Components of the active annotation are:• Active learning paradigm – Selection of examples for annotation.• Potential error detection – Cases in which manual annotation seems to be ambiguous or contradictory.
  • 7. Active annotation (2)• Active learning paradigm: – Statistical learning based paradigm – A first small set will randomly chosen and manually annotated. – Use this set to train a model and annotate the rest of samples. – Selection of the most informative examples to update the statistical model • Most informative = lower confidence score• Use of active learning: – Speed-up annotation – Support annotators in their work – Select examples to be annotated: which examples from a big amount of data will be useful for my purposes?
  • 8. Active annotation (3) Learn curve comparison: active vs. random learning (Riccardi and Takkani-Tür, 2005 )
  • 9. Active annotation (4)• Likely error detection: – Re-annotate the training data using the statistical model. – Extract examples in which manual annotation and automatic annotation are different. – Send them to human supervision.• Use of the likely error detection: – If manual annotation is correct, example is hard to learn: • Analyze which new features can be implemented to enrich the model. – If the annotation is erroneous: • Correct it.
  • 10. Annotation algoritm1. Select randomly a small amount of dialogues and annotate it manually from scratch (SL).2. Train a model M using SL3. while (labeler/data available) a) Use M to automatically annotate the unannotated part of the corpus (Su). b) Rank automatically annotated examples of (Su) according to the confidence measure given by M c) Select a batch of k dialogues with the lowest score (Sk) d) Ask for human control/correction on Sk e) Use M to automatically annotate SL and produce SaL f) Look at the difference between SL and produce SaL i. HARD TO LEARN EXAMPLE: Add new features when training M ii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL g) SL = SL + Sk h) Train a new model M with SL i) Go to 3.1
  • 11. Evaluation (2)• Annotator point of view: – Annotation from scratch: 80-90 minutes/file. – Supervision after 3rd active annotation loop: 25-20 min/file. – Annotators more concentrated in: • Difficult/interesting issues. • Giving feedback about the model.• Error detection: no statistics. – Most of the reported feedback requests were annotation errors. – Some of the reported feedback requests were caused by ambiguities and helped to add features to enrich the model.
  • 12. Evaluation (1)• Wizard of Oz dialogues Act-turn Size in turns Error rate 1 200 59.2% 2 400 44.4% 3 600 39.3% 4 800 6.4% 5 1200 0.0%• Human-human dialogues Act-turn Size in dialogues Error rate 1 10 71.2% 2 20 59.5% 3 30 54.0% 4 40 51.1% 5 60 45.7% 6 80 42.4%
  • 13. Discussion• Questions• Annotation tasks in the GCDH: – Corpus of Coptic Texts. – …..
  • 14. References• LUNA project: http://www.ist-luna.eu• Raymond, Rodriguez and Riccardi (2008): Active Annotation in the LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008).Marrakech. Marrocco.• Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and applications to automatic speech recognition. In IEEE Transactions on Speech and Audio Processing.
  • 15. Thanks!!!Text Analysis Seminar at the Göttingen Center of Digital Humanities