Active Annotation of CorporaKepa J. RodriguezText Analysis Seminar at the Göttingen Center of Digital Humanities02.05.2012
Outline• Goal of the presentation.• The LUNA corpus.• Active annotation. – Concept – Algorithm. – Evaluation.• Potential use of Active Annotation in projects in humanities.
Goal of the presentation• Introduce concepts of: – Active Learning – Active Annotation.• Present its use in the annotation of the LUNA corpus.• Discuss the utility of the Active Annotation in projects in humanities.
The LUNA Corpus (1)• Corpus consists of: – 3000 Human-Human and 8100 WOZ dialogues – Multiple annotation levels: POS, entities, coreference, predicate structure, dialogue acts, etc. – in French, Italian and Polish.• French subcorpus: – Application domains: travel information and reservation, IT help desk, telecom costumer care and financial information transaction – Human-Machine dialogues: 7100• Italian subcorpus: – Application domain: IT helpdesk – 2500 Human-Human and 500 WOZ dialogues• Polish subcorpus: – Application domain: public transportation information – 500 Human-Human and 500 WOZ dialoguesMore information about annotation scheme and levels:http://www.ist-luna.eu/pdf/schemepresentationPdm.pdf
The LUNA Corpus (2)[Operator:] allora mha detto che [non riusciva]c1 ad [accedere]c2 [al computer]c3 e [le manca]c4 [la procedura]c5so, you have told me that you cannot access the computer, and that you need the procedure c1 trouble : unable_to c2 action : access c3 computer-hardware : pc c4 trouble : lack_of c5 computer-software : procedure[Caller:] esattoexactly[Operator:] allora avrei bisogno [dell RWS]c6 [del PC]c7so I need the RWS of the computer c6 code-identificationCode : rws c7 computer-hardware : pc[Caller:] si allora [tredici zero ottantasei]c8yes, 13 0 86 c8 code-identificationCode-rws : 13086
Active annotation (1)Components of the active annotation are:• Active learning paradigm – Selection of examples for annotation.• Potential error detection – Cases in which manual annotation seems to be ambiguous or contradictory.
Active annotation (2)• Active learning paradigm: – Statistical learning based paradigm – A first small set will randomly chosen and manually annotated. – Use this set to train a model and annotate the rest of samples. – Selection of the most informative examples to update the statistical model • Most informative = lower confidence score• Use of active learning: – Speed-up annotation – Support annotators in their work – Select examples to be annotated: which examples from a big amount of data will be useful for my purposes?
Active annotation (3) Learn curve comparison: active vs. random learning (Riccardi and Takkani-Tür, 2005 )
Active annotation (4)• Likely error detection: – Re-annotate the training data using the statistical model. – Extract examples in which manual annotation and automatic annotation are different. – Send them to human supervision.• Use of the likely error detection: – If manual annotation is correct, example is hard to learn: • Analyze which new features can be implemented to enrich the model. – If the annotation is erroneous: • Correct it.
Annotation algoritm1. Select randomly a small amount of dialogues and annotate it manually from scratch (SL).2. Train a model M using SL3. while (labeler/data available) a) Use M to automatically annotate the unannotated part of the corpus (Su). b) Rank automatically annotated examples of (Su) according to the confidence measure given by M c) Select a batch of k dialogues with the lowest score (Sk) d) Ask for human control/correction on Sk e) Use M to automatically annotate SL and produce SaL f) Look at the difference between SL and produce SaL i. HARD TO LEARN EXAMPLE: Add new features when training M ii. ANNOTATION AMBIGUITIES: Hire human annotators to disambiguate SL g) SL = SL + Sk h) Train a new model M with SL i) Go to 3.1
Evaluation (2)• Annotator point of view: – Annotation from scratch: 80-90 minutes/file. – Supervision after 3rd active annotation loop: 25-20 min/file. – Annotators more concentrated in: • Difficult/interesting issues. • Giving feedback about the model.• Error detection: no statistics. – Most of the reported feedback requests were annotation errors. – Some of the reported feedback requests were caused by ambiguities and helped to add features to enrich the model.
Discussion• Questions• Annotation tasks in the GCDH: – Corpus of Coptic Texts. – …..
References• LUNA project: http://www.ist-luna.eu• Raymond, Rodriguez and Riccardi (2008): Active Annotation in the LUNA Italian Corpus of Spontaneous Dialogues. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008).Marrakech. Marrocco.• Riccardi, G. and Hakkani-Tür, D. (2005): Active learning: theory and applications to automatic speech recognition. In IEEE Transactions on Speech and Audio Processing.
Thanks!!!Text Analysis Seminar at the Göttingen Center of Digital Humanities