Z Score,T Score, Percential Rank and Box Plot Graph
BenG Update on automatic labelling
1. MM P05 automatic labeling
term extraction
Victor de Boer
Josefien Schuurman
Roeland Ordelman
2. Term extraction from TT888
• Input:
– TT888 subtitles
• Output:
– GTAA terms
• Onderwerpen
• Persoonsnamen
• Namen
• Geografische namen
– For entire video
(corresponds to
documentalist tasks)
3. Planning
• version 0.1
– `naive baseline’
– Test input andoutput
• version 0.2
– Multiple GTAA axes
– Improve statistics
– Bespreking met metadatabeheer
• version 0.3
– More improvements
– Evaluation
• version 1.0
– To be reimplemented
http://www.recensiekoning.nl/2011/09/48928/ondertiteling
4. Implementation details
• Java to make integration easier
• XML and CSV outputs
– URI of GTAA term
– pref-label
– Confidence value
– Axis
• Input comes from Immix OAI API, where segmentation
should already have taken place
– Algorithm expects one OAI identifier (Expressie or Selectie)
• Matching with GTAA using ElasticSearch instance
5. version 0.1
For every item
1. Get TT888 words in a frequency list
2. Discard stop words (‘de’, ‘het’, ‘op’, ‘naar’..)
3. Take all words with freq > n
4. Match with GTAA “Onderwerpen” with ElasticSearch score > m
– Preflabel + altlabel
Algorithm
GTAA
gtaa:002151
“theater”
OAI
Stop words
6. version 0.1
Informal Evaluation:
Compare to hist labels (“Onderwerpen”)
Works a bit (< 20% correct). Input for version 0.2
Algorithm
GTAA
gtaa:002151
“theater”
OAI
Stop words
7. version 0.2
• Intermediate version, uses Named Entity
Recognizer. Results discussed with Lisette and
Vincent -> Version 0.3
Algorithm
GTAA
“theater”
“Jos Brink”
“Amsterdam”
OAI
Stop words
Named Entity
Recognition
Word freq NL
8. Named Entity Recognition
• Webservice CLTL @ VU
• Input:
– “Hallo, mijn naam is Victor de Boer en ik woon in de mooie stad Haarlem. Ik werk nu bij het
Nederlands Instituut voor Beeld en Geluid in Hilversum. Hiervoor was ik werkzaam bij de
Vrije Universiteit. “
• Output:
[ Victor de Boer | PERSON ],
[ Haarlem | LOCATION ],
[ Nederlands | MISC ],
[ Instituut voor Beeld en Geluid | ORGANIZATION ],
[ Hilversum | LOCATION ],
[ Vrije Universiteit | ORGANIZATION ]
9. version 0.3
For every item
1. Track 1
1. Get TT888 words in a frequency list
2. Discard stop words (‘de’, ‘het’, ‘op’, ‘naar’..)
3. Take all N-GRAMS with normalized frequency > n
4. Match with GTAA “Onderwerpen” with score > m
2. Track 2
1. Present TT888 to Named Entity Recognizer (VU-webservice)
2. Match result (with freq > L) with GTAA “PersoonsNamen”, “Geografische
Namen”, “Onderwerpen”, “Namen”
Algorithm
GTAA
“theater”
“Jos Brink”
“Amsterdam”
OAI
Stop words
Named Entity
Recognition
Word freq NL
11. Evaluation
• Setup
– 4 evaluators (Vincent, Lisette , Alma, Tim)
• 3 in one 50 min session
• 1 in another session
– ~8 minutes per item
– Video + extracted terms
• Open Videos in IE browser
• GTAA URIS + preflabels
• Any other info allowed
– Five point Likert scale
• Only precision, no recall
De gebruikte evaluatieschaal. 0 betekent echt
fout (bv een verkeerd homonym) of echt niet
relevant (verkeerd persoon). Aangezien hier
wisselwerking optreedt kan dit niet veel verder
uitgesplitst worden.
0: Term is geheel niet relevant
1: Term is niet relevant
2: Term is een beetje relevant
3: Term is relevant
4: Term is zeer relevant
13. Results
• Total of 70 terms for 13 videos (5.4 term per vid)
– Some videos did not start-> discarded
– 38 terms with three evaluations
– 32 with one
15. Example of disagreement
• Term “Milwaukee”
– Top2000 a gogo
Eval 1-> score=3
“Term an sich niet heel relevant, maar in combinatie
met Romme, Gianni toch waardevol. Alweer: NER
wint aan kracht als user tijdcode meekrijgt en kan
afspelen ter check of fragment relevant of niet is
voor zijn zoekactie/hergebruik.”
Eval 3-> score=1
“twee keer genoemd, niet relevant”
Eval 2-> score=1
“…”
16. Inter-annotator agreement
Pearson eval1 eval2 eval3
eval1 1
eval2 0,52 1
eval3 0,67 0,58 1
eval4 0.78 x 0.92
Agreement between 3 and 4 is large
between 1 and 4 is substantive
between 1 and 2 , 1 and 3, 2 and 3 is lower but ok
Task is fairly objective, but somewhat subjective
We look mainly at averages for the rest
17. Results: average scores
• Total average of 2.15 (“beetje relevant”+)
At threshold of 2: Precision = 0.61
At threshold of 3: Precision = 0.36
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2
18. Results per video
item average
item 1 0,3333333
item 3 2,6111111
item 5 2,4444444
item 6 1,75
item 8 1,4
item 9 3,6666667
item 10 2,4545455
item 13 2.375
item 14 0(!)
item 15 1.33
item 17 2.08
item 19 4.00 (!)
item 20 1.67
• For some videos we
shouldn’t do this
– Nederland in Beweging
– Metadata on Reeks-level
“Advies: Niveau 1 programma's uitsluiten van
trefwoordextractie, ws. ook van NER”
20. Evaluator remarks
• For some videos this shouldn’t be done
– Game shows, drama..
– Annotate at Reeks level
• Some axes seem to work better then others
– Persoonsnamen, Namen, Geografische namen
• More abstraction or combination would be helpful
– Semantic Clustering?
• Subtitles with * are song lyrics
• Still a need for time-coded terms
21. Conclusion and current steps
• Limited evaluation
• But it works (prec 0.61)
– With some tweaks to 0.7-0.8
• NEs lower threshold, Subjects higher
• Better Elasticsearch matching
– With semantic clustering to 0.8-0.9?
• Currently re-implemented by Arjen as a proper
service
• Re-use for annotating program guides
22. A huge thanks to the annotators for their valuable effort!!
Questions?
antwoordnu.nl