Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Multimodal
Person Discovery
in Broadcast TV
Johann Poignant / Hervé Bredin / Claude Barras
2015
bredin@limsi.fr

herve.nid...
Outline
• Motivations
• Definition of the task
• Datasets
• Baseline & Metadata
• Evaluation protocol
• Organization
• Resu...
Motivations
3
"the usual suspects"
• Huge TV archives are useless if not searchable
• People ❤ people
• Need for person-based indexes
4
REPERE
• Evaluation campaigns in 2012, 2013 and 2014

Three French consortia funded by ANR
• Multimodal people recognition...
From REPERE to Person Discovery
• Speaking faces

Focus on "person of interest"
• Unsupervised approaches

People may not ...
Definition of the task
7
Input
8
Broadcast TV pre-segmented into shots.
Speaking face
9
Tag each shot with the names of people
both speaking and appearing at the same time
Person discovery
• Prior biometric models are not allowed.
• Person names must be discovered automatically

in text overla...
Evidence
11
Associate each name with a unique shot

prooving that the person actually holds this name
Evidence (cont.)
12
an image evidence is a shot during which a person is visible

and their name is written on screen

an ...
Datasets
13
Datasets
DEV | REPERE TEST | INA
14
137 hours
two French TV channels
eight different types
106 hours (172 videos)
only one...
Datasets
TEST | INA
15
106 hours (172 videos)
only one French TV channel
only one type (news)
no prior annotation
a poster...
Evaluation protocol
16
Information retrieval task
• Queries formatted as firstname_lastname

e.g. francois_hollande



"return all shots where Fra...
Evidence-weighted MAP
18
C(q) =
1 if ⇢q > 0.95 and q 2 E(E(nq
0 otherwise
To ensure participants do provide correct eviden...
Baseline & Metadata
19
the "multi" in multimedia
Task necessitates expertise

in various domains:



• multimedia

• computer vision

• speech pr...
Baseline fusion
21
22
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical character recognition • • •
automatic s...
23
focus on
monomodal
components
baseline
face tracking • •
face clustering • • • •
speaker diarization • • • • •
optical ...
Organization
24
Schedule
01.05 development set release

01.06 test set release



01.07 "out-domain" submission deadline



01.07 — 08.07 ...
Leaderboard
26
computed on a secret

subset of the test set
updated every 6 hours
private leaderboard


participants know ...
Collaborative annotation
27
Thanks!
28
around 50 hours in total
Thanks!
29
half of the test set ✔
collaborative annotation platform
-projectgithub.com/
objectives
open source
serverclient
client
client
bring
your
own
cli...
Results
31
Stats
• 14 (+ organizers) registered participants

received dev and test data
• 8 (+ organizers) participants submitted 70...
33
out-domain results
28k shots
2070 queries
1642 # vs. 428 $
EwMAP (%) for PRIMARY runs
34
out-domain results
28k shots
2070 queries
1642 # vs. 428 $
EwMAP (%) for PRIMARY runs
in-domain results
35
EwMAP (%) for PRIMARY runs
in-domain results
36
EwMAP (%) for PRIMARY runs
37
out- vs. in-domain
# people that no other team discovered
38
1200 01 4 0 0 0
speech transcripts?
0
anchors
EwMAP (%) for PRIMARY runs
Conclusion
• Had a great (and exhausting) time organizing this task
• The winning submission DID NOT make any use of
the f...
Next (oral)
• Mateusz Budnik / LIG at MediaEval 2015
Multimodal Person Discovery in Broadcast TV Task
• Meriem Bendris / P...
Next (poster)
• Claude Barras / Multimodal Person Discovery in Broadcast TV at
MediaEval 2015
• Johann Poignant / LIMSI at...
Upcoming SlideShare
Loading in …5
×

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

182 views

Published on

We describe the “Multimodal Person Discovery in Broadcast TV” task of MediaEval 2015 benchmarking initiative. Participants were asked to return the names of people who can be both seen as well as heard in every shot of a collection of videos. The list of people was not known a priori and their names had to be discovered in an unsupervised way from media content using text overlay or speech transcripts. The task was evaluated using information retrieval metrics, based on a posteriori collaborative annotation of the test corpus.

http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org

Published in: Education
  • Be the first to comment

  • Be the first to like this

MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

  1. 1. Multimodal Person Discovery in Broadcast TV Johann Poignant / Hervé Bredin / Claude Barras 2015 bredin@limsi.fr
 herve.niderb.fr
 @hbredin
  2. 2. Outline • Motivations • Definition of the task • Datasets • Baseline & Metadata • Evaluation protocol • Organization • Results • Conclusion 2
  3. 3. Motivations 3
  4. 4. "the usual suspects" • Huge TV archives are useless if not searchable • People ❤ people • Need for person-based indexes 4
  5. 5. REPERE • Evaluation campaigns in 2012, 2013 and 2014
 Three French consortia funded by ANR • Multimodal people recognition in TV documents
 "who speaks when?" and "who appears when?" • Led to significant progress in both supervised and unsupervised multimodal person recognition 5
  6. 6. From REPERE to Person Discovery • Speaking faces
 Focus on "person of interest" • Unsupervised approaches
 People may not be "famous" at indexing time • Evidence
 Archivist/journalist use case 6
  7. 7. Definition of the task 7
  8. 8. Input 8 Broadcast TV pre-segmented into shots.
  9. 9. Speaking face 9 Tag each shot with the names of people both speaking and appearing at the same time
  10. 10. Person discovery • Prior biometric models are not allowed. • Person names must be discovered automatically
 in text overlay or speech utterances. unsupervised approaches only 10
  11. 11. Evidence 11 Associate each name with a unique shot
 prooving that the person actually holds this name
  12. 12. Evidence (cont.) 12 an image evidence is a shot during which a person is visible
 and their name is written on screen
 an audio evidence is a shot during which a person is visible and their name is pronounced at least once during a [shot start time - 5s, shot end time + 5s] neighborhood shot #3 for B shot #1 for A
  13. 13. Datasets 13
  14. 14. Datasets DEV | REPERE TEST | INA 14 137 hours two French TV channels eight different types 106 hours (172 videos) only one French TV channel only one type (news) dense audio annotations speaker diarization speech transcription sparse video annotations face detection & recognition optical character recognition no prior annotation a posteriori collaborative annotation
  15. 15. Datasets TEST | INA 15 106 hours (172 videos) only one French TV channel only one type (news) no prior annotation a posteriori collaborative annotation http://dataset.ina.fr
  16. 16. Evaluation protocol 16
  17. 17. Information retrieval task • Queries formatted as firstname_lastname
 e.g. francois_hollande
 
 "return all shots where François Hollande is speaking and visible at the same time" • Approximate search among submitted names 
 e.g. francois_holande • Select shots tagged with the most similar name 17
  18. 18. Evidence-weighted MAP 18 C(q) = 1 if ⇢q > 0.95 and q 2 E(E(nq 0 otherwise To ensure participants do provide correct evidences fo hypothesized name n 2 N, standard MAP is alter EwMAP (Evidence-weighted Mean Average Precisio o cial metric for the task: EwMAP = 1 |Q| X q2Q C(q) · AP(q) Acknowledgment. This work was supported by the National Agency for Research under grant ANR-12 0006-01. The open source CAMOMILE collaborativ tation platform2 was used extensively throughout the of the task: from the run submission script to the aut leaderboard, including a posteriori collaborative ann MAP = 1 | Q | X q2Q AP(q) C(q) measures the correctness of 
 provided evidence for query q
  19. 19. Baseline & Metadata 19
  20. 20. the "multi" in multimedia Task necessitates expertise
 in various domains:
 
 • multimedia
 • computer vision
 • speech processing
 • natural language processing 20 Technological barriers to entry is lowered
 by the provision of a baseline system. github.com/MediaevalPersonDiscoveryTask
  21. 21. Baseline fusion 21
  22. 22. 22 face tracking • • face clustering • • • • speaker diarization • • • • • optical character recognition • • • automatic speech recognition speaking face detection • • • • fusion • • • • • • Was the baseline useful? team relied on the baseline module • team developed their own module • team tried both 9 participants
  23. 23. 23 focus on monomodal components baseline face tracking • • face clustering • • • • speaker diarization • • • • • optical character recognition • • • automatic speech recognition speaking face detection • • • • fusion • • • • • • focus on multimodal fusion
  24. 24. Organization 24
  25. 25. Schedule 01.05 development set release
 01.06 test set release
 
 01.07 "out-domain" submission deadline
 
 01.07 — 08.07 leaderboard updated every 6 hours
 08.07 "in-domain" submission deadline
 
 09.07 — 28.07 collaborative annotation
 28.07 test set annotation release
 28.07 — 28.08 adjudication 25
  26. 26. Leaderboard 26 computed on a secret
 subset of the test set updated every 6 hours private leaderboard 
 participants know how they rank 
 
 participants know their score but do not know the score of others 

  27. 27. Collaborative annotation 27
  28. 28. Thanks! 28 around 50 hours in total
  29. 29. Thanks! 29 half of the test set ✔
  30. 30. collaborative annotation platform -projectgithub.com/ objectives open source serverclient client client bring your own client JSON data model corpus medium layer annotation medium fragment attached metadata time Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. multimedia document REST API GET /corpus/:idcorpus POST /corpus/:idcorpus/medium PUT /layer/:idlayer/annotation DEL /annotation/:idannotation obtain information about a corpus add a medium to a corpus update an annotation delete an annotation and more... permissions annotation history annotation queue user authentication provide an architecture for effective creation and sharing of annotations of multimedia, multimodal and multilingual data homogeneous collection of multimedia documents homogeneous collection of annotationsaudio video image text categorical value numerical value free text raw content enable novel collaborative and interactive annotation tools be generic enough to support as many use cases as possible online documentation http://camomile-project.github.io/camomile-server JSON JSON A D VER TISEM EN T
  31. 31. Results 31
  32. 32. Stats • 14 (+ organizers) registered participants
 received dev and test data • 8 (+ organizers) participants submitted 70 runs • 7 (+ organizers) submitted a working note paper • 7 (+ organizers) attend the workshop 32
  33. 33. 33 out-domain results 28k shots 2070 queries 1642 # vs. 428 $ EwMAP (%) for PRIMARY runs
  34. 34. 34 out-domain results 28k shots 2070 queries 1642 # vs. 428 $ EwMAP (%) for PRIMARY runs
  35. 35. in-domain results 35 EwMAP (%) for PRIMARY runs
  36. 36. in-domain results 36 EwMAP (%) for PRIMARY runs
  37. 37. 37 out- vs. in-domain
  38. 38. # people that no other team discovered 38 1200 01 4 0 0 0 speech transcripts? 0 anchors EwMAP (%) for PRIMARY runs
  39. 39. Conclusion • Had a great (and exhausting) time organizing this task • The winning submission DID NOT make any use of the face modality. • (Almost) nobody used ASR
 Most people are introduced by overlaid names in the test set • No cross-show approaches • Poster session later today at 16:00
 Technical retreat tomorrow morning at 9:15 39
  40. 40. Next (oral) • Mateusz Budnik / LIG at MediaEval 2015 Multimodal Person Discovery in Broadcast TV Task • Meriem Bendris / PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign • Rosalia Barros / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015 40
  41. 41. Next (poster) • Claude Barras / Multimodal Person Discovery in Broadcast TV at MediaEval 2015 • Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task • Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015 • Javier Hernando / UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task • Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery • Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge 41

×