Czech Malach Cross-lingual
Speech Retrieval Test Collection
Petra Galuščáková
galuscakova@ufal.mff.cuni.cz
Institute of Formal and Applied Linguistics
Charles University in Prague
5. 3. 2016
2
USC Shoah Foundation's
Visual History Archive
● Established to collect and preserve
the testimonies of survivors and other witnesses of the
Holocaust
● Founded in 1994 by Steven Spielberg
● Interviews with the Jewish survivors, Roma and Sinti survivors,
liberators, survivors of the eugenics policies, political prisoners,
aid providers, homosexual survivors, war crimes trials
participants, ...
● Almost 52 000 videotaped testimonies in 56 countries and 32
languages collected between 1994 and 2000
● One of the largest available audio-visual archives
● http://sfi.usc.edu/
3
Malach Centre
for Visual History
● Provides local access to the
digital archives of the USC Shoah Foundation
● Need to retrieve relevant segments of interviews
● Provide a test collection for the retrieval system
created in the Malach project
● http://ufal.mff.cuni.cz/cvhm
4
Czech Malach Cross-lingual
Speech Retrieval Test Collection
● 353 audio recordings (592 hours of audio) randomly
selected from the set of Czech interviews
● Four automatic transcripts by different provides
● Manual topical annotations
● Manually entered metadata (PIQ, Thesaurus)
● Planned to be published in April 2016
● http://ufal.mff.cuni.cz/malach-test-collection
5
Audience
● Historians, teachers, students
● Information Retrieval (IR)
● Cross-lingual IR
● CLEF 2006, 2007 Cross-Language Speech Retrieval
Track
● Speech processing
● Sentiment analysis
● Machine translation
● Social studies
...
6
Collection
● Form of interviews
● Average length: 1 hour
and 41 minutes
● Recorded on tapes
(~ 30 minutes long),
which were digitalized
7
Transcripts
● Provided by IBM (2003), The Johns Hopkins
University (2004, 2006) and
University of West Bohemia (2013)
● In 1-best, MLF and XML format
● Lattices available for 2013 transcripts
● XML transcripts are morphologically tagged
8
Topics
● Annotators manually marked topically coherent segments
and assigned a single topic to each detected segment.
● The set of topics created for the annotation of the VHA.
● Topics for Czech collection were selected.
● Some of the topics were adapted to better react the Czech
realities.
● 5,375 annotations for 118 topics by 6 annotators (librarians
and historians)
● Divided into training, test and excluded sets
● All topics are in Czech and English
● Some topics are also in French, German and Spanish
9
Topic Examples I
Number Name Description Narrative
1173 Children's
art in
Terezin
We are looking for the
description of the art-
related activities of
children in Terezin such as
music, plays, paintings,
writings and poetry
The relevant material
should include
discussions of such
activities and how
they influenced the
survival and following
life of the children.
Any episodes where
the interviewee
demonstrates
examples of such an
art are highly relevant.
1286 Music in the
Holocaust
Tell us if music helped
(spiritually or otherwise)
or hindered the prisoners
interned in concentration
camps
Descriptions of what
role music played in
the life of the
prisoners.
10
Topic Examples II
● Daily life in Terezin
● Jewish children in schools
● The liberation of Buchenwald and Dachau
● Jewish partisans in Italy
● Strengthening faith
● Hidden children and rescuers
● Bombing of Birkenau and Buchenwald
● Minsk ghetto underground
...
11
Annotations I
● Several topics annotated dually
● 2 topics annotated by all annotators
● Search Guided Relevance Assessments
● Set of possible relevant segments was automatically
restricted by an IR system, Thesaurus keywords, and PIQ
● Annotators entered queries and watched the retrieved
parts of recordings
● Each topic was processed in approximately 20 hours
● Highly-ranked Assessments
● Annotators manually evaluated runs submitted to the CLEF
campaign.
12
Annotations II
● Average segment length is 167 second
● For each topic 44 relevant segments were found
in average.
13
Thesaurus
● English Thesaurus with 60,000 keywords
● Terms are hierarchically organized
● Label, definition and scope
● Alternative labels (synonyms)
● Czech Thesaurus
● Labels were translated manually
● Part of the definitions (e.g. complete categories Culture,
Daily Life, Discrimination, Liberation) and scope
translated manually
● The rest of the Thesaurus was translated automatically
14
Conclusion
15
Conclusion
● Czech Malach Collection
● Cleared manual annotations of topics of segments
in recordings
● Translations of topics
● Partially manually translated Thesaurus
● Cross-Language Speech Retrieval
16
Thank you
http://ufal.mff.cuni.cz/malach-test-collection

Czech Malach Cross-lingual Speech Retrieval Test Collection

  • 1.
    Czech Malach Cross-lingual SpeechRetrieval Test Collection Petra Galuščáková galuscakova@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Charles University in Prague 5. 3. 2016
  • 2.
    2 USC Shoah Foundation's VisualHistory Archive ● Established to collect and preserve the testimonies of survivors and other witnesses of the Holocaust ● Founded in 1994 by Steven Spielberg ● Interviews with the Jewish survivors, Roma and Sinti survivors, liberators, survivors of the eugenics policies, political prisoners, aid providers, homosexual survivors, war crimes trials participants, ... ● Almost 52 000 videotaped testimonies in 56 countries and 32 languages collected between 1994 and 2000 ● One of the largest available audio-visual archives ● http://sfi.usc.edu/
  • 3.
    3 Malach Centre for VisualHistory ● Provides local access to the digital archives of the USC Shoah Foundation ● Need to retrieve relevant segments of interviews ● Provide a test collection for the retrieval system created in the Malach project ● http://ufal.mff.cuni.cz/cvhm
  • 4.
    4 Czech Malach Cross-lingual SpeechRetrieval Test Collection ● 353 audio recordings (592 hours of audio) randomly selected from the set of Czech interviews ● Four automatic transcripts by different provides ● Manual topical annotations ● Manually entered metadata (PIQ, Thesaurus) ● Planned to be published in April 2016 ● http://ufal.mff.cuni.cz/malach-test-collection
  • 5.
    5 Audience ● Historians, teachers,students ● Information Retrieval (IR) ● Cross-lingual IR ● CLEF 2006, 2007 Cross-Language Speech Retrieval Track ● Speech processing ● Sentiment analysis ● Machine translation ● Social studies ...
  • 6.
    6 Collection ● Form ofinterviews ● Average length: 1 hour and 41 minutes ● Recorded on tapes (~ 30 minutes long), which were digitalized
  • 7.
    7 Transcripts ● Provided byIBM (2003), The Johns Hopkins University (2004, 2006) and University of West Bohemia (2013) ● In 1-best, MLF and XML format ● Lattices available for 2013 transcripts ● XML transcripts are morphologically tagged
  • 8.
    8 Topics ● Annotators manuallymarked topically coherent segments and assigned a single topic to each detected segment. ● The set of topics created for the annotation of the VHA. ● Topics for Czech collection were selected. ● Some of the topics were adapted to better react the Czech realities. ● 5,375 annotations for 118 topics by 6 annotators (librarians and historians) ● Divided into training, test and excluded sets ● All topics are in Czech and English ● Some topics are also in French, German and Spanish
  • 9.
    9 Topic Examples I NumberName Description Narrative 1173 Children's art in Terezin We are looking for the description of the art- related activities of children in Terezin such as music, plays, paintings, writings and poetry The relevant material should include discussions of such activities and how they influenced the survival and following life of the children. Any episodes where the interviewee demonstrates examples of such an art are highly relevant. 1286 Music in the Holocaust Tell us if music helped (spiritually or otherwise) or hindered the prisoners interned in concentration camps Descriptions of what role music played in the life of the prisoners.
  • 10.
    10 Topic Examples II ●Daily life in Terezin ● Jewish children in schools ● The liberation of Buchenwald and Dachau ● Jewish partisans in Italy ● Strengthening faith ● Hidden children and rescuers ● Bombing of Birkenau and Buchenwald ● Minsk ghetto underground ...
  • 11.
    11 Annotations I ● Severaltopics annotated dually ● 2 topics annotated by all annotators ● Search Guided Relevance Assessments ● Set of possible relevant segments was automatically restricted by an IR system, Thesaurus keywords, and PIQ ● Annotators entered queries and watched the retrieved parts of recordings ● Each topic was processed in approximately 20 hours ● Highly-ranked Assessments ● Annotators manually evaluated runs submitted to the CLEF campaign.
  • 12.
    12 Annotations II ● Averagesegment length is 167 second ● For each topic 44 relevant segments were found in average.
  • 13.
    13 Thesaurus ● English Thesauruswith 60,000 keywords ● Terms are hierarchically organized ● Label, definition and scope ● Alternative labels (synonyms) ● Czech Thesaurus ● Labels were translated manually ● Part of the definitions (e.g. complete categories Culture, Daily Life, Discrimination, Liberation) and scope translated manually ● The rest of the Thesaurus was translated automatically
  • 14.
  • 15.
    15 Conclusion ● Czech MalachCollection ● Cleared manual annotations of topics of segments in recordings ● Translations of topics ● Partially manually translated Thesaurus ● Cross-Language Speech Retrieval
  • 16.