Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Czech Malach Cross-lingual Speech Retrieval Test Collection


Published on

Presentation of Czech Malach collection of Holocaust survivors testimonies with topical annotations presented at the Archives Unleashed Hackathon.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Czech Malach Cross-lingual Speech Retrieval Test Collection

  1. 1. Czech Malach Cross-lingual Speech Retrieval Test Collection Petra Galuščáková Institute of Formal and Applied Linguistics Charles University in Prague 5. 3. 2016
  2. 2. 2 USC Shoah Foundation's Visual History Archive ● Established to collect and preserve the testimonies of survivors and other witnesses of the Holocaust ● Founded in 1994 by Steven Spielberg ● Interviews with the Jewish survivors, Roma and Sinti survivors, liberators, survivors of the eugenics policies, political prisoners, aid providers, homosexual survivors, war crimes trials participants, ... ● Almost 52 000 videotaped testimonies in 56 countries and 32 languages collected between 1994 and 2000 ● One of the largest available audio-visual archives ●
  3. 3. 3 Malach Centre for Visual History ● Provides local access to the digital archives of the USC Shoah Foundation ● Need to retrieve relevant segments of interviews ● Provide a test collection for the retrieval system created in the Malach project ●
  4. 4. 4 Czech Malach Cross-lingual Speech Retrieval Test Collection ● 353 audio recordings (592 hours of audio) randomly selected from the set of Czech interviews ● Four automatic transcripts by different provides ● Manual topical annotations ● Manually entered metadata (PIQ, Thesaurus) ● Planned to be published in April 2016 ●
  5. 5. 5 Audience ● Historians, teachers, students ● Information Retrieval (IR) ● Cross-lingual IR ● CLEF 2006, 2007 Cross-Language Speech Retrieval Track ● Speech processing ● Sentiment analysis ● Machine translation ● Social studies ...
  6. 6. 6 Collection ● Form of interviews ● Average length: 1 hour and 41 minutes ● Recorded on tapes (~ 30 minutes long), which were digitalized
  7. 7. 7 Transcripts ● Provided by IBM (2003), The Johns Hopkins University (2004, 2006) and University of West Bohemia (2013) ● In 1-best, MLF and XML format ● Lattices available for 2013 transcripts ● XML transcripts are morphologically tagged
  8. 8. 8 Topics ● Annotators manually marked topically coherent segments and assigned a single topic to each detected segment. ● The set of topics created for the annotation of the VHA. ● Topics for Czech collection were selected. ● Some of the topics were adapted to better react the Czech realities. ● 5,375 annotations for 118 topics by 6 annotators (librarians and historians) ● Divided into training, test and excluded sets ● All topics are in Czech and English ● Some topics are also in French, German and Spanish
  9. 9. 9 Topic Examples I Number Name Description Narrative 1173 Children's art in Terezin We are looking for the description of the art- related activities of children in Terezin such as music, plays, paintings, writings and poetry The relevant material should include discussions of such activities and how they influenced the survival and following life of the children. Any episodes where the interviewee demonstrates examples of such an art are highly relevant. 1286 Music in the Holocaust Tell us if music helped (spiritually or otherwise) or hindered the prisoners interned in concentration camps Descriptions of what role music played in the life of the prisoners.
  10. 10. 10 Topic Examples II ● Daily life in Terezin ● Jewish children in schools ● The liberation of Buchenwald and Dachau ● Jewish partisans in Italy ● Strengthening faith ● Hidden children and rescuers ● Bombing of Birkenau and Buchenwald ● Minsk ghetto underground ...
  11. 11. 11 Annotations I ● Several topics annotated dually ● 2 topics annotated by all annotators ● Search Guided Relevance Assessments ● Set of possible relevant segments was automatically restricted by an IR system, Thesaurus keywords, and PIQ ● Annotators entered queries and watched the retrieved parts of recordings ● Each topic was processed in approximately 20 hours ● Highly-ranked Assessments ● Annotators manually evaluated runs submitted to the CLEF campaign.
  12. 12. 12 Annotations II ● Average segment length is 167 second ● For each topic 44 relevant segments were found in average.
  13. 13. 13 Thesaurus ● English Thesaurus with 60,000 keywords ● Terms are hierarchically organized ● Label, definition and scope ● Alternative labels (synonyms) ● Czech Thesaurus ● Labels were translated manually ● Part of the definitions (e.g. complete categories Culture, Daily Life, Discrimination, Liberation) and scope translated manually ● The rest of the Thesaurus was translated automatically
  14. 14. 14 Conclusion
  15. 15. 15 Conclusion ● Czech Malach Collection ● Cleared manual annotations of topics of segments in recordings ● Translations of topics ● Partially manually translated Thesaurus ● Cross-Language Speech Retrieval
  16. 16. 16 Thank you