Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EMMA Summer School - Jorge Civera - Being multilingual with EMMA

1,325 views

Published on

Jorge plans to run this session so that it meets the following objectives:

To motivate the need for transcriptions and translations in the context of MOOCs.
To motivate the need for automatic transcriptions and translations to reduce user review effort – automatic transcriptions and translations are not perfect, but better than generating them from scratch.
To show how automatic transcription and translation systems are built? Just to convey the necessity of adapting these systems to the content to be transcribed and translated.
To provide a hands-on step-by-step tutorial on the transcription and translation platform (TTP) simulating the transcription and translation of a course. This tutorial will be divided into two parts corresponding to the two workflows on TTP (transcription and translation of video content and translation of text content). See workflows at the end of this e-mail.
To set up the tutorial, Jorge plans to create an account on the TTP for each participant in the Summer School, so that they can play around. For this he will need to know the native language(s) of all participants to select the appropriate content for each one.

This presentation was given during the EMMA Summer School, that took place in Ischia (Italy) on 4-11 July 2015.

More info on the website: http://project.europeanmoocs.eu/project/get-involved/summer-school/

Follow our MOOCs: http://platform.europeanmoocs.eu/MOOCs

Design and deliver your MOOC with EMMA: http://project.europeanmoocs.eu/project/get-involved/become-an-emma-mooc-provider/

Published in: Education
  • Be the first to comment

  • Be the first to like this

EMMA Summer School - Jorge Civera - Being multilingual with EMMA

  1. 1. Being multilingual with EMMA Jorge Civera EMMA Summer School jcivera@dsic.upv.es Tuesday 7th July, 2015
  2. 2. Index 1. Presentation 2. Multilingual access to MOOCs 3. Video subtitling • Transcription • Translation 4. Document translation 5. Conclusions and Discussion UPV - Being multilingual with EMMA 2 / 21
  3. 3. Presentation • Lecturer at the Department of Computer Systems and Computation • Machine Learning and Language Processing (MLLP) group (mllp.upv.es) • Automatic Speech Recognition: – Already supported: English (En), Spanish (Es), Italian (It), Dutch (Nl), Estonian (Et), Portuguese (Pt), French (Fr) and Catalan (Ca) – In progress: German (De) and Slovene (Sl) • Machine Translation: – Language pairs available: En → {Es, It, Fr, Ca} and {Es, It, Nl, Et, Pt, Fr, Ca} → En • Speech Synthesis: – Already supported: English (En) and Spanish (Es) • Experience on EU projects providing multilingual access to educational content: – transLectures and EMMA UPV - Being multilingual with EMMA 3 / 21
  4. 4. Presentation • transLectures (Nov 2011 - Oct 2014) – Lowering language barrier to access video repositories by providing multilingual subtitles – Improving subtitles by massive adaptation and intelligent interaction – VideoLectures.NET (VL) and poliMedia (pM) video repositories with thousands of hours – Source languages: English and Slovene in VL and Spanish in pM – Target languages: Spanish, French, German, Slovene and English • EMMA (Feb 2014 - Jul 2016) – Providing multilingual access to MOOCs (videos and documents) – Few hours of video in 7 languages: En, Es, It, Nl, Et, Pt and Fr – Source language is the national language of the MOOC provider – Target languages: English, Spanish and Italian UPV - Being multilingual with EMMA 4 / 21
  5. 5. Multilingual access to MOOCs • Most MOOCs are offered in few languages – English (45%), Spanish (32%), French (14%) and other languages (9%) • Language barrier is keeping millions of potential learners from taking MOOCs • What components in a MOOC need to be translated? – Texts – Images – Videos – Conversations (Forums) • EMMA tackles with translation of texts and videos at the moment • Videos are translated by providing subtitles in the target language UPV - Being multilingual with EMMA 5 / 21
  6. 6. Cost of translating MOOCs Texts • Manual translation rate is approximately 2.500 words per day • A 6-week course with 75.000 words takes 1.5 PM to be translated Videos • Before translating, videos are manually transcribed (10 RTF) • Then, transcriptions are translated into the desired language (30 RTF) • A course including 2 hours of video takes 0.5 PM to be translated Solutions to lower costs • Crowdsourcing (TED talks) • Speech Recognition and Machine Translation to generate draft translations – User effort to translate a course is reduced to 30% - 50% (0.6 - 1 PM) UPV - Being multilingual with EMMA 6 / 21
  7. 7. Overview of automatic video subtitling • Step-by-step process: 1. Generation of automatic transcriptions from video 2. Manual review of automatic transcriptions to correct transcription errors 3. Generation of automatic translations from manually reviewed transcription 4. Manual review of automatic translations to generate final subtitles • State-of-the-art technology cannot provide perfect automatic subtitles • However, it significantly reduces the effort to generate multilingual subtitles • User effort saving depends on automatic transcription+translation accuracy • You can contribute to improve transcription and translation accuracy UPV - Being multilingual with EMMA 7 / 21
  8. 8. How to improve transcription accuracy • Transcription systems learn to transcribe from examples – At least 50 hours of videos (audio) previously transcribed to learn the acoustic model – Texts in millions of words to learn the language model Language Videos (hours) Text (Mwords) Dutch 532 628 English 620 464000 Estonian 130 410 French 88 1800 German 36 135 Portuguese 54 573 Italian 54 868 Slovene 27 224 Spanish 128 654 • Adaptation of transcription systems to the specific videos is key for high accuracy – Availability of videos manually transcribed with similar acoustic conditions – Availability of text resources related to the video in question ∗ Title is used to retrieve related documents from Google ∗ Slides contain most of the words uttered by the lecturer ∗ Documents: text content from the course, additional text resources (bibliography) UPV - Being multilingual with EMMA 8 / 21
  9. 9. Why automatic transcriptions • Quality of automatic transcription can be impressive, but it greatly depends on: – Availability of transcribed videos and related text materials – Sound quality of the video – Complexity of language involved (phonetics and grammar) • All in all, high-accuracy fully automatic transcription is not possible • Automatic transcriptions need to be manually reviewed • Reviewing automatic transcription is much faster than doing it from scratch • Transcriptions are not only needed to generate automatic translations: – Non-native speakers and hearing impaired persons – Text searchability and analysis – Summarisation – Video recommendation and relation • Reviewed transcriptions are important to generate usable draft automatic translations UPV - Being multilingual with EMMA 9 / 21
  10. 10. Reviewing automatic transcriptions • Once a video is ingested into the system, a draft transcription is automatically generated • Transcribed videos are available for review using a web interface • Yet another slide and hands on reviewing an automatic transcription UPV - Being multilingual with EMMA 10 / 21
  11. 11. Evaluating transcription review process • Review of automatic transcriptions is evaluated from two viewpoints: – Transcription accuracy – Time spent to review automatic transcriptions measured as Real Time Factor (RTF) Language Accuracy (92%) RTF (10) Spanish Excellent (86%) 3 Estonian Good (70%) 3 Portuguese Average (57%) 5 Italian Good (82%) 5 English Good (81%) 6 Catalan Good (83%) 6 Dutch Good (75%) 6 French Good (75%) 6 UPV - Being multilingual with EMMA 11 / 21
  12. 12. Demo on transcription 1. Overview of the Transcription and Translation Platform (ttp.mllp.upv.es) 2. Uploading a video 3. Reviewing video transcription 4. Reviewing video translation 5. Reviewing document translation UPV - Being multilingual with EMMA 12 / 21
  13. 13. How to improve translation accuracy • Translation systems learn to translate from parallel texts – Millions of sentences previously translated to learn the translation model – Texts in millions of words to learn the language model • Parallel texts are collected from public multilingual organisations (EU, UN, TED, etc.) • Not all parallel text available is useful to translate your MOOC: need of domain adaptation Language pairs All (Msents) Selection (Msents) Dutch-English 27.3 1.7 English-Spanish 14.0 3.2 English-Italian 24.5 6.4 English-French 28.8 3.2 Estonian-English 10.5 10.5 French-English 28.8 0.5 Portuguese-English 27.5 6.4 Italian-English 24.5 6.4 Spanish-English 14.0 6.4 • Adaptation of translation systems to the domain of the MOOC – Text of the course to be translated – Domain-related materials previously translated – Bibliography of the course in the target language UPV - Being multilingual with EMMA 13 / 21
  14. 14. Reviewing automatic translations • Speech Recognition technology is in a more mature stage than Machine Translation • Machine Translation has improved over the last years, but it is still far from perfect • Quality of automatic translation depends on: – Proximity between source and target languages – Complexity of grammar structures used by the speaker – How specific the vocabulary employed is – Availability of parallel texts in the same field • Evaluation of translation is cumbersome, since there is not a unique correct translation • Translations need to be manually reviewed before publishing them • Translation review is faster than generating them from scratch UPV - Being multilingual with EMMA 14 / 21
  15. 15. Reviewing automatic video translations • Reviewed video transcriptions are automaticaly translated into the desired languages • The same web interface allows you to review source and target subtitles in parallel • Reviewed subtitles can be exported as SRT files UPV - Being multilingual with EMMA 15 / 21
  16. 16. Reviewing automatic document translations • Text included in the course is ingested into the translation system • A similar web interface allows you to review source and target texts in parallel • Preview of source and target texts also available • Translated text is imported back into the EMMA platform UPV - Being multilingual with EMMA 16 / 21
  17. 17. Evaluating translation review process • Review of translations is evaluated from two viewpoints: – Translation accuracy automatically computed from single reference translation – Time spent to review automatic translations (in RTF) Language pairs Accuracy RTF (30) Spanish → English Good (64%) 7 Spanish → Catalan Excellent (73%) 9 English → Italian Good (59%) 10 Dutch → English Good (52%) 13 Italian → English Good (53%) 14 Estonian → English Poor (13%) 16 English → Spanish Good (62%) 17 French → English Average (22%) 26 UPV - Being multilingual with EMMA 17 / 21
  18. 18. Demo on translation 1. Overview of the Transcription and Translation Platform 2. Uploading a video 3. Reviewing video transcription 4. Reviewing video translation 5. Reviewing document translation UPV - Being multilingual with EMMA 18 / 21
  19. 19. Conclusions and Discussion • Multilingual access to your course boosts visibility • The cost of manually translating your course is high (2 PM) • Automatic translation can reduce the temporal cost up to 30% - 50% • Accuracy of automatic translation depends on several factors: – Languages involved – Availability of annotated data resources related to your course – Specificity of the course • Designing a multilingual MOOC should also take into account: – Slides – Images – Application interfaces (demos) – Bibliography – In general, language-dependent content that is not easy or too costly to edit UPV - Being multilingual with EMMA 19 / 21
  20. 20. Thank you for your attention! UPV - Being multilingual with EMMA 20 / 21
  21. 21. Comparative results with YouTube/Google • Comparison with YouTube in terms of Word Error Rate Word Error Rate Language EMMA YouTube Dutch 25.7 38.6 English 39.2 70.8 Italian 28.9 31.6 Portuguese 49.8 62.3 Spanish 14.4 34.3 • Comparison with Google Translate in terms of BLEU Quality - BLEU Language pairs EMMA Google Dutch → English 41.6 33.4 English → Spanish 42.5 39.0 Italian → English 46.9 27.9 Portuguese → English 47.6 45.4 Spanish → English 28.2 27.6 UPV - Being multilingual with EMMA 21 / 21

×