Tradução Automática de Fala para Falano Projecto PT-STARLuísa Coheur (L2F/INESC-ID)Place Logos of Partner Institutions
2INESC-ID and L2F
33INESC-ID Brief history Established January 2000 (Owned by IST and INESC) Private Not-for Profit Research Institute of...
44The Spoken Language Systems Lab History Work on speech processing for Portuguese since the 90s Creation: 2001 Missio...
55Core TechnologiesSpeech processing Text-to-speech synthesis Automatic process for building new voices Limited domain ...
6Statistical Machine Translation
7Statistical Machine Translation Automatic Translators target to maximize: Faithfulness or fidelity How close is the me...
8Statistical Machine TranslationˆT  argmaxT fluency(T)faithfulness(T,S)Translation ModelLanguage ModelEstou cansado Flu...
9Modelo de língua: fluêcia Qual a frase mais fluente? Passa a: “qual a mais provável” Podemos recorrer a modelos de lín...
10Modelo de tradução: fidelidade Qual a frase mais fiel? Aqui há que observar como frases na língua fonte se traduzem na...
11Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok e...
12Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok e...
13Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok e...
14Spanish/English corpus1a. Garcia and associates .1b. Garcia y asociados .7a. the clients and the associates are enemies ...
15Speech to Speech Machine Translation
16Speech to speech machine translation Speech-to-Speech Machine Translation (S2SMT) technologiesaim at enabling natural l...
17Speech to speech machine translation S2SMT can be seen as a cascade of three major components: Automatic Speech Recogn...
18Speech to speech machine translation
19The PT-STAR project
20The PT-STAR project Team: L2F/INESC-ID LTI/CMU UBI FLUL
21The PT-STAR project One of the main problems of S2SMT is the still weakintegration between the three components The ma...
22Task 1: ASR/MTTASK 1
23Task 1: ASR/MT Challenge Improve full stops and commas insertions Segmentation is a hard problem in automatic transla...
24Rich transcriptions boa tarde o governo considera que as medidas de austeridade aprovadas eem vigor só para já adequada...
25Rich transcriptions [anchor 150] Boa tarde o governo considera que as medidas de austeridadeaprovadas e em vigor. Só pa...
26Rich transcriptions [anchor 150] Boa tarde o governo considera que as medidas de austeridadeaprovadas e em vigor. Só pa...
27Translation [anchor 150] Good afternoon, the government believes that the austeritymeasures approved and in force. Only...
28Task 1: ASR/MT Challenge Take advantage of in-domain texts to build domain adaptedlanguage models for ASR and MT Doma...
29Task 1: ASR/MT Challenge Take advantage of imperfect transcriptions (in which annotations donot include laughter, appl...
30Task 2: MT/TTSTASK 2
31Task 2: MT/TTS Challenges Built Statistical Parametric Synthetic voices for Portuguese How do deal with translation e...
32Task 3: MTTASK 3
33Task 3: MT Challenges Alignments New algorithms to generate the well known lexicalized reorderingmodel using weighted...
34Task 3: MT Challenges Error analysis Taxonomy and detailed analysis of Moses vs. Google From BP to EP Built the BP2...
35Task 3: MT Challenges Participated in IWSLT 2010 (Evaluation Campaign) CN-EN, EN-CN FR-EN
36Task 4: Proof of conceptTASK 4
37Proof-of-concept Prototype development (pt, en, cn) Broadcast news (S2T) TED TALKS (S2S) Real time demo (S2S)
38Demo e referências Demonstração em vídeo:https://www.l2f.inesc-id.pt/demos/pt-star/Demo_S2S.mov Referências na comunic...
Upcoming SlideShare
Loading in …5
×

Luísa Coheur - Projecto PT-STAR

303 views

Published on

Apresentação da Dra. Luísa Coheur na I Conferência Internacional de Tradução e Tecnologia, 13 e 14 de Maio, Faculdade de Letras do Porto.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
303
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Luísa Coheur - Projecto PT-STAR

  1. 1. Tradução Automática de Fala para Falano Projecto PT-STARLuísa Coheur (L2F/INESC-ID)Place Logos of Partner Institutions
  2. 2. 2INESC-ID and L2F
  3. 3. 33INESC-ID Brief history Established January 2000 (Owned by IST and INESC) Private Not-for Profit Research Institute of Public Interest Associated Laboratory since December 2004 Facilities Alameda Tagus Park
  4. 4. 44The Spoken Language Systems Lab History Work on speech processing for Portuguese since the 90s Creation: 2001 Mission Creating technology to bridge the gap between natural spoken language and theunderlying semantic information. Interdisciplinary background: Signal processing, natural language processing, linguistics, etc.
  5. 5. 55Core TechnologiesSpeech processing Text-to-speech synthesis Automatic process for building new voices Limited domain synthesis Expressive speech synthesis Audio-visual synthesis Automatic speech recognition Robust speech recognition Speaker adaptation Large vocabulary continuous recognition Rich transcription of spontaneous speech Speech coding Speech enhancement Speaker and language identificationText processing– Morphological analysis– Syntactic analysis– Semantic analysis– Discourse analysis– NL Generation– Named entity extraction– Information retrieval– Summarization– Question answering– Machine translationSpoken language processing– Speech understanding– Spoken dialog systems– Speech-to-Speech machine translation– Summarization of spoken documents– Question answering on spoken documents– Classification of multimedia documents– Language tutoring– etc.
  6. 6. 6Statistical Machine Translation
  7. 7. 7Statistical Machine Translation Automatic Translators target to maximize: Faithfulness or fidelity How close is the meaning of the translation to the meaning of theoriginal Fluency or naturalness How natural the translation is, just considering its fluency in thetarget language Developed by researchers from IBMˆT  argmaxT fluency(T)faithfulness(T,S)
  8. 8. 8Statistical Machine TranslationˆT  argmaxT fluency(T)faithfulness(T,S)Translation ModelLanguage ModelEstou cansado Fluência FidelidadeI’m exhausted 5 3Tired me 2 5I love cookies 5 0
  9. 9. 9Modelo de língua: fluêcia Qual a frase mais fluente? Passa a: “qual a mais provável” Podemos recorrer a modelos de língua criados com base em N-grams, por exemplo Advantage: this is monolingual knowledge!
  10. 10. 10Modelo de tradução: fidelidade Qual a frase mais fiel? Aqui há que observar como frases na língua fonte se traduzem na língaalvo. Problema: precisa de Corpora paralelos Parlamento Europeu TED Talks …
  11. 11. 11Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .5b. totat jjat quat cat .11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
  12. 12. 12Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .5b. totat jjat quat cat .11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
  13. 13. 13Centauri/Arcturan [Knight 97]1a. ok-voon ororok sprok .1b. at-voon bichat dat .7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .5b. totat jjat quat cat .11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
  14. 14. 14Spanish/English corpus1a. Garcia and associates .1b. Garcia y asociados .7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .8a. the company has three groups .8b. la empresa tiene tres grupos .3a. his associates are not strong .3b. sus asociados no son fuertes .9a. its groups are in Europe .9b. sus grupos estan en Europa .4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .5a. its clients are angry .5b. sus clientes estan enfadados .11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .6a. the associates are also angry .6b. los asociados tambien estan enfadados .12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
  15. 15. 15Speech to Speech Machine Translation
  16. 16. 16Speech to speech machine translation Speech-to-Speech Machine Translation (S2SMT) technologiesaim at enabling natural language communication betweenpeople that do not share the same language
  17. 17. 17Speech to speech machine translation S2SMT can be seen as a cascade of three major components: Automatic Speech Recognition Machine Translation Text-to-Speech Synthesis
  18. 18. 18Speech to speech machine translation
  19. 19. 19The PT-STAR project
  20. 20. 20The PT-STAR project Team: L2F/INESC-ID LTI/CMU UBI FLUL
  21. 21. 21The PT-STAR project One of the main problems of S2SMT is the still weakintegration between the three components The main goal of PT-STAR (Speech Translation AdvancedResearch to and from Portuguese) is to improve speechtranslation systems for Portuguese by strengthening thisintegration
  22. 22. 22Task 1: ASR/MTTASK 1
  23. 23. 23Task 1: ASR/MT Challenge Improve full stops and commas insertions Segmentation is a hard problem in automatic translation Improve capitalization Important to disambiguate (Ex: Pedro Steps Rabbit) Detect interrogatives Important if you target synthesis Porte everything to English Try to make everything as much language independent aspossible
  24. 24. 24Rich transcriptions boa tarde o governo considera que as medidas de austeridade aprovadas eem vigor só para já adequadas às necessidades financeiras de portugal oministro das finanças mostra-se confiante com as metas traçadas noprograma de estabilidade e crescimento apesar de não fechar as portas àhipótese de medidas adicionais de controlo orçamental em dois mil e dozeé desta forma que teixeira dos santos responde a pressão dos países damoeda única querem que portugal e espanha avança com mais medidas deausteridade dentro de ano e meio ainda em mês passou diz que o governodecidiu apertar o cinto aos portugueses e já europa vem pedir mais paradepois de dois mil e onze o ministro das finanças não fecha a porta, masdefende cada ano a seu tempo acho que estamos de em condições dealimentar digamos confessa estar confiantes de que o objectivo para doismil e dez vai ser conseguido com as medidas adicionais que foramentretanto já decididas
  25. 25. 25Rich transcriptions [anchor 150] Boa tarde o governo considera que as medidas de austeridadeaprovadas e em vigor. Só para já adequadas às necessidades financeiras dePortugal. O ministro das Finanças mostra-se confiante com as metas traçadasno programa de Estabilidade e Crescimento. Apesar de não fechar as portas àhipótese de medidas adicionais de controlo orçamental, em dois mil e doze. Édesta forma que Teixeira dos Santos responde a pressão dos países da moedaúnica, querem que Portugal e Espanha avança com mais medidas deausteridade, dentro de ano e meio. [spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aosportugueses e já Europa vem pedir mais para depois de dois mil e onze. Oministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo. [spk 1000] Acho que estamos de em condições de alimentar, digamos confessaestar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido comas medidas adicionais que foram entretanto já decididas. Tópicos: Política; Economia; Nacional;
  26. 26. 26Rich transcriptions [anchor 150] Boa tarde o governo considera que as medidas de austeridadeaprovadas e em vigor. Só para já adequadas às necessidades financeiras dePortugal. O ministro das Finanças mostra-se confiante com as metas traçadasno programa de Estabilidade e Crescimento. Apesar de não fechar as portas àhipótese de medidas adicionais de controlo orçamental, em dois mil e doze. Édesta forma que Teixeira dos Santos responde a pressão dos países da moedaúnica, querem que Portugal e Espanha avança com mais medidas deausteridade, dentro de ano e meio. [spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aosportugueses e já Europa vem pedir mais para depois de dois mil e onze. Oministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo. [spk 1000] Acho que estamos de em condições de alimentar, digamos confessaestar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido comas medidas adicionais que foram entretanto já decididas. Tópicos: Política; Economia; Nacional;
  27. 27. 27Translation [anchor 150] Good afternoon, the government believes that the austeritymeasures approved and in force. Only for already suited to financial needsof Portugal. The finance minister seems confident with the targets set outin the stability and growth programme. Despite not close the door to thepossibility of additional measures of budgetery control in two thousand,twelve. This is the way that Teixeira dos Santos responds the pressure ofthe countries of the single currency, they want Spain and Portugalprogresses with more austerity measures, within a year and a half. [spk 2000] Still in month passed says that the government has decided totighten their belts the Portuguese and already Europe comes to ask formore for after two thousand and eleven. The finance minister is not closesthe door, but defends each year, the his time. [spk 1000] I think that we are in conditions of food, say admits be trusted,that the objective for two thousand, ten, will be achieved with theadditional measures that were in the meantime, has already decided. Topic: Politics; Economy; National;
  28. 28. 28Task 1: ASR/MT Challenge Take advantage of in-domain texts to build domain adaptedlanguage models for ASR and MT Domain adaptation is one of the major problems in SMT (in aword is not seen during training, the system will not be able totranslate it)
  29. 29. 29Task 1: ASR/MT Challenge Take advantage of imperfect transcriptions (in which annotations donot include laughter, applause, filled pauses, repetitions, or otherdisfluencies, and sometimes contain errors) to build acoustic modelsfor ASRExample:… In my opinion the many options to solve the...… In my opinion ++BREATH++ the ++UH++ many options to solve the...
  30. 30. 30Task 2: MT/TTSTASK 2
  31. 31. 31Task 2: MT/TTS Challenges Built Statistical Parametric Synthetic voices for Portuguese How do deal with translation errors when you target synthesis? Techniques for optimal synchronization using MT N-best list Grammar based phrasing strategies to improve synthesis ofdisfluent MT output Voice Morphing Cross lingual voice morphing to match source speaker
  32. 32. 32Task 3: MTTASK 3
  33. 33. 33Task 3: MT Challenges Alignments New algorithms to generate the well known lexicalized reorderingmodel using weighted alignment matrices Geppetto: a toolkit for word alignments and phrase extraction Users can improve the phrase extraction algorithm, due to thefact that key control points can be manipulated Available at Google code
  34. 34. 34Task 3: MT Challenges Error analysis Taxonomy and detailed analysis of Moses vs. Google From BP to EP Built the BP2EP translator Corpora: TAP-UP corpus Flight magazine with parallel corpora PT/EN 6000 questions translated into PT Original corpus in EN, from TREC Translation Model adapted with the questions’ corpus Important BLEU improvements (EN/PT 9, PT/EN 8)
  35. 35. 35Task 3: MT Challenges Participated in IWSLT 2010 (Evaluation Campaign) CN-EN, EN-CN FR-EN
  36. 36. 36Task 4: Proof of conceptTASK 4
  37. 37. 37Proof-of-concept Prototype development (pt, en, cn) Broadcast news (S2T) TED TALKS (S2S) Real time demo (S2S)
  38. 38. 38Demo e referências Demonstração em vídeo:https://www.l2f.inesc-id.pt/demos/pt-star/Demo_S2S.mov Referências na comunicação social:Reportagem na SIC NotíciasArtigo no "Ciência Hoje“Reportagem na revista Sábado

×