Language Use And Preservation Online

  • 1,291 views
Uploaded on

TEDx presentation on the latest advances in mainstream language technology and how this affects "minor" languages.

TEDx presentation on the latest advances in mainstream language technology and how this affects "minor" languages.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,291
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
17
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Language use and preservation online Tadej Gregorčič
  • 2. “Minor” languages • 6912+ languages altogether • 3500 spoken by 0,2 % of world’s speakers • 40% endangered • Only 600 non-extinct within 100 years?
  • 3. Endangered languages
  • 4. Internet • 90% of content in just 12 languages • How big an issue is extinction? • Language transformation vs. transformation of old media (TV, newspapers, radio) • Unicode - first major breakthrough
  • 5. Slovenian (my language) • Roughly 2 million speakers • More speakers than 96% of languages • Official EU language - enforcement policies • Endangerment?
  • 6. Use of foreign words in scientific text where appropriate Slovenian counterparts exist.
  • 7. Preservation of language
  • 8. The Rosetta Project • http://rosettaproject.org/ • Publicly accessible digital library • Aiming to preserve information about eventually all human languages
  • 9. Preservation of knowledge contained in a language • Smithsonian Institute • Rosetta Project • Unesco • Revitalization (non-extinct) • Resurrection (extinct) • Only successful known example: Hebrew
  • 10. Keeping use of a language viable/economical • Consistent use • Dictionaries, tools • Translation tools • Advanced language software (TTS, SR)
  • 11. Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
  • 12. Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
  • 13. 2005 • Systran (fr.) • Yahoo!, Altavista Babelfish • Google • Rule based + statistical approach
  • 14. Live translation • Done in 2005 as Ethnocon project (presented at MS Imagine Cup) • Speech recognition (language 1) • Text machine translation (Systran API) • Speech synthesis (language 2) • MT quality poor
  • 15. 2006+ • Google Translate Systran • Google obtained United Nations parallel corpora • Words = data, grammar = code • Purely statistical approach (a huge amount of data, code )
  • 16. Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • ...
  • 17. Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • Example: the Bible
  • 18. Google Translate
  • 19. Crowdsourcing • It works (Wikipedia) • An incorrect translation is a natural motivator • Relatively fast improvement of data • But: unprofessional
  • 20. June, 2009
  • 21. Google Translator Toolkit • June, 2009 (200+ languages in October) • “Open Trados” • Global parallel TM • Google TT + Google Translate • 345 languages, 10.664 language pairs
  • 22. Google Translator Toolkit • Incentive for professionals: productivity • Motivated to contribute to global TM • GT pre-translates text with • Huge parallel corpora • Professional translation!
  • 23. Professional translations are fed into the crowdsourced Google Translate parallel corpora. Like Wikipedia with professional editors. Huge quality gains over time if Google Translator Toolkit takes off.
  • 24. Results today:
  • 25. Automatic subtitling (think hearing impaired users)
  • 26. Results soon:
  • 27. AR, “augmented reality”
  • 28. November 2009 Thank you! Tadej Gregorcic Software developer, entrepreneur and amateur linguist twitter.com/tadej linkedin.com/in/tadejgregorcic www.facebook.com/tadej