Language use and
preservation online

    Tadej Gregorčič
“Minor” languages

• 6912+ languages altogether
• 3500 spoken by 0,2 % of world’s speakers
• 40% endangered
• Only 600 non...
Endangered languages
Internet

• 90% of content in just 12 languages
• How big an issue is extinction?
• Language transformation vs. transforma...
Slovenian (my language)

• Roughly 2 million speakers
• More speakers than 96% of languages
• Official EU language - enforc...
Use of foreign words in scientific text where
 appropriate Slovenian counterparts exist.
Preservation of language
The Rosetta Project

• http://rosettaproject.org/
• Publicly accessible digital library
• Aiming to preserve information a...
Preservation of knowledge
   contained in a language
• Smithsonian Institute
• Rosetta Project
• Unesco
• Revitalization (...
Keeping use of a language
   viable/economical

• Consistent use
• Dictionaries, tools
• Translation tools
• Advanced lang...
Language technologies
• Machine translation
• Speech synthesis
• Speech recognition
• ...
• Advance in one field accelerate...
Language technologies
• Machine translation
• Speech synthesis
• Speech recognition
• ...
• Advance in one field accelerate...
2005

• Systran (fr.)
• Yahoo!, Altavista Babelfish
• Google
• Rule based + statistical approach
Live translation
• Done in 2005 as Ethnocon project
  (presented at MS Imagine Cup)
• Speech recognition (language 1)
• Te...
2006+
• Google Translate Systran
• Google obtained United Nations parallel
  corpora
• Words = data, grammar = code
• Pure...
Parallel corpus

• evrokorpus.gov.si
• Translation memory (Trados ipd.)
• TM from governmental institutions
• Open TM proj...
Parallel corpus

• evrokorpus.gov.si
• Translation memory (Trados ipd.)
• TM from governmental institutions
• Open TM proj...
Google Translate
Crowdsourcing


• It works (Wikipedia)
• An incorrect translation is a natural
  motivator
• Relatively fast improvement o...
June, 2009
Google Translator Toolkit

• June, 2009 (200+ languages in October)
• “Open Trados”
• Global parallel TM
• Google TT + Goo...
Google Translator Toolkit

• Incentive for professionals: productivity
• Motivated to contribute to global TM
• GT pre-tra...
Professional translations are fed into the
 crowdsourced Google Translate parallel
                corpora.

Like Wikipedi...
Results today:
Automatic subtitling




(think hearing impaired users)
Results soon:
AR, “augmented reality”
November 2009




Thank you!

Tadej Gregorcic
Software developer, entrepreneur and amateur linguist




twitter.com/tadej ...
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Language Use And Preservation Online
Upcoming SlideShare
Loading in...5
×

Language Use And Preservation Online

1,396
-1

Published on

TEDx presentation on the latest advances in mainstream language technology and how this affects "minor" languages.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,396
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Language Use And Preservation Online

    1. 1. Language use and preservation online Tadej Gregorčič
    2. 2. “Minor” languages • 6912+ languages altogether • 3500 spoken by 0,2 % of world’s speakers • 40% endangered • Only 600 non-extinct within 100 years?
    3. 3. Endangered languages
    4. 4. Internet • 90% of content in just 12 languages • How big an issue is extinction? • Language transformation vs. transformation of old media (TV, newspapers, radio) • Unicode - first major breakthrough
    5. 5. Slovenian (my language) • Roughly 2 million speakers • More speakers than 96% of languages • Official EU language - enforcement policies • Endangerment?
    6. 6. Use of foreign words in scientific text where appropriate Slovenian counterparts exist.
    7. 7. Preservation of language
    8. 8. The Rosetta Project • http://rosettaproject.org/ • Publicly accessible digital library • Aiming to preserve information about eventually all human languages
    9. 9. Preservation of knowledge contained in a language • Smithsonian Institute • Rosetta Project • Unesco • Revitalization (non-extinct) • Resurrection (extinct) • Only successful known example: Hebrew
    10. 10. Keeping use of a language viable/economical • Consistent use • Dictionaries, tools • Translation tools • Advanced language software (TTS, SR)
    11. 11. Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
    12. 12. Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
    13. 13. 2005 • Systran (fr.) • Yahoo!, Altavista Babelfish • Google • Rule based + statistical approach
    14. 14. Live translation • Done in 2005 as Ethnocon project (presented at MS Imagine Cup) • Speech recognition (language 1) • Text machine translation (Systran API) • Speech synthesis (language 2) • MT quality poor
    15. 15. 2006+ • Google Translate Systran • Google obtained United Nations parallel corpora • Words = data, grammar = code • Purely statistical approach (a huge amount of data, code )
    16. 16. Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • ...
    17. 17. Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • Example: the Bible
    18. 18. Google Translate
    19. 19. Crowdsourcing • It works (Wikipedia) • An incorrect translation is a natural motivator • Relatively fast improvement of data • But: unprofessional
    20. 20. June, 2009
    21. 21. Google Translator Toolkit • June, 2009 (200+ languages in October) • “Open Trados” • Global parallel TM • Google TT + Google Translate • 345 languages, 10.664 language pairs
    22. 22. Google Translator Toolkit • Incentive for professionals: productivity • Motivated to contribute to global TM • GT pre-translates text with • Huge parallel corpora • Professional translation!
    23. 23. Professional translations are fed into the crowdsourced Google Translate parallel corpora. Like Wikipedia with professional editors. Huge quality gains over time if Google Translator Toolkit takes off.
    24. 24. Results today:
    25. 25. Automatic subtitling (think hearing impaired users)
    26. 26. Results soon:
    27. 27. AR, “augmented reality”
    28. 28. November 2009 Thank you! Tadej Gregorcic Software developer, entrepreneur and amateur linguist twitter.com/tadej linkedin.com/in/tadejgregorcic www.facebook.com/tadej
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×