• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Language Use And Preservation Online
 

Language Use And Preservation Online

on

  • 1,999 views

TEDx presentation on the latest advances in mainstream language technology and how this affects "minor" languages.

TEDx presentation on the latest advances in mainstream language technology and how this affects "minor" languages.

Statistics

Views

Total Views
1,999
Views on SlideShare
1,516
Embed Views
483

Actions

Likes
0
Downloads
17
Comments
0

15 Embeds 483

http://isthistaarof.blogspot.com 210
http://tadej.eu 207
http://www.elasticlife.net 23
http://localhost 20
http://www.slideshare.net 5
http://isthistaarof.blogspot.ca 5
http://feeds.feedburner.com 4
http://indianreview.in 2
http://isthistaarof.blogspot.fi 1
http://www.linkedin.com 1
http://www.newsblur.com 1
http://isthistaarof.blogspot.co.il 1
http://isthistaarof.blogspot.co.uk 1
http://translate.googleusercontent.com 1
http://isthistaarof.blogspot.com.au 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Language Use And Preservation Online Language Use And Preservation Online Presentation Transcript

  • Language use and preservation online Tadej Gregorčič
  • “Minor” languages • 6912+ languages altogether • 3500 spoken by 0,2 % of world’s speakers • 40% endangered • Only 600 non-extinct within 100 years?
  • Endangered languages
  • Internet • 90% of content in just 12 languages • How big an issue is extinction? • Language transformation vs. transformation of old media (TV, newspapers, radio) • Unicode - first major breakthrough
  • Slovenian (my language) • Roughly 2 million speakers • More speakers than 96% of languages • Official EU language - enforcement policies • Endangerment?
  • Use of foreign words in scientific text where appropriate Slovenian counterparts exist.
  • Preservation of language
  • The Rosetta Project • http://rosettaproject.org/ • Publicly accessible digital library • Aiming to preserve information about eventually all human languages
  • Preservation of knowledge contained in a language • Smithsonian Institute • Rosetta Project • Unesco • Revitalization (non-extinct) • Resurrection (extinct) • Only successful known example: Hebrew
  • Keeping use of a language viable/economical • Consistent use • Dictionaries, tools • Translation tools • Advanced language software (TTS, SR)
  • Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
  • Language technologies • Machine translation • Speech synthesis • Speech recognition • ... • Advance in one field accelerates advances in others through increased feasibility
  • 2005 • Systran (fr.) • Yahoo!, Altavista Babelfish • Google • Rule based + statistical approach
  • Live translation • Done in 2005 as Ethnocon project (presented at MS Imagine Cup) • Speech recognition (language 1) • Text machine translation (Systran API) • Speech synthesis (language 2) • MT quality poor
  • 2006+ • Google Translate Systran • Google obtained United Nations parallel corpora • Words = data, grammar = code • Purely statistical approach (a huge amount of data, code )
  • Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • ...
  • Parallel corpus • evrokorpus.gov.si • Translation memory (Trados ipd.) • TM from governmental institutions • Open TM projects • Example: the Bible
  • Google Translate
  • Crowdsourcing • It works (Wikipedia) • An incorrect translation is a natural motivator • Relatively fast improvement of data • But: unprofessional
  • June, 2009
  • Google Translator Toolkit • June, 2009 (200+ languages in October) • “Open Trados” • Global parallel TM • Google TT + Google Translate • 345 languages, 10.664 language pairs
  • Google Translator Toolkit • Incentive for professionals: productivity • Motivated to contribute to global TM • GT pre-translates text with • Huge parallel corpora • Professional translation!
  • Professional translations are fed into the crowdsourced Google Translate parallel corpora. Like Wikipedia with professional editors. Huge quality gains over time if Google Translator Toolkit takes off.
  • Results today:
  • Automatic subtitling (think hearing impaired users)
  • Results soon:
  • AR, “augmented reality”
  • November 2009 Thank you! Tadej Gregorcic Software developer, entrepreneur and amateur linguist twitter.com/tadej linkedin.com/in/tadejgregorcic www.facebook.com/tadej