Bederson and resnik june 2010

849 views

Published on

On June 10-11, 2010, the University of Maryland hosted a Workshop on Crowdsourcing and Translation. Originally put together in connection with a project on collaborative translation, the goal was to bring together a set of people whose work is helping to define this new and exciting area, and to create an opportunity for discussions that will help define its future directions.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
849
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • ICDL has4418 books in 54 (*nothing in the middle*) languages. Our goal is to have 10,000 books translated into 100 languages so we can have 1M book languages. This means not only a lot of translation, but also among some uncommon language pairs. For example, Croatian into japaneseHow do we do that?
  • 15% of visitors visited 5 or more times this month45% of visitors visit for 3 minutes or longer21% of visitors look at 20 or more pageviews
  • Some of you might say, machine translation
  • How about giving translation to humans? Indeed, professional translators can provide HQ translation, but they are slow and expensive
  • Compare to all bilingual people, there are much much more monolingual people who could probably help
  • To give a rough idea about what we could do with monolingual crowd translation, let’s look at the methods of translation in this space.…It is the scalability that I am interested in - scalabilitybiligual doesn’t scale well
  • The translation is from Russian to Chinese. The Russian version is a volunteer's translation from the "original" English book (which is in itself a translation from the original story in Croatian). The 28 Russian sentences contain 213 words. The Chinese translations contain 410 characters. That is roughly 50 words per hours in Russian, 100 characters per hour in Chinese.
  • Shift from inaccurate to more accurateGrade the sentences from not translated to fully translatedLook at the number of sentences with each gradeshft
  • Shift from inaccurate to more accurateGrade the sentences from not translated to fully translatedLook at the number of sentences with each gradeshft
  • Bederson and resnik june 2010

    1. 1. Translation as a Collaborative Activity<br />Benjamin B. Bederson & Philip Resnik<br />Computer Science Department<br />Human-Computer Interaction Lab<br />Department of Linguistics<br />Computational Linguistics and Information Processing Lab<br />University of Maryland<br />
    2. 2. Languages on Internet by Population<br />Source: Global Reach, Internet World Stats<br />
    3. 3. International Children’s Digital Library<br />www.childrenslibrary.org<br />
    4. 4. A real-world problem: ICDL<br /> Now<br />4,386 books<br />54 languages<br />Some translations in a few languages<br />49,420 adults, 29,102 children registered<br />~100K unique visitors/month<br /> Goal<br />10,000 books<br />100 languages<br />Every book in every language!<br />
    5. 5. Machine Translation (MT)<br />(餐厅= restaurant, dining hall)<br />Large volume, cheap, fast <br />Unreliable quality<br />
    6. 6. Professional Translators<br />High quality, but slow and expensive<br />(even for common language pairs)<br />
    7. 7. Amateur Translators<br />
    8. 8. The key idea<br />
    9. 9. Translation with the Crowd<br />Translate with the MonolingualCrowd<br />vs. 75,000 contributors<br /> Wikipedia: 800 translators<br />
    10. 10. Machine<br />Translation<br />Monolingual<br />Human<br />Participation<br />Affordability<br />Amateur Bilingual Human Participation<br />Professional Bilingual Human Participation<br />Quality<br />
    11. 11. Monolingual translation protocol<br />Original source sentence<br />Noisy target hypothesis<br />F0<br />E0<br />MT<br />Monolingual post-editing<br />Fluent target hypothesis<br />Noisy back translation<br />E1<br />MT<br />F1<br />HTER editing<br />F2<br />Fluent, accurate<br />E2<br />MT<br />Et cetera…<br />
    12. 12. Monolingual translation protocol<br />Each participant is performing a monolingual task:<br />Infer partner’s intended meaning as well as possible<br />Express that meaning grammatically in own language<br />Source language participant has extra constraint:<br />Expressed meaning must match source sentence<br />Conflict? Original meaning wins <br />
    13. 13. Three Types of Errors<br />detectable and correctable<br />Tout le monde doit entendre l'histoire de Cendrillon.<br />MT<br />Everybody must to hear story about Cinderella<br />Monolingual post-editing<br />Everybody must hear the story about Cinderella<br />
    14. 14. Three Types of Errors<br />detectable but not correctable<br />Tout le monde doit entendre l'histoire de Cendrillon.<br />MT<br />Everybody must heard the business by Cinderella<br />Monolingual post-editing<br />?<br />
    15. 15. Three Types of Errors<br />not detectable<br />Tout le monde doit entendre l'histoire de Cendrillon.<br />MT<br />Everybody loves the story about Cinderella<br />
    16. 16. Enrich Translations<br /> Increase redundancy and shared context …<br />… to help make detectable errors correctable<br />… to help make undetectable errors detectable<br />
    17. 17. MT<br />
    18. 18. MT<br />MT<br />
    19. 19. MT<br />MT<br />MT<br />enrichment<br />
    20. 20. MT<br />MT<br />MT<br />enrichment<br />MT<br />
    21. 21. MT<br />MT<br />MT<br />enrichment<br />MT<br />
    22. 22.
    23. 23.
    24. 24. Web link<br />Image<br />Mark ok<br />Mark problematic<br />
    25. 25. Preliminary validation of the protocol<br />Language pair: Russian to Chinese<br />Hard case: no orthographic cues<br />Easy to find local volunteers<br />Two Russian speakers and four Chinese speakers<br />Four Russian-Chinese translation pairs (Russians twice)<br />One hour per pair<br />Worked on 44 sentences (6 pages), finished 28<br />= ~8.5 minutes per sentence (~1 word per minute) <br />N.B. average translators: ~2500 words per 8hr day<br /> (= ~5 words per minute) <br />
    26. 26. Google Translate …<br />
    27. 27. … Monotrans<br />
    28. 28. Ishida, Lin and colleagues at Kyoto University Department of Social Informatics have independently developed a very similar back-and-forth protocol.In their protocol, there is no enrichment to increase redundancy: if the target participant cannot make sense of the whole sentence, he or she requests that the entire original sentence be rephrased. <br />
    29. 29. Global Internet User Population<br />Source: http://www.internetworldstats.com/stats7.htm<br />
    30. 30. Announced a Popular Movement for the Liberation of Sudan to withdraw its candidate in the presidential elections scheduled in April this as confirmation of the leaders in the movement.<br />اعلنتالحركةالشعبيةلتحريرالسودانسحبمرشحهافيالانتخاباتالرئاسيةالمقرراجراؤهافينيسان/ابريلالجاري،حسبتاكيداتلقياديينفيالحركة.<br />Announced SPLM withdraw its candidate in the presidential elections in April by assurances to leaders in the movement.<br />http://news.bbc.co.uk/2/hi/africa/8597996.stm<br />
    31. 31.
    32. 32. Do I really need to worry about this?<br />
    33. 33. With some things, it’s very hard to say no. <br />
    34. 34. Monolingual post-editing<br />Callison-Burch (2005)<br />
    35. 35. Enriching context<br />Pplrrllygdtgttngmstfthmnngrght, vnfrmdfcnt txt.<br />{minus,weak,deficient}<br />
    36. 36. One more observation<br /> The original source sentence is not the only way the intended meaning could have been expressed.<br />Suppose this phrasing is difficult to translate correctly<br />a restaurant close by<br />Perhaps one of these alternatives can be more successful<br />
    37. 37. Polls indicate Brown, a state senator, and Coakley, Massachusetts’ Attorney General, are locked in a virtual tie to fill the late Sen. Ted Kennedy’s Senate seat<br />Les sondagesindiquent Brown, un s ´enateurd’ ´ etat, et Coakley, Massachusetts’ Procureurg´en´eral, sontenferm´ esdansunecravatevirtuel `a remplir le regrett´es ´enateur Ted Kennedy’s si`ege au S´enat.<br />Polls indicate Brown, a state senator, and Coakley, Massachusetts’ Attorney General, are locked in a virtual tie to fill the late Sen. Ted Kennedy’s Senate seat<br />Les sondagesindiquent Brown, un s´enateurd’ ´ etat, et Coakley, Massachusetts’ Procureurg´en´eral, sontenferm´esdansunecravatevirtuel `a remplir le regrett´es´enateur Ted Kennedy’ssi`ege au S´enat.<br />Polls indicate that Brown, a state senator, and Coakley,the Attorney General of Massachusetts, are locked in a virtual tie to fill the Senate seat of the Sen. Ted Kennedy, who died recently.<br />Les sondagesindiquentque Brown, un s ´ enateurd’ ´ etat, et Coakley, le procureurg ´en´eral du Massachusetts, sontenferm´ esdansunecravatevirtuelpourvoir le sige au S´enat de Sen. Ted Kennedy, qui estd´ ec´ed´er´ecemment<br />
    38. 38.
    39. 39. Automatically determining where the errors are<br />NP<br />NP<br />PP<br />F<br />visit Jupiter<br />to<br />was the Pluto-bound new horizons spacecraft<br />probe<br />The most recent<br />Mismatches?<br />D<br />S<br />F’<br />visit Jupiter<br />was the Pluto-bound new horizons spacecraft<br />The latest<br /> research<br />MT<br />MT<br />the most recent probe to visit jupiterwas the pluto-bound new horizons spacecraft<br />E<br />
    40. 40. the press trust of india quoted <br />the government minister for relief and rehabilitation kadam<br />kadam, the government’s relief and rehabilitation minister (2/3)<br />the government minister concerned with relief and rehabiliationkadam (1/3) <br />as revealing today that in the last week, the monsoon has started in <br />all of india’s states one<br />every one of india’s state, one (3/3)<br />each of India’s states one (2/3)<br />all states of india one (1/3)<br />after another, and that the financial losses and casualties have been serious in all areas. just in maharashtra, the state which includes <br />mumbai, india’s largest city, <br />india's largest city, mumbai (3/3)<br />the largest city in India, Mumbai, (3/3)<br />mumbai, the largest city of india, (3/3) <br />the number of people <br />known to have died<br />who died (3/3) <br />identified to have died (2/3) <br />known to have passed away (2/3)<br />has now reached 358.<br />For 31% of the sentences in an English-to-Chinese experiment, at least one new version of the sentence leads to better translation.<br />Often the gains are quite substantial.<br />
    41. 41. Polls indicate Brown, a state senator, and Coakley, Massachusetts’ Attorney General, are locked in a virtual tie to fill the late Sen. Ted Kennedy’s Senate seat<br />Les sondagesindiquent Brown, un s ´enateurd’ ´ etat, et Coakley, Massachusetts’ Procureurg´en´eral, sontenferm´ esdansunecravatevirtuel `a remplir le regrett´es ´enateur Ted Kennedy’s si`ege au S´enat.<br />Polls indicate Brown, a state senator, and Coakley, Massachusetts’ Attorney General, are locked in a virtual tie to fill the late Sen. Ted Kennedy’s Senate seat<br />Les sondagesindiquent Brown, un s´enateurd’ ´ etat, et Coakley, Massachusetts’ Procureurg´en´eral, sontenferm´esdansunecravatevirtuel `a remplir le regrett´es´enateur Ted Kennedy’ssi`ege au S´enat.<br />Polls indicate that Brown, a state senator, and Coakley,the Attorney General of Massachusetts, are locked in a virtual tie to fill the Senate seat of the Sen. Ted Kennedy, who died recently.<br />Les sondagesindiquentque Brown, un s ´ enateurd’ ´ etat, et Coakley, le procureurg ´en´eral du Massachusetts, sontenferm´ esdansunecravatevirtuelpourvoir le sige au S´enat de Sen. Ted Kennedy, qui estd´ ec´ed´er´ecemment<br />
    42. 42. Where to from here?<br />Larger and more formal validation of the protocol<br />Exploring the space of richer annotations<br />Reconsidering UI for:<br />Ease of use<br />Throughput<br />Parallel, multi-person contribution<br />Exploring the space of automatic and human error detection and paraphrase<br />
    43. 43. Collaborators and Sponsors<br />Chang Hu<br />CS Ph.D. student<br />Olivia Buzek<br />CS/Linguistics undergrad<br />Alex Quinn<br />CS Ph.D. student<br />Yakov Kronrod<br />Linguistics Ph.D. student<br />

    ×