Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AI based language learning tools

2,278 views

Published on

In this session, Rakuten Institute of Technology Singapore will be showcasing AI based language learning tools on top of authentic foreign language content. The prototypes employ state-of-the-art technologies and the treasure trove of Rakuten’s multilingual data. See how an interdisciplinary team of experts in machine translation, computational linguistics, platform engineering, and cognitive psychology comes together to blend education with entertainment, transforming passive TV viewing into an opportunity for active learning.

Published in: Technology
  • Login to see the comments

AI based language learning tools

  1. 1. Oct.28.2017 Ewa Szymanska, PhD Head of Rakuten Institute of Technology Singapore
  2. 2. 2 Source: https://unsplash.com/ by Element5 Digital
  3. 3. 3 I am watching shows in Chinese to get used to ‘actual’ spoken Mandarin, and not just what I see in my textbooks “ ” VIKI user
  4. 4. 4 * Images from Rakuten VIKI, Rakuten TV
  5. 5. 5 1.8 billion people are learning foreign languages Source: The Washington Post: https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts Languages with most native speakers Most commonly studied foreign languages
  6. 6. 6 Online individual language learning market is growing at 12% CAGR Source: Rosetta Stone Investor Day 2017
  7. 7. 7 I. Entertaining Content II. Global Users III. Technology *Photo by Jakob Owens on Unsplash
  8. 8. 8 Interactive subtitles Video dictionary Quizzes1 2 3 * Images from Rakuten VIKI
  9. 9. 9 Interactive subtitles1 Fast adoption 30,000 DAU – daily active users High engagement Korean Learn Mode users view 10% more than Viki average High satisfaction 83 NPS – net promoter score *cnet.com @ CBS Interactive Inc. Apr 13, 2017; Keia.org, Korean Economic Institute, Apr 2017; Forbes Oct 24, 2017; The Verge, Sep 28, 2017
  10. 10. Shows availability “Daughter Back” “Return of Happiness” “Ice and Fire of Youth” “My Love from the Star” “Boys Over Flowers” “Descendants of the Sun” Learn Chinese (Japan) Learn Korean (USA) * Images from Rakuten VIKI [ Learn Mode collection on viki.com ]
  11. 11. 11 • 60,000+ quizzes taken • 35,000+ users completed the quiz • Very positive social media engagement: 2 Drama Vocab Quiz [ languagequiz.viki.com ]
  12. 12. 12 3 Video-based Dictionary Integrate with the classroom curriculum:
  13. 13. 13 “ If you talk to a man in a language he understands, that goes to his head. If you talk to him in his language, that goes to his heart. ” - Nelson Mandela
  14. 14. 14
  15. 15. Oct 28, 2017 Stanley Kok Principal Research Scientist Rakuten Institute of Technology (Singapore)
  16. 16. you 16 你 是 辣妹 , 也是 名门贵 族 你是辣妹,也是名门贵族 你 是 辣妹 , 也是 名门贵族 are (a) hot chick and also (of) the gentry Splitting a sentence into pieces, each preserving its original semantics you are (a) hot chick and also tribe
  17. 17. 17 努力的人才会成功 努力 的 人 才 会 成功 only hardworking people will succeed 努力 的 人才 会 成功 hardworking talent will succeed
  18. 18. 18
  19. 19. Tokenization 19 Dictionary Lookup
  20. 20. 20 Many open-source tokenizers available Good, but not perfect Different mistakes Why not use more (or all) of them to improve tokenization?  Strengths of one tokenizer overcomes shortcomings of another
  21. 21. 21 How to quantify “goodness” of tokenization? Take human learner’s perspective #Dictionary look-ups needed to understand all tokens Non-existent tokens assumed to need large #lookups (10) 你 是 辣妹 你 是 辣 妹 你 是辣 妹 hot chick areyou younger sister spicy areyou younger sister ?you 1 + 1 + 1 = 3 1 + 1 + 1 + 1 = 4 1 + 10 + 1 = 12
  22. 22. 22 Can do better than picking lowest cost tokenization from tokenizers Treat common tokens as “anchor points” Pick best tokens from remaining ones
  23. 23. 23 你 是 辣妹 也是 名门贵 族 你 是辣 妹 也是 名门贵族 你 是 辣妹 也是 名门贵族 you are hot chick and also tribe you younger sister and also (of) the gentry (15) (14) (5)
  24. 24. 24 Dictionaries are important for language learning Manual approach provides high-quality dictionary, but not scalable About 7000 languages in the world About 49 million bilingual dictionaries Thus need automatic approach
  25. 25. 25 Lots of online dictionaries available Could we automatically learn new dictionaries from them? Focus on Chinese-English (C-E) & Korean- English (K-E) bilingual dictionaries
  26. 26. 26 Lots of dictionaries online Some are C-E and K-E, but many are not Many dictionaries are C-X and X-E Use language X as bridge/pivot C-X + X-E => C-E, e.g., 辣妹->fille sexy + fille sexy ->hot chick => 辣妹-> hot chick
  27. 27. 27 Take 2 hops for now Chinese-English dictionary has 750K entries 90% correct Korean-English dictionary has 100K entries 99% correct
  28. 28. 28 Learn bilingual dictionary using Using seed lexicon Monolingual data (plentiful) Maps bi-lingual phrases to vector space dolphin 海豚 东京Tokyo Sushi 寿司
  29. 29. 29
  30. 30. 30
  31. 31. 31 Artifact of standard machine translation pipeline Parallel sentences aligned word for word Compute probability of mapping tokens of a source language to those of a target language A correct source token will be more consistently aligned to its corresponding target token(s) Add high-probability mappings to dictionary
  32. 32. 32 Chinese English P(C|E) P(E|C) AveProb 辣妹 hot chick 0.8 0.9 0.85 是辣 is curry 0.1 0.1 0.1
  33. 33. 33 Chinese-English Dictionary 3 million Chinese tokens (Jan’17) 89% in dictionary Korean-English Dictionary 4 million Korean tokens (Jan’17) 86% in dictionary
  34. 34. 34 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 #KoreanTokens vs. #Defintions 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #ChineseTokens vs. #Definitions
  35. 35. 35 Match parallel sentences to Phrase table Dictionary
  36. 36. 36 他 放弃 梦想 He gave up his dreams Chinese English AveProb 放弃 gave up his 0.74 放弃 quit, 0.83 放弃 abdicate 0.68 Phrase Table
  37. 37. 37 他 放弃 梦想 He gave up his dreams Chinese English AveProb 放弃 gave up his 0.74 放弃 quit 0.83 放弃 abdicate 0.68 Phrase Table Best Match
  38. 38. 他 放弃 梦想 He gave up his dreams best match 38 Chinese English AveProb 放弃 gave up his 0.74 放弃 quit 0.83 放弃 abdicate 0.68 Phrase Table best match Chinese English 放弃 abandon 放弃 give up 放弃 abdicate Dictionary
  39. 39. Drama Vocabulary Quiz Liling Tan Rakuten Institute of Technology (Singapore) 28 Oct 2017 @ Rakuten Tech. Conference
  40. 40. 40 Overview •Introduction •Demo •How did We Create the Quiz?
  41. 41. 41 Introduction •Quizzes are fun and could be viral •But manually creating quizzes is tedious •We created #DramaVocabQuiz that generates new vocabulary quizzes automatically
  42. 42. 42
  43. 43. 43
  44. 44. 44
  45. 45. 45
  46. 46. 46
  47. 47. 47
  48. 48. 48 How do we Generate Quizzes Automatically?
  49. 49. 49 Korean Drama Word List • The word 미남 [minam] “handsome guy” can be followed by multiple suffixes at once -이시라 구요 [-issilaguyo] to form a single word meaning “someone said that he is handsome”. • We only extract the root word 미남 [minam], and count it as a unique word type
  50. 50. 50 Korean Drama Word List
  51. 51. 51 Korean Drama Word List
  52. 52. 52 Korean Drama Word List
  53. 53. 53 Splitting Word List into 3 Difficulty Levels ↑
  54. 54. 54 Generate the Distractors • Distractor 1: Select the top 5th to 20th closest words (cosine) • Distractor 2: Use Distractor 1 as negative and question word as positive, select 1st to 20th closest word (cosmul) References: • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR. • Omer Levy and Yoav Goldberg. 2014. Linguistic Regularities in Sparse and Explicit Word Representations. In CoNLL.
  55. 55. 55 Language Leaners Like Quizzes!! • 60,000+ quizzes taken • 35,000+ unique users completed quiz • 16% of the users repeated quiz
  56. 56. 56 Word Frequency is a Good Indicator of Difficulty 10 8 6 4 2 0 Easy Medium Hard Easy = Frequent words Medium = Less Frequent words Hard = Least Frequent words
  57. 57. 57 Conclusion Watch Drama, Learn Language Quiz: https://languagequiz.viki.com Techblog: https://techblog.rakuten.co.jp/2017/05/26/lang-quiz/
  58. 58. Oct.28.2017 Pang Zineng Senior Technologist Rakuten Institute of Technology Singapore
  59. 59. 59 * Images from Rakuten VIKI
  60. 60. 60 clips pages Web Search In-Video Search * Images from Rakuten VIKI
  61. 61. 61 Web Search In-Video Search •The meta data of the site •The meta data of the page •The word tokens in the page •The topic of the page •The originality of the page •Hyperlinks (page rank) • The meta data of the video •The meta data of this clip (timestamp, length, URI, etc.) • The caption text of the clip • The frames & audio signal •Complexity of the sentence •Diversity of the clips site identifier page identifier content ranking search relevancy video identifier clip identifier search relevancy content ranking * Images from Rakuten VIKI
  62. 62. 62 Job: • Make some data ready for consumption. Questions: • How does the data come? • What needs to be done for it to be ready? • How will the data be consumed? database Pre- processing function Trigger / monitor function Raw Data Data access function FTP API Data provider Data consumer
  63. 63. 63 Job: • Let outsider use a function. Questions: • How frequently will the function be used? • What data does the function need? Application logic API Endpoint Web Application API Cache Request Queue Application Cache Internal/External Data
  64. 64. 64 Rakuten TV video contents Other video contents Rakuten VIKI video contents Search function 3rd Party Platform Motion Dictionary * Images from Rakuten VIKI
  65. 65. 65 Japanese Dictionary Data dictionary function voice function 3rd party solution Korean Dictionary Data Chinese Dictionary Data 3rd party solution open source framework Interactive Subtitles (version 2) Interactive Subtitles (version 3) * Images from Rakuten VIKI tokenization function Korean Tokenization Data Chinese Tokenization Data Japanese Tokenization Data open source framework open source framework open source framework Korean Tokenization Data Chinese Tokenization Data In-house solution In-house solution
  66. 66. 66 Japanese Dictionary Data dictionary function voice function 3rd party solution Korean Dictionary Data Chinese Dictionary Data 3rd party solution open source framework Interactive Subtitles (version 2) Interactive Subtitles (version 3) * Images from Rakuten VIKI tokenization function Japanese Tokenization Data open source framework Global Tokenization Data In-house solution Global Dictionary Data In-house solution Korean Tokenization Data Chinese Tokenization Data In-house solution In-house solution
  67. 67. 67 Take Quiz function Vocab Quiz (version 1) * Images from Rakuten VIKI Chinese Quiz Data Korean Quiz Data
  68. 68. 68 Chinese Quiz Data Take Quiz function voice function Vocab Quiz (version 2) * Images from Rakuten VIKI Korean Quiz Data
  69. 69. 69 Fast iteration in R&D won’t be possible if we had many things bundled or coupled. -- Pang Vocab Quiz • https://languagequiz.viki.com/ Learn Mode (PC/Mac only) • https://www.viki.com/collections/316981l-learn-the-basics-chinese • https://www.viki.com/collections/316939l-learn-the-basics-korean Motion Dictionary • TBD

×