Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lingvist - Statistical Methods in Language Learning

841 views

Published on

Machine Learning Estonia Meetup 28.02.2017

Published in: Data & Analytics
  • Be the first to comment

Lingvist - Statistical Methods in Language Learning

  1. 1. LINGVIST | Language learning meets AI STATISTICAL METHODS IN LANGUAGE LEARNING Machine Learning Estonia Meetup 2017-02-28
  2. 2. LINGVIST | Language learning meets AI lingvist.com@lingvist Last time… “We have very little Machine Learning” - paraphrasing Ahti
  3. 3. LINGVIST | Language learning meets AI lingvist.com@lingvist Lets fix it!
  4. 4. LINGVIST | Language learning meets AI lingvist.com@lingvist At the same time in Marketing team…
  5. 5. LINGVIST | Language learning meets AI lingvist.com@lingvist Lingvist Intro • Foreign language learning application • We are obsessed with learning speed • Currently free to use • Web, iOS, Android versions • 16 courses (language pairs) publicly available ET-EN, ET-FR, RU-EN, RU-FR, EN-DE, EN-ES, EN-FR, EN-RU, AR-EN, DE-EN, FR-EN, ES-EN, JA-EN, PT-EN, ZH-Hant-EN, ZH-Hans-EN Homepage: lingvist.com
  6. 6. LINGVIST | Language learning meets AI @lingvist lingvist.com
  7. 7. LINGVIST | Language learning meets AI @lingvist lingvist.com You are expected to type in the correct answer
  8. 8. LINGVIST | Language learning meets AI @lingvist lingvist.com If you don’t know then we show correct answer
  9. 9. LINGVIST | Language learning meets AI @lingvist lingvist.com Well done!
  10. 10. LINGVIST | Language learning meets AI lingvist.com@lingvist We use statistics to… • Prepare the course material • Predict what learners already know • Choose optimal repetition intervals during learning • Analyze common mistakes learners do (and help them to avoid these) We use conversion, retention, engagement statistics also to drive most product decisions but I will not talk about it today.
  11. 11. LINGVIST | Language learning meets AI @lingvist lingvist.com Course material preparation
  12. 12. LINGVIST | Language learning meets AI lingvist.com@lingvist Frequency based vocabulary Objective: • Teach vocabulary based on frequency • Quickly reach to level which is practically useful • French: ~2000 words covers ~80% words in typical text Solution: • Acquire big text corpus • Parse and tag (noun, verb, …) all words • Build word list in frequency order • Adjust ranking (down-rank pronouns, articles, …) • Review and adjustments by linguists
  13. 13. LINGVIST | Language learning meets AI lingvist.com@lingvist Sample sentence extraction Objective: • Sentences should represent typical context • Manual production is very time consuming Solution: • Extract candidate sentence/phrases from text corpus • Rank sentences based on set of criteria • Linguists choose the most suitable • Sentences are redacted for consistency and completeness
  14. 14. LINGVIST | Language learning meets AI lingvist.com@lingvist Sample sentence ranking Ranking criteria: • C1. Sentence length • C2. Complete sentence • C3. Previously learned words in course • C4. Natural sequence of words ("fast car“ vs “brave car”) • C5. Contain relevant context words (“go home”) • C6. Thematically consistent (“flower” and “bloom”) Total score is weighted sum of sub-scores.
  15. 15. LINGVIST | Language learning meets AI @lingvist lingvist.com Extracted sample sentences sample
  16. 16. LINGVIST | Language learning meets AI lingvist.com@lingvist Dr. Haystack • English corpus size used was ~3.7bln words • There is no conversational corpora of required size • Number of criteria leads to “The curse of dimensionality” • Words rarely used in context that linguists consider as good example • Harder than needle in the haystack
  17. 17. LINGVIST | Language learning meets AI @lingvist lingvist.com Predicting what user already knows
  18. 18. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Objective: • We have many users with previous knowledge in language • If we could predict what they know already... - then we can exclude these words - save time - avoid boredom • We have placement test feature for about a year - prediction is based on word frequencies - but this correlation is not high and we miss many known words - it still has a big positive impact on user retention - can we do better?
  19. 19. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows User wait doubt letter between son wait Target word: wonder User 1 1 1 1 0 1 0 0 User 2 1 0 1 0 1 1 1 User 3 0 0 0 1 1 1 1 How? • We don't teach new words – we ask first • What person already knows is valuable information Training the models: • Take all first answers from learning history (correct answer = user knows the word already) • Train model per word to predict knowledge of that word • Rank words by their predictive power • Train second model for each word using fixed set of most predictive words as inputs
  20. 20. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows • 5000 models for each course (one model for each word in course) • User answers most predictive words (up to 50 words) • For each word in the course feed answers as input • Get the prediction for each word • Include or exclude word in course based on prediction • Include small % of excluded words despite (for validation)
  21. 21. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Averages of performance metrics: RU-EN course Random Forest first 4000 words Random Forest first 2000 words Accuracy 0.74 0.72 Precision for “known” 0.67 0.72 Recall for “known” 0.69 0.72 Precision for “unknown” 0.52 0.52 Recall for “unknown” 0.54 0.57 Training samples 2440 4959
  22. 22. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Challenges: • Distribution of samples is heavily skewed to beginning of the course • Dataset is biased due current placement test implementation: - we excluded word if we predicted user knows the word - so we have little data about true positives and false positives • Model has worse performance for some language pairs • Order of the words in the course influences the model
  23. 23. LINGVIST | Language learning meets AI @lingvist lingvist.com Predicting optimal repetition interval
  24. 24. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval
  25. 25. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval Based on : • Forgetting curve: exponential decay, Hermann Ebbinghaus ~1885 • Spaced repetition: C.A.Mace ~1932 Forgetting curve parameters are: • highly individual (depends on person) • highly contextual (depends on fact what is learned) Challenge: Measure or estimate forgetting curve parameters • for this particular person • for this particular word or skill
  26. 26. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval Objective: • Target word with learning history (3x, 1/10/50min, wrong/correct/wrong) • Predict interval user answering correctly with desired probability (~80-90%) Method: • Take user learning history (all answers and preceding histories) • Calculate distance to our target word • Choose up to ~100 learning histories most similar to target word • Fit the curve through next repetition intervals and answers • Calculate the interval for desired probability that user answers correctly
  27. 27. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  28. 28. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  29. 29. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  30. 30. LINGVIST | Language learning meets AI lingvist.com@lingvist Curve fitting
  31. 31. LINGVIST | Language learning meets AI @lingvist lingvist.com Mistake classification
  32. 32. LINGVIST | Language learning meets AI lingvist.com@lingvist Mistake classification • Extract all wrong answers • Classify wrong answers: typos, wrong grammar form, synonyms, false-friends, … • Sort by most common mistakes • … and figure out what we can do about it
  33. 33. LINGVIST | Language learning meets AI lingvist.com@lingvist Reducing mistakes • Improve the sample sentence • Give hints to user • Allow use to try-again
  34. 34. LINGVIST | Language learning meets AI @lingvist lingvist.com Concluding remarks
  35. 35. LINGVIST | Language learning meets AI lingvist.com@lingvist Some learnings • Deterministic history leads to biases • Adding some randomizations is good for discovery • Each language pair is analyzed separately (RU-EN vs FR-EN) • Noise (typos, bad samples etc) must be accounted for
  36. 36. LINGVIST | Language learning meets AI lingvist.com@lingvist Technology • Python (3.x) • NumPy, Scipy, Pandas – statistics, clustering, calculations • Scikit-Learn - machine mearning (Random Forest, Multinominal Bayes, feature extraction) • Gensim – distributional semantics (CBOW, word2vec, skip-gram …) • Semspaces – functions for working with semantic spaces • NLTK, Freeling, Stanford NLP – parsing, PoS tagging, pre-processing
  37. 37. LINGVIST | Language learning meets AI @lingvist lingvist.com THANK YOU! Credits go to team, mistakes are mine!

×