SlideShare a Scribd company logo
LINGVIST | Language learning meets AI
STATISTICAL METHODS IN
LANGUAGE LEARNING
Machine Learning Estonia Meetup
2017-02-28
LINGVIST | Language learning meets AI lingvist.com@lingvist
Last time…
“We have very little Machine Learning”
- paraphrasing Ahti
LINGVIST | Language learning meets AI lingvist.com@lingvist
Lets fix it!
LINGVIST | Language learning meets AI lingvist.com@lingvist
At the same time in Marketing team…
LINGVIST | Language learning meets AI lingvist.com@lingvist
Lingvist Intro
• Foreign language learning application
• We are obsessed with learning speed
• Currently free to use
• Web, iOS, Android versions
• 16 courses (language pairs) publicly available
ET-EN, ET-FR,
RU-EN, RU-FR,
EN-DE, EN-ES, EN-FR, EN-RU,
AR-EN, DE-EN, FR-EN, ES-EN, JA-EN, PT-EN, ZH-Hant-EN, ZH-Hans-EN
Homepage: lingvist.com
LINGVIST | Language learning meets AI @lingvist lingvist.com
LINGVIST | Language learning meets AI @lingvist lingvist.com
You are expected to
type in the correct
answer
LINGVIST | Language learning meets AI @lingvist lingvist.com
If you don’t know
then we show
correct answer
LINGVIST | Language learning meets AI @lingvist lingvist.com
Well done!
LINGVIST | Language learning meets AI lingvist.com@lingvist
We use statistics to…
• Prepare the course material
• Predict what learners already know
• Choose optimal repetition intervals during learning
• Analyze common mistakes learners do (and help them to avoid these)
We use conversion, retention,
engagement statistics also to drive most
product decisions but I will not talk
about it today.
LINGVIST | Language learning meets AI @lingvist lingvist.com
Course material preparation
LINGVIST | Language learning meets AI lingvist.com@lingvist
Frequency based vocabulary
Objective:
• Teach vocabulary based on frequency
• Quickly reach to level which is practically useful
• French: ~2000 words covers ~80% words in typical text
Solution:
• Acquire big text corpus
• Parse and tag (noun, verb, …) all words
• Build word list in frequency order
• Adjust ranking (down-rank pronouns, articles, …)
• Review and adjustments by linguists
LINGVIST | Language learning meets AI lingvist.com@lingvist
Sample sentence extraction
Objective:
• Sentences should represent typical context
• Manual production is very time consuming
Solution:
• Extract candidate sentence/phrases from text corpus
• Rank sentences based on set of criteria
• Linguists choose the most suitable
• Sentences are redacted for consistency and completeness
LINGVIST | Language learning meets AI lingvist.com@lingvist
Sample sentence ranking
Ranking criteria:
• C1. Sentence length
• C2. Complete sentence
• C3. Previously learned words in course
• C4. Natural sequence of words ("fast car“ vs “brave car”)
• C5. Contain relevant context words (“go home”)
• C6. Thematically consistent (“flower” and “bloom”)
Total score is weighted sum of sub-scores.
LINGVIST | Language learning meets AI @lingvist lingvist.com
Extracted sample sentences sample
LINGVIST | Language learning meets AI lingvist.com@lingvist
Dr. Haystack
• English corpus size used was ~3.7bln words
• There is no conversational corpora of required size
• Number of criteria leads to “The curse of dimensionality”
• Words rarely used in context that linguists consider as good example
• Harder than needle in the haystack
LINGVIST | Language learning meets AI @lingvist lingvist.com
Predicting what user already knows
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Objective:
• We have many users with previous knowledge in language
• If we could predict what they know already...
- then we can exclude these words
- save time
- avoid boredom
• We have placement test feature for about a year
- prediction is based on word frequencies
- but this correlation is not high and we miss many known words
- it still has a big positive impact on user retention
- can we do better?
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
User wait doubt letter between son wait Target word: wonder
User 1 1 1 1 0 1 0 0
User 2 1 0 1 0 1 1 1
User 3 0 0 0 1 1 1 1
How?
• We don't teach new words – we ask first
• What person already knows is valuable information
Training the models:
• Take all first answers from learning history (correct answer = user knows the word already)
• Train model per word to predict knowledge of that word
• Rank words by their predictive power
• Train second model for each word using fixed set of most predictive words as inputs
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
• 5000 models for each course (one model for each word in course)
• User answers most predictive words (up to 50 words)
• For each word in the course feed answers as input
• Get the prediction for each word
• Include or exclude word in course based on prediction
• Include small % of excluded words despite (for validation)
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Averages of performance metrics:
RU-EN course Random Forest
first 4000 words
Random Forest
first 2000 words
Accuracy 0.74 0.72
Precision for “known” 0.67 0.72
Recall for “known” 0.69 0.72
Precision for “unknown” 0.52 0.52
Recall for “unknown” 0.54 0.57
Training samples 2440 4959
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Challenges:
• Distribution of samples is heavily skewed to beginning of the course
• Dataset is biased due current placement test implementation:
- we excluded word if we predicted user knows the word
- so we have little data about true positives and false positives
• Model has worse performance for some language pairs
• Order of the words in the course influences the model
LINGVIST | Language learning meets AI @lingvist lingvist.com
Predicting optimal repetition interval
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
Based on :
• Forgetting curve: exponential decay, Hermann Ebbinghaus ~1885
• Spaced repetition: C.A.Mace ~1932
Forgetting curve parameters are:
• highly individual (depends on person)
• highly contextual (depends on fact what is learned)
Challenge:
Measure or estimate forgetting curve parameters
• for this particular person
• for this particular word or skill
LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
Objective:
• Target word with learning history (3x, 1/10/50min, wrong/correct/wrong)
• Predict interval user answering correctly with desired probability (~80-90%)
Method:
• Take user learning history (all answers and preceding histories)
• Calculate distance to our target word
• Choose up to ~100 learning histories most similar to target word
• Fit the curve through next repetition intervals and answers
• Calculate the interval for desired probability that user answers correctly
LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
LINGVIST | Language learning meets AI lingvist.com@lingvist
Curve fitting
LINGVIST | Language learning meets AI @lingvist lingvist.com
Mistake classification
LINGVIST | Language learning meets AI lingvist.com@lingvist
Mistake classification
• Extract all wrong answers
• Classify wrong answers: typos, wrong grammar form, synonyms, false-friends, …
• Sort by most common mistakes
• … and figure out what we can do about it
LINGVIST | Language learning meets AI lingvist.com@lingvist
Reducing mistakes
• Improve the sample sentence
• Give hints to user
• Allow use to try-again
LINGVIST | Language learning meets AI @lingvist lingvist.com
Concluding remarks
LINGVIST | Language learning meets AI lingvist.com@lingvist
Some learnings
• Deterministic history leads to biases
• Adding some randomizations is good for discovery
• Each language pair is analyzed separately (RU-EN vs FR-EN)
• Noise (typos, bad samples etc) must be accounted for
LINGVIST | Language learning meets AI lingvist.com@lingvist
Technology
• Python (3.x)
• NumPy, Scipy, Pandas – statistics, clustering, calculations
• Scikit-Learn - machine mearning (Random Forest, Multinominal Bayes, feature extraction)
• Gensim – distributional semantics (CBOW, word2vec, skip-gram …)
• Semspaces – functions for working with semantic spaces
• NLTK, Freeling, Stanford NLP – parsing, PoS tagging, pre-processing
LINGVIST | Language learning meets AI @lingvist lingvist.com
THANK YOU!
Credits go to team, mistakes are mine!

More Related Content

Similar to Lingvist - Statistical Methods in Language Learning

Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
ananth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
DenisDumas2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
Traian Rebedea
 
Edulearnarticle 4
Edulearnarticle 4Edulearnarticle 4
Edulearnarticle 4John Allan
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
Anuj Gupta
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
Anuj Gupta
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
Aravind Reddy
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
Aravind Reddy
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
How effective is speech recognition software for improving pronunciation skills
How effective is speech recognition software for improving pronunciation skillsHow effective is speech recognition software for improving pronunciation skills
How effective is speech recognition software for improving pronunciation skills
Bindi Clements
 
Williamstown 2009 Kurzweil 3000 Supports Writing
Williamstown 2009  Kurzweil 3000 Supports WritingWilliamstown 2009  Kurzweil 3000 Supports Writing
Williamstown 2009 Kurzweil 3000 Supports Writing
Jennifer Edge-Savage
 
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...Just In Time Learning Implementing Principles Of Multimodal Processing And Le...
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...wacerone
 
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
John Rochford
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Fluency
FluencyFluency
FluencyLisa
 

Similar to Lingvist - Statistical Methods in Language Learning (20)

Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Edulearnarticle 4
Edulearnarticle 4Edulearnarticle 4
Edulearnarticle 4
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
Natural Language Processing for development
Natural Language Processing for developmentNatural Language Processing for development
Natural Language Processing for development
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
How effective is speech recognition software for improving pronunciation skills
How effective is speech recognition software for improving pronunciation skillsHow effective is speech recognition software for improving pronunciation skills
How effective is speech recognition software for improving pronunciation skills
 
Williamstown 2009 Kurzweil 3000 Supports Writing
Williamstown 2009  Kurzweil 3000 Supports WritingWilliamstown 2009  Kurzweil 3000 Supports Writing
Williamstown 2009 Kurzweil 3000 Supports Writing
 
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...Just In Time Learning Implementing Principles Of Multimodal Processing And Le...
Just In Time Learning Implementing Principles Of Multimodal Processing And Le...
 
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Fluency
FluencyFluency
Fluency
 

More from André Karpištšenko

Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
André Karpištšenko
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
André Karpištšenko
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
André Karpištšenko
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
André Karpištšenko
 
AI Control
AI ControlAI Control
Deep learning
Deep learningDeep learning
Deep learning
André Karpištšenko
 

More from André Karpištšenko (6)

Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
 
Data science for everyone
Data science for everyoneData science for everyone
Data science for everyone
 
AI Control
AI ControlAI Control
AI Control
 
Deep learning
Deep learningDeep learning
Deep learning
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

Lingvist - Statistical Methods in Language Learning

  • 1. LINGVIST | Language learning meets AI STATISTICAL METHODS IN LANGUAGE LEARNING Machine Learning Estonia Meetup 2017-02-28
  • 2. LINGVIST | Language learning meets AI lingvist.com@lingvist Last time… “We have very little Machine Learning” - paraphrasing Ahti
  • 3. LINGVIST | Language learning meets AI lingvist.com@lingvist Lets fix it!
  • 4. LINGVIST | Language learning meets AI lingvist.com@lingvist At the same time in Marketing team…
  • 5. LINGVIST | Language learning meets AI lingvist.com@lingvist Lingvist Intro • Foreign language learning application • We are obsessed with learning speed • Currently free to use • Web, iOS, Android versions • 16 courses (language pairs) publicly available ET-EN, ET-FR, RU-EN, RU-FR, EN-DE, EN-ES, EN-FR, EN-RU, AR-EN, DE-EN, FR-EN, ES-EN, JA-EN, PT-EN, ZH-Hant-EN, ZH-Hans-EN Homepage: lingvist.com
  • 6. LINGVIST | Language learning meets AI @lingvist lingvist.com
  • 7. LINGVIST | Language learning meets AI @lingvist lingvist.com You are expected to type in the correct answer
  • 8. LINGVIST | Language learning meets AI @lingvist lingvist.com If you don’t know then we show correct answer
  • 9. LINGVIST | Language learning meets AI @lingvist lingvist.com Well done!
  • 10. LINGVIST | Language learning meets AI lingvist.com@lingvist We use statistics to… • Prepare the course material • Predict what learners already know • Choose optimal repetition intervals during learning • Analyze common mistakes learners do (and help them to avoid these) We use conversion, retention, engagement statistics also to drive most product decisions but I will not talk about it today.
  • 11. LINGVIST | Language learning meets AI @lingvist lingvist.com Course material preparation
  • 12. LINGVIST | Language learning meets AI lingvist.com@lingvist Frequency based vocabulary Objective: • Teach vocabulary based on frequency • Quickly reach to level which is practically useful • French: ~2000 words covers ~80% words in typical text Solution: • Acquire big text corpus • Parse and tag (noun, verb, …) all words • Build word list in frequency order • Adjust ranking (down-rank pronouns, articles, …) • Review and adjustments by linguists
  • 13. LINGVIST | Language learning meets AI lingvist.com@lingvist Sample sentence extraction Objective: • Sentences should represent typical context • Manual production is very time consuming Solution: • Extract candidate sentence/phrases from text corpus • Rank sentences based on set of criteria • Linguists choose the most suitable • Sentences are redacted for consistency and completeness
  • 14. LINGVIST | Language learning meets AI lingvist.com@lingvist Sample sentence ranking Ranking criteria: • C1. Sentence length • C2. Complete sentence • C3. Previously learned words in course • C4. Natural sequence of words ("fast car“ vs “brave car”) • C5. Contain relevant context words (“go home”) • C6. Thematically consistent (“flower” and “bloom”) Total score is weighted sum of sub-scores.
  • 15. LINGVIST | Language learning meets AI @lingvist lingvist.com Extracted sample sentences sample
  • 16. LINGVIST | Language learning meets AI lingvist.com@lingvist Dr. Haystack • English corpus size used was ~3.7bln words • There is no conversational corpora of required size • Number of criteria leads to “The curse of dimensionality” • Words rarely used in context that linguists consider as good example • Harder than needle in the haystack
  • 17. LINGVIST | Language learning meets AI @lingvist lingvist.com Predicting what user already knows
  • 18. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Objective: • We have many users with previous knowledge in language • If we could predict what they know already... - then we can exclude these words - save time - avoid boredom • We have placement test feature for about a year - prediction is based on word frequencies - but this correlation is not high and we miss many known words - it still has a big positive impact on user retention - can we do better?
  • 19. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows User wait doubt letter between son wait Target word: wonder User 1 1 1 1 0 1 0 0 User 2 1 0 1 0 1 1 1 User 3 0 0 0 1 1 1 1 How? • We don't teach new words – we ask first • What person already knows is valuable information Training the models: • Take all first answers from learning history (correct answer = user knows the word already) • Train model per word to predict knowledge of that word • Rank words by their predictive power • Train second model for each word using fixed set of most predictive words as inputs
  • 20. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows • 5000 models for each course (one model for each word in course) • User answers most predictive words (up to 50 words) • For each word in the course feed answers as input • Get the prediction for each word • Include or exclude word in course based on prediction • Include small % of excluded words despite (for validation)
  • 21. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Averages of performance metrics: RU-EN course Random Forest first 4000 words Random Forest first 2000 words Accuracy 0.74 0.72 Precision for “known” 0.67 0.72 Recall for “known” 0.69 0.72 Precision for “unknown” 0.52 0.52 Recall for “unknown” 0.54 0.57 Training samples 2440 4959
  • 22. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting what user already knows Challenges: • Distribution of samples is heavily skewed to beginning of the course • Dataset is biased due current placement test implementation: - we excluded word if we predicted user knows the word - so we have little data about true positives and false positives • Model has worse performance for some language pairs • Order of the words in the course influences the model
  • 23. LINGVIST | Language learning meets AI @lingvist lingvist.com Predicting optimal repetition interval
  • 24. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval
  • 25. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval Based on : • Forgetting curve: exponential decay, Hermann Ebbinghaus ~1885 • Spaced repetition: C.A.Mace ~1932 Forgetting curve parameters are: • highly individual (depends on person) • highly contextual (depends on fact what is learned) Challenge: Measure or estimate forgetting curve parameters • for this particular person • for this particular word or skill
  • 26. LINGVIST | Language learning meets AI lingvist.com@lingvist Predicting optimal repetition interval Objective: • Target word with learning history (3x, 1/10/50min, wrong/correct/wrong) • Predict interval user answering correctly with desired probability (~80-90%) Method: • Take user learning history (all answers and preceding histories) • Calculate distance to our target word • Choose up to ~100 learning histories most similar to target word • Fit the curve through next repetition intervals and answers • Calculate the interval for desired probability that user answers correctly
  • 27. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  • 28. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  • 29. LINGVIST | Language learning meets AI @lingvist lingvist.com Word # answers Last interval Last correct + N parameters Next interval Next correct voiture 3 50 min Yes … ??? 80-90% reste 2 6 min No 4 min Yes reste 3 4 min Yes 1 hr Yes voyage 3 30 min Yes 3 hrs No voyage 4 3 hrs No 2 hrs Yes … … devriez 12 2 wk Yes 10 wk No Clustering similar histories
  • 30. LINGVIST | Language learning meets AI lingvist.com@lingvist Curve fitting
  • 31. LINGVIST | Language learning meets AI @lingvist lingvist.com Mistake classification
  • 32. LINGVIST | Language learning meets AI lingvist.com@lingvist Mistake classification • Extract all wrong answers • Classify wrong answers: typos, wrong grammar form, synonyms, false-friends, … • Sort by most common mistakes • … and figure out what we can do about it
  • 33. LINGVIST | Language learning meets AI lingvist.com@lingvist Reducing mistakes • Improve the sample sentence • Give hints to user • Allow use to try-again
  • 34. LINGVIST | Language learning meets AI @lingvist lingvist.com Concluding remarks
  • 35. LINGVIST | Language learning meets AI lingvist.com@lingvist Some learnings • Deterministic history leads to biases • Adding some randomizations is good for discovery • Each language pair is analyzed separately (RU-EN vs FR-EN) • Noise (typos, bad samples etc) must be accounted for
  • 36. LINGVIST | Language learning meets AI lingvist.com@lingvist Technology • Python (3.x) • NumPy, Scipy, Pandas – statistics, clustering, calculations • Scikit-Learn - machine mearning (Random Forest, Multinominal Bayes, feature extraction) • Gensim – distributional semantics (CBOW, word2vec, skip-gram …) • Semspaces – functions for working with semantic spaces • NLTK, Freeling, Stanford NLP – parsing, PoS tagging, pre-processing
  • 37. LINGVIST | Language learning meets AI @lingvist lingvist.com THANK YOU! Credits go to team, mistakes are mine!