This document discusses how Lingvist, a language learning application, uses machine learning and statistical methods to optimize the learning process. It prepares course material using frequency-based vocabulary and extracted sentences from text corpora. It predicts what users already know using models trained on learning histories to exclude known words. It predicts optimal repetition intervals by analyzing similar learning histories and fitting forgetting curves. It classifies common mistakes to provide hints to reduce errors. The techniques aim to quickly teach practical vocabulary, engage users, and analyze performance data to improve the product. Python libraries like Scikit-Learn are used for machine learning tasks.
Language ability & content knowledge by Ivana Vidakovic at IATEFL BESIG TEASI...IATEFL BESIG
In an academic or work environment, the ability to communicate on study- or work-related matters is vital. This requires specific-purpose language ability which combines content knowledge and language ability. Teasing out language ability from content knowledge is a delicate and sometimes impossible task in English for Specific Purposes (ESP) assessment and teaching. The key questions which arise are:
• What effect does content knowledge have on linguistic performance?
• What roles do content knowledge and language ability play in an ESP test?
• How much does an ESP teacher need to know about the subject content?
Many studies reveal that content knowledge has a facilitating effect on reading, listening, speaking and writing performance in a foreign language (L2). However, this effect may not always be consistent - test takers with a certain academic background may not always do better on a text/task from their field than test takers from a different discipline. The effect of content knowledge on linguistic performance varies with language proficiency which makes it important to see how content knowledge and language ability interact. How specific an ESP test is will determine the roles of content knowledge and language ability, and how much content knowledge is necessary to pass. All this will be discussed through a critical examination of Cambridge English ESP tests. The presentation will also address some of the key challenges in ESP teaching – overcoming the teacher’s lack of content knowledge and bridging the gap between English language ability and specific-purpose language ability – when the ESP teacher is ‘just’ an ELT professional.
Is acquiring knowledge of verb subcategorization in English easier? A partial...Yu Tamura
Tamura, Y. (2016). Is acquiring knowledge of verb subcategorization in English easier? A partial replication of Jiang (2007). Paper presented at PacSLRF2016. Chuo University, Tokyo Japan. September 11, 2016
Language ability & content knowledge by Ivana Vidakovic at IATEFL BESIG TEASI...IATEFL BESIG
In an academic or work environment, the ability to communicate on study- or work-related matters is vital. This requires specific-purpose language ability which combines content knowledge and language ability. Teasing out language ability from content knowledge is a delicate and sometimes impossible task in English for Specific Purposes (ESP) assessment and teaching. The key questions which arise are:
• What effect does content knowledge have on linguistic performance?
• What roles do content knowledge and language ability play in an ESP test?
• How much does an ESP teacher need to know about the subject content?
Many studies reveal that content knowledge has a facilitating effect on reading, listening, speaking and writing performance in a foreign language (L2). However, this effect may not always be consistent - test takers with a certain academic background may not always do better on a text/task from their field than test takers from a different discipline. The effect of content knowledge on linguistic performance varies with language proficiency which makes it important to see how content knowledge and language ability interact. How specific an ESP test is will determine the roles of content knowledge and language ability, and how much content knowledge is necessary to pass. All this will be discussed through a critical examination of Cambridge English ESP tests. The presentation will also address some of the key challenges in ESP teaching – overcoming the teacher’s lack of content knowledge and bridging the gap between English language ability and specific-purpose language ability – when the ESP teacher is ‘just’ an ELT professional.
Is acquiring knowledge of verb subcategorization in English easier? A partial...Yu Tamura
Tamura, Y. (2016). Is acquiring knowledge of verb subcategorization in English easier? A partial replication of Jiang (2007). Paper presented at PacSLRF2016. Chuo University, Tokyo Japan. September 11, 2016
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
A short tutorial of the OCS freeware, which is used by psychologists to score creativity assessments.
We originally presented these slides in Thessaloniki in August of 2023.
The best known natural language processing tool is GPT-3, from OpenAI, which uses AI and statistics to predict the next word in a sentence based on the preceding words. NLP practitioners call tools like this “language models,” and they can be used for simple analytics tasks, such as classifying documents and analyzing the sentiment in blocks of text, as well as more advanced tasks, such as answering questions and summarizing reports. Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems. The latest version, called InstructGPT, has been fine-tuned by humans to generate responses that are much better aligned with human values and user intentions, and Google’s latest model shows further impressive breakthroughs on language and reasoning.
For businesses, the three areas where GPT-3 has appeared most promising are writing, coding, and discipline-specific reasoning. OpenAI, the Microsoft-funded creator of GPT-3, has developed a GPT-3-based language model intended to act as an assistant for programmers by generating code from natural language input. This tool, Codex, is already powering products like Copilot for Microsoft’s subsidiary GitHub and is capable of creating a basic video game simply by typing instructions. This transformative capability was already expected to change the nature of how programmers do their jobs, but models continue to improve — the latest from Google’s DeepMind AI lab, for example, demonstrates the critical thinking and logic skills necessary to outperform most humans in programming competitions.
Models like GPT-3 are considered to be foundation models — an emerging AI research area — which also work for other types of data such as images and video. Foundation models can even be trained on multiple forms of data at the same time, like OpenAI’s DALL·E 2, which is trained on language and images to generate high-resolution renderings of imaginary scenes or objects simply from text prompts. Due to their potential to transform the nature of cognitive work, economists expect that foundation models may affect every part of the economy and could lead to increases in economic growth similar to the industrial revolution.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
How effective is speech recognition software for improving pronunciation skillsBindi Clements
For a summary of the research and link to published research paper: https://www.wallstreetenglish.com/blog/speech-recognition-for-improving-pronunciation-skills/
An Introduction to Kurzweil 3000 software to support reading, writing, study skills and test-taking. Focus on how Kurzweil 3000's highlighters, graphic organizer, column notes, outliner, customizable writing rubric and ability to seemlessly move from reading content to writing can support students in the classroom.
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...John Rochford
A study that shows significantly-improved comprehension, by people with intellectual disabilities, of Web text simplified with operationalized plain-language standards. This work has significant promise for training artificial intelligence how to create simplified text. Presented at the CSUN Assistive Technology Conference 2019.
Natural Language Processing: L01 introductionananth
This presentation introduces the course Natural Language Processing (NLP) by enumerating a number of applications, course positioning, challenges presented by Natural Language text and emerging approaches to topics like word representation.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
A short tutorial of the OCS freeware, which is used by psychologists to score creativity assessments.
We originally presented these slides in Thessaloniki in August of 2023.
The best known natural language processing tool is GPT-3, from OpenAI, which uses AI and statistics to predict the next word in a sentence based on the preceding words. NLP practitioners call tools like this “language models,” and they can be used for simple analytics tasks, such as classifying documents and analyzing the sentiment in blocks of text, as well as more advanced tasks, such as answering questions and summarizing reports. Language models are already reshaping traditional text analytics, but GPT-3 was an especially pivotal language model because, at 10x larger than any previous model upon release, it was the first large language model, which enabled it to perform even more advanced tasks like programming and solving high school–level math problems. The latest version, called InstructGPT, has been fine-tuned by humans to generate responses that are much better aligned with human values and user intentions, and Google’s latest model shows further impressive breakthroughs on language and reasoning.
For businesses, the three areas where GPT-3 has appeared most promising are writing, coding, and discipline-specific reasoning. OpenAI, the Microsoft-funded creator of GPT-3, has developed a GPT-3-based language model intended to act as an assistant for programmers by generating code from natural language input. This tool, Codex, is already powering products like Copilot for Microsoft’s subsidiary GitHub and is capable of creating a basic video game simply by typing instructions. This transformative capability was already expected to change the nature of how programmers do their jobs, but models continue to improve — the latest from Google’s DeepMind AI lab, for example, demonstrates the critical thinking and logic skills necessary to outperform most humans in programming competitions.
Models like GPT-3 are considered to be foundation models — an emerging AI research area — which also work for other types of data such as images and video. Foundation models can even be trained on multiple forms of data at the same time, like OpenAI’s DALL·E 2, which is trained on language and images to generate high-resolution renderings of imaginary scenes or objects simply from text prompts. Due to their potential to transform the nature of cognitive work, economists expect that foundation models may affect every part of the economy and could lead to increases in economic growth similar to the industrial revolution.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
How effective is speech recognition software for improving pronunciation skillsBindi Clements
For a summary of the research and link to published research paper: https://www.wallstreetenglish.com/blog/speech-recognition-for-improving-pronunciation-skills/
An Introduction to Kurzweil 3000 software to support reading, writing, study skills and test-taking. Focus on how Kurzweil 3000's highlighters, graphic organizer, column notes, outliner, customizable writing rubric and ability to seemlessly move from reading content to writing can support students in the classroom.
Creating Simple Web Text for People with Intellectual Disabilities and to Tra...John Rochford
A study that shows significantly-improved comprehension, by people with intellectual disabilities, of Web text simplified with operationalized plain-language standards. This work has significant promise for training artificial intelligence how to create simplified text. Presented at the CSUN Assistive Technology Conference 2019.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
10. LINGVIST | Language learning meets AI lingvist.com@lingvist
We use statistics to…
• Prepare the course material
• Predict what learners already know
• Choose optimal repetition intervals during learning
• Analyze common mistakes learners do (and help them to avoid these)
We use conversion, retention,
engagement statistics also to drive most
product decisions but I will not talk
about it today.
11. LINGVIST | Language learning meets AI @lingvist lingvist.com
Course material preparation
12. LINGVIST | Language learning meets AI lingvist.com@lingvist
Frequency based vocabulary
Objective:
• Teach vocabulary based on frequency
• Quickly reach to level which is practically useful
• French: ~2000 words covers ~80% words in typical text
Solution:
• Acquire big text corpus
• Parse and tag (noun, verb, …) all words
• Build word list in frequency order
• Adjust ranking (down-rank pronouns, articles, …)
• Review and adjustments by linguists
13. LINGVIST | Language learning meets AI lingvist.com@lingvist
Sample sentence extraction
Objective:
• Sentences should represent typical context
• Manual production is very time consuming
Solution:
• Extract candidate sentence/phrases from text corpus
• Rank sentences based on set of criteria
• Linguists choose the most suitable
• Sentences are redacted for consistency and completeness
14. LINGVIST | Language learning meets AI lingvist.com@lingvist
Sample sentence ranking
Ranking criteria:
• C1. Sentence length
• C2. Complete sentence
• C3. Previously learned words in course
• C4. Natural sequence of words ("fast car“ vs “brave car”)
• C5. Contain relevant context words (“go home”)
• C6. Thematically consistent (“flower” and “bloom”)
Total score is weighted sum of sub-scores.
15. LINGVIST | Language learning meets AI @lingvist lingvist.com
Extracted sample sentences sample
16. LINGVIST | Language learning meets AI lingvist.com@lingvist
Dr. Haystack
• English corpus size used was ~3.7bln words
• There is no conversational corpora of required size
• Number of criteria leads to “The curse of dimensionality”
• Words rarely used in context that linguists consider as good example
• Harder than needle in the haystack
17. LINGVIST | Language learning meets AI @lingvist lingvist.com
Predicting what user already knows
18. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Objective:
• We have many users with previous knowledge in language
• If we could predict what they know already...
- then we can exclude these words
- save time
- avoid boredom
• We have placement test feature for about a year
- prediction is based on word frequencies
- but this correlation is not high and we miss many known words
- it still has a big positive impact on user retention
- can we do better?
19. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
User wait doubt letter between son wait Target word: wonder
User 1 1 1 1 0 1 0 0
User 2 1 0 1 0 1 1 1
User 3 0 0 0 1 1 1 1
How?
• We don't teach new words – we ask first
• What person already knows is valuable information
Training the models:
• Take all first answers from learning history (correct answer = user knows the word already)
• Train model per word to predict knowledge of that word
• Rank words by their predictive power
• Train second model for each word using fixed set of most predictive words as inputs
20. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
• 5000 models for each course (one model for each word in course)
• User answers most predictive words (up to 50 words)
• For each word in the course feed answers as input
• Get the prediction for each word
• Include or exclude word in course based on prediction
• Include small % of excluded words despite (for validation)
21. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Averages of performance metrics:
RU-EN course Random Forest
first 4000 words
Random Forest
first 2000 words
Accuracy 0.74 0.72
Precision for “known” 0.67 0.72
Recall for “known” 0.69 0.72
Precision for “unknown” 0.52 0.52
Recall for “unknown” 0.54 0.57
Training samples 2440 4959
22. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting what user already knows
Challenges:
• Distribution of samples is heavily skewed to beginning of the course
• Dataset is biased due current placement test implementation:
- we excluded word if we predicted user knows the word
- so we have little data about true positives and false positives
• Model has worse performance for some language pairs
• Order of the words in the course influences the model
23. LINGVIST | Language learning meets AI @lingvist lingvist.com
Predicting optimal repetition interval
24. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
25. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
Based on :
• Forgetting curve: exponential decay, Hermann Ebbinghaus ~1885
• Spaced repetition: C.A.Mace ~1932
Forgetting curve parameters are:
• highly individual (depends on person)
• highly contextual (depends on fact what is learned)
Challenge:
Measure or estimate forgetting curve parameters
• for this particular person
• for this particular word or skill
26. LINGVIST | Language learning meets AI lingvist.com@lingvist
Predicting optimal repetition interval
Objective:
• Target word with learning history (3x, 1/10/50min, wrong/correct/wrong)
• Predict interval user answering correctly with desired probability (~80-90%)
Method:
• Take user learning history (all answers and preceding histories)
• Calculate distance to our target word
• Choose up to ~100 learning histories most similar to target word
• Fit the curve through next repetition intervals and answers
• Calculate the interval for desired probability that user answers correctly
27. LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
28. LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
29. LINGVIST | Language learning meets AI @lingvist lingvist.com
Word # answers Last
interval Last correct + N parameters Next interval Next correct
voiture 3 50 min Yes … ??? 80-90%
reste 2 6 min No 4 min Yes
reste 3 4 min Yes 1 hr Yes
voyage 3 30 min Yes 3 hrs No
voyage 4 3 hrs No 2 hrs Yes
… …
devriez 12 2 wk Yes 10 wk No
Clustering similar histories
31. LINGVIST | Language learning meets AI @lingvist lingvist.com
Mistake classification
32. LINGVIST | Language learning meets AI lingvist.com@lingvist
Mistake classification
• Extract all wrong answers
• Classify wrong answers: typos, wrong grammar form, synonyms, false-friends, …
• Sort by most common mistakes
• … and figure out what we can do about it
33. LINGVIST | Language learning meets AI lingvist.com@lingvist
Reducing mistakes
• Improve the sample sentence
• Give hints to user
• Allow use to try-again
34. LINGVIST | Language learning meets AI @lingvist lingvist.com
Concluding remarks
35. LINGVIST | Language learning meets AI lingvist.com@lingvist
Some learnings
• Deterministic history leads to biases
• Adding some randomizations is good for discovery
• Each language pair is analyzed separately (RU-EN vs FR-EN)
• Noise (typos, bad samples etc) must be accounted for