Data-Driven Development and
Evaluation of Enskill EnglishW. Lewis Johnson
www.alelo.com/enskill-english
W. Lewis Johnson, PhD, CEO, Alelo
• Entrepreneur, thought leader, author
• DARPA Significant Technical Achievement
Award
• IFAAMAS Influential Paper Award
• Host of webinar series on the Future of AI
in Education and Training
• Past President, Intl. AI in Education Society
• Linguistics: Princeton; Computer Sci.: Yale
2
Data-driven development (D3)
• Design is informed by learner data
• System is as much a data collection tool as
a learning tool
• AI models are iteratively trained on
learner data, using machine learning
techniques
• Learning evaluation and system evaluation
are iterative and continuous
Data mining
and analysis
Development
and model
updates
Deployment
Johnson, W.L. (2019). Data-Driven
Development and Evaluation of Enskill
English. Int. Journal of AI in Education, 29,
pp. 425-457. 3
D3 vs. ADDIE vs. SAM
Alelo Enskill: An AI-driven learning architecture
• Communicative practice with AI avatars in safe environment
• Formative assessments
• Feedback
• Personalized practice
• Analytics for teachers, learners, administrators, and
developers
5
Alelo Enskill around the world
Brazil • Chile • China • Colombia • Costa Rica • Croatia • Honduras • Malaysia • Mexico • Panama
Paraguay • Peru • Portugal • Serbia • Spain • Sweden • Thailand • Turkey • United States
Over 300,000 users to date
7/14/2020
8
Performance
dashboard
9
Performance on each
attempt
Enskill Architecture
Continuous Evaluation in D3
• The system is evolving, so evaluations must produce findings quickly.
• Types of tests and evaluations:
• Instant tests: Automated data analyses plus spot checks of data
• We perform these weekly
• Snapshot evaluations: Performed over a limited period of time with a limited
population
• Often to test specific hypotheses
• Sometimes with new populations, to collect data and design improvements
• A/B evaluations: comparison of different learner populations
• Regression tests: Tests of current system on archived data sets
• Evaluations are in the field, not in the lab
Snapshot Evaluation: Univ. of Novi Sad
• April-May 2018
• Study population: 80 CEFR B-level English learners
• Study materals: CEFR A-level conversational simulations
• Questions:
• Would the learners find it useful?
• Would their performance improve with practice?
• Was the speech & language technology adequate for this population?
Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Turns per
Minute
All trials 17.26 4.56 12.43 74.62% 2.75
First trial 18.86 7.52 11.33 62.65% 2.62
Last trial 17.86 3.52 14.33 82.35% 4.31
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 2187 85%
Plan a Party A1 2589 63%
Jerry’s Spaghetti A2 725 67%
Train Ticket A2 2733 71%
Snapshot Evaluation: University of Split
• December 2018
• Study population: 39 CEFR B-level English learners
• Somewhat higher proficiency than the Novi Sad learners
• Study population: 5 simulations, upgraded NLU and NLU models
Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Exchanges
per Minute
All trials 15.22 2.79 12.43 82.44% 3.80
First trial 15.00 3.21 11.79 77.78% 3.37
Last trial 16.07 1.82 14.31 88.78% 4.26
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 1045 83%
Plan a Party A1 973 83%
School Newspaper A1 1032 94%
Jerry’s Spaghetti A2 407 63%
Train Ticket A2 725 81%
UVM Toluca Campus, Fall 2019
UVM Level 2 Simulation Usage
• Analysis of 67 users
• 37 users tried all 10 simulations
• 7.31 average simulations tried
• 17.72 average simulation runs per user
• 2.27 average runs per simulation
• 75.69% average maximum score
• Students who practiced simulations multiple times achieved 14.15%
increase in mastery score per trial (ignoring trials with 0% scores)
© 2020 Alelo Inc. 19
UVM Students’ Mastery Decays Very Slowly After
Training
Average
mastery
score
Average
before
gap
First after
1-week
gap
Average
after 1-
week gap
First after
1-month
gap
Average
after 1-
month
gap
Level 1 32.67% 32.67% 26.29% 29.47% 22.78% 24.55%
Level 2 44.16% 41.47% 45.03% 45.28% 23.43% 23.43%
© 2020 Alelo Inc. 20
Analysis of 40 UVM Level 1 students and 22 UVM Level 2
students who practiced simulations, stopped practicing for at
least a week, then resumed practicing. 0% scores excluded from
analysis, except when scores before or after gap were ALL 0%.
Case Study at UVM Campus Toluca
• Students using Alelo Enskill at UVM, a
Laureate institution in Mexico,
developed greater proficiency and self-
confidence in English communication.
• “It helps me to improve my classes and
also it makes my classes very very short
and very very communicative.”
• “If you want your students to have
more self-confidence, Alelo is going to
be your best option.”
• “When I was a kid I wish I had this kind
of platform because it helps in
confidence.”
21
WhatsApp Student Recordings
UVM Toluca Campus, April-May 2020
• 25 students participated in the trial
• System improvements:
• Restart options in case students get stuck
• Revised mastery score at the low end of the scale
• Disabled “click-through” feature so students must speak to progress
Summary of results
• 23 students completed all 10 simulations at least once
• 1 completed 9 simulations
• 1 completed 1 simulation
• Students completed 63% of simulations only once.
• Average maximum score: 46.6%
• Average for multiple trials: 64.3%
• Average increase per trial: 13.1%
Conclusions
• AIED can and should adopt a data-driven approach
• Evaluations can and should be more agile and iterative
• A series of small evaluations can be more informative than one large
evaluation, and can guide development
• Development and evaluation can and should go hand in hand
www.alelo.com/AIED
ljohnson@alelo.com

Data-Driven Development (D3) and Evaluation of Enskill English

  • 1.
    Data-Driven Development and Evaluationof Enskill EnglishW. Lewis Johnson www.alelo.com/enskill-english
  • 2.
    W. Lewis Johnson,PhD, CEO, Alelo • Entrepreneur, thought leader, author • DARPA Significant Technical Achievement Award • IFAAMAS Influential Paper Award • Host of webinar series on the Future of AI in Education and Training • Past President, Intl. AI in Education Society • Linguistics: Princeton; Computer Sci.: Yale 2
  • 3.
    Data-driven development (D3) •Design is informed by learner data • System is as much a data collection tool as a learning tool • AI models are iteratively trained on learner data, using machine learning techniques • Learning evaluation and system evaluation are iterative and continuous Data mining and analysis Development and model updates Deployment Johnson, W.L. (2019). Data-Driven Development and Evaluation of Enskill English. Int. Journal of AI in Education, 29, pp. 425-457. 3
  • 4.
    D3 vs. ADDIEvs. SAM
  • 5.
    Alelo Enskill: AnAI-driven learning architecture • Communicative practice with AI avatars in safe environment • Formative assessments • Feedback • Personalized practice • Analytics for teachers, learners, administrators, and developers 5
  • 6.
    Alelo Enskill aroundthe world Brazil • Chile • China • Colombia • Costa Rica • Croatia • Honduras • Malaysia • Mexico • Panama Paraguay • Peru • Portugal • Serbia • Spain • Sweden • Thailand • Turkey • United States Over 300,000 users to date
  • 7.
  • 8.
  • 9.
  • 11.
  • 12.
  • 13.
    Continuous Evaluation inD3 • The system is evolving, so evaluations must produce findings quickly. • Types of tests and evaluations: • Instant tests: Automated data analyses plus spot checks of data • We perform these weekly • Snapshot evaluations: Performed over a limited period of time with a limited population • Often to test specific hypotheses • Sometimes with new populations, to collect data and design improvements • A/B evaluations: comparison of different learner populations • Regression tests: Tests of current system on archived data sets • Evaluations are in the field, not in the lab
  • 14.
    Snapshot Evaluation: Univ.of Novi Sad • April-May 2018 • Study population: 80 CEFR B-level English learners • Study materals: CEFR A-level conversational simulations • Questions: • Would the learners find it useful? • Would their performance improve with practice? • Was the speech & language technology adequate for this population?
  • 15.
    Quick Summary ofResults Category Exchanges Repeats Meaningful Exchanges Meaningful Exchange Rate Turns per Minute All trials 17.26 4.56 12.43 74.62% 2.75 First trial 18.86 7.52 11.33 62.65% 2.62 Last trial 17.86 3.52 14.33 82.35% 4.31 Simulation Name CEFR Level Total Exchanges Raw Understanding Rate Class Interview A1 2187 85% Plan a Party A1 2589 63% Jerry’s Spaghetti A2 725 67% Train Ticket A2 2733 71%
  • 16.
    Snapshot Evaluation: Universityof Split • December 2018 • Study population: 39 CEFR B-level English learners • Somewhat higher proficiency than the Novi Sad learners • Study population: 5 simulations, upgraded NLU and NLU models
  • 17.
    Quick Summary ofResults Category Exchanges Repeats Meaningful Exchanges Meaningful Exchange Rate Exchanges per Minute All trials 15.22 2.79 12.43 82.44% 3.80 First trial 15.00 3.21 11.79 77.78% 3.37 Last trial 16.07 1.82 14.31 88.78% 4.26 Simulation Name CEFR Level Total Exchanges Raw Understanding Rate Class Interview A1 1045 83% Plan a Party A1 973 83% School Newspaper A1 1032 94% Jerry’s Spaghetti A2 407 63% Train Ticket A2 725 81%
  • 18.
  • 19.
    UVM Level 2Simulation Usage • Analysis of 67 users • 37 users tried all 10 simulations • 7.31 average simulations tried • 17.72 average simulation runs per user • 2.27 average runs per simulation • 75.69% average maximum score • Students who practiced simulations multiple times achieved 14.15% increase in mastery score per trial (ignoring trials with 0% scores) © 2020 Alelo Inc. 19
  • 20.
    UVM Students’ MasteryDecays Very Slowly After Training Average mastery score Average before gap First after 1-week gap Average after 1- week gap First after 1-month gap Average after 1- month gap Level 1 32.67% 32.67% 26.29% 29.47% 22.78% 24.55% Level 2 44.16% 41.47% 45.03% 45.28% 23.43% 23.43% © 2020 Alelo Inc. 20 Analysis of 40 UVM Level 1 students and 22 UVM Level 2 students who practiced simulations, stopped practicing for at least a week, then resumed practicing. 0% scores excluded from analysis, except when scores before or after gap were ALL 0%.
  • 21.
    Case Study atUVM Campus Toluca • Students using Alelo Enskill at UVM, a Laureate institution in Mexico, developed greater proficiency and self- confidence in English communication. • “It helps me to improve my classes and also it makes my classes very very short and very very communicative.” • “If you want your students to have more self-confidence, Alelo is going to be your best option.” • “When I was a kid I wish I had this kind of platform because it helps in confidence.” 21
  • 22.
  • 23.
    UVM Toluca Campus,April-May 2020 • 25 students participated in the trial • System improvements: • Restart options in case students get stuck • Revised mastery score at the low end of the scale • Disabled “click-through” feature so students must speak to progress
  • 24.
    Summary of results •23 students completed all 10 simulations at least once • 1 completed 9 simulations • 1 completed 1 simulation • Students completed 63% of simulations only once. • Average maximum score: 46.6% • Average for multiple trials: 64.3% • Average increase per trial: 13.1%
  • 25.
    Conclusions • AIED canand should adopt a data-driven approach • Evaluations can and should be more agile and iterative • A series of small evaluations can be more informative than one large evaluation, and can guide development • Development and evaluation can and should go hand in hand
  • 26.

Editor's Notes

  • #9 Commentary: At the end of the dialogue Enskil gives the learner feedback about what objectives they met and where they need to improve.
  • #10 Commentary: At the end of the dialogue Enskil gives the learner feedback about what objectives they met and where they need to improve.