Data-Driven Development (D3) and Evaluation of Enskill English

Data-Driven Development and
Evaluation of Enskill EnglishW. Lewis Johnson
www.alelo.com/enskill-english

W. Lewis Johnson, PhD, CEO, Alelo
• Entrepreneur, thought leader, author
• DARPA Significant Technical Achievement
Award
• IFAAMAS Influential Paper Award
• Host of webinar series on the Future of AI
in Education and Training
• Past President, Intl. AI in Education Society
• Linguistics: Princeton; Computer Sci.: Yale
2

Data-driven development (D3)
• Design is informed by learner data
• System is as much a data collection tool as
a learning tool
• AI models are iteratively trained on
learner data, using machine learning
techniques
• Learning evaluation and system evaluation
are iterative and continuous
Data mining
and analysis
Development
and model
updates
Deployment
Johnson, W.L. (2019). Data-Driven
Development and Evaluation of Enskill
English. Int. Journal of AI in Education, 29,
pp. 425-457. 3

Alelo Enskill: An AI-driven learning architecture
• Communicative practice with AI avatars in safe environment
• Formative assessments
• Feedback
• Personalized practice
• Analytics for teachers, learners, administrators, and
developers
5

Alelo Enskill around the world
Brazil • Chile • China • Colombia • Costa Rica • Croatia • Honduras • Malaysia • Mexico • Panama
Paraguay • Peru • Portugal • Serbia • Spain • Sweden • Thailand • Turkey • United States
Over 300,000 users to date

Continuous Evaluation in D3
• The system is evolving, so evaluations must produce findings quickly.
• Types of tests and evaluations:
• Instant tests: Automated data analyses plus spot checks of data
• We perform these weekly
• Snapshot evaluations: Performed over a limited period of time with a limited
population
• Often to test specific hypotheses
• Sometimes with new populations, to collect data and design improvements
• A/B evaluations: comparison of different learner populations
• Regression tests: Tests of current system on archived data sets
• Evaluations are in the field, not in the lab

Snapshot Evaluation: Univ. of Novi Sad
• April-May 2018
• Study population: 80 CEFR B-level English learners
• Study materals: CEFR A-level conversational simulations
• Questions:
• Would the learners find it useful?
• Would their performance improve with practice?
• Was the speech & language technology adequate for this population?

Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Turns per
Minute
All trials 17.26 4.56 12.43 74.62% 2.75
First trial 18.86 7.52 11.33 62.65% 2.62
Last trial 17.86 3.52 14.33 82.35% 4.31
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 2187 85%
Plan a Party A1 2589 63%
Jerry’s Spaghetti A2 725 67%
Train Ticket A2 2733 71%

Snapshot Evaluation: University of Split
• December 2018
• Study population: 39 CEFR B-level English learners
• Somewhat higher proficiency than the Novi Sad learners
• Study population: 5 simulations, upgraded NLU and NLU models

Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Exchanges
per Minute
All trials 15.22 2.79 12.43 82.44% 3.80
First trial 15.00 3.21 11.79 77.78% 3.37
Last trial 16.07 1.82 14.31 88.78% 4.26
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 1045 83%
Plan a Party A1 973 83%
School Newspaper A1 1032 94%
Jerry’s Spaghetti A2 407 63%
Train Ticket A2 725 81%

UVM Level 2 Simulation Usage
• Analysis of 67 users
• 37 users tried all 10 simulations
• 7.31 average simulations tried
• 17.72 average simulation runs per user
• 2.27 average runs per simulation
• 75.69% average maximum score
• Students who practiced simulations multiple times achieved 14.15%
increase in mastery score per trial (ignoring trials with 0% scores)
© 2020 Alelo Inc. 19

UVM Students’ Mastery Decays Very Slowly After
Training
Average
mastery
score
Average
before
gap
First after
1-week
gap
Average
after 1-
week gap
First after
1-month
gap
Average
after 1-
month
gap
Level 1 32.67% 32.67% 26.29% 29.47% 22.78% 24.55%
Level 2 44.16% 41.47% 45.03% 45.28% 23.43% 23.43%
© 2020 Alelo Inc. 20
Analysis of 40 UVM Level 1 students and 22 UVM Level 2
students who practiced simulations, stopped practicing for at
least a week, then resumed practicing. 0% scores excluded from
analysis, except when scores before or after gap were ALL 0%.

Case Study at UVM Campus Toluca
• Students using Alelo Enskill at UVM, a
Laureate institution in Mexico,
developed greater proficiency and self-
confidence in English communication.
• “It helps me to improve my classes and
also it makes my classes very very short
and very very communicative.”
• “If you want your students to have
more self-confidence, Alelo is going to
be your best option.”
• “When I was a kid I wish I had this kind
of platform because it helps in
confidence.”
21

UVM Toluca Campus, April-May 2020
• 25 students participated in the trial
• System improvements:
• Restart options in case students get stuck
• Revised mastery score at the low end of the scale
• Disabled “click-through” feature so students must speak to progress

Summary of results
• 23 students completed all 10 simulations at least once
• 1 completed 9 simulations
• 1 completed 1 simulation
• Students completed 63% of simulations only once.
• Average maximum score: 46.6%
• Average for multiple trials: 64.3%
• Average increase per trial: 13.1%

Conclusions
• AIED can and should adopt a data-driven approach
• Evaluations can and should be more agile and iterative
• A series of small evaluations can be more informative than one large
evaluation, and can guide development
• Development and evaluation can and should go hand in hand

www.alelo.com/AIED
ljohnson@alelo.com

Data-Driven Development (D3) and Evaluation of Enskill English

More Related Content

What's hot

Similar to Data-Driven Development (D3) and Evaluation of Enskill English

Recently uploaded

Data-Driven Development (D3) and Evaluation of Enskill English

Editor's Notes