Invited conference presentation at the 2020 International Conference on Artificial Intelligence in Education, based on an article published in the International Journal of Artificial Intelligence in Education.
2. W. Lewis Johnson, PhD, CEO, Alelo
• Entrepreneur, thought leader, author
• DARPA Significant Technical Achievement
Award
• IFAAMAS Influential Paper Award
• Host of webinar series on the Future of AI
in Education and Training
• Past President, Intl. AI in Education Society
• Linguistics: Princeton; Computer Sci.: Yale
2
3. Data-driven development (D3)
• Design is informed by learner data
• System is as much a data collection tool as
a learning tool
• AI models are iteratively trained on
learner data, using machine learning
techniques
• Learning evaluation and system evaluation
are iterative and continuous
Data mining
and analysis
Development
and model
updates
Deployment
Johnson, W.L. (2019). Data-Driven
Development and Evaluation of Enskill
English. Int. Journal of AI in Education, 29,
pp. 425-457. 3
5. Alelo Enskill: An AI-driven learning architecture
• Communicative practice with AI avatars in safe environment
• Formative assessments
• Feedback
• Personalized practice
• Analytics for teachers, learners, administrators, and
developers
5
6. Alelo Enskill around the world
Brazil • Chile • China • Colombia • Costa Rica • Croatia • Honduras • Malaysia • Mexico • Panama
Paraguay • Peru • Portugal • Serbia • Spain • Sweden • Thailand • Turkey • United States
Over 300,000 users to date
13. Continuous Evaluation in D3
• The system is evolving, so evaluations must produce findings quickly.
• Types of tests and evaluations:
• Instant tests: Automated data analyses plus spot checks of data
• We perform these weekly
• Snapshot evaluations: Performed over a limited period of time with a limited
population
• Often to test specific hypotheses
• Sometimes with new populations, to collect data and design improvements
• A/B evaluations: comparison of different learner populations
• Regression tests: Tests of current system on archived data sets
• Evaluations are in the field, not in the lab
14. Snapshot Evaluation: Univ. of Novi Sad
• April-May 2018
• Study population: 80 CEFR B-level English learners
• Study materals: CEFR A-level conversational simulations
• Questions:
• Would the learners find it useful?
• Would their performance improve with practice?
• Was the speech & language technology adequate for this population?
15. Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Turns per
Minute
All trials 17.26 4.56 12.43 74.62% 2.75
First trial 18.86 7.52 11.33 62.65% 2.62
Last trial 17.86 3.52 14.33 82.35% 4.31
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 2187 85%
Plan a Party A1 2589 63%
Jerry’s Spaghetti A2 725 67%
Train Ticket A2 2733 71%
16. Snapshot Evaluation: University of Split
• December 2018
• Study population: 39 CEFR B-level English learners
• Somewhat higher proficiency than the Novi Sad learners
• Study population: 5 simulations, upgraded NLU and NLU models
17. Quick Summary of Results
Category Exchanges Repeats Meaningful
Exchanges
Meaningful
Exchange
Rate
Exchanges
per Minute
All trials 15.22 2.79 12.43 82.44% 3.80
First trial 15.00 3.21 11.79 77.78% 3.37
Last trial 16.07 1.82 14.31 88.78% 4.26
Simulation Name CEFR Level Total Exchanges Raw
Understanding
Rate
Class Interview A1 1045 83%
Plan a Party A1 973 83%
School Newspaper A1 1032 94%
Jerry’s Spaghetti A2 407 63%
Train Ticket A2 725 81%
21. Case Study at UVM Campus Toluca
• Students using Alelo Enskill at UVM, a
Laureate institution in Mexico,
developed greater proficiency and self-
confidence in English communication.
• “It helps me to improve my classes and
also it makes my classes very very short
and very very communicative.”
• “If you want your students to have
more self-confidence, Alelo is going to
be your best option.”
• “When I was a kid I wish I had this kind
of platform because it helps in
confidence.”
21
23. UVM Toluca Campus, April-May 2020
• 25 students participated in the trial
• System improvements:
• Restart options in case students get stuck
• Revised mastery score at the low end of the scale
• Disabled “click-through” feature so students must speak to progress
24. Summary of results
• 23 students completed all 10 simulations at least once
• 1 completed 9 simulations
• 1 completed 1 simulation
• Students completed 63% of simulations only once.
• Average maximum score: 46.6%
• Average for multiple trials: 64.3%
• Average increase per trial: 13.1%
25. Conclusions
• AIED can and should adopt a data-driven approach
• Evaluations can and should be more agile and iterative
• A series of small evaluations can be more informative than one large
evaluation, and can guide development
• Development and evaluation can and should go hand in hand