Deep Learning to help student’s
Deep Learning
Hak Kim | AI Team
The return of the MOOC
● Education is no longer a one-time event but
a lifelong experience
● Working lives are now so lengthy and
fast-changing
● People must be able to acquire new skill
throughout their careers
● MOOC is broadening access to cutting-edge
vocational subjects
Backgrounds
Hak Kim |
Challenges
● Significant increase in student numbers
● Impracticality to assess each student’s
level of engagement by human instructors
● Increasingly fast dev and update cycle of
course contents
● Diverse demographics of students in each
online classroom
In the World of MOOCs
Hak Kim |
● Challenging problem to forecast the future
performance of students as they interact
with online coursework
● Reliable early-stage prediction of a student’s
future performance is critical to facilitate
timely (educational) interventions during a
course
Only a few prior works have explored the problem
from a deep learning perspective!
Student Perf. Prediction
Hak Kim |
GritNet
Graduate (user 1122)
● First views four lecture videos, then gets a quiz correct
● In the subsequent 13 events, watches a series of videos, answers quizzes
and submits first project three times and finally get it passed
Drop-out (user 2214)
● Views four lectures sparsely and fails the first project three times in a row
With this raw student activity data, GritNet can make a prediction as to
whether or not student would graduate (or any other student outcomes)
Student Event Sequence
Hak Kim |
Input
● Feed time-stamped students’ raw event sequence
(w/o engineered features)
● Encode them by one-hot encoding of possible event tuple
{action, timestamp} - to reserve ordering information and to
capture student’s learning speed
Output
● Probability of a student performance
● E.g., graduation or any other student outcomes
Model
● GMP layer is able to capture most relevant part of the input
event sequence and to deal with imbalanced nature of data!
GritNet
Hak Kim |
Training objective is the negative log likelihood of the observed
event sequence of student activities
● Binary cross entropy is minimized using stochastic gradient descent on mini-batches (size of 32)
Model optimization
● BLSTM layers of 128 cell dimensions per direction
● Embedding layer dimension grid-searched
● Dropout applied to the BLSTM output
● Note: Baseline model (logistic regression based) was also optimized extensively
Trained a different model for different weeks based on student’s week-by-week event records
GritNet Training
Hak Kim |
On Udacity Data
● ND-A program: 777 graduated out of 1,853 students (41.9%)
● ND-B program: 1,005 graduated from 8,301 students (12.1%)
Results
● Compared student graduation prediction accuracy in terms of
mean area under the ROC curve (AUC) over 5-fold cross validation
● GritNet consistently outperforms the baseline and provides
superior performance by more than 5% abs on both ND-{A,B}
dataset
● On ND-B dataset, the baseline model requires a wait of two
months to reach the same performance that the GritNet shows
within three weeks
I.e., improvements of GritNet are substantially pronounced in the first few
weeks when accurate predictions are most challenging!
Prediction Performance
5.3% abs (7.7% rel)
5% abs
Hak Kim |
Intermission
GritNet is the current state of the art for student prediction problem
● P1: Does not need any feature engineering (learns from raw input)
● P2: Can operate on any student event data associated with a timestamp
(even when highly imbalanced)
Hak Kim |
Remaining challenges
● Significant increase in student numbers
● Impracticality to assess each student’s
level of engagement by human instructors
● Increasingly fast dev and update cycle of
course contents
● Diverse demographics of students in each
online classroom
In the World of MOOCs
Hak Kim |
What’s Next
● So far GritNet trained and tested on the same course
data
● Real-time prediction during on-going course
(before the course finishes)
Hak Kim |
Real-Time Prediction
Observation
● GMP layer output captures generalizable sequential
representation of input event sequence
● FC layer learns the course-specific features
GritNet can be viewed as a sequence-level feature (or
embedding) extractor plus a classifier
Intuition
Hak Kim |
Unsupervised DA framework
● Trained from previous (source) course
but to be deployed on another (target) course
● E.g., v1 → v2, existing → newly-launched ND program
Ordinal Input Encoding
● Group (raw) content IDs into subcategories
● Normalize them using natural ordering implicit
in the content paths within each category
Domain Adaptation
Hak Kim |
Domain Adaptation
Hak Kim |
Unsupervised DA framework
● Trained from previous (source) course
but to be deployed on another (target) course
● E.g., v1 → v2, existing → newly-launched ND program
Ordinal Input Encoding
● Group (raw) content IDs into subcategories
● Normalize them using natural ordering implicit
in the content paths within each category
Pseudo-Labels generation
● No target labels from the target course
● Use a trained GritNet to assign hard
one-hot labels (Step 4)
Domain Adaptation
(cont’d)
Hak Kim |
Transfer Learning
● Continue training the last FC layer on the
target course while keeping the GritNet
core fixed (Step 6 and Step 7)
● The last FC layer limits the number of
params to learn
Very practical solution for faster adaptation time and applicability for even a smaller size target course!
Domain Adaptation
(cont’d)
Hak Kim |
Udacity Data
Setup
● Vanilla baseline: LR trained on source course, evaluated on target course
● GritNet baseline: GritNet trained on source course, evaluated on target course
○ BLSTM layers of 256 cell dimensions per direction and embedding layer of 512 dimension
● Domain adaptation
● Oracle: use oracle target labels instead of pseudo-labels
Experimental Setup
Hak Kim |
Generalization Performance
Hak Kim |
70.60% ARR
58.06% ARR
Up to 84.88% ARR at week 3Up to 75.07% ARR at week 3
Results
● AUC Recovery Rate (ARR): AUC reduction that unsupervised domain adaptation provides divided by
the absolute reduction that oracle training produces
○ Clear wins of GritNet baseline vs. vanilla baseline
○ Domain adaptation enhances real-time predictions explicitly in the first few weeks!
Summary
GritNet is the current state of the art for student prediction problem
● P1: Does not need any feature engineering (learns from raw input)
● P2: Can operate on any student event data associated with a timestamp
(even when highly imbalanced)
● P3: Can be transferred to new courses without any (students’ outcome) label!
Full papers available on arXiv (GritNet, GritNet2)
Hak Kim |
Looking ahead
Is a student likely to benefit from (sequential) interventions?
Student, X Intervention, T
Learning outcomes, Y
? ... ...
... ...
Hak Kim |
Wrap Up
Questions?
hak@udacity.com
Hak Kim |
Thank You!

[243] Deep Learning to help student’s Deep Learning

  • 1.
    Deep Learning tohelp student’s Deep Learning Hak Kim | AI Team
  • 2.
    The return ofthe MOOC ● Education is no longer a one-time event but a lifelong experience ● Working lives are now so lengthy and fast-changing ● People must be able to acquire new skill throughout their careers ● MOOC is broadening access to cutting-edge vocational subjects Backgrounds Hak Kim |
  • 3.
    Challenges ● Significant increasein student numbers ● Impracticality to assess each student’s level of engagement by human instructors ● Increasingly fast dev and update cycle of course contents ● Diverse demographics of students in each online classroom In the World of MOOCs Hak Kim |
  • 4.
    ● Challenging problemto forecast the future performance of students as they interact with online coursework ● Reliable early-stage prediction of a student’s future performance is critical to facilitate timely (educational) interventions during a course Only a few prior works have explored the problem from a deep learning perspective! Student Perf. Prediction Hak Kim |
  • 5.
  • 6.
    Graduate (user 1122) ●First views four lecture videos, then gets a quiz correct ● In the subsequent 13 events, watches a series of videos, answers quizzes and submits first project three times and finally get it passed Drop-out (user 2214) ● Views four lectures sparsely and fails the first project three times in a row With this raw student activity data, GritNet can make a prediction as to whether or not student would graduate (or any other student outcomes) Student Event Sequence Hak Kim |
  • 7.
    Input ● Feed time-stampedstudents’ raw event sequence (w/o engineered features) ● Encode them by one-hot encoding of possible event tuple {action, timestamp} - to reserve ordering information and to capture student’s learning speed Output ● Probability of a student performance ● E.g., graduation or any other student outcomes Model ● GMP layer is able to capture most relevant part of the input event sequence and to deal with imbalanced nature of data! GritNet Hak Kim |
  • 8.
    Training objective isthe negative log likelihood of the observed event sequence of student activities ● Binary cross entropy is minimized using stochastic gradient descent on mini-batches (size of 32) Model optimization ● BLSTM layers of 128 cell dimensions per direction ● Embedding layer dimension grid-searched ● Dropout applied to the BLSTM output ● Note: Baseline model (logistic regression based) was also optimized extensively Trained a different model for different weeks based on student’s week-by-week event records GritNet Training Hak Kim |
  • 9.
    On Udacity Data ●ND-A program: 777 graduated out of 1,853 students (41.9%) ● ND-B program: 1,005 graduated from 8,301 students (12.1%) Results ● Compared student graduation prediction accuracy in terms of mean area under the ROC curve (AUC) over 5-fold cross validation ● GritNet consistently outperforms the baseline and provides superior performance by more than 5% abs on both ND-{A,B} dataset ● On ND-B dataset, the baseline model requires a wait of two months to reach the same performance that the GritNet shows within three weeks I.e., improvements of GritNet are substantially pronounced in the first few weeks when accurate predictions are most challenging! Prediction Performance 5.3% abs (7.7% rel) 5% abs Hak Kim |
  • 10.
    Intermission GritNet is thecurrent state of the art for student prediction problem ● P1: Does not need any feature engineering (learns from raw input) ● P2: Can operate on any student event data associated with a timestamp (even when highly imbalanced) Hak Kim |
  • 11.
    Remaining challenges ● Significantincrease in student numbers ● Impracticality to assess each student’s level of engagement by human instructors ● Increasingly fast dev and update cycle of course contents ● Diverse demographics of students in each online classroom In the World of MOOCs Hak Kim |
  • 12.
    What’s Next ● Sofar GritNet trained and tested on the same course data ● Real-time prediction during on-going course (before the course finishes) Hak Kim |
  • 13.
  • 14.
    Observation ● GMP layeroutput captures generalizable sequential representation of input event sequence ● FC layer learns the course-specific features GritNet can be viewed as a sequence-level feature (or embedding) extractor plus a classifier Intuition Hak Kim |
  • 15.
    Unsupervised DA framework ●Trained from previous (source) course but to be deployed on another (target) course ● E.g., v1 → v2, existing → newly-launched ND program Ordinal Input Encoding ● Group (raw) content IDs into subcategories ● Normalize them using natural ordering implicit in the content paths within each category Domain Adaptation Hak Kim |
  • 16.
    Domain Adaptation Hak Kim| Unsupervised DA framework ● Trained from previous (source) course but to be deployed on another (target) course ● E.g., v1 → v2, existing → newly-launched ND program Ordinal Input Encoding ● Group (raw) content IDs into subcategories ● Normalize them using natural ordering implicit in the content paths within each category
  • 17.
    Pseudo-Labels generation ● Notarget labels from the target course ● Use a trained GritNet to assign hard one-hot labels (Step 4) Domain Adaptation (cont’d) Hak Kim |
  • 18.
    Transfer Learning ● Continuetraining the last FC layer on the target course while keeping the GritNet core fixed (Step 6 and Step 7) ● The last FC layer limits the number of params to learn Very practical solution for faster adaptation time and applicability for even a smaller size target course! Domain Adaptation (cont’d) Hak Kim |
  • 19.
    Udacity Data Setup ● Vanillabaseline: LR trained on source course, evaluated on target course ● GritNet baseline: GritNet trained on source course, evaluated on target course ○ BLSTM layers of 256 cell dimensions per direction and embedding layer of 512 dimension ● Domain adaptation ● Oracle: use oracle target labels instead of pseudo-labels Experimental Setup Hak Kim |
  • 20.
    Generalization Performance Hak Kim| 70.60% ARR 58.06% ARR Up to 84.88% ARR at week 3Up to 75.07% ARR at week 3 Results ● AUC Recovery Rate (ARR): AUC reduction that unsupervised domain adaptation provides divided by the absolute reduction that oracle training produces ○ Clear wins of GritNet baseline vs. vanilla baseline ○ Domain adaptation enhances real-time predictions explicitly in the first few weeks!
  • 21.
    Summary GritNet is thecurrent state of the art for student prediction problem ● P1: Does not need any feature engineering (learns from raw input) ● P2: Can operate on any student event data associated with a timestamp (even when highly imbalanced) ● P3: Can be transferred to new courses without any (students’ outcome) label! Full papers available on arXiv (GritNet, GritNet2) Hak Kim |
  • 22.
    Looking ahead Is astudent likely to benefit from (sequential) interventions? Student, X Intervention, T Learning outcomes, Y ? ... ... ... ... Hak Kim |
  • 23.
  • 24.