Your SlideShare is downloading. ×
Think-Aloud Protocols
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Think-Aloud Protocols

359

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
359
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • HERE!
  • Transcript

    • 1. Educational Data Mining
      March 3, 2010
    • 2. Today’s Class
      EDM
      Assignment#5
      Mega-Survey
    • 3. Educational Data Mining
      “Educational Data Mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.”
      www.educationaldatamining.org
    • 4. Classes of EDM Method(Romero & Ventura, 2007)
      Information Visualization
      Web mining
      Clustering, Classification, Outlier Detection
      Association Rule Mining/Sequential Pattern Mining
      Text Mining
    • 5. Classes of EDM Method(Baker & Yacef, 2009)
      Prediction
      Clustering
      Relationship Mining
      Discovery with Models
      Distillation of Data For Human Judgment
    • 6. Prediction
      Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)
      Which students are using CVS?
      Which students will fail the class?
    • 7. Clustering
      Find points that naturally group together, splitting full data set into set of clusters
      Usually used when nothing is known about the structure of the data
      What behaviors are prominent in domain?
      What are the main groups of students?
    • 8. Relationship Mining
      Discover relationships between variables in a data set with many variables
      Association rule mining
      Correlation mining
      Sequential pattern mining
      Causal data mining
      Beck & Mostow (2008) article is a great example of this
    • 9. Discovery with Models
      Pre-existing model (developed with EDM prediction methods… or clustering… or knowledge engineering)
      Applied to data and used as a component in another analysis
    • 10. Distillation of Data for Human Judgment
      Making complex data understandable by humans to leverage their judgment
      Text replays are a simple example of this
    • 11. Focus of today’s class
      Prediction
      Clustering
      Relationship Mining
      Discovery with Models
      Distillation of Data For Human Judgment
      There will be a term-long class on this, taught by Joe Beck, in coordination with Carolina Ruiz’s Data Mining class, in a future year
      Strongly recommended
    • 12. Prediction
      Pretty much what it says
      A student is using a tutor right now.Is he gaming the system or not?
      A student has used the tutor for the last half hour.
      How likely is it that she knows the knowledge component in the next step?
      A student has completed three years of high school.
      What will be her score on the SAT-Math exam?
    • 13. Two Key Types of Prediction
      This slide adapted from slide by Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials
    • 14. Classification
      General Idea
      Canonical Methods
      Assessment
      Ways to do assessment wrong
    • 15. Classification
      There is something you want to predict (“the label”)
      The thing you want to predict is categorical
      The answer is one of a set of categories, not a number
      CORRECT/WRONG (sometimes expressed as 0,1)
      HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE
      WILL DROP OUT/WON’T DROP OUT
      WILL SELECT PROBLEM A,B,C,D,E,F, or G
    • 16. Classification
      Associated with each label are a set of “features”, which maybe you can use to predict the label
      Skill pknow time totalactions right
      ENTERINGGIVEN 0.704 9 1 WRONG
      ENTERINGGIVEN 0.502 10 2 RIGHT
      USEDIFFNUM 0.049 6 1 WRONG
      ENTERINGGIVEN 0.967 7 3 RIGHT
      REMOVECOEFF 0.792 16 1 WRONG
      REMOVECOEFF 0.792 13 2 RIGHT
      USEDIFFNUM 0.073 5 2 RIGHT
      ….
    • 17. Classification
      The basic idea of a classifier is to determine which features, in which combination, can predict the label
      Skill pknow time totalactions right
      ENTERINGGIVEN 0.704 9 1 WRONG
      ENTERINGGIVEN 0.502 10 2 RIGHT
      USEDIFFNUM 0.049 6 1 WRONG
      ENTERINGGIVEN 0.967 7 3 RIGHT
      REMOVECOEFF 0.792 16 1 WRONG
      REMOVECOEFF 0.792 13 2 RIGHT
      USEDIFFNUM 0.073 5 2 RIGHT
      ….
    • 18. Classification
      Of course, usually there are more than 4 features
      And more than 7 actions/data points
      I’ve recently done analyses with 800,000 student actions, and 26 features
    • 19. Classification
      Of course, usually there are more than 4 features
      And more than 7 actions/data points
      I’ve recently done analyses with 800,000 student actions, and 26 features
      5 years ago that would’ve been a lot of data
      These days, in the EDM world, it’s just a medium-sized data set
    • 20. Classification
      One way to classify is with a Decision Tree (like J48)
      PKNOW
      <0.5
      >=0.5
      TIME
      TOTALACTIONS
      <6s.
      >=6s.
      <4
      >=4
      RIGHT
      RIGHT
      WRONG
      WRONG
    • 21. Classification
      One way to classify is with a Decision Tree (like J48)
      PKNOW
      <0.5
      >=0.5
      TIME
      TOTALACTIONS
      <6s.
      >=6s.
      <4
      >=4
      RIGHT
      RIGHT
      WRONG
      WRONG
      Skill pknow time totalactions right
      COMPUTESLOPE 0.544 9 1 ?
    • 22. Classification
      Another way to classify is with step regression
      Linear regression (discussed later), with a cut-off
    • 23. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
    • 24. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
    • 25. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
      RapidMiner
    • 26. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
      RapidMiner
      KEEL
    • 27. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
      RapidMiner
      KEEL
      RapidMiner
    • 28. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
      RapidMiner
      KEEL
      RapidMiner
      RapidMiner
    • 29. And of course…
      There are lots of other classification algorithms you can use...
      SMO (support vector machine)
      In your favorite Machine Learning package
      WEKA
      RapidMiner
      KEEL
      RapidMiner
      RapidMiner
      RapidMiner
    • 30. Comments? Questions?
    • 31. How can you tell if a classifier is any good?
    • 32. How can you tell if a classifier is any good?
      What about accuracy?
      # correct classifications
      total number of classifications
      9200 actions were classified correctly, out of 10000 actions = 92% accuracy, and we declare victory.
    • 33. What are some limitations of accuracy?
    • 34. Biased training set
      What if the underlying distribution that you were trying to predict was:
      9200 correct actions, 800 wrong actions
      And your model predicts that every action is correct
      Your model will have an accuracy of 92%
      Is the model actually any good?
    • 35. What are some alternate metrics you could use?
    • 36. What are some alternate metrics you could use?
      Kappa
      (Accuracy – Expected Accuracy)
      (1 – Expected Accuracy)
    • 37. What are some alternate metrics you could use?
      A’
      The probability that if the model is given an example from each category, it will accurately identify which is which
    • 38. Comparison
      Kappa
      easier to compute
      works for an unlimited number of categories
      wacky behavior when things are worse than chance
      difficult to compare two kappas in different data sets (K=0.6 is not always better than K=0.5)
    • 39. Comparison
      A’
      more difficult to compute
      only works for two categories (without complicated extensions)
      meaning is invariant across data sets (A’=0.6 is always better than A’=0.55)
      very easy to interpret statistically
    • 40. Comments? Questions?
    • 41. What data set should you generally test on?
      A vote…
      Raise your hands as many times as you like
    • 42. What data set should you generally test on?
      The data set you trained your classifier on
      A data set from a different tutor
      Split your data set in half (by students), train on one half, test on the other half
      Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times.
      Votes?
    • 43. What data set should you generally test on?
      The data set you trained your classifier on
      A data set from a different tutor
      Split your data set in half (by students), train on one half, test on the other half
      Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times.
      What are the benefits and drawbacks of each?
    • 44. The dangerous one(though still sometimes OK)
      The data set you trained your classifier on
      If you do this, there is serious danger of over-fitting
    • 45. The dangerous one(though still sometimes OK)
      You have ten thousand data points.
      You fit a parameter for each data point.
      “If data point 1, RIGHT. If data point 78, WRONG…”
      Your accuracy is 100%
      Your kappa is 1
      Your model will neither work on new data, nor will it tell you anything.
    • 46. The dangerous one(though still sometimes OK)
      The data set you trained your classifier on
      When might this one still be OK?
    • 47. K-fold cross validation (standard)
      Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times.
      What can you infer from this?
    • 48. K-fold cross validation (standard)
      Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times.
      What can you infer from this?
      Your detector will work with new data from the same students
    • 49. K-fold cross validation (student-level)
      Split your data set in half (by student), train on one half, test on the other half
      What can you infer from this?
    • 50. K-fold cross validation (student-level)
      Split your data set in half (by student), train on one half, test on the other half
      What can you infer from this?
      Your detector will work with data from new students from the same population (whatever it was)
    • 51. A data set from a different tutor
      The most stringent test
      When your model succeeds at this test, you know you have a good/general model
      When it fails, it’s sometimes hard to know why
    • 52. An interesting alternative
      Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006)
      Train on data from 3 or more tutors
      Test on data from a different tutor
      (Repeat for all possible combinations)
      Good for giving a picture of how well your model will perform in new lessons
    • 53. Comments? Questions?
    • 54. Regression
    • 55. Regression
      There is something you want to predict (“the label”)
      The thing you want to predict is numerical
      Number of hints student requests
      How long student takes to answer
      What will the student’s test score be
    • 56. Regression
      Associated with each label are a set of “features”, which maybe you can use to predict the label
      Skill pknow time totalactionsnumhints
      ENTERINGGIVEN 0.704 9 1 0
      ENTERINGGIVEN 0.502 10 2 0
      USEDIFFNUM 0.049 6 1 3
      ENTERINGGIVEN 0.967 7 3 0
      REMOVECOEFF 0.792 16 1 1
      REMOVECOEFF 0.792 13 2 0
      USEDIFFNUM 0.073 5 2 0
      ….
    • 57. Regression
      The basic idea of regression is to determine which features, in which combination, can predict the label’s value
      Skill pknow time totalactionsnumhints
      ENTERINGGIVEN 0.704 9 1 0
      ENTERINGGIVEN 0.502 10 2 0
      USEDIFFNUM 0.049 6 1 3
      ENTERINGGIVEN 0.967 7 3 0
      REMOVECOEFF 0.792 16 1 1
      REMOVECOEFF 0.792 13 2 0
      USEDIFFNUM 0.073 5 2 0
      ….
    • 58. Linear Regression
      The most classic form of regression is linear regression
      Alternatives include Poisson regression, Neural Networks...
    • 59. Linear Regression
      The most classic form of regression is linear regression
      Numhints = 0.12*Pknow + 0.932*Time – 0.11*Totalactions
      Skill pknow time totalactionsnumhints
      COMPUTESLOPE 0.544 9 1 ?
    • 60. Linear Regression
      Linear regression only fits linear functions (except when you apply transforms to the input variables, which RapidMiner can do for you…)
    • 61. Linear Regression
      However…
      It is blazing fast
      It is often more accurate than more complex models, particularly once you cross-validate
      Data Mining’s “Dirty Little Secret”
      It is feasible to understand your model(with the caveat that the second feature in your model is in the context of the first feature, and so on)
    • 62. Example of Caveat
      Let’s study a classic example
    • 63. Example of Caveat
      Let’s study a classic example
      Drinking too much prune nog at a party, and having an emergency trip to the Little Researcher’s Room
    • 64. Data
    • 65. Data
      Some people are resistent to the deletrious effects of prunes and can safely enjoy high quantities of prune nog!
    • 66. Learned Function
      Probability of “emergency”= 0.25 * # Drinks of nog last 3 hours - 0.018 * (Drinks of nog last 3 hours)2
      But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less “emergencies”?
    • 67. Learned Function
      Probability of “emergency”= 0.25 * # Drinks of nog last 3 hours - 0.018 * (Drinks of nog last 3 hours)2
      But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less “emergencies”?
      No!
    • 68. Example of Caveat
      (Drinks of nog last 3 hours)2 is actually positively correlated with emergencies!
      r=0.59
    • 69. Example of Caveat
      The relationship is only in the negative direction when (Drinks of nog last 3 hours) is already in the model…
    • 70. Example of Caveat
      So be careful when interpreting linear regression models (or almost any other type of model)
    • 71. Comments? Questions?
    • 72. Discovery with Models
    • 73. Why do Discovery with Models?
      Let’s say you have a model of some construct of interest or importance
      Knowledge
      Meta-Cognition
      Motivation
      Affect
      Inquiry Skill
      Collaborative Behavior
      Etc.
    • 74. Why do Discovery with Models?
      You can use that model to
      Find outliers of interest by finding out where the model makes extreme predictions
      Inspect the model to learn what factors are involved in predicting the construct
      Find out the construct’s relationship to other constructs of interest, by studying its correlations/associations/causal relationships with data/models on the other constructs
      Study the construct across contexts or students, by applying the model within data from those contexts or students
      And more…
    • 75. Most frequently
      Done using prediction models
      Though other types of models (in particular knowledge engineering models) are amenable to this as well!
    • 76. Boosting
    • 77. Boosting
      Let’s say that you have 300 labeled actions randomly sampled from 600,000 overall actions
      Not a terribly unusual case, in these days of massive data sets, like those in the PSLC DataShop
      You can train the model on the 300, cross-validate it, and then apply it to all 600,000
      And then analyze the model across all actions
      Makes it possible to study larger-scale problems than a human could do without computer assistance
      Especially nice if you have some unlabeled data set with nice properties
      For example, additional data such as questionnaire data(cf. Baker, Walonoski, Heffernan, Roll, Corbett, & Koedinger, 2008)
    • 78. However…
      To do this and trust the result,
      You should validate that the model can transfer across students, populations, and to the learning software you’re using
      As discussed earlier
    • 79. A few examples…
    • 80. Middle School Gaming Detector
    • 81. Skills from the Algebra Tutor
      Probability of learning skill at each opportunity
      Initial probability of knowing skill
    • 82. Which skills could probably be removed from the tutor?
    • 83. Which skills could use better instruction?
    • 84. Comments? Questions?
    • 85. A lengthier example (if there’s time)
      Applying Baker et al’s (2008) gaming detector across contexts
    • 86. Research Question
      Do students game the system because of state or trait factors?
      If trait factors are the main explanation, differences between students will explain much of the variance in gaming
      If state factors are the main explanation, differences between lessons could account for many (but not all) state factors, and explain much of the variance in gaming
      So: is the student or the lesson a better predictor of gaming?
    • 87. Application of Detector
      After validating its transfer
      We applied the gaming detector across 35 lessons, used by 240 students, from a single Cognitive Tutor
      Giving us, for each student in each lesson, a gaming frequency
    • 88. Model
      Linear Regression models
      Gaming frequency = Lesson + a0
      Gaming frequency = Student + a0
    • 89. Model
      Categorical variables transformed to a set of binaries
      i.e. Lesson = Scatterplot becomes
      3DGeometry = 0
      Percents = 0
      Probability = 0
      Scatterplot = 1
      Boxplot = 0
      Etc…
    • 90. Metrics
    • 91. r2
      The correlation, squared
      The proportion of variability in the data set that is accounted for by a statistical model
    • 92. r2
      The correlation, squared
      The proportion of variability in the data set that is accounted for by a statistical model
    • 93. r2
      However, a limitation
      The more variables you have, the more variance you should be expected to predict, just by chance
    • 94. r2
      We should expect
      240 students
      To predict gaming better than
      35 lessons
      Just by overfitting
    • 95. So what can we do?
    • 96. BiC
      Bayesian Information Criterion(Raftery, 1995)
      Makes trade-off between goodness of fit and flexibility of fit (number of parameters)
    • 97. Predictors
    • 98. The Lesson
      Gaming frequency = Lesson + a0
      35 parameters
      r2 = 0.55
      BiC’ = -2370
      Model is significantly better than chance would predict given model size & data set size
    • 99. The Student
      Gaming frequency = Student + a0
      240 parameters
      r2 = 0.16
      BiC’ = 1382
      Model is worse than chance would predict given model size & data set size!
    • 100. Standard deviation bars, not standard error bars
    • 101. Comments? Questions?
    • 102. EDM – where?
      Holistic
      Existentialist
      Essentialist
      Entitative
    • 103. Today’s Class
      EDM
      Assignment#5
      Mega-Survey
    • 104. Any questions?
    • 105. Today’s Class
      EDM
      Assignment#5
      Mega-Survey
    • 106. Mega-Survey
      I need a volunteer to bring these surveys to Jim Doyle after class
      *NOT THE REGISTRAR*
    • 107. Mega-Survey Additional Questions(See back)
      #1: In future years, should this class be given
      1: In half a semester, as part of a unified semester class, along with Professor Skorinko’s Research Methods class
      3: Unsure/neutral
      5: As a full-semester class, with Professor Skorinko’s class as a prerequisite
      #2: Are there any topics you think should be dropped from this class? [write your answer in the space to the right]
      #3: Are there any topics you think should be added to this class? [write your answer in the space to the right]

    ×