Putting the world to work for ITS


Published on

Presentation at Intelligent Tutoring Systems conference in 2008 on open community authoring of targeted worked example problems.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • [Insert a graphic to start this off]
  • developing ITS is expensive and it’s done in small groups.Lots of work by skilled experts in the groupslet’s figure out how to distribute it. Open source (Linux) and open content (Wikipedia) show us it can be done.No large scale collaboration systems for ITS authoring. The goal here is something of a Wikipedia for tutoring.Then next slide, Wikipedia on PT.
  • But Wikipedia itself is not the right model. E.g. this Wikipedia entry is geared to people who already know the math and want more details.No learning by doing. No doing in Wikipedia at all. If you put such information into Wikipedia, you get a note you move to Wikiversityand Wikibooks. But if you look at those, they have hardly any content. Wikis are awkward for instructional material because they attempt to be canonical.But students learn in diverse ways and at different rates.Let’s allow divergence of resources so that materials can be tailored specifically to each student.
  • The work here is part of a larger study into a working collaborative community. The vision is for a model of development that is cheaper than existing methods, leads people to think more about learning, and can evolve to be the best.[walk through the cycle]The study I’m going to tell you about is in the Generation phase of the cycle, where people submit new material.[save for end:Here Improve leads to Generate because each improvement is actually a new artifact that then gets evaluated on its own. We can come back to full cycle at the end.
  • To build such a system requires an understanding of the social context, so we begin by studying it empirically.[Emphasize why we want these things]e.g. if we made it, would enough people use it?Would picking out the good stuff be feasible?How can the system foster quality adaptive materials?
  • So to examine these, we created a prototype authoring system that works over the webParticipants show up to the site and contribute new worked example problemsWe wanted to see the quality of what people contributed and how hard it is to pick out the good stuff.We also wanted to see if we could increase the quality and diversity of the contributions by manipulating the authoring tool.so some participants were asked to target their worked example to a specific person
  • In more detail…ARTIFACTWorked examples improve learning, particularly when coupled with interactive tutoringNot very different from a simple inner loop (which by Van Lehn’s Law simple may be good enough)DOMAINPT most difficult in Assistment data at the timePerhaps machines could make exercise text more efficiently that volunteers, but not drawings.
  • To get an idea of what was made, here is a worked example that one of the participants submitted.
  • This is the tool they used to make them.
  • Specific hypotheses in separating wheat from chaff are that
  • Let me describe the experimental condition.Half the participants, randomly assigned, were asked to help “the student above” in understanding the Pythagorean Theorem. They would see one of 16 different student profiles at the top of the authoring tool.[go slowly through pics, read them out][use pointer to contrast the features. Flip back and forth.]
  • We expect these differences would do these three things.
  • even when I took away the money, people contributed at roughly the same quality levels. (wait for them to ask for the final slide)Of the 1130, filtered out the contributions that didn’t follow the form. Calling “machine filtering” because simple SQL query without human intervention. So it’s very easy to do a first pass quality filter.Here we compare depth of participation across three types of participants: math teachers, other teachers, and amateurs.
  • Easy to filter.
  • After the machine filter, the remaining submissions were coded for quality by two geometry teachers.Here ratings and definitions they used [read them]Three components of each problem were rated: problem statement (Statement), the work shown (Work), and the explanation of the work(Explanation). median time to rate was ~40 seconds and they agreed alpha=0.8.So it’s pretty easy for people to accurately rate the quality.. Overall then, separating the wheat from the chaff in a production system will be feasible.
  • [read out the legend][put screenshots into this the way I did with the Wikipedia page][figure out what I want them to get from the examples. Scrolling back and forth is impossible. Cut down to two. Zoom into parts to talk about. Include where the three components differed and point that out Explain the color codes.]
  • [
  • Here is the quality distribution of all the original 1130 contributions after machine filtering and human ratings. Whole here is the value from averaging the statement and the solution.See in Filtered column, Over half filtered instantly by SQL query. Other columns are the 551 human rated.In general the statements were of higher quality than the solutions. Over 300 were worthy without modification.We see that solutions were the most difficult parts to authorwell. And there were effectsby expertise…
  • As predicted in H2, math teachers did write the best problem statements. See the A and B groupings of significant differences.Surprisingly, their solutions weren’t any better than the amateurs. Amateurs did slightly though not significantly better than math teachers. Comparing amateurs to teachers all together, amateurs did significantly better.The take-away from this is that non-professional educators produce valuable contributions, which can exceed those of professionals. And educational content systemscan benefit from opening the channels of contribution to all comers.
  • Here are the results of the student profiles manipulation. Focus on these columns [explain]Most remarkable is the use of gender pronouns. Pronoun attributes mean presence of that pronoun in the problem statement. Generic condition is like a normal authoring tool, 19% of problems discuss males but only 5% discuss females. When you show a student profile that is female, she pronouns are included in 16% of those problems. Though this is still less than the 19% males, which is the same rate even when you show a male. Clearly males are the default mindset.Another strong effect by including the sports hobby, discussion of sports went from 9% to 24%.Same pattern for all the other social attributes in the profile (well except favorite color).
  • To give you an idea the tailoring. [read out loud]Here they used the 3-4-5 Pythagorean Triple.
  • Another drawing on the profile details. [read out loud]
  • So when shown a student profile, people tailor their contributions to the social attributes shown. What about the cognitive attributes?We expect the difficulty measures in contributions for high skill profiles to differ from the low. And that’s in fact what we see. Comparing the reading level of the contributions, High verbal profiles were significantly different from low, by almost a grade level. Also significantly different from generic.Same situation with math skill, measured by probability of making the problem around the 3-4-5 triangle, the simplest Pythagorean triple. So in the Generic condition, 21% of the problems used the integers 3, 4 and 5.
  • Here’s anotherexample,one of my favorites. The student was High in verbal, “top of their English class”. The authoring customized not just the difficulty but the engagement of the content.
  • The last two hypotheses were not confirmed. Not clear effect on effort. While problem statement in the profile condition were 25% longer, this may not be a good measure. Another measure, time spent, had no significant difference.Profiles had no effect on quality. There was no difference in quality between the conditions.
  • More parts of the design to study.
  • Right now I’m running a second web study of how people evaluate and improve the problems from the study described here.
  • I plan to develop production web site in the fallfor educators to create, use, improve, and discuss worked example problems. Part of this will be how to motivate contributions (in the form of original works, improvements, feedback, etc.)If the system grows enough, I look forward to classroom studies in which students are involved in making, rating and improving problems.I also intend to provide open data APIsand linking in with other projects. I think this collaborative system will best built collaboratively.
  • The end
  • Putting the world to work for ITS

    1. 1. Putting the world to work for ITS: Open community authoring of targeted worked example problems Aleahmad, Aleven and Kraut6/27/2008 ITS 2008
    2. 2. Current situation in tutoring2 systems • Development is very laborious • (e.g. estimates of 200-300 hrs for 1 hr instruction) • Small groups with much effort per person • Distribute the development • Open source • Open content • How to make a “Wikipedia” for ITS?
    3. 3. Wikipedia not the right model3
    4. 4. Towards a collaborative4 community • Volunteers • Others rate and submit new critique material Generate Evaluate Improve Use • Link resources • Others make the into tutoring contribution systems or better create new ones
    5. 5. Broad research questions5  If you make it, will they come?  Can the wheat be separated from the chaff?  How to structure and support authoring?  For quality  For diversity to engage students – Contextualization, personalization, and provision of choices can improve student motivation and engagement in learning (Cordova and Lepper, 1996 ) – Personalization improves performance gains and even at start (Anand and Ross, 1987; Ku and Sullivan 2002; López and Sullivan 1992)
    6. 6. Overview of the study6  Web site where people contribute worked example problems  In registering, indicated their professional status  Tested a mechanism to increase quality and diversity  Asked some authors to target to a specific person  Increase their effort?  Increase diversity/adaptivity of corpus?
    7. 7. Task7 • Artifact: Worked example problem – Leads to better and more efficient learning when added to interactive tutoring (McLaren et al., 2006; Schwonke et al., 2007) – Instruct and foster self-explanation (Renkl and Atkinson, 2002) – Customizability – both to the student and the interaction • Domain: Pythagorean Theorem – Most difficult skill on the Massachusetts Comprehensive Assessment System curriculum standards (ASSISTment data)
    8. 8. Zack and Slater want to build a bike jump. They have two parts of the ramp constructed but they need toProblem know the length of the final piece of the jump. They have two parts of the ramp built, one is 3 ft long andStateme the other is 4 ft long and they are constructed as shown in the diagram. What is the length of the missing section that Zack and Slater still need tont construct? + Work ExplanationSolution The unknown is the hypotenus which is represented by c in the 3^2 + 4^2steps equation. Therefore I input both a and b into the equation first. Following the equation I square both of these numbers. = 9 + 16 = These two numbers are added 25 together first because of theWhole parenthesis. To complete the equation I takeworked Square the square root of 25 which is five. This problem also demonstrates root of 25example is 5 and this is the the common Pythagoras triangle. solution.8
    9. 9. Authoring tool9
    10. 10. Open authoring hypotheses10  H1: Identifying the good from the bad contributions is easy. We expect that all contributions are good, easily fixed, or easily filtered.  H2: Math teachers submit the best contributions.
    11. 11. Student profiles11  Goal of realism  Varied on social and cognitive attributes  16 profiles  4 Hobbies x 4 Homes  4 realistic skill profiles distributed  2 genders distributed
    12. 12. Profile hypotheses12 Profiles in experimental condition versus generic control condition  H3: Student profiles lead to tailored contributions.  H4: Student profiles increase the effort of authors.  H5: Student profiles lead to higher quality
    13. 13. Participants and contributions13 • Participation URL posted on web sites (educational and otherwise) offering $4-12 • 1427 people registered, of which 570 used the tool to submit 1130 contributions • After machine filtering, 281 participants were left having submitted 551 contributions Participation Math teachers Other teachers Amateurs Registered 131 170 1126 Contributed also 70 72 428 Passed vetting 26 35 220 also
    14. 14. Machine filtered14 Some have just a worthless drawing. Or nothing at all.
    15. 15. Quality ratings15 Human experts rated the machine vetted submissions Numerical Rating value category Definition No use in teaching and it would be easier to 0 Useless write a new one than improve this one. Has some faults, but they are obvious and 1 Easy fix can be fixed easily, in under 5 minutes. Worthy of being given to a student who matches on the difficulty and subject matter. 2 Worthy Assume that the system knows whats in the problem and what is appropriate for each student, based on their skills and interests. Excellent example to provide to some student. Again, assume that the system 3 Excellent knows whats in the problem and what is appropriate for each student, based on their
    16. 16. Quality rating examples16  Excellent statement with poor solution (1124)  Worthy statement with excellent solution (337)
    17. 17. 17 Open authoring
    18. 18. Quality of pool18
    19. 19. Quality by contributor expertise19 Statement quality Solution quality Teacher Sign. Mean Std Teacher Sign. Mean Std status diffs quality Err status diffs quality Err Math A 1.80 0.12 Math A B 0.70 0.10 teacher teacher Other B 1.54 0.09 Other B 0.53 0.08 teacher teacher Not B 1.48 0.09 Not B 0.76 0.03 teacher teacher
    20. 20. 20 Student profiles
    21. 21. Tailoring to social attributes21 With profiles With With profiles not F-test F-test Attribute GENERIC mentioning mentioning (G-M) (N-M) (G) attribute (M) attribute (N) Female pronoun 5% 4% 16% 9.68* 12.82** Male pronoun 19% 14% 19% 0.004 1.19 Sports word 9% 9% 24% 18.01** 11.89** TV word 4% 4% 10% 8.36* 2.63† Music word 2% 2% 9% 6.92* 8.93** Home word 14% n/a 20% 3.60* n/a Probabilities of authoring matching an attribute †p<.10 *p<.05 **p<.001
    22. 22. For profile with a home outside town22
    23. 23. For profile who lives in tall apartment building23
    24. 24. Tailoring to cognitive attributes24 Verbal skill in profile General math skill in profile Verbal Sign Mean Std Math Sign Probability Std Err skill . reading Err skill . of using 3- shown diffs level of shown diffs 4-5 contribution triangle High A 3.78 0.24 High A 16% 0.05 Medium A B 3.56 0.32 Medium A B 26% 0.05 Low B 2.93 0.33 Low B 27% 0.04 GENERI B 3.20 0.16 GENER A B 21% 0.03 C IC Correspondence of verbal and math skill levels with the authoring interface
    25. 25. Shakespeare for profile in “top of English class”25
    26. 26. Effects of profiles26 On effort On quality  Problem statements  No main effect of in profile condition profiles on quality were 25% longer  No interaction with  No significant teacher status either difference in time spent (median 5 each minutes on statement and solution)
    27. 27. 27 Conclusions
    28. 28. Recap of Hypotheses28 Hypothesis Short Long Answer Answer 1 Quality control is easy Yes Filtering trivial; rating by experts take less than a minute 2 Math teachers contribute Partly Amateurs and non-math teachers the best worked examples wrote okay problem statements and amateurs wrote better solutions 3 Profiles lead to tailoring Yes Every aspect of profiles was tailored to 4 Profiles increase effort Inconclusiv A quarter longer problem e statement, but no difference in time 5 Profiles lead to higher No No difference in machine filtering quality contributions or human rated quality
    29. 29. Current and future work29 • Volunteers • Others rate and submit new critique material Generate Evaluate Improve Use • Link resources • Others make the into tutoring contribution systems or better create new ones
    30. 30. Current and future work30 • Volunteers • Others rate and submit new critique material Generate Evaluate Improve Use • Link resources • Others make the into tutoring contribution systems or better create new ones
    31. 31. Current and future work31 • Volunteers • Others rate and submit new critique material Generate Evaluate Improve Use • Link resources • Others make the into tutoring contribution systems or better create new ones
    32. 32. Acknowledgements32  Thanks to ASSISTment project, Ken Koedinger and Sara Kiesler for data and feedback  Work supported by IES and NSF  It’s going to take a lot of connected work to build a scalable shared ITS for the world  Let’s talk more about how  http://OpenEducationResearch.org
    33. 33. Gratis participants33  Still 93 submissions from 92 participants  Of these 38 submissions from 21 participants pass machine vetting  41% pass rate of machine vetting compared to 49% rate in experiment  Not significantly different by Fishers Exact Test (p=0.16)