Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Quantifying reflection

Creating a gold-standard for evaluating automated reflection detection

  • Login to see the comments

Quantifying reflection

  1. 1. Quantifying reflection: Creating a gold-standard for evaluating automated reflection detection Thomas Ullmann, Fridolin Wild, Peter Scott, Knowledge Media Institute, The Open University
  2. 2. Outline • A Model for reflection • Related work on quantification of reflection • Methodology • Data collected • Results and discussion • Outlook 2
  3. 3. Reflection is creative sense-making of the past 3
  4. 4. State of the art in quantifying reflection Reference Scales Unit of analysis Findings Dyment & O’Connell (2011) Depth of reflection Studies (writings) Meta review: five studies low; four medium; two studies high levels of reflection Wong et al. (1995) Depth of reflection: habitual to critical. 45 students Content analysis and interviews: 76% reflectors, 11% critical reflectors. Wald et al. (2012) Reflective to non- reflective 93 writings 2nd year students, self selected best of reflective field notes: 30% critically reflective, 11% transformative reflective. Plack et al. (2005) Frequencies of elements and depth of reflection 43 journals 43% reflection, 42% critical reflection; frequencies see next slide. Hatton & Smith (1995) Units of reflection; dialogic versus descriptive ‘units’ (in writings of 60 students) After instruction: 30% dialogic reflection; 19 reflective units in average per 8-12 pages Ross (1989) Depth of reflection 134 papers of 25 students 22% highly reflective, 34 % moderately reflective Williams et al. (2002) Action classification. 56 student journals 23% verify learning, 36% new understanding, 39% future behaviour 4
  5. 5. Plack et al. %
  6. 6. Williams et al.
  7. 7. Summary: Related work • More research on level than on elements • Wide range for ‘level of depth’ • Measurements on students or writings/journals level • Mostly in the context of instructed reflective writing • Typically: Mapping from evidence to depth/breadth => No re-usable instrument to measure reflection
  8. 8. The dimensions of reflection Ullmann, Wild, Scott (2012): Comparing automatically detected reflective texts with human judgements. Documentation of insights, plans, and intentions. Switch point of view. Argumentation and reasoning. Identification of a conflict. Awareness building over affective factors. Explication of self-awareness, e.g., inner monologues, description of feelings.
  9. 9. Example accounts (anonymised) Dim: Type Example SA: Identification of a conflict. “[Victor] and [Morgan], you are right that I should have applied better my own learning instead of using the Uni ones.” CA: Reasoning. “I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.” TP: Switch point of view. “When I am doing FRT work, I often think about how the parents view me when they know I haven‟t got children!” Dim: Type Example OD: Documentation of an insight. “After I saw how this lifted her mood and eased her anxiety, I will remember that what we can view sometimes to be small can actually make a significant difference.” OD: Intention. “I would like to be involved in helping with the site, too - although I‟m a novice! I imagine this is probably in order to have a focus and provide enough detail rather than skim over the whole area.” Dim: Type Example OD: New understanding. “This has helped me reflect on my own life and experiences whilst allowing me to empathise with others in their own circumstances; I feel proud of what I have achieved so far as the work/life/study balance is always difficult to navigate, but I‟m lucky that I have a supportive family to help.“ None “Bye the way, Audacity is also run under the CC Attribution.”
  10. 10. Methodology: creating a gold standard 10 Corpus selection Sanitize Chunking (for cues) Sample Batching Crowd- sourcing „Spam‟ filtering Objecti- fication mid range length postings OU LMS forum posts 4 subjects, 2 years de-identification sentence level 1000 random 500 pers. 500 non- pers.. Expand grid, 10 batches control questions 5 raters each Justification valid „gold questions‟ passed„majority vote‟ interrater reliability
  11. 11. Crowdsourcing • Crowdflower: the ‘virtual pedestrian area’ • Pre-tests showed: – Really simple questions needed for HITs – But: Quick answer options increase spam – Short texts easier than long texts (less spam, smaller costs) – Shuffling of answers to avoid artefacts • Check: larger than usual number of raters (5+) to see how reliable judgements are
  12. 12. Example questionnaire
  13. 13. OU Forum Corpus
  14. 14. Countries (origin of request) • In total 411 raters • Most of them from the USA (N=202) • GB (N=94) • India (N=45) • 14 other nations (N=70)
  15. 15. Across batches (3M)
  16. 16. Frequency distribution (3M)
  17. 17. Frequencies by courses (3M)
  18. 18. Interrater Reliability – Raw data • Baseline: control questions: Krippendorff’s α = 0.43 • Control questions + survey data: α = 0.32 • Survey data: α = 0.22 – ‘objectified’ data • Majority vote of 3 to all raters agree – Survey data: α = 0.36, (623 out of 1,000 sentences) • Majority vote of 4 to all agree – Survey data: α = 0.581, (301 sentences) • Majority vote of 5 (to all) agree: – α = 0.98 (with outliers), (107 out of 1,000 sentences)
  19. 19. Discussion • Agreement of 5 of course increases IRR – (to 0.98 unfiltered) – when omitting ‘over answering’: to 1.0 – But: reduces to single category sentences • Agreement of 3 deemed good enough – since questions were single choice, whereas multiple anwers are correct • Sentences are reduction, but allow to zoom in on markers • Context: Forum texts • Personal vs. non personal sentences
  20. 20. Questions? Answers?