Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Measuring Learning During Search - ACM SIGIR CHIIR 2019

113 views

Published on

Slides from my talk on presening our paper: "Measuring Learning During Search: Differences in Interactions, Eye-Gaze, and Semantic Similarity to Expert Knowledge" at the ACM SIGIR CHIIR 2019 @ Glasgow
https://doi.org/10.1145/3295750.3298926

Published in: Technology
  • Be the first to comment

Measuring Learning During Search - ACM SIGIR CHIIR 2019

  1. 1. NILAVRA BHATTACHARYA, JACEK GWIZDKA School of Information, The University of Texas at Austin ACM SIGIR CHIIR 2019 • GLASGOW, SCOTLAND, UK? EU? MEASURING LEARNING DURING SEARCH Differences in Interactions, Eye-Gaze, and Semantic Similarity to Expert Knowledge
  2. 2. Why is the sky blue? The sky is blue because … Big Idea: to measure this knowledge-change, and (eventually) infer when it is happening Benefits: can be extended to a wide variety of fields, independent of topic and content e.g. online learning environments will become more popular “learning” or change in knowledge
  3. 3. 1. Introduction & Background 2. Method 3. Measures 4. Results 5. Summary Overview 4
  4. 4. 1.1 What is Learning? 5 change in verbal knowledge, from before to after a search session Image: http://thepeakperformancecenter.com/educational-learning/thinking/blooms-taxonomy/blooms-taxonomy-revised Revised Bloom’s Taxonomy
  5. 5. 1.2 Measuring Learning 6 Existing Methods of Assessing of Knowledge-Change: • asking explicit fact-checking questions – can be disruptive for web-searching • SVT: Sentence Verification Technique – requires creating specific questions for each document consumed (Freund et al., 2016) • (Automated) Essay Scoring – requires training set of carefully hand-scored essays (Yang et al., 2002) • concept-maps and mind-mapping – difficult to score for non-experts • common drawbacks: in the context of online information search – topic specific – time consuming to measure – difficult to scale-up
  6. 6. 1.3 Prior Work 7 • Goal: implicit measurement of learning or knowledge-gain Implicit Measures: • Cole et al. (2013): eye gaze patterns can assess differences in users’ domain knowledge level (for text search). – behavioural features are topic-independent predictive cues of domain knowledge • Collins-Thompson et al. (2016): diversity in search queries is an indicator of increased knowledge gain. • Vakkari (2016): suggested a set of predictors for knowledge-change during search.
  7. 7. 1.3.1 Prior Work 8 Image: Gadiraju, U., Yu, R., Dietze, S., & Holtz, P. (2018). Analyzing knowledge gain of users in informational search sessions on the web. CHIIR ’18 Gadiraju et al. (CHIIR 2018): • topic specific pre- and post tests involving True/False questionnaires – may not be generalizable for all topics – exposes users to search-topic and possible answers – correct answer for multiple-choice questions can be selected by guesswork
  8. 8. 1.3.2 Prior Work 9 Image: Ghosh, S., Rath, M., & Shah, C. (2018). Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-related Tasks. CHIIR ’18 Ghosh et al. (CHIIR 2018): • users were asked to self-rate their perceived change in knowledge – subjective – may not reflect true change in knowledge
  9. 9. • explore knowledge-change measures that – do not require domain-specific comprehension tests – do not expose users to the search-topic before the actual search begins – attempt to measure a searcher’s knowledge-change, minimizing guessing and subjective differences • investigate differences in search behaviour and gaze-patterns of users showing low versus high knowledge-change 1.4 Research Aims 10
  10. 10. 1. Introduction & Background 2. Method 3. Measures 4. Results 5. Summary Overview 11
  11. 11. • Eye-tracking user study (n=30; 16 females) • Within subjects design • Searched for health-related information on the web • participants were pre-screened for - non-expert topic familiarity - uncorrected eye-sight - proficiency in online searching 2.1 Experimental Design 12
  12. 12. • Two search tasks, on health related topics, simulating work-task approach (Borlund, 2003) – tried to trigger realistic information-need in participants (e.g., helping a cousin, and a friend) • Topics: – Vitamin A – Hypotension • Each task had 4 questions from multiple facets – e.g. for Vitamin A, participants had to find: • recommended dosage • health benefits • consequences of excess and deficiency • food sources 2.2 Task Description 13 Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research. 8(3). Section 3.2 of our paper contains the full-texts of the task prompts.
  13. 13. Memory- span test via Working Memory Capacity WMC Online health literacy test via eHealth Literacy Scale eHEALS Training Task to familiarize with interface Search task (in counter- balanced order) 2.3 Procedure 14
  14. 14. 2.3 Procedure 15 a b c e d Pre-task Knowledge (free-text) Customized Google SERP CONTENT pages Bookmarking Note-taking Post-task Knowledge (free-text)
  15. 15. • Custom Google SERP: – result retrieved in background from Google – 7 results per page • increases font-size and visual angle for proper eye- tracking – no ads 2.3 Custom Google SERP 16
  16. 16. 2.3 Bookmarking & Note-taking 17 Bookmarking Note-taking
  17. 17. Memory- span test via Working Memory Capacity WMC Online health literacy test via eHealth Literacy Scale eHEALS Training Task to familiarize with interface Search task (in counter- balanced order) Perceived workload test after each task via NASA-TLX 2.4 Procedure 18
  18. 18. 1. Introduction & Background 2. Method 3. Measures 4. Results 5. Summary Overview 19
  19. 19. • Pre- and post-tasks 3 Measures 20 Think of what you already know on the topic of this search and list as many phrases or words as you can that come to your mind. Now that you have completed this search task, think of the information that you found and list as many words or phrases as you can on the topic of the search task. change in knowledge Aim: to measure this change, using implicit-feedback measures Challenge: user input is open-ended text, via free-recall from memory (no time-limit) (Key difference of our study from prior works (Gadiraju et al., 2018; Yu et al., 2018; Ghosh et al., 2018))
  20. 20. • Knowledge Change (KC) – simple – sophisticated (using semantic similarity) • Eye-tracking (ET) • Search Interactions (SI) • Unit of analysis: <user, task> pair 3 Measures 21
  21. 21. 3.1.1 KC Measures - Simple 22 𝐾𝐶_𝑆𝑖𝑚𝑝𝑙𝑒 = 𝑖𝑡𝑒𝑚𝑠 𝑝𝑜𝑠𝑡 − 𝑖𝑡𝑒𝑚𝑠 𝑝𝑟𝑒 𝑖𝑡𝑒𝑚𝑠 𝑝𝑜𝑠𝑡 items = words and phrases entered by users before and after each task, separated by ENTER key presses (“n”)
  22. 22. 3.1.2 KC Measures - Sophisticated 23 Expert Knowledge (or “Correct” Answers) User’s Pre-task answers User’s Post-task answers knowledge change
  23. 23. 3.1.2 KC Measures - Sophisticated 24 Step 1: Curating expert knowledge vocabulary: – crowdsourced answers to each question from the search task (MTurk) . . . – answers were cleaned and verified by a medical doctor (expert) – final vocabulary size: • 115 phrases / words for Task 1 • 105 phrases / words for Task 2
  24. 24. 3.1.2 KC Measures - Sophisticated 25 Step 2: Measuring semantic similarity between texts Step 2(a): Turn natural text into numbers: "user's pre-task answers" "user's post-task answers" "answers from expert" [0.3, 5.6, 0.7, …] [0.7, 1.2, 0.1, …] [0.9, 3.6, 0.5, …] Sentence Embedding Image: https://tfhub.dev/google/universal-sentence-encoder/2 • encoder of greater-than-word length text phrases, sentences, short paragraphs • trained on a variety of large text-corpuses Google News, entire English Wikipedia, etc.
  25. 25. 3.1.2 KC Measures - Sophisticated 26 Step 2: Measuring semantic similarity between texts Step 2(b): Measure distance between vectors: [0.9, 3.6, 0.5, …] Ԧ𝑣 expert’s knowledge vector = 1 − arccos ൗ 𝑢 ⋅ Ԧ𝑣 ‖𝑢‖‖ Ԧ𝑣‖ 𝜋
  26. 26. 3.1.2 KC Measures - Sophisticated 27 Expert Knowledge (or “Correct” Answers) User’s Pre-task answers User’s Post-task answers knowledge change final knowledge state initial knowledge state
  27. 27. 3.1.2 KC Measures - Sophisticated 28 𝐾𝐶_𝑆𝑒𝑚_𝑫𝒊𝒇𝒇 = sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 − sim 𝒑𝒓𝒆_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 𝐾𝐶_𝑆𝑒𝑚_𝑹𝒂𝒕𝒊𝒐 = sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 sim 𝒑𝒓𝒆_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 • How to measure change between two numbers?
  28. 28. • we used “reading” eye-fixations only – fixation count (𝑓𝑖𝑥_𝑛) and duration (𝑓𝑖𝑥_𝑑𝑢𝑟_𝑠𝑢𝑚, 𝑓𝑖𝑥_𝑑𝑢𝑟_𝑎𝑣𝑔) – length of reading sequences (𝑟𝑠𝑒𝑞_𝑙𝑒𝑛) – regression count (𝑟𝑒𝑔𝑟_𝑛) and length (𝑟𝑒𝑔𝑟_𝑙𝑒𝑛) 3.2 Eye-tracking (ET) 29 (Gwizdka, 2014) Reading (on relevant content) Scanning (on irrelevant content) Gwizdka, J. (2014). Characterizing Relevance with Eye-tracking Measures. IIiX ’14
  29. 29. • Webpage based: – count of pages visited (𝑝𝑔_𝑛) • Search-query based: – count of queries (𝑞𝑢𝑒𝑟𝑦_𝑛) – count of new queries in query-reformulations (𝑞𝑟_𝑛𝑒𝑤_𝑛) – how “specialized” were the words used in queries (𝑞_𝑤𝑜𝑟𝑑_𝑓𝑟𝑒𝑞) • "cure for low blood pressure" (less specialized) • "mayoclinic hypotension treatment" (more specialized) • Table 1 in our paper describes how to compute all the measures. 3.3 Search Interactions (SI) 30
  30. 30. 1. Introduction & Background 2. Method 3. Measures 4. Results 5. Summary Overview 31
  31. 31. 4.1 Data Analysis 32 𝐾𝐶_𝑆𝑖𝑚𝑝𝑙𝑒 𝐾𝐶_𝑆𝑒𝑚_𝑅𝑎𝑡𝑖𝑜 𝐾𝐶_𝑆𝑒𝑚_𝐷𝑖𝑓𝑓 LO group1 LO group2 LO group3 HI group1 HI group2 HI group3 ET SI ET SI ET SI ET SI ET SI ET SI Do LO and HI groups differ significantly in terms of their Eye-tracking (ET) and Search Interaction (SI) measures? • Quasi-independent Vars: – Knowledge Change (KC) groups (LO and HI) • Dependent Vars: – Eye-tracking (ET) – Search Interactions (SI) • Statistical Test: – Mann Whitney UGroup-membership was fairly consistent: - 2 / 49 mismatches between _Ratio and _Diff - 9 / 49 mismatches between _Simple and _Sem
  32. 32. 4.2.1 ET Measures - Fixations 33 • LO group had higher (!) eye-tracking fixation-measures than HI group: – fixated more on CONTENT pages (fix_n_content_avg .05 ≤ p ≤ .1) – fixated longer in total (fix_dur_content_sum p < .01) and on average (fix_n_content_avg) • Yu et al. (SIGIR 2018) similarly found: – total, average, and max time spent on webpages have highest predictive power for knowledge-gain prediction
  33. 33. 4.2.1 ET Measures - Movement 34 • Again, LO group differed significantly by having: – longer reading sequences (rseq_n); higher probability of reading (pRR_serp) – regressed backwards longer (regr_len), and more often (regr_n) • Eye-tracking measures show LO group put more effort in reading, yet our Knowledge-Change measures reflect they learnt less
  34. 34. 4.2.2 SI Measures 35 • LO and HI users entered similar number of search queries – LO group entered fewer new queries in reformulations (qr_new_n) – LO group used more common (or less specialized) words in queries (q_words_freq) • Yu et al. (SIGIR 2018) similarly observed: – count of unique terms in queries was the only query-related feature that showed predictive power
  35. 35. 4.3 Other Measures 36 • HI group reported higher mental workload (NASA_TLX) • LO and HI groups did not have any significant differences in – eHealth literacy knowledge, comfort, and skills at finding, evaluating, and applying electronic health information – working-memory capacity – number of webpages visited • Yu et al. (SIGIR 2018) similarly illustrate: – counts of webpages visited are very weak predictors of knowledge-gain (Fig 1 of Yu et al. (2018): feature importance of random forest model).
  36. 36. • LO-FKS group: – spent longer time on reading SERPs (pRR_serp) – opened fewer CONTENT pages (pg_content_n); thus found fewer relevant CONTENT pages (pg_content_rel_n) • similar phenomenon observed by Gwizdka (CHIIR 2017) and Collins-Thompson et al. (2016) – reported lower mental workload after task (NASA_TLX) 4.4 Final Knowledge State (FKS) 37 Expert Knowledge Post-task answers sim 𝒑𝒐𝒔𝒕_𝑡𝑎𝑠𝑘, 𝑒𝑥𝑝𝑒𝑟𝑡 𝑝𝑜𝑠𝑡_𝑒𝑥𝑝_𝑠𝑖𝑚 LO-FKS HI-FKS ET SI ET SI final knowledge state (FKS)
  37. 37. 1. Introduction & Background 2. Method 3. Measures 4. Results 5. Summary Overview 38
  38. 38. • LO group read more, yet they learnt less – possibly due to difficulty in acquiring information • LO-FKS group spent more time in reading SERPs – yet they opened fewer relevant search results • LO group used less specialized words in their queries • LO group reported lower mental workload after each task • No significant differences in – total number of pages visited – eHealth Literacy Score – Working Memory Capacity 5.1 Takeaways 39 GROUPS: LO: Low Knowledge-Change (KC) LO-FKS: Low Final-Knowledge-State (FKS)
  39. 39. • explore knowledge-change measures that – do not require domain-specific comprehension tests – do not expose users to the search-topic before the actual search begins • we introduce a topic-independent, free-recall based method of knowledge assessment – expert vocabulary can be curated from online knowledgebases (e.g. Wikipedia) – attempt to measure a searcher’s knowledge-change, while minimizing guessing and subjective differences • we used semantic similarity of user-responses to expert-knowledge to measure knowledge-change – advances in measuring semantic-similarity will help in this direction • investigate differences in search behaviour and gaze-patterns of users showing low versus high knowledge-change – results show Eye-tracking (ET) and Search-Interaction (SI) measures sig. differ with varying levels of knowledge-change => ET & SI: good candidate measures of verbal-learning 5.2 In terms of Research Aims 40
  40. 40. 5.3 Limitations & Future Work 41 • Limitations: – only 2 search-tasks, of similar nature (health information search) – data-analysis at task-level (not participant level) – relatively uniform group of participants (young-adult college students) – short time-frame • Future Directions: – wider range of search tasks – more diverse participants – additional individual-difference tests – multiple-session study (to assess knowledge-change over longer period of time)
  41. 41. 5.4 Summary 42 Verbal Knowledge Change Specialized words in queries NASA TLX mental workload Eye- tracking measures Search interactions webpage counts, durations Working Memory Capacity eHealth Literacy Score
  42. 42. THANK YOU Questions? Student Travel GrantCareer Award Acknowledgements: expert-knowledge curation Dr. Andrzej Kahl crowdsourcing and data collection Yinglong Zhang
  43. 43. • Collins-Thompson, K., Rieh, S. Y., Haynes, C. C., & Syed, R. (2016). Assessing learning outcomes in web search: A comparison of tasks and query strategies. CHIIR ’16 • Gadiraju, U., Yu, R., Dietze, S., & Holtz, P. (2018). Analyzing knowledge gain of users in informational search sessions on the web. CHIIR ’18 • Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., & Dietze, S. (2018). Predicting user knowledge gain in informational search sessions. SIGIR ‘18 • Ghosh, S., Rath, M., & Shah, C. (2018). Searching as Learning: Exploring Search Behavior and Learning Outcomes in Learning-related Tasks. CHIIR ’18 • Gwizdka, J. (2014). Characterizing Relevance with Eye-tracking Measures. IIiX ’14 • Cole, M. J., Gwizdka, J., Liu, C., Belkin, N. J., & Zhang, X. (2013). Inferring user knowledge level from eye movement patterns. Information Processing & Management, 49(5), 1075-1091. • Gwizdka, J. (2017, March). I can and so I search more: effects of memory span on search behavior. CHIIR ’17 45 References
  44. 44. • Borlund, P. (2003). The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research. 8(3). • Wildemuth, B. M. (2004). The effects of domain knowledge on search tactic formulation. Journal of the American Society for Information Science and Technology, 55(3), 246-258. • Vakkari, P. (2016). Searching as learning: A systematization based on literature. Journal of Information Science, 42(1), 7-18. • Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Sung, Y. H. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175. • Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Google Machine Translation Team, 20. • Freund, L., Kopak, R., & O’Brien, H. (2016). The effects of textual environment on reading comprehension: Implications for searching as learning. Journal of Information Science, 42(1), 79-93. • Yang, Y., Buckendahl, C. W., Juszkiewicz, P. J., & Bhola, D. S. (2002). A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15(4), 391-412. • Francis, G., MacKewn, A., & Goldthwaite, D. (2004). CogLab on a CD. Wadsworth Publishing Company. 46 References

×