Your SlideShare is downloading. ×
Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

103
views

Published on

Published in: Internet, Technology, Design

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
103
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Ordinary Search Engine Users assessing Difficulty, Effort, and Outcome for Simple and Complex Search Tasks Georg Singer Ulrich Norbisrath Institute of Comuter Science, University of Tartu, Estonia Dirk Lewandowski Department of Information, Hamburg University of Applied Sciences, Germany iiiX 2012, Nijmegen 23 August, 2012
  • 2. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 3. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 4. Introduction •  People use search engines for all kinds of tasks, from simply looking up trivia to planning their holiday trips. More complex tasks usually can take much longer than expected. •  Disparity between expected and real effort for such tasks •  Definition of a complex search task (Singer, Norbisrath & Danilov, 2012): –  Requires at least one of the elements aggregation (finding several documents to a known aspect), discovery (detecting new aspects), and synthesis (synthesizing the found information into a single document). –  Complex tasks typically require going through those steps multiple times.
  • 5. Motivation •  We observed that users are on the one hand highly satisfied with search engines, but on the other hand, everybody knows many cases where “the search engine failed”. •  Users’ dissatisfaction in terms of unsuccessful queries might partly be caused by users not being able to judge the task effort properly and therefore their experience not being in line with their expectations. •  There is little research about users carrying out complex search tasks and examining their ability to judge task effort, which is based on a reasonably large and diverse sample.
  • 6. Motivation •  Conduct a study with a reasonable sample of ordinary Web search engine users •  Collect self-reported measures of task difficulty, task effort and task outcome, before and after working on the task.
  • 7. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 8. Definitions §  A search task is complex if it requires at least one of the elements aggregation, discovery and synthesis. §  A search task is difficult if a lot of cognitive input is needed to carry out the task. §  A search task requires increased effort if the user needs either §  more cognitive effort to understand the task and formulate queries (time effort), §  or more mechanical effort (number of queries, number of pages visited, browser tabs opened and closed).
  • 9. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 10. Research questions RQ1: Can users assess difficulty, effort and task outcome for simple search tasks? RQ2: Can users assess difficulty, effort and task outcome for complex search tasks? RQ3: Are there significant performance differences between assessing simple and complex search tasks? RQ4: Does the users’ ability to judge if the information they have found is correct or not depend on task complexity? RQ5: Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks? RQ6: Does the judging performance depend on task complexity or simply the individual user?
  • 11. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 12. Methods •  Laboratory study with 60 participants, conducted in August, 2011, in Hamburg, Germany •  Each user was given 12 tasks (6 simple, 6 complex) •  Browser interactions were collected using the Search Logger plugin for Firefox (http://www.search-logger.com). •  Pre- and post-questionnaires for each task •  Although tasks were presented in a certain order (switching between simple and complex), participants were able to switch between tasks as they liked. •  Users were allowed to use any search engine (or other Web resource) they liked.
  • 13. Research design Pre$task)ques,onnaire) 1.#This#task#is#easy# # 2.#It#will#take#me#less#than## five#minutes#to#complete# the#task# # 3.#I#will#need#fewer#than#five# queries#to#complete#the#task# # 4.#I#will#find#the#correct# informa?on# # 1.#This#task#was#easy# # 2.#It#took#me#less#than## five#minutes#to#complete# the#task# # 3.#I#needed#fewer#than#five# queries#to#complete#the#task# # 4.#I#have#found#the#correct# informa?on# # Post$task)ques,onnaire)Work)on)task)independently) 1.#Result#is#correct# # 2.#Result#is#partly#correct# # 3.#Result#is#wrong# # Objec,ve)results)assessment) Collec?ng#browser## interac?on#data# !# !# !#
  • 14. User sample •  60 users •  Recruitment followed a demographic structure model •  A sample of that size cannot be representative, but is a vast improvement over samples usually used (i.e., self-selection, students) •  Data from 4 users was corrupted and therefore, was not analysed ” or “di cult”, sometimes interchangeably. reader of this paper, we give some defini- will use throughout this paper. stract description of activities to achieve a 1]. rocess of finding information. s a piece of work concerning the retrieval lated to an information need. The search out with IR systems [11]. s complex if it requires at least one of the tion, discovery and synthesis [15]. It typ- viewing many documents and synthesizing ed format. di cult if a lot of cognitive input is needed ask. equires increased e↵ort if the user needs ei- Basic Data Gender Age Span Female Male Total 18-24 5 4 9 25-34 9 7 16 35-44 7 8 15 45-54 8 8 16 55-59 3 1 4 Total 32 28 60 Table 1: Demography of user sample Hamburg, Germany. Participants were invited to the univ sity, where they were given a set of search tasks (see belo to fulfill. The study was carried out in one of the univ
  • 15. Tasks •  Simple vs. complex tasks (examples) –  (S) When was the composer of the piece “The Magic Flute” born? –  (S) When and by whom was penicillin discovered? –  (C) Are there any differences regarding the distribution of religious affiliations between Austria, Germany, and Switzerland? Which ones? –  (C) There are five countries whose names are also carried by chemical elements. France has two (31. Ga – Gallium and 87. Fr – Francium), Germany has one (32. Ge – Germanium), Russia has one (44. Ru – Ruthenium) and Poland has one (84. Po – Polonium). Please name the fifth country.
  • 16. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 17. Results on RQ 1 (“Can users assess difficulty, effort and task outcome for simple search tasks?”) •  An assessment was graded “correct” when the user’s self- judged values were the same in the pre-task and the post-task questionnaire. •  For all measures, users were generally able to judge the simple tasks correctly (around 90%). # of tasks % di culty incorrect 29 9.8 correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks simple search tasks. “# of tasks” represents the number of simple tasks that have been processed by the study par- ab Ta
  • 18. Results on RQ 2 (“Can users assess difficulty, effort and task outcome for complex search tasks?”) •  Ratio of correctly judged tasks drops to approx. 2/3 when considering complex tasks. # of tasks % di culty incorrect 29 9.8 correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks mple search tasks. “# of tasks” represents the number simple tasks that have been processed by the study par- # of tasks % di culty incorrect 95 33.2 correct 191 66.8 time e↵ort incorrect 99 34.6 correct 187 65.3 query e↵ort incorrect 91 31.8 correct 195 68.2 ability to find right result incorrect 78 27.2 correct 208 72.8 Table 3: Users judging complex search tasks di culty (%) time e↵ort (%) query e↵ort (%) task outcome (%)
  • 19. Results on RQ3 (“Are there significant performance differences between assessing simple and complex search tasks?”) •  Comparison of the differences between pre-task and post-task judgments over all tasks (i.e., user- independent) •  Paired sample t-tests to compare the results •  Users are significantly better at judging the complexity of simple tasks. correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks imple search tasks. “# of tasks” represents the number f simple tasks that have been processed by the study par- icipants. The total number of tasks (correct plus incorrect nes) should have been 56*6=336 (56 valid users x 6 tasks). t is slightly lower due to invalid or not given answers or ot fulfilled tasks. % shows the percentage of the number of udged tasks to the total valid answers for tasks. We graded n answer as correct when the users’ self-judged values were he same in the pre-task questionnaire and the post-task uestionnaire. For example if they judged a task to be di - ult in the pre-task questionnaire and after carrying out the ask stated again that it was a di cult task, the judgment was graded as correct. For all parameters (di culty, time e↵ort, query e↵ort, and esult finding ability) approximately 90% of the users man- ged to match estimated and experienced values for simple asks. However, in our study the users had slightly more rouble estimating the time e↵ort needed than the query ↵ort (in terms of estimating going over a threshold of num- ers of queries). correct 191 66.8 time e↵ort incorrect 99 34.6 correct 187 65.3 query e↵ort incorrect 91 31.8 correct 195 68.2 ability to find right result incorrect 78 27.2 correct 208 72.8 Table 3: Users judging complex search tasks di culty (%) time e↵ort (%) query e↵ort (%) task outcome (%) Simple tasks (n=295) 90±2 91±2 87±2 95±1 Complex tasks (n=286) 67±3 65±3 68±3 73±3 p-value <0.001 <0.001 <0.001 <0.001 Table 4: Correctly judged tasks per dependent vari- able (mean values over tasks) culty for complex tasks. We followed the same procedure for time e↵ort, query e↵ort and task outcome. In the case of simple tasks, users are significantly better at estimating all four parameters: di culty, time e↵ort, query e↵ort and search success, i.e. the di↵erence between their pre-task estimate and their post-task experience based value
  • 20. Results on RQ 4 (“Does the users’ ability to judge if the information they have found is correct or not depend on task complexity?”) •  Comparison between subjective judgments (from questionnaires) and objective result (as judged by a research assistant). •  Users’ judging ability depends on task complexity: –  Using the pre-task data, 87% of simple tasks are predicted correctly, where for complex tasks, only 52% were predicted correctly. –  When considering post-task data, 88% of simple tasks are predicted correctly, where for complex tasks, this number is also much lower (60%). 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 9 (C) (n=48) 60±7 81±6 73±6 81±6 10 (C) (n=46) 72±7 67±7 72±7 52±7 11 (C) (n=49) 65±7 67±7 65±7 61±7 12 (C) (n=47) 74±6 55±7 68±7 79±6 Table 5: Fraction of users correctly judging task pa- rameters per task Task type Correctly estimated tasks (%) simple (n=259) 87±2 complex (n=233) 52±3 p-value <0.001 Table 6: Judgments of expected search outcome (in pre-task questionnaire) compared to correctness of manually evaluated search results (mean values over tasks) the search task. These evaluations gives us an estimate of how well users can judge that a result they found on the Internet is actually correct and how well they can judge in advance, if they will be able to find the correct result. Table 6 shows that the ability to predict, whether it is pos- sible to find the correct information, is significantly higher for simple search tasks than for complex search tasks, 87% versus 52% in case of complex tasks. We used paired sam- ple t-tests to compare the results from simple and complex tasks. We paired average correctness for simple tasks and average correctness for complex tasks. Task di culty (%) time e↵ort (%) query e↵ort (%) task out- come (%) 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 Task type Correctly estimated tasks (%) simple (n=259) 88±2 complex (n=230) 60±3 p-value <0.001 Table 7: Assessments of self-judged search results (in post-task questionnaire) compared to correct- ness of manually evaluated search results (mean val- ues over tasks)
  • 21. Results on RQ5 (“Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks?”) •  Comparison of the top performing users and the worst performing users (1st quartile vs. 4th quartile). •  Good searchers are not significantly better at judging difficulty and effort for complex tasks, but they are significantly better at judging the task outcome. Task di culty (%) time e↵ort (%) query e↵ort (%) task out- come (%) 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 9 (C) (n=48) 60±7 81±6 73±6 81±6 10 (C) (n=46) 72±7 67±7 72±7 52±7 11 (C) (n=49) 65±7 67±7 65±7 61±7 12 (C) (n=47) 74±6 55±7 68±7 79±6 Table 5: Fraction of users correctly judging task pa- rameters per task Task type Correctly estimated tasks (%) simple (n=259) 87±2 complex (n=233) 52±3 p-value <0.001 Table 6: Judgments of expected search outcome (in pre-task questionnaire) compared to correctness of manually evaluated search results (mean values over tasks) Task type Correctly estimated tasks (%) simple (n=259) 88±2 complex (n=230) 60±3 p-value <0.001 Table 7: Assessments of self-judged search results (in post-task questionnaire) compared to correct- ness of manually evaluated search results (mean val- ues over tasks) Avg. di culty in % Avg. time e↵ort in % Avg. query e↵ort in % Avg. task outcome in % 1. quartile (n=67) 67±6 64±6 67±6 85±4 4. quartile (n=59) 73±6 75±6 73±6 64±6 p-value n.s. n.s. n.s. <0.05 Table 8: Correct estimations of best and worst quar- tile for expected and experienced task parameters ranking in the experiment. We ranked the users first by the number of correct answers given and then, in cases of users with the same number of correct answers, by answers with right elements (simple and complex tasks).
  • 22. Results on RQ 6 (“Does the judging performance depend on task complexity or simply the individual user?”) •  There are some users who are able to correctly judge all task parameters for both simple and complex tasks, but these are only a few. •  The majority of users gets only some parameters right (for simple and for complex tasks, as well). •  These findings hold true for all task parameters considered. •  Judging performance is much more dependent on task complexity than on the individual user.
  • 23. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 24. Discussion •  No problem with assessing the difficulty of simple tasks. –  Reason might be that all have sufficient experience with such tasks. •  Two thirds are able to sufficiently judge the subjective difficulty of complex tasks. –  However, large gap between self-judged success and objective results! •  Only 47% of submitted results for complex tasks were completely correct. –  The problem with complex tasks might not be users finding no results, but the results found only seemingly being correct. This may to a certain extent explain users’ satisfaction with search engines’ outcomes. à More research needed on when users only think they have found correct results.
  • 25. Conclusion and limitations •  Users tend to overestimate their own search capabilities in case of complex tasks. Search engines should offer more support on these tasks. •  While our experiences with recruiting using a demographic structure model are good, it lead to some problems in this study: –  Users’ strategies vary greatly, which resulted in high standard errors for some indicators. –  A solution would be either to increase sample size, or to focus on more specific user groups. –  However, using specific user groups (e.g., students) might produce significant results, but they will not hold true for other populations. After all, search engines are used by “everyone”, and research in that area should aim at producing results relevant to the whole user population.
  • 26. Thank you. Dirk Lewandowski, dirk.lewandowski@haw-hamburg.de Part of this research was supported by the European Union Regional Development Fund through the Estonian Centre of Excellence in Computer Science and by the target funding scheme SF0180008s12. In addition the research was supported by the Estonian Information Technology Foundation (EITSA) and the Tiger University program, as well as by Archimedes Estonia.

×