Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Dirk Lewandowski
Dirk LewandowskiProfessor at Hamburg University of Applied Sciences
Ordinary Search Engine Users assessing Difficulty, Effort,
and Outcome for Simple and Complex Search Tasks
Georg Singer
Ulrich Norbisrath
Institute of Comuter Science, University of Tartu, Estonia
Dirk Lewandowski
Department of Information, Hamburg University of Applied Sciences, Germany
iiiX 2012, Nijmegen
23 August, 2012
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Introduction
•  People use search engines for all kinds of tasks, from simply looking up
trivia to planning their holiday trips. More complex tasks usually can take
much longer than expected.
•  Disparity between expected and real effort for such tasks
•  Definition of a complex search task (Singer, Norbisrath & Danilov, 2012):
–  Requires at least one of the elements aggregation (finding several documents to
a known aspect), discovery (detecting new aspects), and synthesis (synthesizing
the found information into a single document).
–  Complex tasks typically require going through those steps multiple times.
Motivation
•  We observed that users are on the one hand highly satisfied with search
engines, but on the other hand, everybody knows many cases where “the
search engine failed”.
•  Users’ dissatisfaction in terms of unsuccessful queries might partly be
caused by users not being able to judge the task effort properly and
therefore their experience not being in line with their expectations.
•  There is little research about users carrying out complex search tasks and
examining their ability to judge task effort, which is based on a reasonably
large and diverse sample.
Motivation
•  Conduct a study with a reasonable sample of ordinary Web search engine
users
•  Collect self-reported measures of task difficulty, task effort and task
outcome, before and after working on the task.
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Definitions
§  A search task is complex if it requires at least one of the elements
aggregation, discovery and synthesis.
§  A search task is difficult if a lot of cognitive input is needed to carry out
the task.
§  A search task requires increased effort if the user needs either
§  more cognitive effort to understand the task and formulate queries
(time effort),
§  or more mechanical effort (number of queries, number of pages
visited, browser tabs opened and closed).
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Research questions
RQ1: Can users assess difficulty, effort and task outcome for simple search tasks?
RQ2: Can users assess difficulty, effort and task outcome for complex search tasks?
RQ3: Are there significant performance differences between assessing simple and
complex search tasks?
RQ4: Does the users’ ability to judge if the information they have found is correct or not
depend on task complexity?
RQ5: Is there a correlation between the overall search performance (ranking in the
experiment) and the ability to assess difficulty, time effort, query effort, and task outcome
for complex tasks?
RQ6: Does the judging performance depend on task complexity or simply the individual
user?
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Methods
•  Laboratory study with 60 participants, conducted in August, 2011, in
Hamburg, Germany
•  Each user was given 12 tasks (6 simple, 6 complex)
•  Browser interactions were collected using the Search Logger plugin for
Firefox (http://www.search-logger.com).
•  Pre- and post-questionnaires for each task
•  Although tasks were presented in a certain order (switching between simple
and complex), participants were able to switch between tasks as they liked.
•  Users were allowed to use any search engine (or other Web resource) they
liked.
Research design
Pre$task)ques,onnaire)
1.#This#task#is#easy#
#
2.#It#will#take#me#less#than##
five#minutes#to#complete#
the#task#
#
3.#I#will#need#fewer#than#five#
queries#to#complete#the#task#
#
4.#I#will#find#the#correct#
informa?on#
#
1.#This#task#was#easy#
#
2.#It#took#me#less#than##
five#minutes#to#complete#
the#task#
#
3.#I#needed#fewer#than#five#
queries#to#complete#the#task#
#
4.#I#have#found#the#correct#
informa?on#
#
Post$task)ques,onnaire)Work)on)task)independently)
1.#Result#is#correct#
#
2.#Result#is#partly#correct#
#
3.#Result#is#wrong#
#
Objec,ve)results)assessment)
Collec?ng#browser##
interac?on#data#
!# !# !#
User sample
•  60 users
•  Recruitment followed a demographic
structure model
•  A sample of that size cannot be
representative, but is a vast
improvement over samples usually
used (i.e., self-selection, students)
•  Data from 4 users was corrupted and
therefore, was not analysed
” or “di cult”, sometimes interchangeably.
reader of this paper, we give some defini-
will use throughout this paper.
stract description of activities to achieve a
1].
rocess of finding information.
s a piece of work concerning the retrieval
lated to an information need. The search
out with IR systems [11].
s complex if it requires at least one of the
tion, discovery and synthesis [15]. It typ-
viewing many documents and synthesizing
ed format.
di cult if a lot of cognitive input is needed
ask.
equires increased e↵ort if the user needs ei-
Basic Data Gender
Age Span Female Male Total
18-24 5 4 9
25-34 9 7 16
35-44 7 8 15
45-54 8 8 16
55-59 3 1 4
Total 32 28 60
Table 1: Demography of user sample
Hamburg, Germany. Participants were invited to the univ
sity, where they were given a set of search tasks (see belo
to fulfill. The study was carried out in one of the univ
Tasks
•  Simple vs. complex tasks (examples)
–  (S) When was the composer of the piece “The Magic Flute” born?
–  (S) When and by whom was penicillin discovered?
–  (C) Are there any differences regarding the distribution of religious
affiliations between Austria, Germany, and Switzerland? Which ones?
–  (C) There are five countries whose names are also carried by chemical
elements. France has two (31. Ga – Gallium and 87. Fr – Francium),
Germany has one (32. Ge – Germanium), Russia has one (44. Ru –
Ruthenium) and Poland has one (84. Po – Polonium). Please name the
fifth country.
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Results on RQ 1 (“Can users assess difficulty, effort and task
outcome for simple search tasks?”)
•  An assessment was graded
“correct” when the user’s self-
judged values were the same in
the pre-task and the post-task
questionnaire.
•  For all measures, users were
generally able to judge the simple
tasks correctly (around 90%).
# of tasks %
di culty
incorrect 29 9.8
correct 266 90.2
time e↵ort
incorrect 27 9.1
correct 268 90.8
query e↵ort
incorrect 38 12.9
correct 257 87.1
ability to find right result
incorrect 16 5.4
correct 279 94.6
Table 2: Users judging simple search tasks
simple search tasks. “# of tasks” represents the number
of simple tasks that have been processed by the study par-
ab
Ta
Results on RQ 2 (“Can users assess difficulty, effort and task
outcome for complex search tasks?”)
•  Ratio of correctly judged tasks
drops to approx. 2/3 when
considering complex tasks.
# of tasks %
di culty
incorrect 29 9.8
correct 266 90.2
time e↵ort
incorrect 27 9.1
correct 268 90.8
query e↵ort
incorrect 38 12.9
correct 257 87.1
ability to find right result
incorrect 16 5.4
correct 279 94.6
Table 2: Users judging simple search tasks
mple search tasks. “# of tasks” represents the number
simple tasks that have been processed by the study par-
# of tasks %
di culty
incorrect 95 33.2
correct 191 66.8
time e↵ort
incorrect 99 34.6
correct 187 65.3
query e↵ort
incorrect 91 31.8
correct 195 68.2
ability to find right result
incorrect 78 27.2
correct 208 72.8
Table 3: Users judging complex search tasks
di culty
(%)
time
e↵ort
(%)
query
e↵ort
(%)
task
outcome
(%)
Results on RQ3 (“Are there significant performance differences
between assessing simple and complex search tasks?”)
•  Comparison of the differences
between pre-task and post-task
judgments over all tasks (i.e., user-
independent)
•  Paired sample t-tests to compare
the results
•  Users are significantly better at
judging the complexity of simple
tasks.
correct 266 90.2
time e↵ort
incorrect 27 9.1
correct 268 90.8
query e↵ort
incorrect 38 12.9
correct 257 87.1
ability to find right result
incorrect 16 5.4
correct 279 94.6
Table 2: Users judging simple search tasks
imple search tasks. “# of tasks” represents the number
f simple tasks that have been processed by the study par-
icipants. The total number of tasks (correct plus incorrect
nes) should have been 56*6=336 (56 valid users x 6 tasks).
t is slightly lower due to invalid or not given answers or
ot fulfilled tasks. % shows the percentage of the number of
udged tasks to the total valid answers for tasks. We graded
n answer as correct when the users’ self-judged values were
he same in the pre-task questionnaire and the post-task
uestionnaire. For example if they judged a task to be di -
ult in the pre-task questionnaire and after carrying out the
ask stated again that it was a di cult task, the judgment
was graded as correct.
For all parameters (di culty, time e↵ort, query e↵ort, and
esult finding ability) approximately 90% of the users man-
ged to match estimated and experienced values for simple
asks. However, in our study the users had slightly more
rouble estimating the time e↵ort needed than the query
↵ort (in terms of estimating going over a threshold of num-
ers of queries).
correct 191 66.8
time e↵ort
incorrect 99 34.6
correct 187 65.3
query e↵ort
incorrect 91 31.8
correct 195 68.2
ability to find right result
incorrect 78 27.2
correct 208 72.8
Table 3: Users judging complex search tasks
di culty
(%)
time
e↵ort
(%)
query
e↵ort
(%)
task
outcome
(%)
Simple
tasks
(n=295)
90±2 91±2 87±2 95±1
Complex
tasks
(n=286)
67±3 65±3 68±3 73±3
p-value <0.001 <0.001 <0.001 <0.001
Table 4: Correctly judged tasks per dependent vari-
able (mean values over tasks)
culty for complex tasks. We followed the same procedure
for time e↵ort, query e↵ort and task outcome.
In the case of simple tasks, users are significantly better at
estimating all four parameters: di culty, time e↵ort, query
e↵ort and search success, i.e. the di↵erence between their
pre-task estimate and their post-task experience based value
Results on RQ 4 (“Does the users’ ability to judge if the information
they have found is correct or not depend on task complexity?”)
•  Comparison between subjective judgments
(from questionnaires) and objective result
(as judged by a research assistant).
•  Users’ judging ability depends on task
complexity:
–  Using the pre-task data, 87% of simple
tasks are predicted correctly, where for
complex tasks, only 52% were predicted
correctly.
–  When considering post-task data, 88% of
simple tasks are predicted correctly, where
for complex tasks, this number is also much
lower (60%).
1 (S) (n=51) 92±4 90±4 88±5 98±2
2 (S) (n=48) 81±6 85±5 73±6 88±5
3 (S) (n=47) 89±5 91±4 87±5 96±3
4 (S) (n=51) 98±2 100±0 98±3 96±3
5 (S) (n=49) 84±5 82±6 82±6 94±3
6 (S) (n=49) 96±3 96±3 94±3 96±3
7 (C) (n=47) 62±7 51±7 60±7 85±5
8 (C) (n=48) 67±7 69±7 71±7 77±6
9 (C) (n=48) 60±7 81±6 73±6 81±6
10 (C) (n=46) 72±7 67±7 72±7 52±7
11 (C) (n=49) 65±7 67±7 65±7 61±7
12 (C) (n=47) 74±6 55±7 68±7 79±6
Table 5: Fraction of users correctly judging task pa-
rameters per task
Task type Correctly estimated
tasks (%)
simple
(n=259)
87±2
complex
(n=233)
52±3
p-value <0.001
Table 6: Judgments of expected search outcome (in
pre-task questionnaire) compared to correctness of
manually evaluated search results (mean values over
tasks)
the search task. These evaluations gives us an estimate of
how well users can judge that a result they found on the
Internet is actually correct and how well they can judge in
advance, if they will be able to find the correct result.
Table 6 shows that the ability to predict, whether it is pos-
sible to find the correct information, is significantly higher
for simple search tasks than for complex search tasks, 87%
versus 52% in case of complex tasks. We used paired sam-
ple t-tests to compare the results from simple and complex
tasks. We paired average correctness for simple tasks and
average correctness for complex tasks.
Task di culty
(%)
time
e↵ort
(%)
query
e↵ort
(%)
task
out-
come
(%)
1 (S) (n=51) 92±4 90±4 88±5 98±2
2 (S) (n=48) 81±6 85±5 73±6 88±5
3 (S) (n=47) 89±5 91±4 87±5 96±3
4 (S) (n=51) 98±2 100±0 98±3 96±3
5 (S) (n=49) 84±5 82±6 82±6 94±3
6 (S) (n=49) 96±3 96±3 94±3 96±3
7 (C) (n=47) 62±7 51±7 60±7 85±5
8 (C) (n=48) 67±7 69±7 71±7 77±6
Task type Correctly estimated
tasks (%)
simple
(n=259)
88±2
complex
(n=230)
60±3
p-value <0.001
Table 7: Assessments of self-judged search results
(in post-task questionnaire) compared to correct-
ness of manually evaluated search results (mean val-
ues over tasks)
Results on RQ5 (“Is there a correlation between the overall search
performance (ranking in the experiment) and the ability to assess difficulty,
time effort, query effort, and task outcome for complex tasks?”)
•  Comparison of the top performing
users and the worst performing
users (1st quartile vs. 4th quartile).
•  Good searchers are not
significantly better at judging
difficulty and effort for complex
tasks, but they are significantly
better at judging the task
outcome.
Task di culty
(%)
time
e↵ort
(%)
query
e↵ort
(%)
task
out-
come
(%)
1 (S) (n=51) 92±4 90±4 88±5 98±2
2 (S) (n=48) 81±6 85±5 73±6 88±5
3 (S) (n=47) 89±5 91±4 87±5 96±3
4 (S) (n=51) 98±2 100±0 98±3 96±3
5 (S) (n=49) 84±5 82±6 82±6 94±3
6 (S) (n=49) 96±3 96±3 94±3 96±3
7 (C) (n=47) 62±7 51±7 60±7 85±5
8 (C) (n=48) 67±7 69±7 71±7 77±6
9 (C) (n=48) 60±7 81±6 73±6 81±6
10 (C) (n=46) 72±7 67±7 72±7 52±7
11 (C) (n=49) 65±7 67±7 65±7 61±7
12 (C) (n=47) 74±6 55±7 68±7 79±6
Table 5: Fraction of users correctly judging task pa-
rameters per task
Task type Correctly estimated
tasks (%)
simple
(n=259)
87±2
complex
(n=233)
52±3
p-value <0.001
Table 6: Judgments of expected search outcome (in
pre-task questionnaire) compared to correctness of
manually evaluated search results (mean values over
tasks)
Task type Correctly estimated
tasks (%)
simple
(n=259)
88±2
complex
(n=230)
60±3
p-value <0.001
Table 7: Assessments of self-judged search results
(in post-task questionnaire) compared to correct-
ness of manually evaluated search results (mean val-
ues over tasks)
Avg.
di culty
in %
Avg.
time
e↵ort in
%
Avg.
query
e↵ort in
%
Avg.
task
outcome
in %
1.
quartile
(n=67)
67±6 64±6 67±6 85±4
4.
quartile
(n=59)
73±6 75±6 73±6 64±6
p-value n.s. n.s. n.s. <0.05
Table 8: Correct estimations of best and worst quar-
tile for expected and experienced task parameters
ranking in the experiment. We ranked the users first by the
number of correct answers given and then, in cases of users
with the same number of correct answers, by answers with
right elements (simple and complex tasks).
Results on RQ 6 (“Does the judging performance depend on task
complexity or simply the individual user?”)
•  There are some users who are able to correctly judge all task parameters
for both simple and complex tasks, but these are only a few.
•  The majority of users gets only some parameters right (for simple and for
complex tasks, as well).
•  These findings hold true for all task parameters considered.
•  Judging performance is much more dependent on task complexity
than on the individual user.
Agenda
1.  Introduction
2.  Definitions
3.  Research questions
4.  Methods
5.  Results
6.  Discussion and conclusion
Discussion
•  No problem with assessing the difficulty of simple tasks.
–  Reason might be that all have sufficient experience with such tasks.
•  Two thirds are able to sufficiently judge the subjective difficulty of
complex tasks.
–  However, large gap between self-judged success and objective results!
•  Only 47% of submitted results for complex tasks were completely
correct.
–  The problem with complex tasks might not be users finding no results, but the
results found only seemingly being correct. This may to a certain extent explain
users’ satisfaction with search engines’ outcomes.
à More research needed on when users only think they have found correct results.
Conclusion and limitations
•  Users tend to overestimate their own search capabilities in case of complex
tasks. Search engines should offer more support on these tasks.
•  While our experiences with recruiting using a demographic structure model
are good, it lead to some problems in this study:
–  Users’ strategies vary greatly, which resulted in high standard errors for some
indicators.
–  A solution would be either to increase sample size, or to focus on more specific
user groups.
–  However, using specific user groups (e.g., students) might produce significant
results, but they will not hold true for other populations. After all, search engines
are used by “everyone”, and research in that area should aim at producing
results relevant to the whole user population.
Thank you.
Dirk Lewandowski, dirk.lewandowski@haw-hamburg.de
Part of this research was supported by the European Union Regional Development Fund through the
Estonian Centre of Excellence in Computer Science and by the target funding scheme SF0180008s12.
In addition the research was supported by the Estonian Information Technology Foundation (EITSA)
and the Tiger University program, as well as by Archimedes Estonia.
1 of 26

Recommended

My experiment by
My experimentMy experiment
My experimentBoshra Albayaty
234 views30 slides
Generation of Question and Answer from Unstructured Document using Gaussian M... by
Generation of Question and Answer from Unstructured Document using Gaussian M...Generation of Question and Answer from Unstructured Document using Gaussian M...
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
180 views5 slides
NLP_Project_Paper_up276_vec241 by
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241Urjit Patel
469 views9 slides
A Survey on Sentiment Categorization of Movie Reviews by
A Survey on Sentiment Categorization of Movie ReviewsA Survey on Sentiment Categorization of Movie Reviews
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
371 views8 slides
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE by
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
115 views5 slides
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION by
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONTEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
25 views10 slides

More Related Content

What's hot

Sources of errors in distributed development projects implications for colla... by
Sources of errors in distributed development projects implications for colla...Sources of errors in distributed development projects implications for colla...
Sources of errors in distributed development projects implications for colla...Bhagyashree Deokar
452 views34 slides
Feature selection, optimization and clustering strategies of text documents by
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
60 views8 slides
Viva by
VivaViva
VivaBoshra Albayaty
1.4K views72 slides
Question Answering System using machine learning approach by
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approachGarima Nanda
2.9K views27 slides
Assessment of Programming Language Reliability Utilizing Soft-Computing by
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computingijcsa
988 views11 slides

What's hot(11)

Sources of errors in distributed development projects implications for colla... by Bhagyashree Deokar
Sources of errors in distributed development projects implications for colla...Sources of errors in distributed development projects implications for colla...
Sources of errors in distributed development projects implications for colla...
Bhagyashree Deokar452 views
Feature selection, optimization and clustering strategies of text documents by IJECEIAES
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
IJECEIAES60 views
Question Answering System using machine learning approach by Garima Nanda
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
Garima Nanda2.9K views
Assessment of Programming Language Reliability Utilizing Soft-Computing by ijcsa
Assessment of Programming Language Reliability Utilizing Soft-ComputingAssessment of Programming Language Reliability Utilizing Soft-Computing
Assessment of Programming Language Reliability Utilizing Soft-Computing
ijcsa988 views
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal... by Nirav Raje
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Nirav Raje137 views
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R... by ijiert bestjournal
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
Nlp research presentation by Surya Sg
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg240 views
The recognition system of sentential by ijaia
The recognition system of sententialThe recognition system of sentential
The recognition system of sentential
ijaia223 views

Similar to Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

Differences in-task-descriptions by
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptionsSameer Chavan
404 views7 slides
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research by
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchTutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchEd Chi
3.4K views29 slides
ECE695DVisualAnalyticsprojectproposal (2) by
ECE695DVisualAnalyticsprojectproposal (2)ECE695DVisualAnalyticsprojectproposal (2)
ECE695DVisualAnalyticsprojectproposal (2)Shweta Gupte
154 views4 slides
Usability Testing for Survey Research:How to and Best Practices by
Usability Testing for Survey Research:How to and Best PracticesUsability Testing for Survey Research:How to and Best Practices
Usability Testing for Survey Research:How to and Best Practicesegeisen
961 views131 slides
Usability and its use for Academic Libraries by
Usability and its use for Academic LibrariesUsability and its use for Academic Libraries
Usability and its use for Academic Librariesdfordishes
584 views17 slides
UX and Usability Workshop Southampton Solent University by
UX and Usability Workshop Southampton Solent University UX and Usability Workshop Southampton Solent University
UX and Usability Workshop Southampton Solent University Dr.Mohammed Alhusban
1.8K views65 slides

Similar to Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks(20)

Differences in-task-descriptions by Sameer Chavan
Differences in-task-descriptionsDifferences in-task-descriptions
Differences in-task-descriptions
Sameer Chavan404 views
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research by Ed Chi
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI ResearchTutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research
Ed Chi3.4K views
ECE695DVisualAnalyticsprojectproposal (2) by Shweta Gupte
ECE695DVisualAnalyticsprojectproposal (2)ECE695DVisualAnalyticsprojectproposal (2)
ECE695DVisualAnalyticsprojectproposal (2)
Shweta Gupte154 views
Usability Testing for Survey Research:How to and Best Practices by egeisen
Usability Testing for Survey Research:How to and Best PracticesUsability Testing for Survey Research:How to and Best Practices
Usability Testing for Survey Research:How to and Best Practices
egeisen961 views
Usability and its use for Academic Libraries by dfordishes
Usability and its use for Academic LibrariesUsability and its use for Academic Libraries
Usability and its use for Academic Libraries
dfordishes584 views
UX and Usability Workshop Southampton Solent University by Dr.Mohammed Alhusban
UX and Usability Workshop Southampton Solent University UX and Usability Workshop Southampton Solent University
UX and Usability Workshop Southampton Solent University
A Human Factors Experiment With Students Google Searching by Joaquin Hamad
A Human Factors Experiment With Students  Google SearchingA Human Factors Experiment With Students  Google Searching
A Human Factors Experiment With Students Google Searching
Joaquin Hamad2 views
Introduction to Usability Testing for Survey Research by Caroline Jarrett
Introduction to Usability Testing for Survey ResearchIntroduction to Usability Testing for Survey Research
Introduction to Usability Testing for Survey Research
Caroline Jarrett10.9K views
Crowdsourcing using MTurk for HCI research by Ed Chi
Crowdsourcing using MTurk for HCI researchCrowdsourcing using MTurk for HCI research
Crowdsourcing using MTurk for HCI research
Ed Chi2.2K views
Masters Project - FINAL - Public by Michael Hay
Masters Project - FINAL - PublicMasters Project - FINAL - Public
Masters Project - FINAL - Public
Michael Hay155 views
SCHEDULING AND INSPECTION PLANNING IN SOFTWARE DEVELOPMENT PROJECTS USING MUL... by ijseajournal
SCHEDULING AND INSPECTION PLANNING IN SOFTWARE DEVELOPMENT PROJECTS USING MUL...SCHEDULING AND INSPECTION PLANNING IN SOFTWARE DEVELOPMENT PROJECTS USING MUL...
SCHEDULING AND INSPECTION PLANNING IN SOFTWARE DEVELOPMENT PROJECTS USING MUL...
ijseajournal430 views
Open domain Question Answering System - Research project in NLP by GVS Chaitanya
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
GVS Chaitanya2.1K views
Modelling Time-aware Search Tasks for Search Personalisation by Thanh Vu
Modelling Time-aware Search Tasks for Search PersonalisationModelling Time-aware Search Tasks for Search Personalisation
Modelling Time-aware Search Tasks for Search Personalisation
Thanh Vu192 views
Lecture5.pdf by Take1As
Lecture5.pdfLecture5.pdf
Lecture5.pdf
Take1As9 views
Hydraulics Team Full-Technical Lab Report by Alfonso Figueroa
Hydraulics Team Full-Technical Lab ReportHydraulics Team Full-Technical Lab Report
Hydraulics Team Full-Technical Lab Report
Alfonso Figueroa8.5K views
Advanced Methods for User Evaluation in Enterprise AR by Mark Billinghurst
Advanced Methods for User Evaluation in Enterprise ARAdvanced Methods for User Evaluation in Enterprise AR
Advanced Methods for User Evaluation in Enterprise AR
Mark Billinghurst1.2K views

More from Dirk Lewandowski

The Need for and fundamentals of an Open Web Index by
The Need for and fundamentals of an Open Web IndexThe Need for and fundamentals of an Open Web Index
The Need for and fundamentals of an Open Web IndexDirk Lewandowski
230 views17 slides
In a World of Biased Search Engines by
In a World of Biased Search EnginesIn a World of Biased Search Engines
In a World of Biased Search EnginesDirk Lewandowski
583 views48 slides
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni... by
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...Dirk Lewandowski
319 views45 slides
Künstliche Intelligenz bei Suchmaschinen by
Künstliche Intelligenz bei SuchmaschinenKünstliche Intelligenz bei Suchmaschinen
Künstliche Intelligenz bei SuchmaschinenDirk Lewandowski
466 views28 slides
Analysing search engine data on socially relevant topics by
Analysing search engine data on socially relevant topicsAnalysing search engine data on socially relevant topics
Analysing search engine data on socially relevant topicsDirk Lewandowski
191 views36 slides
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert by
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertGoogle Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertDirk Lewandowski
278 views27 slides

More from Dirk Lewandowski(20)

The Need for and fundamentals of an Open Web Index by Dirk Lewandowski
The Need for and fundamentals of an Open Web IndexThe Need for and fundamentals of an Open Web Index
The Need for and fundamentals of an Open Web Index
Dirk Lewandowski230 views
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni... by Dirk Lewandowski
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
EIN ANDERER BLICK AUF GOOGLE: Wie interpretieren Nutzer/innen die Suchergebni...
Dirk Lewandowski319 views
Künstliche Intelligenz bei Suchmaschinen by Dirk Lewandowski
Künstliche Intelligenz bei SuchmaschinenKünstliche Intelligenz bei Suchmaschinen
Künstliche Intelligenz bei Suchmaschinen
Dirk Lewandowski466 views
Analysing search engine data on socially relevant topics by Dirk Lewandowski
Analysing search engine data on socially relevant topicsAnalysing search engine data on socially relevant topics
Analysing search engine data on socially relevant topics
Dirk Lewandowski191 views
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert by Dirk Lewandowski
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändertGoogle Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Google Assistant, Alexa & Co.: Wie sich die Welt der Suche verändert
Dirk Lewandowski278 views
Suchverhalten und die Grenzen von Suchdiensten by Dirk Lewandowski
Suchverhalten und die Grenzen von SuchdienstenSuchverhalten und die Grenzen von Suchdiensten
Suchverhalten und die Grenzen von Suchdiensten
Dirk Lewandowski173 views
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden? by Dirk Lewandowski
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?
Können Nutzer echte Suchergebnisse von Werbung in Suchmaschinen unterscheiden?
Dirk Lewandowski265 views
Are Ads on Google search engine results pages labeled clearly enough? by Dirk Lewandowski
Are Ads on Google search engine results pages labeled clearly enough?Are Ads on Google search engine results pages labeled clearly enough?
Are Ads on Google search engine results pages labeled clearly enough?
Dirk Lewandowski910 views
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen? by Dirk Lewandowski
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?
Search Engine Bias - sollen wir Googles Suchergebnissen vertrauen?
Dirk Lewandowski1.2K views
Wie Suchmaschinen die Inhalte des Web interpretieren by Dirk Lewandowski
Wie Suchmaschinen die Inhalte des Web interpretierenWie Suchmaschinen die Inhalte des Web interpretieren
Wie Suchmaschinen die Inhalte des Web interpretieren
Dirk Lewandowski1.5K views
Perspektiven eines Open Web Index by Dirk Lewandowski
Perspektiven eines Open Web IndexPerspektiven eines Open Web Index
Perspektiven eines Open Web Index
Dirk Lewandowski1.6K views
Wie entwickeln sich Suchmaschinen heute, was kommt morgen? by Dirk Lewandowski
Wie entwickeln sich Suchmaschinen heute, was kommt morgen?Wie entwickeln sich Suchmaschinen heute, was kommt morgen?
Wie entwickeln sich Suchmaschinen heute, was kommt morgen?
Dirk Lewandowski1.5K views
Internet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven by Dirk Lewandowski
Internet-Suchmaschinen: Aktueller Stand und EntwicklungsperspektivenInternet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven
Internet-Suchmaschinen: Aktueller Stand und Entwicklungsperspektiven
Dirk Lewandowski871 views
Verwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen by Dirk Lewandowski
Verwendung von Skalenbewertungen in der Evaluierung von SuchmaschinenVerwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen
Verwendung von Skalenbewertungen in der Evaluierung von Suchmaschinen
Dirk Lewandowski833 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (3)
Dirk Lewandowski646 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (2)
Dirk Lewandowski595 views
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1) by Dirk Lewandowski
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)
Neue Entwicklungen bei Suchmaschinen und deren Relevanz für Bibliotheken (1)
Dirk Lewandowski571 views

Recently uploaded

Is Entireweb better than Google by
Is Entireweb better than GoogleIs Entireweb better than Google
Is Entireweb better than Googlesebastianthomasbejan
12 views1 slide
IETF 118: Starlink Protocol Performance by
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol PerformanceAPNIC
354 views22 slides
WEB 2.O TOOLS: Empowering education.pptx by
WEB 2.O TOOLS: Empowering education.pptxWEB 2.O TOOLS: Empowering education.pptx
WEB 2.O TOOLS: Empowering education.pptxnarmadhamanohar21
16 views16 slides
Marketing and Community Building in Web3 by
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3Federico Ast
12 views64 slides
Building trust in our information ecosystem: who do we trust in an emergency by
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
106 views18 slides
Affiliate Marketing by
Affiliate MarketingAffiliate Marketing
Affiliate MarketingNavin Dhanuka
16 views30 slides

Recently uploaded(9)

IETF 118: Starlink Protocol Performance by APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC354 views
Marketing and Community Building in Web3 by Federico Ast
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3
Federico Ast12 views
Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat106 views
PORTFOLIO 1 (Bret Michael Pepito).pdf by brejess0410
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdf
brejess04108 views
How to think like a threat actor for Kubernetes.pptx by LibbySchulze1
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptx
LibbySchulze15 views

Ordinary Search Engine Users Assessing Difficulty, Effort and Outcome for Simple and Complex Search Tasks

  • 1. Ordinary Search Engine Users assessing Difficulty, Effort, and Outcome for Simple and Complex Search Tasks Georg Singer Ulrich Norbisrath Institute of Comuter Science, University of Tartu, Estonia Dirk Lewandowski Department of Information, Hamburg University of Applied Sciences, Germany iiiX 2012, Nijmegen 23 August, 2012
  • 2. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 3. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 4. Introduction •  People use search engines for all kinds of tasks, from simply looking up trivia to planning their holiday trips. More complex tasks usually can take much longer than expected. •  Disparity between expected and real effort for such tasks •  Definition of a complex search task (Singer, Norbisrath & Danilov, 2012): –  Requires at least one of the elements aggregation (finding several documents to a known aspect), discovery (detecting new aspects), and synthesis (synthesizing the found information into a single document). –  Complex tasks typically require going through those steps multiple times.
  • 5. Motivation •  We observed that users are on the one hand highly satisfied with search engines, but on the other hand, everybody knows many cases where “the search engine failed”. •  Users’ dissatisfaction in terms of unsuccessful queries might partly be caused by users not being able to judge the task effort properly and therefore their experience not being in line with their expectations. •  There is little research about users carrying out complex search tasks and examining their ability to judge task effort, which is based on a reasonably large and diverse sample.
  • 6. Motivation •  Conduct a study with a reasonable sample of ordinary Web search engine users •  Collect self-reported measures of task difficulty, task effort and task outcome, before and after working on the task.
  • 7. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 8. Definitions §  A search task is complex if it requires at least one of the elements aggregation, discovery and synthesis. §  A search task is difficult if a lot of cognitive input is needed to carry out the task. §  A search task requires increased effort if the user needs either §  more cognitive effort to understand the task and formulate queries (time effort), §  or more mechanical effort (number of queries, number of pages visited, browser tabs opened and closed).
  • 9. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 10. Research questions RQ1: Can users assess difficulty, effort and task outcome for simple search tasks? RQ2: Can users assess difficulty, effort and task outcome for complex search tasks? RQ3: Are there significant performance differences between assessing simple and complex search tasks? RQ4: Does the users’ ability to judge if the information they have found is correct or not depend on task complexity? RQ5: Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks? RQ6: Does the judging performance depend on task complexity or simply the individual user?
  • 11. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 12. Methods •  Laboratory study with 60 participants, conducted in August, 2011, in Hamburg, Germany •  Each user was given 12 tasks (6 simple, 6 complex) •  Browser interactions were collected using the Search Logger plugin for Firefox (http://www.search-logger.com). •  Pre- and post-questionnaires for each task •  Although tasks were presented in a certain order (switching between simple and complex), participants were able to switch between tasks as they liked. •  Users were allowed to use any search engine (or other Web resource) they liked.
  • 14. User sample •  60 users •  Recruitment followed a demographic structure model •  A sample of that size cannot be representative, but is a vast improvement over samples usually used (i.e., self-selection, students) •  Data from 4 users was corrupted and therefore, was not analysed ” or “di cult”, sometimes interchangeably. reader of this paper, we give some defini- will use throughout this paper. stract description of activities to achieve a 1]. rocess of finding information. s a piece of work concerning the retrieval lated to an information need. The search out with IR systems [11]. s complex if it requires at least one of the tion, discovery and synthesis [15]. It typ- viewing many documents and synthesizing ed format. di cult if a lot of cognitive input is needed ask. equires increased e↵ort if the user needs ei- Basic Data Gender Age Span Female Male Total 18-24 5 4 9 25-34 9 7 16 35-44 7 8 15 45-54 8 8 16 55-59 3 1 4 Total 32 28 60 Table 1: Demography of user sample Hamburg, Germany. Participants were invited to the univ sity, where they were given a set of search tasks (see belo to fulfill. The study was carried out in one of the univ
  • 15. Tasks •  Simple vs. complex tasks (examples) –  (S) When was the composer of the piece “The Magic Flute” born? –  (S) When and by whom was penicillin discovered? –  (C) Are there any differences regarding the distribution of religious affiliations between Austria, Germany, and Switzerland? Which ones? –  (C) There are five countries whose names are also carried by chemical elements. France has two (31. Ga – Gallium and 87. Fr – Francium), Germany has one (32. Ge – Germanium), Russia has one (44. Ru – Ruthenium) and Poland has one (84. Po – Polonium). Please name the fifth country.
  • 16. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 17. Results on RQ 1 (“Can users assess difficulty, effort and task outcome for simple search tasks?”) •  An assessment was graded “correct” when the user’s self- judged values were the same in the pre-task and the post-task questionnaire. •  For all measures, users were generally able to judge the simple tasks correctly (around 90%). # of tasks % di culty incorrect 29 9.8 correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks simple search tasks. “# of tasks” represents the number of simple tasks that have been processed by the study par- ab Ta
  • 18. Results on RQ 2 (“Can users assess difficulty, effort and task outcome for complex search tasks?”) •  Ratio of correctly judged tasks drops to approx. 2/3 when considering complex tasks. # of tasks % di culty incorrect 29 9.8 correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks mple search tasks. “# of tasks” represents the number simple tasks that have been processed by the study par- # of tasks % di culty incorrect 95 33.2 correct 191 66.8 time e↵ort incorrect 99 34.6 correct 187 65.3 query e↵ort incorrect 91 31.8 correct 195 68.2 ability to find right result incorrect 78 27.2 correct 208 72.8 Table 3: Users judging complex search tasks di culty (%) time e↵ort (%) query e↵ort (%) task outcome (%)
  • 19. Results on RQ3 (“Are there significant performance differences between assessing simple and complex search tasks?”) •  Comparison of the differences between pre-task and post-task judgments over all tasks (i.e., user- independent) •  Paired sample t-tests to compare the results •  Users are significantly better at judging the complexity of simple tasks. correct 266 90.2 time e↵ort incorrect 27 9.1 correct 268 90.8 query e↵ort incorrect 38 12.9 correct 257 87.1 ability to find right result incorrect 16 5.4 correct 279 94.6 Table 2: Users judging simple search tasks imple search tasks. “# of tasks” represents the number f simple tasks that have been processed by the study par- icipants. The total number of tasks (correct plus incorrect nes) should have been 56*6=336 (56 valid users x 6 tasks). t is slightly lower due to invalid or not given answers or ot fulfilled tasks. % shows the percentage of the number of udged tasks to the total valid answers for tasks. We graded n answer as correct when the users’ self-judged values were he same in the pre-task questionnaire and the post-task uestionnaire. For example if they judged a task to be di - ult in the pre-task questionnaire and after carrying out the ask stated again that it was a di cult task, the judgment was graded as correct. For all parameters (di culty, time e↵ort, query e↵ort, and esult finding ability) approximately 90% of the users man- ged to match estimated and experienced values for simple asks. However, in our study the users had slightly more rouble estimating the time e↵ort needed than the query ↵ort (in terms of estimating going over a threshold of num- ers of queries). correct 191 66.8 time e↵ort incorrect 99 34.6 correct 187 65.3 query e↵ort incorrect 91 31.8 correct 195 68.2 ability to find right result incorrect 78 27.2 correct 208 72.8 Table 3: Users judging complex search tasks di culty (%) time e↵ort (%) query e↵ort (%) task outcome (%) Simple tasks (n=295) 90±2 91±2 87±2 95±1 Complex tasks (n=286) 67±3 65±3 68±3 73±3 p-value <0.001 <0.001 <0.001 <0.001 Table 4: Correctly judged tasks per dependent vari- able (mean values over tasks) culty for complex tasks. We followed the same procedure for time e↵ort, query e↵ort and task outcome. In the case of simple tasks, users are significantly better at estimating all four parameters: di culty, time e↵ort, query e↵ort and search success, i.e. the di↵erence between their pre-task estimate and their post-task experience based value
  • 20. Results on RQ 4 (“Does the users’ ability to judge if the information they have found is correct or not depend on task complexity?”) •  Comparison between subjective judgments (from questionnaires) and objective result (as judged by a research assistant). •  Users’ judging ability depends on task complexity: –  Using the pre-task data, 87% of simple tasks are predicted correctly, where for complex tasks, only 52% were predicted correctly. –  When considering post-task data, 88% of simple tasks are predicted correctly, where for complex tasks, this number is also much lower (60%). 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 9 (C) (n=48) 60±7 81±6 73±6 81±6 10 (C) (n=46) 72±7 67±7 72±7 52±7 11 (C) (n=49) 65±7 67±7 65±7 61±7 12 (C) (n=47) 74±6 55±7 68±7 79±6 Table 5: Fraction of users correctly judging task pa- rameters per task Task type Correctly estimated tasks (%) simple (n=259) 87±2 complex (n=233) 52±3 p-value <0.001 Table 6: Judgments of expected search outcome (in pre-task questionnaire) compared to correctness of manually evaluated search results (mean values over tasks) the search task. These evaluations gives us an estimate of how well users can judge that a result they found on the Internet is actually correct and how well they can judge in advance, if they will be able to find the correct result. Table 6 shows that the ability to predict, whether it is pos- sible to find the correct information, is significantly higher for simple search tasks than for complex search tasks, 87% versus 52% in case of complex tasks. We used paired sam- ple t-tests to compare the results from simple and complex tasks. We paired average correctness for simple tasks and average correctness for complex tasks. Task di culty (%) time e↵ort (%) query e↵ort (%) task out- come (%) 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 Task type Correctly estimated tasks (%) simple (n=259) 88±2 complex (n=230) 60±3 p-value <0.001 Table 7: Assessments of self-judged search results (in post-task questionnaire) compared to correct- ness of manually evaluated search results (mean val- ues over tasks)
  • 21. Results on RQ5 (“Is there a correlation between the overall search performance (ranking in the experiment) and the ability to assess difficulty, time effort, query effort, and task outcome for complex tasks?”) •  Comparison of the top performing users and the worst performing users (1st quartile vs. 4th quartile). •  Good searchers are not significantly better at judging difficulty and effort for complex tasks, but they are significantly better at judging the task outcome. Task di culty (%) time e↵ort (%) query e↵ort (%) task out- come (%) 1 (S) (n=51) 92±4 90±4 88±5 98±2 2 (S) (n=48) 81±6 85±5 73±6 88±5 3 (S) (n=47) 89±5 91±4 87±5 96±3 4 (S) (n=51) 98±2 100±0 98±3 96±3 5 (S) (n=49) 84±5 82±6 82±6 94±3 6 (S) (n=49) 96±3 96±3 94±3 96±3 7 (C) (n=47) 62±7 51±7 60±7 85±5 8 (C) (n=48) 67±7 69±7 71±7 77±6 9 (C) (n=48) 60±7 81±6 73±6 81±6 10 (C) (n=46) 72±7 67±7 72±7 52±7 11 (C) (n=49) 65±7 67±7 65±7 61±7 12 (C) (n=47) 74±6 55±7 68±7 79±6 Table 5: Fraction of users correctly judging task pa- rameters per task Task type Correctly estimated tasks (%) simple (n=259) 87±2 complex (n=233) 52±3 p-value <0.001 Table 6: Judgments of expected search outcome (in pre-task questionnaire) compared to correctness of manually evaluated search results (mean values over tasks) Task type Correctly estimated tasks (%) simple (n=259) 88±2 complex (n=230) 60±3 p-value <0.001 Table 7: Assessments of self-judged search results (in post-task questionnaire) compared to correct- ness of manually evaluated search results (mean val- ues over tasks) Avg. di culty in % Avg. time e↵ort in % Avg. query e↵ort in % Avg. task outcome in % 1. quartile (n=67) 67±6 64±6 67±6 85±4 4. quartile (n=59) 73±6 75±6 73±6 64±6 p-value n.s. n.s. n.s. <0.05 Table 8: Correct estimations of best and worst quar- tile for expected and experienced task parameters ranking in the experiment. We ranked the users first by the number of correct answers given and then, in cases of users with the same number of correct answers, by answers with right elements (simple and complex tasks).
  • 22. Results on RQ 6 (“Does the judging performance depend on task complexity or simply the individual user?”) •  There are some users who are able to correctly judge all task parameters for both simple and complex tasks, but these are only a few. •  The majority of users gets only some parameters right (for simple and for complex tasks, as well). •  These findings hold true for all task parameters considered. •  Judging performance is much more dependent on task complexity than on the individual user.
  • 23. Agenda 1.  Introduction 2.  Definitions 3.  Research questions 4.  Methods 5.  Results 6.  Discussion and conclusion
  • 24. Discussion •  No problem with assessing the difficulty of simple tasks. –  Reason might be that all have sufficient experience with such tasks. •  Two thirds are able to sufficiently judge the subjective difficulty of complex tasks. –  However, large gap between self-judged success and objective results! •  Only 47% of submitted results for complex tasks were completely correct. –  The problem with complex tasks might not be users finding no results, but the results found only seemingly being correct. This may to a certain extent explain users’ satisfaction with search engines’ outcomes. à More research needed on when users only think they have found correct results.
  • 25. Conclusion and limitations •  Users tend to overestimate their own search capabilities in case of complex tasks. Search engines should offer more support on these tasks. •  While our experiences with recruiting using a demographic structure model are good, it lead to some problems in this study: –  Users’ strategies vary greatly, which resulted in high standard errors for some indicators. –  A solution would be either to increase sample size, or to focus on more specific user groups. –  However, using specific user groups (e.g., students) might produce significant results, but they will not hold true for other populations. After all, search engines are used by “everyone”, and research in that area should aim at producing results relevant to the whole user population.
  • 26. Thank you. Dirk Lewandowski, dirk.lewandowski@haw-hamburg.de Part of this research was supported by the European Union Regional Development Fund through the Estonian Centre of Excellence in Computer Science and by the target funding scheme SF0180008s12. In addition the research was supported by the Estonian Information Technology Foundation (EITSA) and the Tiger University program, as well as by Archimedes Estonia.