Understanding & Evaluating Search Sessions

Dr Max L.Wilson http://cs.nott.ac.uk/~mlw/
Extended Searching Sessions
and Evaluating Success
Dr Max L.Wilson
Mixed Reality Lab
University of Nottingham, UK
Friday, 10 May 13

Studying Extended Search
Success In Observable Natural
Sessions
SESSIONS
Friday, 10 May 13

Extended Searching Sessions and
Evaluating Sensemaking Success
About Me
Study 1:The Real
Nature of Sessions
Study 2: Evaluating
Sensemaking Success
Friday, 10 May 13

About me
MEng & Phd in Southampton
Taught in Swansea for 3 years
Moved to Nottingham April 2012
Friday, 10 May 13

About Me
Friday, 10 May 13

Friday, 10 May 13

UIST
2008
JCDL
2008
Friday, 10 May 13

My PhD
Bates, M. J. (1979a). Idea tactics. Journal of
the American Society for Information
Science, 30(5):280–289.
Bates, M. J. (1979b). Information search
tactics. Journal of the American Society for
Information Science, 30(4):205–214.
Belkin, N. J., Marchetti, P. G., and Cool, C.
(1993). Braque: design of an interface to support
user interaction in information retrieval.
Information Processing and Management, 29(3):
325–344.
Friday, 10 May 13

My PhD
Wilson, M. L., schraefel, m. c., and White, R. W. (2009). Evaluating advanced
search interfaces using established information-seeking models. Journal of the
American Society for Information Science and Technology, 60(7):1407–1422.
Friday, 10 May 13

Search User Interface Design
Friday, 10 May 13

MyTeam
Horia Maior Matthew Pike Jon Hurlock Paul BrindleyZenah Alkubaisy
Chaoyu (Kelvin)Ye
(Study 1)
Mathew Wilson
(Study 2)
Friday, 10 May 13

People Searching the Web
Elsweiler, D.,Wilson M. L. and Kirkegaard-Lunn, B. (2011) Understanding Casual-leisure Information Behaviour. In Spink,A.
and Heinstrom, J. (Eds) New Directions in Information Behaviour. Emerald Group Publishing Limited, pp 211-241.
Friday, 10 May 13

The Search Communities
Ingwersen, P., Jarvelin, K., 2005.The turn:
integration of information seeking
and retrieval in context. Springer, Berlin, Germany.
The IR Community
•Focused on Accuracy
•Are these results relevant?
•How many are relevant?
•Did we get all the relevant ones?
Friday, 10 May 13

The IS Community
•Focused on Success
•Did they ﬁnd the right result?
•How long did they take
•How many interactions?
Friday, 10 May 13

The IB Community
•Focused on Quality
•Did they do a good job?
•How did the UI affect the task?
•Was the higher level motivating
task achieved more successfully?
Friday, 10 May 13

“Relatively” well known
“Naively estimated”
- Study 1
“Simplistically” measured
- Study 2
Friday, 10 May 13

WorkTasks
• Work tasks - typically considered work-led information-
intensive activities the lead to searching
• Can be out-of-work - like planning holidays, or buying a car
• We’ve begun looking at motivating ‘tasks’ outside of work
Friday, 10 May 13

Casual Leisure WorkTasks
behaviours documented so far.
4.1 Need-less browsing
Much like the desire to pass time at the television, we saw
many examples (some shown in Table 3) of people passing
time typically associated with the ‘browsing’ keyword.
1) ... I’m not even *doing* anything useful... just browsing
eBay aimlessly...
2) to do list today: browse the Internet until fasting break
time..
3) ... just got done eating dinner and my family is watch-
ing the football. Rather browse on the laptop
4) I’m at the dolphin mall. Just browsing.
Table 3: Example tweets where the browsing activ-
ity is need-less.
From the collected tweets it is clear that often the inform-
ation-need in these situations are not only fuzzy, but typi-
cally absent. The aim appears to be focused on the activity,
where the measure of success would be in how much they
D
d
a
5
h
f
o
S
i
b
a
d
f
t
s
W
t
Wilson, M. L. and Elsweiler, D. (2010) Casual-leisure Searching: the Exploratory Search scenarios
that break our current models. In: 4th HCIR Workshop ,Aug 22 2010. pp 28-31.
Friday, 10 May 13

• Traditionally examined by analysing logs for stats
• In the 90s, suggested they are broken by ~25mins
- More recently by ~5mins
• BUT evidence shows web use typically interleaves tasks
- AND tabs make this all much harder
• Become a big focus as Dagstuhls/workshops
Sessions
Friday, 10 May 13

SearchTrails
• Aimed at ﬁnding common
end locations for queries
• An interesting step towards
sessions though
• most involved some trail
features (not query+click)
White, Ryen W., and Steven M. Drucker. "Investigating behavioral variability in
web search." in Proc WWW 2007 .ACM
Friday, 10 May 13

Top Sessions
as Seen by Bing
Bailey et al, User task understanding: a web search engine perspective, NII Shonan, 8 Oct 2012
Friday, 10 May 13

Study 1:
Investigating Extended Sessions
What on earth
is happening here?
Friday, 10 May 13

Study 1: Interview Method
Send & Preprocess History
Interview Recording, Cards, Card Sorts, Marked history ﬁle, log data
A history artefact - approx 300 items
How
would
you
deﬁne a
session?
Mark out history into
sessions, starting recently
+
Create ‘cards’ of varying
types of ‘sessions’
Open Card Sort
+
Close Card Sort
10mins 20-30mins 30-50mins
15-20Cards
Friday, 10 May 13

Study 1: Data
• Rich discussion of ~20 Sessions per participant
• Currently: 7 participants and ~120 sessions
- richly described and compared
• Aiming for: 12 participants and 200+ sessions at ﬁrst
Friday, 10 May 13

Study 1: Questions for Sessions
1)Where was this done (e.g. work vs home vs mobile)
2)With who (collaborative?)
3)For who (shared task?)
4)Devices involved (whether devices affect things)
5)Length of the Session (how do they deﬁne long?)
6)Successful or not (for future measurement insights)
At some point: tried to learn these for each session
Friday, 10 May 13

Study 1:A Card
Friday, 10 May 13

Study 1: Card Sorting
• We aimed first to let them define the dimensions
- this lets us see how they define things
- how do they self-categorise different sessions
• We then had some targeted card sorts
- For who, duration, difficulty, importance, location
- whats short vs long?
- whats important vs not?
- how do people divide work vs home etc
Friday, 10 May 13

Study 1: Example Card Sorts
Friday, 10 May 13

Study 1: Preliminary Findings
• avg 21 cards per person, inc ~8 sessions of 5+mins
- ~4 work & ~4 leisure
• 18.6% of those extended sessions involved task switches
• avg length: 17.5mins avg #queries: 3.55
• short: third said <30s, third said <1m third said <30m
• long: third said >1hour, third said >5mins
Friday, 10 May 13

• longest sessions: entertainment, work prep, news, shopping
• longest leisure: 22-76mins youtube, 28mins news
• most important: work, money, urgent shopping
• lest important: leisure, entertainment, free time
• most difﬁcult: technical work prep
Friday, 10 May 13

• Huge divide over where sessions start or stop
- many people considered a session to span a large break
- paused and left in tabs
• One person divided a single topical episode by phases
- and phases were sessions
- e.g. broadening/confused stage vs successful focus stage
• One person divided single topical episode by major sources
- moved from web searching to video searching on same topic
What is a session?
Implications for where/when to measure success
Friday, 10 May 13

Study 1:What is a session?
Single topic - changing purpose
Friday, 10 May 13

Single topic - pausing sessions
Friday, 10 May 13

Low-query extended sessions
Friday, 10 May 13

Study 1: Other observations
• Seeing an informal relationship between who tasks are for
- and skewed importance
- including for another person, or for a group
- and slow sequential interactions (as talk to others)
• Seeing a strong low-query correlation with entertainment
- seeing serious-leisure more similar to work tasks
• Hard tasks have high query loads,
- and are related to rare or new areas
Friday, 10 May 13

Study 1: Summary
• We’re beginning to get some real insight into real sessions
• Already identifying examples where time-splitting isnt sufﬁcient
- but intention changing is common
• We’re seeing possible common patterns of overlapping sessions
• We havent ﬁnished!
Friday, 10 May 13

Study 2: Evaluating Sensemaking
“Simplistically” measured
- Study 2
Wilson, M. J. and Wilson, M. L. (2012) A Comparison ofTechniques for Measuring Sensemaking
and Learning within Participant-Generated Summaries. In: JASIST (accepted).
Friday, 10 May 13

Study 2:“Simplistically” measured
• If learning is closed: then a quiz
- “closed” determines WHAT should be learned
- can measure recall, but also recognising if cued by Q.
• If learning is open:
a) sub-topic count (integer) & topic quality (judged likert)
b) simple count of facts (integer) and statements (integer)
• These do not measure how “good” the learning was
Friday, 10 May 13

Study 2: Measuring “Depth” of Learning
• A theory from Education
• As learning improves
you progress up the diagram
• You begin to ‘understand’
- then critically ‘analyze’
- then ‘evaluate’ information
etc.
Image from: http://www.nwlink.com/~donclark/hrd/bloom.html
Friday, 10 May 13

Study 2: Developed 3 Scales
• 12 participants performed 3 learning tasks
- mix of high and low prior knowledge
• 1) Write summary of knowledge, 2) Learn, 3) Write summary
• 36 pairs of pre/post summaries
- 18 high prior knowledge
- 18 low prior knowledge
Friday, 10 May 13

Study 2: Developed 3 Scales
• Inductive GroundingTheory analysis
• 3 rounds of 6 high and 6 low pairs analysed by 2 researchers
• Validated by an external judge
• Until high Fleiss Kappa scores
i.e.‘substantial agreement’
Friday, 10 May 13

Study 2: Measure 1: D-Qual
scale ranging from irrelevant or useless facts (0 points) to facts that showed a level of
technical understanding (3 points). The emphasis of usefulness in this measure meant that it
was closer to the “understanding” level of Bloom’s revised taxonomy, rather than simply
“remembering”. It was important to differentiate between the two levels as many poor
summaries, as determined by the authors during the coding session, simply listed many
redundantly obvious facts (“A labrador is a dog”) rather than describing them in sentences
and summaries. For D-Qual, the judges achieved a Fleiss kappa of 0.64.
Rating Description
0 Facts are irrelevant to the subject; Facts hold no useful information or advice.
1 Facts are generalised to the overall subject matter; Facts hold little useful information or
advice.
2 Facts fulfil the required information need and are useful.
3 A level of technical detail is given via at least one key term associated with the technology
of the subject; Statistics are given.
Table 1: Quality of Facts (D-Qual).
Many of the better summaries interpreted facts into more intelligent statements. To
identify this, D-Intrp (Table 2) measured summaries in how they synthesised facts and
statements to draw conclusions and deductions (Bloom’s “analysing”) using a 3-point scale.
Measure understanding rather than remembering
Friday, 10 May 13

Study 2: Measure 2: D-Intrp
Rating Description
0 Facts contained within one statement with no association.
1 Association of two useful or detailed facts: ‘A -> B’
2 Association of multiple useful or detailed facts: ‘A+B->C’; ‘A->B->C’; ‘A->B∴C’
Table 2: Interpretation of data into statements (D-Intrp).
D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that
compared facts, or used facts to raise questions about other statements. The measurement for
D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of
0.74 was achieved.
Measure analysing capabilities
Friday, 10 May 13

Study 2: Measure 3: D-Crit
Measure evaluating capabilities
Table 2: Interpretation of data into statements (D-Intrp).
D-Crit reflected Bloom’s concept of “evaluating” by identifying statements that
compared facts, or used facts to raise questions about other statements. The measurement for
D-Crit was either true (1 point) or false (0 points), as shown in Table 3. A Fleiss kappa of
0.74 was achieved.
Rating Description
0 Facts are listed with no further thought or analysis.
1 Both advantages and disadvantages listed; Comparisons drawn between items;
Participant deduced his or her own questions.
Table 3: Use of critique (D-Crit).
We did not produce a scale for level three of Anderson’s revised version of Bloom’s
taxonomy, “applying”, since the act of writing a summary would not involve the participant
to carry out a procedure that has been learned. This level of learning was thus not identifiable
in our corpus of summaries. Similarly, the highest level, “creating”, also goes beyond writingFriday, 10 May 13

Study 2: Evaluating these measures
Compare against Counting &Topic measures
measure depth (‘T-Depth’), each topic was measured on a 4-point scale ranging from not
covered (0 points) to detailed focused coverage (3 points) and averaged.
As the process of learning is primarily internal it is difficult to measure it objectively.
For this reason our measures of learning focused on the difference between pre- and post-task
knowledge held by the participant.
Code Measurement Scale
D-Qual Recall of facts 0 – 3 points
D-Intrp Interpretation of data into statements 0 – 2 points
D-Crit Critique 0 – 1 point
F-Fact Number of facts Count
F-State Number of statements Count
F-Ratio Ratio of facts per statement Average
T-Count Number of topics covered (breadth of knowledge) Count
T-Depth Level of topic focus (depth of knowledge) 0 – 3 points, averaged
Table 4: Outline of coding scheme used for analysis.
5 Results
Before beginning, the data from two participants were removed from the analysis. A
first-pass sanity check over the collected summaries revealed that they had misunderstood the
tasks set. One chose to describe their own feelings and history relating to the task topic, rather
than trying to answer the task. Another described what they intended to search for in their
• Can you differentiate pre- & post- task summaries?
• Can you differentiate high & low prior knowledge?
• How long do summaries need to be?
Friday, 10 May 13

Study 2:Analysing summaries
Pre-task example
Friday, 10 May 13

Study 2:Analysing summaries
Post-task example
Friday, 10 May 13

Study 2: Resultsknowledge, especially for pre-task summaries, which can possibly be explained that the
participants who wrote shorter summaries based on high prior knowledge are more likely to
concentrate on a single topic.
All Pre-task Post-task
D-Qual U(68) = 537.5, p = 0.32 U(34) = 125, p = 0.28 U(34) = 148, p = 0.46
D-Intrp U(68) = 642, p = 0.21 U(34) = 145, p = 0.47 U(34) = 174, p = 0.16
D-Crit U(68) = 570, p = 0.47 U(34) = 140, p = 0.47 U(34) = 144.5, p = 0.49
F-Fact t(66) = -0.4, p = 0.35 t(32) = -0.75, p = 0.23 t(32) = -0.25, p = 0.4
F-State t(66) = -0.21, p = 0.42 t(32) = -0.4, p = 0.35 t(32) = -0.17, p = 0.43
F-Ratio t(66) = 0.2, p = 0.42 t(32) = 0.31, p = 0.38 t(32) = -0.04, p = 0.48
T-Count t(66) = -0.35, p = 0.36 t(32) = 0.43, p = 0.34 t(32) = -1.01, p = 0.16
T-Depth U(68) = 721, p = 0.04 * U(34) = 194.5, p = 0.04 * U(34) = 168, p = 0.21
Table 12: Comparing high and low prior knowledge in shorter summaries. * Indicates significant results.
All Pre-task Post-task
D-Qual U(68) = 390, p = 0.01 * U(34) = 89.5, p = 0.03 * U(34) = 113.5, p = 0.18
D-Intrp U(68) = 497.5, p = 0.16 U(34) = 158.5, p = 0.29 U(34) = 95, p = 0.06
D-Crit U(68) = 693.5, p = 0.08 U(34) = 189, p = 0.05 * U(34) = 154, p = 0.32
F-Fact t(66) = 1.62, p = 0.06 t(32) = 0.64, p = 0.26 t(32) = 1, p = 0.16
F-State t(66) = 1, p = 0.16 t(32) = 0.29, p = 0.39 t(32) = 0.79, p = 0.22
F-Ratio t(66) = 0.86, p = 0.2 t(32) = 0.31, p = 0.38 t(32) = 0.21, p = 0.42
T-Count t(66) = 3.44, p = 0.0005 * t(32) = 1.92, p = 0.03 * t(32) = 2.82, p = 0.004 *
T-Depth U(68) = 572, p = 0.48 U(34) = 163, p = 0.25 U(34) = 142, p = 0.48
Table 13: Comparing high and low prior knowledge in longer summaries. * Indicates significant results.
Pretty obvious - as you can see
Friday, 10 May 13

Study 2: Results
• 1) Most measures could identify learning (between pre-post)
- more robust with longer summaries
the summaries and the prior knowledge held by the participant should be taken in to
consideration. Table 14 provides an overview of the strengths and weaknesses of each
measure and recommendations are made below. While serving as a guide readers should refer
back to the full text in our results section for more detail before using them in a study.
Identifies Learning Identifies Prior Knowledge Ignores
Length
High Low Short Long Pre Post Short Long Pre Post
D-Qual       
D-Intrp    
D-Crit    
F-Fact      
F-State    
F-Ratio    
T-Count      
T-Depth     
Table 14: Overview of measure suitability.
If participants have written shorter summaries (here averaged to around 90 words) then
learning is only really noticeable if those participants began with low prior knowledge, where
measures such as the quality of facts (D-Qual), simple fact and statement counting (F-Fact, F-
State) and topic coverage (T-Count) can be used to determine an increase of knowledge. If
short summaries are written based on high prior knowledge then only simple fact and
Friday, 10 May 13

Study 2: Results
• 2) Only some were good at identifying prior knowledge
- these required long pre-task summaries to be written
Length
D-Qual       
D-Crit    
F-Fact      
T-Count      
T-Depth     
Friday, 10 May 13

Study 2: Results
• 3) Our measures were the most robust to length of summary
- others require pushing participants beyond 200 words
Length
D-Qual       
D-Crit    
F-Fact      
T-Count      
T-Depth     
Friday, 10 May 13

Study 2: Conclusions
• We proposed a new measure based on depth of learning
- demonstrating higher levels of thinking
• This was more robust to size of written summary,
- good at long and short, while measuring learning
- able to determine if someone has existing high knowledge
• All measures did surprisingly well, for measuring learning
• Ours was most robust for determining prior knowledge level
• Future work: behaviour between good vs bad learners
Friday, 10 May 13

Talk Summary
• Search communities are trying to move beyond simple tasks
- more than result quality, and time to target
• Current focusing on understanding sessions
- which has primarily been splitting logs by time gaps
• Our work
1) moving beyond assumptions about sessions
2) introducing new methods to evaluate sensemaking
Friday, 10 May 13

Talk Summary
• There’s a long way to go before search engines know what
we’re doing beyond a query (and immediate reﬁnements)
- there’s a long way before we do
• Also - we still need to measure:
- success in decision making (like online shopping)
- success in entertainment sessions
Friday, 10 May 13

Understanding & Evaluating Search Sessions

Recommended

Recommended

More Related Content

Similar to Understanding & Evaluating Search Sessions

Similar to Understanding & Evaluating Search Sessions (20)

More from Max L. Wilson

More from Max L. Wilson (16)

Recently uploaded

Recently uploaded (20)

Understanding & Evaluating Search Sessions