Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences? (at EVIA 2016)

Two-layered Summaries for Mobile Search:
Does the Evaluation Measure Reflect User Preferences?
Makoto P. Kato (Kyoto U.), Tetsuya Sakai (Waseda U.),
Takehiro Yamamoto (Kyoto U.), Virgil Pavlu (Northeastern U.),
and Hajime Morita (Kyoto U.)

IR Systems in Ten-Blue-Link Paradigm
Enter query
Click SEARCH button
Scan ranked list of URLs
Click URL
Read URL contents
Get all desired information
Long way to get all desired information

MobileClick System
Enter query
Click SEARCH button
Get all desired information
Go beyond the "ten-blue-link" paradigm, and tackle
information retrieval rather than document retrieval
LCD is better in terms of the weight, size and
energy saving. OLED shows a better black
color, a faster response speed, and a wider
view angle.
Advantage of OLED
Advantage of LCD
Task: Given a search query,
return a two-layered textual output
System output
OLED LCD difference
Phone: 046-223-3636.
Fax: 046-223-3630.
Address: 118-1
Nurumizu, Atsugi,
243-8551. Email:
soumu@shonan-
atsugi.jp. Visiting
hours: general ward
Mon-Fri 15-20;
Sat&Holidays 13-20 /
Intensive Care Unit
(ICU) 11-11:30, 15:30,
19-19:30.
Phone: 046-223-3636.
Fax: 046-223-3630.
Address: 118-1
Nurumizu, Atsugi,
243-8551. Email:
soumu@shonan-
atsugi.jp. Visiting
hours: general ward
Mon-Fri 15-20;
Sat&Holidays 13-20 /
Intensive Care Unit
(ICU) 11-11:30, 15:30,
19-19:30.
Skip

• Given a query, a set of iUnits, and a set of intents,
generate a two-layered summary
iUnit Summarization Subtask at NTCIR-12
5
iUnit
A series of evaluation workshops
Designed to enhance IA research
…
NTCIR
Input: Query
Input: iUnit set
Intents
News
Schedule
…
Input: Intents
M-measure
0.5
The NTCIR Workshop is a
series of evaluation
workshops designed to
enhance research in
information access
technologies including
information retrieval,
summarization, extraction,
question answering, etc.
News
Schedule
Tasks
2nd layer
20/Jan./2016: Task Registration Due
06/Jan./2016: Document Set Release
Jan.-May/2016: Dry Run
Mar.-July/2016: Formal Run
01/Aug./2016: Evaluation Results Due
01/Aug./2016: Task overview release
15/Sep./2016: Paper submission Due
01/Nov./2016: All paper Due
09-12/Dec./2016: NTCIR-11 Conference
Output: Two-layered summary
Evaluation metric
designed for mobile
information access
Lay out iUnits so that
any types of users can be immediately satisfied
Challenge

Two-layered Summary in Action
6

Does the Evaluation Measure
Reflect User Preferences?
Research Question Addressed in This Work
7
M-measure
0.5 0.4
User preference
(# of users who prefer to A (B))
10 4
0.5 > 0.4
10 > 4
A B
A > B
A > B
=
Same?
Which is higher? Which is better?

Overview of Data
9
napoleon
Queries
Documents
Web search
Born on the island of Corsica
Defeated at the Battle of Waterloo
Established legal equality and religious
toleration an innovator
iUnits
Extraction
Achievement
Skill
Career
Clustering
Intents
iUnit
summarization
Input
Input

• Queries
– 100 English/Japanese queries
– Most of which were ambiguous/underspecified
– Selected from five categories:
celebrity, location, definition, and QA (similar to NTCIR 1CLICK-2)
• Documents
– 500 commercial search engine results for each query
from which iUnits were extracted
Queries and Documents
10
CELEBRITY LOCATION DEFINITION QA
hulk hogan bank adelanto bitcoin what is mirror made of
bruno mars cafe killeen divers disease how to cook coleslaw
sharon stone cincinnati art museum windows 7 role of animal tail
Examples

• Definition
– Atomic information pieces relevant to a given query
• The number of iUnits
– 2,317 (23.8 iUnits per query) for English
– 4,169 (41.7 iUnits per query) for Japanese
iUnits
11
Born on the island of Corsica General of the Army of Italy
Defeated at the Battle of Waterloo One of the most controversial political figures
won at the Battle of Wagram
Baptised as a Catholic
Absent during Peninsular War Cut off European trade with Britain
Examples of iUnits for query “Napoleon”

• An intent can be defined as
– A specific interpretation of an ambiguous query
(“Mac OS” and “car brand” for “jaguar”), or
– An aspect of a faceted query
(“windows 8” and “windows 10” for “windows”)
• Obtained by clustering iUnits
Intents
12
Achievement
Skill
Career
Born on the island of Corsica
Defeated at the Battle of Waterloo
Absent during Peninsular War
iUnits Intents
Clustering

• Importance of iUnits in terms of an intent
• Intent probability P(i|q)
– Probability of having intent i for a given query q
Per-intent iUnit Importance and Intent Probability
iUnit Importance
A series of evaluation workshops 5
Task Registration Due 20/Jun./2016 3
iUnit Importance
A series of evaluation workshops 2
Task Registration Due 20/Jun./2016 5
In terms of intent “Definition” In terms of intent “Schedule”
Intent Prob.
Definition 0.4
Schedule 0.3
Tasks 0.3
For details, see our MobileClick-2 overview paper

• Consider single-layered summary evaluation
• U-measure [Sakai and Dou. SIGIR2013]
– Higher if more important iUnits appear earlier
Evaluation of iUnit Summarization (Single-layer Case)
15
𝑢1 𝑢2
𝑢3
Summary Trailtext
(reading path)
𝑢1 𝑢3
G(u1)(1-10/L)
+ G(u2)(1-15/L)
+ G(u3)(1-25/L)
U-measure
Create a list of iUnits
by assuming that users
read text from left to right,
from top to bottom
𝑈 =
𝑟=1
𝐺 𝑢 𝑟 1 −
pos 𝑢 𝑟
𝐿
𝑢 𝑟: r-th iUnit
𝐺(𝑢): importance of u
pos(𝑢): offset of u from the beginning
𝐿: patience parameter
𝑢2
10chars 10chars5chars

• M-measure
– Expectation of U-measure over multiple trailtexts
𝑀 =
𝐭
𝑃(𝐭)𝑈(𝐭)
1. Generate trailtexts by assuming that
– Users read a summary from the top of the first layer
– Users click on an intent if they are interested in it
M-measure
16
𝑃(𝐭): probability of trailtext t
𝑈(𝐭): U-measure of trailtext t
𝑙1
𝑢1 𝑢2
𝑢3
𝑢4
User interested in
Intent 1 (𝑃(𝑖1|𝑞))
User interested in
Intent 2 (𝑃(𝑖2|𝑞))
𝑢1 𝑢2 𝑢3 𝑢4
𝑢1 𝑢2 𝑢3

2. Compute the expectation of U-measure
Evaluation of iUnit Summarization (Two-layer Case)
17
𝑙1
𝑙2
𝑢1 𝑢2
𝑢3
𝑢6
𝑢4 𝑢5
Trailtext (t)
(reading path)
U
𝑢1 𝑢2 𝑢3
𝑢4 𝑢5
𝑢1 𝑢2 𝑢3
𝑢6
0.44
0.12
0.36
𝑃 𝐭1 = 𝑃 𝑖1 𝑞 = 0.75
𝑃 𝐭2 = 𝑃 𝑖2 𝑞 = 0.25
M-measure
𝑀 =
𝐭
𝑃(𝐭)𝑈(𝐭)
Because trailtext t2 is read
by users interested in i2

Pairwise Comparison
All possible pairs of 7 summaries for 25 queries
were presented to about 14 users

• Users were asked to select either
the left one is better,
the right one is better,
equally good, or
equally bad
• Criteria:
(1) How much useful information you can get
from the summary, and
(2) How quickly you can get useful information
from the summary
Instruction in Pairwise Comparison
20

• 𝑳 of U-measure in M-measure
– 𝑈 = 𝑟=1 𝐺 𝑢 𝑟 max 0, 1 −
pos 𝑢 𝑟
𝐿
– 𝐿 is a patience parameter that controls how the
gain of iUnits decreases as the user reads the text
• Simple variants of M-measure
– Use only first layer
– Use only second layer
– Use a uniform distribution for 𝑃 𝑖 𝑞
Settings of M-measure
21
𝑙1
𝑢1 𝑢2
𝑢3
𝑢4
𝐿 = 100
𝐿 = 200
200100
1−
pos𝑢𝑟
𝐿
pos 𝑢 𝑟

Interpretation of Results
22
(Num. of votes for A)
(Total num. of votes)
Diff. of M-measure (M(A) - M(B))
Agree
Disagree
Disagree
Agree
A
is better
(User pref.)
B
is better
(User pref.)
Ais better
(M-measure)
Bis better
(M-measure)
Each dot represents
a pair of systems (A, B)
for a particular query
Agreement
= (#dots in Agree)
/ (#dots)

Experimental Results for Different Patient Parameters
23
93.75 750 6000 24000
31.25 125 2000 8000
English
Japanese
LOW agreement for LOW
patience parameter
(L=93.5)
HIGH agreement for HIGH
patience parameter
(L=24000)
Agreement is high (70-74%) for both of the languages

Experimental Results for Simple Variants of M-measure
24
Original
Worse Slightly worseClose
Use of the second layer and intent probability
improves the agreement (but the first layer doesn’t)
24000
2000

• Possible explanations include
– The quality of the second layer correlates to the
quality of the whole summary
– Users decided the quality of the summary mainly
based on the second layer
• We asked the users to look at the second layer in the
assessment
Why did the only 2nd layer correlate to the user pref. well?
25

• Conclusions
– Proposed M-measure
• A special case of intent-aware U-measure for two-
layered summarization
– Measured the agreement between
M-measure and user preferences
• Agreement was high (70-74%)
• Future work
– Error analysis
– Address “why did the only second layer correlate
to the user preferences well?”
Conclusions and Future Work
26

Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences? (at EVIA 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences? (at EVIA 2016)

Similar to Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences? (at EVIA 2016) (20)

Recently uploaded

Recently uploaded (20)

Two-layered Summaries for Mobile Search: Does the Evaluation Measure Reflect User Preferences? (at EVIA 2016)