The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search (CIKM 2013)

THE WATER FILLING MODEL AND
THE CUBE TEST:
Multi-Dimensional Evaluation
for Professional Search
Jiyun Luo1 Christopher Wing1 Grace Hui Yang1
Marti A. Hearst2
1Department of Computer Science
Georgetown University
Washington, DC, USA
{jl1749, cpw26}@georgetown.edu
huiyang@cs.georgetown.edu
CIKM 2013
2School of Information
University of California, Berkeley
Berkeley, CA, USA
hearst@berkeley.edu
1

INTRODUCTION
¢  Complicated search has recently received much
attention
¢  Professional search activities are usually
complicated search tasks
—  Examples: Medical record search, Legal search,
Patent prior art search
¢  Evaluation metrics need to reflect this complexity
—  U-measure for whole session evaluation [Sakai et al.
sigir’13]
—  Time-based gain [Smucker and Clarke sigir’12]
—  α-nDCG for diversity and novelty [Clarke et al. sigir’08]
—  PRES for recall-orientated search tasks [Magdy and Jones,
sigir’10]
2

PROFESSIONAL SEARCH
¢  Rich information needs
—  Multiple aspects or subtopics
¢  Time-sensitive
—  It is not true that professional searchers, e.g., lawyers, are
evil and would like to read irrelevant documents since they
are paid by time and only care about recall
¢  Novelty
—  Once examined one relevant document, subsequent
relevant documents are perceived as less relevant
¢  Stopping criteria
—  Once a sub-information-need has been fulfilled, relevant
documents about it will contribute not much any more
¢  A mix of unranked and ranked retrieval
—  Boolean search and proximity search are still popular 3

Fenestration Segment Stent-
Graft and Fenestration Method
US 20090259290 A1
Patent Prior Art Search
ABSTRACT
A method includes deploying a fenestration
segment stent-graft into a main vessel such
that a fenestration section …
1. A fenestration segment stent-graft comprising : a proximal
section comprising a woven graft cloth; …
2. The fenestration segment stent-graft of claim 1 wherein said
proximal section comprises a proximal end and a distal end, …
attachment means comprises stitching.
…
20. A fenestration segment stent-graft comprising : a
proximal section; a distal section; …
fenestration section comprises : graft material comprising loose woven
fibers…
Claims
4
Looking for published literature that can be
used to `say no’ to a patent application. A
granted patent should be novel and non-
trivial.
Ø  Time constraint: less than 6 hours
Independent
DependentDependentDependent

5
¢  Information need with
multiple subtopics
¢  Goal: fulfill the info need
with relevant documents as
soon as possible
¢  A document can cover
different subtopics
¢  Stop finding more relevant
documents for a subtopic or
for the entire information
need
¢  A cube with multiple
segments
¢  Goal: fill up the cube with
water as soon as possible
¢  “document water” can flow
in different segments
¢  Reaching a cap in a segment
and no more water can go
there
Professional Search The Water-filling Model
We draw an analogy between Professional Search
and Filling Water into a Cube
How to judge a search system is good?
Ø  We assume the searcher wants the multi-subtopics of a task
to be fulfilled as quickly as possible & as much as possible

The Task Cube
Ø  The Cube with unit length
represents the entire
information need
Ø  Each cuboid in the Cube
represents a subtopic
Ø  The top of the Cube is the
cap that limits the maximum
amount of relevant
information needed
Ø  Stopping criterion
Ø  The bottom is segmented into different areas.
Ø  The area size indicates the importance of each
subtopic.
Ø  E.g. in prior art search, independent claims are
assigned more weights than dependent claims
6
An empty
task cube for
a search task
with 6 subtopics

The Water Filling Model
7
Ø  A new coming relevant
document will increase
waters in all its relevant
subtopics
Ø  The height increment is the
relevance gain from that
document with regard to that
subtopic
Ø  The total height of the water
in one cuboid represents the
accumulated relevance gain
for a subtopic
Ø  Total volume in the task
Cube is the total Gain

The Cube Test
Ø  Based on the water-filling model, we design
a new multi-dimensional evaluation metric
for professional search: the Cube Test (CT)
8
Ø  CT calculates the rates of how fast a search
system can fill up the task cube as much as
possible
Ø  It is a speed function

The Gain Function
𝐺𝑎𝑖𝑛( 𝑄, 𝑑𝑗)=∑𝑖↑▒𝑎𝑟𝑒𝑎𝑖 ×height𝑖, 𝑗 × KeepFilling𝑖
Ø  Document dj’s gain is calculated as the
volume of relevant “document water” that
matches to all subtopics in the task cube.
Ø  A more concrete equation:
where - Γ is a discounting factor for subtopic novelty, Γ = γnrel(c
i
,j-1)
where nrel(ci, j-1) is # of relevant documents for subtopic ci in
previously examined documents (d1 to dj-1).
- θi is the importance of the ith subtopic, ∑𝑖↑▒θ 𝑖  = 1.
- rel(d j,c i) is the water height, i.e., the document d j’s
relevance grade towards subtopic c i,
- Ι is the indicator function,
- MaxHeight is the cap for subtopic relevance (set to 1).
9

10
Ø  Total Gain for a list of documents
have been examined
The Total Gain Function
Ø  Note that it does not assume any
traversal order
Ø  It even does not assume ranked
retrieval
Ø  This allows us to support both ranked
and unranked retrieval or a mix of
them

The Cube Test - Recap
11
Ø  It is a speed function
Ø  The time function is the amount of time taken from the
beginning up to the tth document, it can be
Ø  actual reading time
Ø  a formulation similar to TBG [Smucker &
Clarke,sigir’12], taking into account document length
∑𝑗=1↑𝑡▒4.4+ 𝑟↓𝑖 ×(0.018 𝑙↓𝑗 +7.8)  
Ø  or simply # of documents have been examined so far

EXPERIMENTS
Datasets
USPTO
•  It consists of three million US patent applications and
publications from 2001 to 2013 in XML with images removed.
•  We created 33 runs for 49 prior art finding tasks.
•  Office actions written by US Patent Examiners are parsed
and the ground truth are extracted automatically from them
(PublicPair)
CLEF-IP 2012
•  XML patent documents from the European Patent Office
(EPO) prior to 2002 and 400,000+ documents published by
the World Intellectual Property Organization (WIPO).
•  We evaluate the 31 official runs from 5 teams who
participated CLEF-IP 2012.
12

Discriminative Power
Ø  We compare the new metric with
a few well-known metrics:
•  Recall
•  I-rec (Sakai et al. EVIA’10]
•  nDCG
•  α-nDCG [Clarke et al. sigir’08]
•  PRES [Magdy and Jones, sigir’10]
•  MAP
•  TBG [Smucker & Clarke, sigir’12]
•  nERR-IA [Sakai & Song, sigir’11]
Ø  Evaluate the evaluation metrics
by their discrimination power
[Sakai, sigir’06]
Ø  We test a few variations of CT
Ø  In the CLEF-IP dataset, all CT
metrics show high
discriminative power.
13
Ø  For the USPTO dataset, Recall and
I-rec show the best discriminative
power. CT metrics show good
discriminative power.

Tradeoff between coverage and single relevance
Ø  CT is able to adjust its bias between
recall-oriented tasks and precision-
oriented tasks
Ø  We create two artificial runs
Ø  coverage run It arranges relevant
documents to each subtopic in a round-
robin fashion.
Ø  single relevance run It puts all relevant
documents ordered by rel(d, ci) for a
subtopic first, then for the next subtopic.
CT vs. γ for the coverage run
CT vs. γ for the single
relevance run
The novelty discount base γ ranges in
[0.1,0.9].
When γ is small, CT has a big novelty
discount, is biased towards coverage and
rewards more for runs that spread relevant
documents across different subtopics;
When γ is big, CT is biased towards precision
and rewards more for runs that produce highly
relevant documents early.
14

Conclusions
Ø  This paper presents a novel evaluation metric (the Cube
Test), based on a novel utility model (the water filling model)
Ø  It addresses several important dimensions in professional
search, and in complicated search in general
Ø  Covers different aspects or subtopics
Ø  Subtopics no need to be equally important
Ø  Allows for single document to cover several subtopics
Ø  Is time-sensitive
Ø  Handles the stopping criterion
Ø  Adding more relevant documents to certain subtopic
will not help to improve the overall gain
Ø  Expresses the tradeoff between time, quality of
documents, and diverse coverage of subtopics
15
Acknowledgments: Portions of this work were conducted to explore
new concepts under the umbrella of a larger project at the US Patent
and Trademark Office.

THANK YOU
Jiyun Luo1 Christopher Wing1 Hui Yang1 Marti A. Hearst2
1Department of Computer Science
Georgetown University
Washington, DC, USA
{jl1749, cpw26}@georgetown.edu
huiyang@cs.georgetown.edu
2School of Information
University of California, Berkeley
Berkeley, CA, USA
hearst@berkeley.edu
16

The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search (CIKM 2013)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search (CIKM 2013)

Similar to The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search (CIKM 2013) (20)

Recently uploaded

Recently uploaded (20)

The Water Filling Model and The Cube Test: Multi-Dimensional Evaluation for Professional Search (CIKM 2013)