Ben Carterett — Advances in Information Retrieval Evaluation

System
Eﬀec+veness,

User
Models,
and
User
U+lity

A
Conceptual
Framework
for
Inves+ga+on

Ben
CartereBe

University
of
Delaware

carteret@cis.udel.edu

Effec+veness
Evalua+on

•  Determine
how
good
the
system
is
at
finding
and

ranking
relevant
documents

•  An
effec+veness
measure
should
be
correlated
to
the

user’s
experience

–  Value
increases
when
user
experience
gets
beBer;

decreases
when
it
gets
worse

•  Thus
interest
in
effec+veness
measures
based
on

explicit
models
of
user
interac+on

–  RBP
[Moffat
&
Zobel],
DCG
[Järvelin
&
Kekäläinen],
ERR

[Chapelle
et
al.],
EBU
[Yilmaz
et
al.],
sessions
[Kanoulas
et

al.],
etc.

Discounted
Gain
Model

•  Simple
model
of
user
interac+on:

–  User
steps
down
ranked
results
one-‐by-‐one

–  Gains
something
from
relevant
documents

–  Increasingly
less
likely
to
see
documents
deeper
in
the

ranking

•  Implementa+on
of
model:

–  Gain
is
a
func+on
of
relevance
at
rank
k

–  Ranks
k
are
increasingly
discounted

–  Eﬀec+veness
=
sum
over
ranks
of
gain
+mes
discount

•  Most
measures
can
be
made
to
ﬁt
this
framework

Rank
Biased
Precision

[Moﬀat
and
Zobel,
TOIS08]

black powder
ammunition

1

2
Toss
a
biased
coin
(θ)

3

4

If
HEADS,
observe
next

5

6

document

7

8
IF
TAILS,
stop

9

10

…

Rank
Biased
Precision

black powder Let
θ=0.8

ammunition

1
0.532<θ

2

3

0.933≥θ

4

5

6

7

8

9

10

…

Rank
Biased
Precision

black powder
ammunition Query

1

2
View
Next

Stop

3
Item

4

5

6

7

8

9

10

…

Rank
Biased
Precision

black powder
ammunition

1

2

∞

3

RBP = (1 − θ) relk θk−1
4

k=1
5

∞

6

= relk θk−1 (1 − θ)
7

k=1
8

9

10
Relevance
discounted
by

…
geometric
distribu+on

Discounted
Cumula+ve
Gain

[Järvelin
and
Kekäläinen
SIGIR00]

black powder
ammunition
Relevance
Relevance

Discounted

Score
Gain

1
R
1

1

2
R
1

0.63

Discount

3
N
0

by
rank
0

4
N
0

0

DCG
1/log2(r+1)
NDCG =
5
R
1

0.38

optDCG
6
R
1

0.35

DCG
=
2.689

NDCG = 0.91
7
N
0

0

8
R
1

0.31

9
N
0

0

10
N
0

0

…
…
€

Discounted
Cumula+ve
Gain

Relevance
0.0 0.2 0.4 0.6 0.8 1.0

1
R

2
R

3
N

4
N

5
R
∞
1
6
R
DCG = ∑ reli
7
N
i=1 log 2 (1+ i)
8
R

9
N

10
N

€
…
…

Expected
Reciprocal
Rank

[Chapelle
et
al
CIKM09]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5
Stop

6

7

8

9

10

…

Expected
Reciprocal
Rank

black powder Query

ammunition

1

View
Next

2
Item

3

4

5

Relevant?

6

7

8

highly
somewhat
no

9

10

…

Stop

Models
of
Browsing
Behavior

black powder
ammunition Posi+on-‐based
models

1
The
chance
of
observing
a

2
document
depends
on
the
posi+on

3
of
the
document
in
the
ranked
list.

4

5

6

Cascade
models

7
The
chance
of
observing
a

8
document
depends
on
its
posi+on

9
as
well
as
the
relevance
of

10

documents
ranked
above
it.

…

A
More
Formal
Model

•  My
claim:

this
implementa+on
conﬂates
at
least
four

dis+nct
models
of
user
interac+on

•  Formalize
it
a
bit:

–  Change
rank
discount
to
stopping
probability
density
P(k)

–  Change
gain
func+on
to
either
a
u+lity
func+on
or
a
cost

func+on

•  Then
eﬀec+veness
=
expected
u+lity
or
cost
over

stopping
points

∞

M= f (k)P (k)
k=1

Our
Framework

•  The
components
of
a
measure
are:

–  stopping
rank
probability
P(k)

•  posi+on-‐based
vs
cascade
is
a
feature
of
this
distribu+on

–  document
u+lity
model
(binary
relevance)

–  u+lity
accumula+on
model
or
cost
model

•  We
can
test
hypotheses
about
general
proper+es

of
stopping
distribu+on,
u+lity/cost
model

–  Instead
of
trying
to
evaluate
every
possible
measure

on
its
own,
evaluate
proper+es
of
the
measure

Model
Families

•  Depending
on
choices,
we
get
four
dis+nct

families
of
user
models

–  Each
family
characterized
by
u+lity/cost
model

–  Within
family,
freedom
to
choose
P(k),
document

u+lity
model

•  Model
1:

expected
u+lity
at
stopping
point

•  Model
2:

expected
total
u+lity

•  Model
3:

expected
cost

•  Model
4:

expected
total
u+lity
per
unit
cost

Model
1:

Expected
U+lity
at
Stopping
Point

•  Exemplar:

Rank-‐Biased
Precision
(RBP)

∞

RBP = (1 − θ) relk θk−1
k=1
∞

= relk θk−1 (1 − θ)
k=1

•  Interpreta+on:

–  P(k)
=
geometric
density
func+on

–  f(k)
=
relevance
of
document
at
stopping
rank

–  Eﬀec+veness
=
expected
relevance
at
stopping

rank

Model
2:

Expected
Total
U+lity

•  Instead
of
stopping
probability,
think
about
viewing

probability
∞

P (view doc at k) = P (k) = F (k)
i=k
•  This
ﬁts
in
discounted
gain
model
framework:

∞

M= relk F (k)
k=1

•  Does
it
ﬁt
in
expected
u+lity
framework?

–  Yes,
and
Discounted
Cumula+ve
Gain
(DCG;
Jarvelin
et
al.)

is
exemplar
for
this
class

Model
2:

Expected
Total
U+lity

∞
∞
∞

M= relk F (k) = relk P (i)
k=1 k=1 i=k
∞
k
∞

= P (k) reli = Rk P (k)
k=1 i=1 k=1

•  f(k)
=
Rk
(total
summed
relevance)

•  Let
FDCG(k)
=
1/log2(k+1)

–  Then
PDCG(k)
=
FDCG(k)
–
FDCG(k+1)

– 

PDCG(k)
=
1/log2(k+1)
–
1/log2(k+2)

•  Work
algebra
backwards
to
show
that
you
get
binary-‐
relevance
DCG
(if
summing
to
inﬁnity)

Model
3:

Expected
Cost

•  User
stops
with
probability
based
on

accumulated
u+lity
rather
than
rank
alone

–  P(k)
=
P(Rk)
if
document
at
rank
k
is
relevant,
0

otherwise

•  Then
use
f(k)
to
model
cost
of
going
to
rank
k

•  Exemplar
measure:

Expected
Reciprocal
Rank

(ERR;
Chapelle
et
al.)
(with
binary
relevance)

–  P(k)
=
relk · θ Rk −1 (1 − θ)
–  1/cost
=
f(k)
=
1/k

Model
4:

Expected
U+lity
per
Unit
Cost

•  User
considers
expected
eﬀort
of
further

browsing
axer
each
relevant
document

∞
∞

M= relk f (k)P (k)
k=1 i=k

•  Similar
to
M2
family,
manipulate
algebraically

∞
∞
∞
k

relk f (i)P (i) = f (k)P (k) reli
k=1 i=k k=1 i=1
∞
= f (k)Rk P (k)
k=1

Model
4:

Expected
U+lity
per
Unit
Cost

•  When
f(k)
=
1/k,
we
get:

∞

M= prec@k · P (k)
k=1

•  Average
Precision
(AP)
is
exemplar
for
this

class

–  P(k)
=
relk/R

–  u+lity/cost
=
f(k)
=
prec@k

Summary
So
Far

•  Four
ways
to
turn
a
sum
over
gain
+mes

discounts
into
an
expecta+on
over
stopping
ranks

–  M1,
M2,
M3,
M4

•  Four
exemplar
measures
from
IR
literature

–  RBP,
DCG,
ERR,
AP

•  Four
stopping
probability
distribu+ons

–  PRBP,
PDCG,
PERR,
PAP

–  Add
two
more:

•  PRR(k)
=
1/(k(k+1)),
PRRR(k)
=
1/(Rk(Rk+1))

Stopping
Probability
Densi+es

1.0
0.5

PRBP = (1 !PERRFRBP =((1!!)k!1!1
)k!1ERR = 11!RRk
F= relk ( ))k!1
PRR = 1 (k(kRRRF= rel= 1k RkRk + 1))
P + 1)) = k1 (Rk(
FRR
RRR
P 2( F+rel) !R (R 2( k ) 1)
PDCG = 1 logAP kFDCG==1 ! logk2!k1+ 2R
= AP k 1 log
1
0.8
0.4
cumulative probability
0.6
0.3
probability
0.4
0.2
0.2
0.1
0.0

5 10 15 20 25
rank

From
Models
to
Measures

•  Six
stopping
probability
distribu+ons,
four

model
families

•  Mix
and
match
to
create
up
to
24
new

measures

–  Many
of
these
are
uninteres+ng:

isomorphic
to

precision/recall,
or
constant-‐valued

–  15
turn
out
to
be
interes+ng

Some
Brief
Asides

•  From
geometric
to
reciprocal
rank

–  Integrate
geometric
w.r.t.
parameter
theta

–  Result
is
1/(k(k+1))

–  Cumula+ve
form
is
approximately
1/k

•  Normaliza+on

–  Every
measure
in
M2
family
must
be
normalized
by

max
possible
value

–  Other
measures
may
not
fall
between
0
and
1

Some
Brief
Asides

•  Rank
cut-‐offs

–  DCG
formula+on
only
works
for
n
going
to
infinity

–  In
reality
we
usually
calculate
DCG@K
for
small
K

–  This
fits
our
user
model
if
we
make
worst-‐case

assump+on
about
relevance
of
documents
below

rank
K

Analyzing
Measures

•  Some
ques+ons
raised:

–  Are
models
based
on
u+lity
beBer
than
models

based
on
eﬀort?

(Hypothesis:
no
diﬀerence)

–  Are
measures
based
on
stopping
probabili+es

beBer
than
measures
based
on
viewing

probabili+es?

(Hypothesis:

laBer
more
robust)

–  What
proper+es
should
the
stopping
distribu+on

have?

(Hypothesis:

faBer
tail,
sta+c
more
robust)

How
to
Analyze
Measures

•  Many
possible
ways,
no
one
widely-‐accepted

–  How
well
they
correlate
with
user
sa+sfac+on

–  How
robust
they
are
to
changes
in
underlying
data

–  How
good
they
are
for
op+mizing
systems

–  How
informa+ve
they
are

Fit
to
Click
Logs

•  How
well
does
a
stopping
distribu+on
ﬁt
to

empirical
click
probabili+es?

–  A
click
does
not
mean
the
end
of
a
search

–  But
we
need
some
model
of
the
stopping
point,

and
a
click
is
a
decent
proxy

•  Good
ﬁt
may
indicate
a
good
stopping
model

Fit
to
Logged
Clicks

empirical distribution
PRBP = (1 ! )k!1
PRR = 1 (k(k + 1))
PDCG = 1 log2(k + 1) ! 1 log2(k + 2)
1e−02
probability P(k)
1e−04
1e−06

1 2 5 10 20 50 100 200 500
rank k

Robustness
and
Stability

•  How
robust
is
the
measure
to
changes
in

underlying
test
collec+on
data?

–  If
one
of
the
following
changes:

•  topic
sample

•  relevance
judgments

•  pool
depth
of
judgments

–  how
diﬀerent
are
the
decisions
about
rela+ve

system
eﬀec+veness?

Data

•  Three
test
collec+ons
+
evalua+on
data:

–  TREC-‐6
ad
hoc:

50
topics,
72,270
judgments,
550,000-‐
document
corpus;
74
runs
submiBed
to
TREC

•  Second
set
of
judgments
from
Waterloo

–  TREC
2006
Terabyte
named
page:

180
topics,
2361

judgments,
25M-‐doc
corpus;
43
runs
submiBed
to

TREC

–  TREC
2009
Web
ad
hoc:

50
topics,
18,666
judgments,

500M-‐doc
corpus;
37
runs
submiBed
to
TREC

Experimental
Methodology

•  Pick
some
part
of
the
collec+on
to
vary

–  e.g.
judgments,
topic
sample
size,
pool
depth

•  Evaluate
all
submiBed
systems
with
TREC’s
gold
standard

data

•  Evaluate
all
submiBed
systems
with
the
modiﬁed
data

•  Compare
ﬁrst
evalua+on
to
second
using
Kendall’s
tau
rank

correla+on

•  Determine
which
proper+es
are
most
robust

–  Model
family,
tail
fatness,
sta+c/dynamic
distribu+on

Varying
Assessments

•  Compare
evalua+on
with
TREC’s
judgments
to

evalua+on
with
Waterloo’s

type
P(k)
M1
M2
M3
M4
mean

PRBP
RBP
=
0.813
RBTR
=
0.816
RBAP
=
0.801
0.810

Tenta+ve
conclusions:

•  sta+c
P CDG
=
0.831
DCG
=
0.920

DCG
DAG
=
0.819
0.857

–  M2
most
robust,
fRR
=
0.859
by
M3
(axer

r.812
0.830

P RRG
=
0.819

RR
ollowed
RAP
= 0 emoving

AP
outlier)

P ERR
ERR
=
0.829
EPR
=
0.836
0.833

dynamic
P ARR
=
0.847
AP
=
0.896
0.872

–  FaBer-‐tail
distribu+ons
more
=r
0.826
RRAP
=
0.844
0.835

P
AP

RRR

obust

RRR

mean
Dynamic
a0.821
more
robust
than
sta+c

– 
bit
0.865
0.834
0.835

Varying
Topic
Sample
Size

•  Sample
a
subset
of
N
topics
from
the
original

50;
evaluate
systems
over
that
set

1.0

fat
M1 tail: PDCG, PAP
medium tail: PRR, PRRR
M2
M3
slim tail: PRBP, PERR
0.9

M4
mean Kendall’s tau
0.8
0.7
0.6
0.5

10 20 30 40
number of topics

Varying
Pool
Depth

•  Take
only
judgments
on
documents
appearing

at
ranks
1
to
depth
D
in
submiBed
systems

–  D
=
1,
2,
4,
8,
16,
32,
64

1.0
0.9
mean Kendall’s tau
0.8
0.7
0.6

M1
M2
M3
M4
0.5

1 2 5 10 20 50
pool depth

Conclusions

•  FaBer-‐tailed
distribu+ons
generally
more
robust

–  Maybe
beBer
for
mi+ga+ng
risk
of
not
sa+sfying
tail
users

•  M2
(expected
total
u+lity;
DCG)
generally
more
robust

–  But
does
it
model
users
beBer?

•  M3
(expected
cost;
ERR)
more
robust
than
expected

•  M4
(expected
u+lity
per
cost;
AP)
not
as
robust
as
expected

–  AP
is
an
outlier
with
a
very
fat
tail

•  DCG
may
be
based
on
a
more
realis+c
user
model
than

commonly
thought

Conclusions

•  The
gain
+mes
discount
formula+on
conﬂates
four

dis+nct
models
of
user
behavior

•  Teasing
these
apart
allows
us
to
test
hypotheses
about

general
proper+es
of
measures

•  This
is
a
conceptual
framework:

it
organizes
and

describes
measures
in
order
to
provide
structure
for

reasoning
about
general
proper+es

•  Hopefully
will
provide
direc+ons
for
future
research
on

evalua+on
measures

Ben Carterett — Advances in Information Retrieval Evaluation

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

More from yaevents

More from yaevents (20)

Recently uploaded

Recently uploaded (20)

Ben Carterett — Advances in Information Retrieval Evaluation