The document discusses lessons learned from the author's personal journey in search engineering. It covers insights from library science about treating search as an information-seeking context and communicating with users. It also discusses the importance of entity detection and how to leverage corpus features to improve extraction. The author realized that queries vary in difficulty and systems need to recognize this and adapt accordingly. The key takeaway is that search should be treated as a communication problem rather than just a ranking task.
20. 20
20
for i in [1..n]!
s ← w1 w2 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← {s}!
a.prob ← Pc(s)!
B[i] ← {a}!
for j in [1..i-1]!
for b in B[j]!
s ← wj wj+1 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← b.segs U {s}!
a.prob ← b.prob * Pc(s)!
B[i] ← B[i] U {a}!
sort B[i] by prob!
truncate B[i] to size k!
People
search
for
en//es.
Recognize
them!
22. Problem:
they
process
each
document
separately.
EnAty
DetecAon
System
Why
not
take
advantage
of
corpus
features?
23. Give
your
documents
the
right
to
vote!
Use
a
high-‐recall
method
to
collect
candidates.
• e.g.,
all
Atle-‐case
spans
of
words
other
than
single
word
beginning
a
sentence.
Process
each
document
separately.
• Each
candidate
is
assigned
an
enAty
type,
or
no
type
at
all.
If
a
candidate
is
mostly
assigned
a
single
enAty
type,
extrapolate
to
all
its
occurrences.
24. Looking
for
topics?
Use
idf,
and
its
cousin
ridf.
Inverse
document
frequency
(idf)
• Too
low?
Probably
a
stop
word.
• Too
high?
Could
be
noise.
Residual
inverse
document
frequency
(ridf)
• Predict
idf
using
Poisson
model.
• Difference
between
idf
and
predicted
idf.
“a
good
keyword
is
far
from
Poisson”
[Church
and
Gale,
1995]
34. 34
34
for i in [1..n]!
s ← w1 w2 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← {s}!
a.prob ← Pc(s)!
B[i] ← {a}!
for j in [1..i-1]!
for b in B[j]!
s ← wj wj+1 … wi!
if Pc(s) > 0!
a ← new Segment()!
a.segs ← b.segs U {s}!
a.prob ← b.prob * Pc(s)!
B[i] ← B[i] U {a}!
sort B[i] by prob!
truncate B[i] to size k!
We
can
segment
informa/on
need
from
the
query.
36. And
we
can
look
at
our
relevance
scores.
Naviga/onal
Exploratory
37. Claudia
Hauff,
Query
Difficulty
for
Digital
Libraries
[2009]
There
are
many
pre-‐
and
post-‐retrieval
signals.
38. Take-‐away
for
search
engine
developers:
Queries
vary
in
difficulty.
Recognize
and
adapt.
39. Review
1. Lessons
from
Library
Science
• Act
like
a
librarian.
Communicate
with
users.
2.
Adventures
with
InformaAon
ExtracAon
• EnAty
detecAon
is
crucial.
And
isn’t
that
hard.
3.
A
Moment
of
Clarity
• Queries
vary
in
difficulty.
Recognize
and
adapt.