Conversion Models: A Systematic Method of Building Learning to Rank Training Data - Doug Turnbull, OpenSource Connections

Conversion Models
ABSOLUTELY AMAZING learning to rank training data ?
Activate 2019
Discount Code ctwact19 for 40% off!
Doug Turnbull, http://o19s.com
WE'RE HIRING!

Relevance Cornucopia🦃 Training Event:
http://o19s.com/blog/2019/09/11/announcing-relevance-cornucopia/
(Early Bird (gobble gobble) till end of Sept)
● Week of Nov 10
● "Think Like a Relevance Engineer" for Solr or Elasticsearch
● "Learning to Rank" & "Natural Language Search" training
● Delivered by our crack team of expert relevance consultants!

What I'm currently up to...
THEY'RE
HIRING!
(see Dennis Chaney's talk)
https://www.lexisnexis.com/en-us/about-us/careers.page

Outline
1. What holds orgs back from AI-Powered Search?
2. Click Models help?
3. Click Models for The Rest of Us

What holds orgs
back from this?
http://aipoweredsearch.com
Discount Code: ctwact19

How most 'Machine Learning Search' Projects
Fail
Our
Jerk-face
AI Search
Garbage
Training Data In
Garbage Results Out

This difficulty is a major theme in our
community
From User Actions to Better Rankings Agnes Van Belle, Haystack EU 2018; Learning Learning To Rank Torsten Köster & Fabian Klenk & René Kriegler,
Haystack EU 2018; Learning to rank (LTR) in an Activity Marketplace Ashraf Aaref & Felipe Besson - MICES 2018
Through 4 iterations of LtR
"Consistent theme of being hindered
by judgment quality"
V1 LTR model failed. We need to "Redefine our criteria for
measuring relevance" and "Judge the judgements very often"
(entire talk about this problem)

First: what is the training data?
grade,keywords,docId
4,Rambo,7555 # Rambo
3,Rambo,1370 # Rambo III
0,Rambo,102947 # First Daughter
4,Rocky,1366 # Rocky
...
Doc 7555 is
perfectly relevant
for query "Rambo"
Doc 102947 very
irrelevant for
"Rambo"
Judgment List:

Measuring how good is search...
...
Our
Search
Solution
Keywords NDCG@5 ERR@5
Rambo 0.95 0.56
Rocky 0.58 0.21
Offline testing: How is our tuning going?
Rambo: going pretty good!
Rocky: not so great… let's focus here

… and for training Learning to Rank
...
Our
LtR
Model
Keywords NDCG@5
Rambo 0.95
Rocky 0.58
Train
modelJudgments are training data...
Analyze
Results
Elite
Search
Team

Of course there's manual judgments
http://github.com/o19s/quepid
For a good talk on a robust human judgment program, see Tito Sierra and Tara
Diedrichson's Haystack Talk "Making the Case for Human Judgment Relevance
Testing" https://haystackconf.com/2019/human-judgement/
(Usually not enough data for LtR training
data)

For LtR: use implicit data from user behavior
Less
'Opinion'?

How to do this - maybe something like this!?
if purchased=True:
grade = 4
if clicked + dwell for 5 secs:
grade = 3
if click:
grade = 2
if shown, but not clicked:
grade = 1
Clickstream
...
Is this a good approach?
Thoughts?

Self reinforcing bad search
Search
Engine
'Santa Claus Conquers
Martians' most relevant!
Users only interact with what
the search engine shows them
ML reinforces search's current
(bad?) behavior
Position bias: 'Santa Claus…' clicked more as its in posn 1
Presentation bias: where is "The Martian"?
q=stuck on mars

Domain-specific considerations
Lack of a clear 'Conversion' - what if this is just IMDB getting info on the
movie?; what if users just want to research an expensive purchase first?
What are YOUR user's goals? Shopping vs research vs known-item search vs
passive browsing vs … all have different fingerprints
UI layout? How does a grid vs a list influence user's click behaviors? What
about a chat-bot system or Alexa-style question answering!??
'Good Abandonments' - what if your snippets answer the user's question
without them clicking on a thing!

How you get judgments is a model too!
Your
Intuition
<your assumptions go
here>
Clickstream
...

This means when you hear...
"I think that clicking
and spending > 5
seconds on the page
indicates relevant
document!"
"I think that we should
oversample clicks
farther down the page
to compensate for
position bias""Carefully inspecting
the product is an
indication of relevance"

NDCG - but based on what judgment methodology?
"We improved
NDCG 20% through
X ML search technique!"
Overconfident
search consultant

We need to study these models too
Hard-Coded
Ranking 2
Hard-Coded
Ranking 1 Clickstream
Judgment
Aggregation
Solution 1
Show users hard-coded
corresponding to judgment list
Judgment
Aggregation
Solution 2
A
B
- A/B Test the Judgment system
- Consensus with other judgment
systems (ie manual)
- Continue to evolve & improve

This is why this is so hard
- Search behaviors / UIs constantly
evolving
- Your domain & products
considerations dominate
- SERP UIs have biases

What is a click model
CLICKS
q=waffle maker
So hot right now
Really really really
ridiculously good
looking
What is this? A search
result for q=ANTS?
Click Models for Web Search by Chuklin, Markov, de Rijke
https://www.morganclaypool.com/doi/abs/10.2200/S00654ED1V01Y201507ICR043

Attractiveness vs Satisfaction
Attractiveness
~Perceived Relevance
Denoted 'A'
The snippet *looked*
useful/interesting for
what I need - tied to
clicks
All click models
provide A
≠
Satisfaction
~Actual Relevance
Denoted 'S'
The document satisfied
my information need
Some click models
attempt S

A=0.45
A=0.25
A=0.15
CTR: The World's Dumbest Click Model
(we know this is
dominated by
position bias)
So Hot
Right
Now

A=0.45 / 0.50
= 0.9
A = 0.25 / 0.20
= 1.25
A = 0.15 / 0.16
= 0.9375
CTR/Avg Posn CTR:
The World's Second Simplest Click Model
So Hot
Right
Now
(aka COEC - clicks over expected clicks)
Personalized Click Prediction in Sponsored Search, Chang, Cantu-Paz
http://www.wsdm-conference.org/2010/proceedings/docs/p351.pdf
Avg CTR for posn 1
over all queries
This Query's
CTR for posn 1

Probabilistic Models ~ e.g. Position Based Model
C
d
Ed
Ad
Ad User found doc d attractive
Ed User Examined document d
αdq
γr
αdq Attractiveness for doc d, query q
γr
Examine probability for rank r
across all queries
C
d
Document d
clicked
Observed:
Rank examine
prob
Doc attractiveness
for Query
P(Cd) = P(Ed) * P(Ad)
~ γr * αdq

PBM ~ Two Unknowns, One Equation
P(Cd) ~ γr * αdq
Find best
examine for
observed clicks
Find best
attractiveness for
doc/query pair It's definitely examined P(Ed)=1 if it's clicked!
It's definitely attractive P(Ad)=1 if it's clicked!
Unlikely something was examined if users never click on
that position (or is the document unattractive)?
Unlikely something is attractive, if users seem to examine
that position (see posn clicked a lot) but don't click this
particular document
Assumptions:

Assumptions -> TERRIFYING MAAAAAATH!!!
Iteratively improve attractiveness & examine probabilities over the search session until they
converge to most likely
Clicked 'assumptions'
Not Clicked, then probably not
attractive if this posn is
examined a lot (trust me 😊 )
For each session with
query/doc pair
(t - iteration)

Solving for satisfaction
Shoutout: Solving for Satisfaction, Liz Haubert
https://haystackconf.com/2019/satisfaction/

Dynamic Bayesian Network
A Dynamic Bayesian Network Click Modelfor Web Search Ranking by Chapelle, Zhang
http://olivier.chapelle.cc/pub/DBN_www2009.pdf
Wikimedia Foundation's use of DBN:
https://blog.wikimedia.org/2017/10/17/elasticsearch-learning-to-rank-plugin/
Er
Cd
Ar
αdq
Sr
sdq
Er-1
Cd
Ar-1
αdq
Sr-1
sdq
We can compute 'attractiveness' and 'satisfaction' of doc for query
......
γ
You examine the next
result if you clicked but
were not satisfied, or at
probability γ if you were
satisfied
Simplified DBN: last
clicked result satisfied me

We are not building Web Search
● Low visibility just the SERP clicks, we
don't see what happens beyond...
● High volume simpler assumptions
help map just clicks to satisfaction
Web Search:

Most of us - 'Average Joes'
● More visibility clicks, conversions, and
more from the session after search!
● Lower volume may not be able to rely
on simpler assumptions for satisfaction
Most other search apps:

Click models for the rest of us

Click models for the rest of us
● Click Model CAN be used to overcome
SERP UI biases to derive
attractiveness for Average Joes
● What about satisfaction? Aka 'actual
relevance'
● Can we use our advantage to measure
that directly?

q=waffle maker
0.7
0.9
0.4
Avg Joes have enough data to derive attractiveness
Attractiveness:

Most of us have some kind of 'post click' tracking
Conversions: Direct/explicit goal completed by user - like
"purchase"
Pseudo-conversions: "goals" not directly recognized by
user or clear in analytics - like "read article" or "add to cart"
Indications of interest: not quite "goals" but indications
user is happy - like "click plus dwell"

q=heart attack
0.7
'Shallow' events dense; 'deeper' events sparse
Attractiveness: click!
These clicks are fleeting to
users
Top of
funnel/path
Click+
Dwell
Click+
Dwell+
Scroll
Read
Reviews
Add to
Cart
Checkout
End of
funnel/path
Most people
should get here...
...a few will get all
the way through...

q=waffle maker
0.7
If user can't bother to do shallow event, attractiveness
discounted
Attractiveness:
User immediately hits back
button!
Time on page = 0.001s
Not actually relevant

q=waffle maker
0.7
If user moves deep into page, attractiveness confirmed
Attractiveness:
Add to Cart
Bought
Definitely relevant

q=heart attack
0.7
Discount attractiveness based on event not achieved
Attractiveness: click!
Click+
Dwell
Click+
Dwell+
Scroll
Read
Reviews
Add to
Cart
Checkout
Quit here?
Discount A: 0.01
Quit here?
Discount A: 0.95

Update over multiple sessions...
q=waffle maker
0.7
Attractiveness:
Bought
Session 1
Immediately
returned to
SERP
Session 2
Stayed on
page, read
reviews
Session 3
Further 'post query' evidence:
D=0.65 D=0.01 D=0.20
J = Discount * Attractiveness
Σ
num_sessions
J =0.7 x 0.65+0.01+0.2 = 0.29
3

User Value-Cost Model
What is the value of a page for the user
We can't really measure the value but we can indirectly the cost to the user in
time & money
...I can't be
bothered...
Click+
Dwell
Click+
Dwell+
Scroll
Read
Reviews… this was at least
worth some of my time
towards my goal...
Back immediately
Discount heavily
Discount
moderately

Bayes justification to judgments
P (J | V) = P (V | J) * P(J)
P(V)
Prior, earlier belief in relevance given by
attractiveness as derived from click model
Probability of user getting value in the
context of it being deemed relevant
to this query
Probability of user getting value
regardless of query
Judgment in the
context of value

Bayes approach to judgments
J = avgPageValueForThisQuery * A
avgPageValue

When avg_page_value = 0.3
q=waffle maker
0.7
Attractiveness:
Bought
Session 1
Immediately
returned to
SERP
Session 2
Stayed on
page, read
reviews
Session 3
Further 'post query' evidence:
D=0.65
user_value=0.01 user_value=0.20
Discount * Attractiveness
Σ
num_sessions
J =0.7 x 0.65+0.01+0.2 = 0.95
3 / 0.3
avg_page_value
J =

Zhong, et. al. Incorporating Post-Click Behaviors into a Click Model
https://zhangyuc.github.io/files/zhang11kdd.pdf
Your Take home reading

Conversion Models: A Systematic Method of Building Learning to Rank Training Data - Doug Turnbull, OpenSource Connections

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Conversion Models: A Systematic Method of Building Learning to Rank Training Data - Doug Turnbull, OpenSource Connections

Similar to Conversion Models: A Systematic Method of Building Learning to Rank Training Data - Doug Turnbull, OpenSource Connections (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

Conversion Models: A Systematic Method of Building Learning to Rank Training Data - Doug Turnbull, OpenSource Connections

Editor's Notes