Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM)

PyData Global 2023
Sujit Pal, Elsevier Health
ORCID Id: https://orcid.org/0000-0002-6225-110X
Building Learning to
Rank models for search
using Large Language
Models
2023

• Work at the intersection of search and
machine learning
• Interested in Information Retrieval, Natural
Language Processing, Knowledge Graphs
and Machine Learning, and now LLMs and
Generative AI
About Me
2
sujit.pal@elsevier.com
https://www.linkedin.com/in/sujitpal
@palsujit@hachyderm.io

Agenda
OVERVIEW COMPONENTS FUTURE
WORK
3

Basic Idea (what)
Use LLMs to
generate relevance
judgements
Use relevance
judgements to train
LTR models
Use LTR models to
rerank query results
Profit!
5

Rationale (why)
LTR
Easy way to jumpstart relevance
model
Practical for situations where
judgement data is cheap and plentiful
LLM
Potentially unlimited source of
judgement data
70B+ LLM models capable of
mimicking human preferences
6
Large language models can accurately predict searcher preferences (Thomas et al, 2023)

Rationale (results)
Model MAP@10
BM25 (Elasticsearch OOB) 6.42
Cosine Sim. (QDrant OOB) 8.35
CKPOC Heuristics 8.50
Baselines
Model MAP@10
Pointwise (regression) 8.12
Pairwise (RankNet) 8.38
Pairwise (LambdaRank) 7.92
Pairwise (LambdaMART) 7.58
LTR models
7

Workflow (how)
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
8

Query Sampler
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
10

• Determine a set of representative queries system
expected to answer (for training LTR model)
• This was for a specialized search component to
answer long queries, so we sampled from our
query log
• Pretend #-tokens and #-concepts form a normal
distribution, calculate mean and standard
deviation
• Set up boundaries: mean ± s.d.
• Filter queries from query log whose #-tokens and
#-concepts fall within the (mean ± s.d.) boundary
Query Sampler
11

Label Generator
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
12

Label Generation (pointwise)
q (q, dk)
(q, dk, yk)
Human: You are a medical expert tasked with
identifying if the provided DOCUMENT addresses the
information needs for the provided QUERY.
QUERY: `{query}`
DOCUMENT: `{document}`
Your RESPONSE should be:
- RELEVANT if the DOCUMENT addresses the
information needs for the QUERY
- IRRELEVANT otherwise
Explain your REASONING.
Format the output as follows:
<output>
<response>RESPONSE</response>
<reasoning>REASONING</reasoning>
</output>
Assistant: <output>
Prompt
used
RELEVANT or IRRELEVANT
(binary)
13

Label Generation (pairwise)
Human: You are a medical expert who has to judge which
of two DOCUMENTs shown below are relevant for the
given QUERY. Provide your JUDGEMENT as DOCUMENT-
1 or DOCUMENT-2 depending on which DOCUMENT you
think is relevant for the QUERY.
QUERY: `{query}`
DOCUMENT-1: `{document_1}`
DOCUMENT-2: `{document_2}`
Explain your REASONING.
Format your output as follows:
<output>
<response>JUDGEMENT</response>
</output>
Assistant: <output>
q (q, dk)
(q, dki, dkj, yk)
(q, dki, dkj)
Generate
pairs
Prompt
used
DOCUMENT-1 or
DOCUMENT-2
14

Label Generation (listwise)
q (q, dk)
(q, dk, yk)
Human: You are a medical expert tasked with assigning a SCORE
indicating how relevant the given DOCUMENT is to the given
QUERY.
QUERY: `{query}`
DOCUMENT: `{document}`
Assign the SCORE as follows:
1 - DOCUMENT is completely unrelated to QUERY
2 - DOCUMENT has some relation to QUERY, but mostly off-topic
3 - DOCUMENT is relevant to QUERY, but lacking focus or key
details
4 - DOCUMENT is highly relevant, addressing the main aspects of
QUERY
5 - DOCUMENT is directly relevant and precisely targeted to QUERY
Explain your REASONING for assigning the SCORE.
Format the output as follows:
<output>
<score>SCORE</score>
</output>
Assistant: <output>
5-point scale (1-5)
(numeric)
Prompt
used
15

Feature Generator
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
16

Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Total Term Frequency (TTF) for field
TF (min, max, mean, var) for field
TF*IDF (min, max, mean, var) for field
#-overlapping query tokens w/field
#-overlapping query concepts w/field
#-overlapping query semantic groups w/field
BM25 scores for matching query w/field
Cosine similarity between query and field
Idea Source: Learning to Rank Datasets page from Microsoft Research
17

Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Document Fields
- title
- section title
- breadcrumbs
- text
18

Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Multiple point estimates for same
feature
19

Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Count
Custom
NER
61 features in all
20

Model
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
21

• Pointwise Models: take query and
document as input and return a
relevance judgment between 0 and 1.
• Pairwise Models: take a query and
pair of documents as input and return
a judgment between -1 and 1
• Listwise Models (not used): take a
query and list of documents and return
list of documents ordered by relevance
• Feature generator takes query and
document and returns a feature vector
LTR Models Recap
Generate
features
Point-
wise
LTR
Model
query
doc judgment
Generate
features
Pairwise
LTR
Model
query
doc-1
judgment
doc-2
22

• Pointwise
− 2-layer FCN for binary classification, uses binary relevance data
• RankNet
− 3-layer Siamese network for binary classification, uses pairwise
relevance data
• LambdaRank
− Pairwise model, needs listwise (scored) input, internally
converts to pairwise
− Code adapted from houchenyu/L2R
− Also available via XGBoost using rank:pairwise objective
• LambdaMART
− Also available via XGBoost using rank:ndcg objective
Model Performance
binary pairwise scored
23

Model Implementations
24
Pointwise Pairwise
RankNet LambdaRank LambdaMART
• Two-layer FCN
• Binary Classifier
• Scores = Pred.Prob
• Binary Labels
• 3-layer Siamese
Network
• Binary Classifier
• Pairwise Labels
• Adapted from houchenyu/L2R
• Returns numeric Score
• Scored (listwise) labels
• Internally converted to pairwise
rank:pairwise rank:ndcg

Rank Matrix (RankNet only)
25
D1 D2 D3 D4 D5
D1 0 1 1 -1 1
D2 -1 0 -1 1 -1
D3 -1 1 0 1 -1
D4 1 -1 -1 0 -1
D5 -1 1 1 1 0
• D2 > D1
• D3 > D1
• D4 < D1
• D5 > D1
• D3 < D2
• D4 > D2
• D5 < D2
• D4 > D3
• D5 < D3
• D4 < D4
Given the following pair-wise rankings
Pairwise to Listwise – Pairwise Comparison Method (1000 minds)
Return sorted
list of
documents
total
1
-2
0
-2
2
sorted
D5
D1
D3
D2
D4

• RankNet
− Trains using gradient descent
− Gradient computed as ∂C/∂S, where C = cross-entropy,
penalizes difference in desired ranking vs actual ranking, and
S = model score
• LambdaRank
− Multiplies RankNet ∂C/∂S (ƛ) values by |ΔNDCG|, the change
in NDCG caused by swapping a pair of inputs
• LambdaMART
− Combines Gradient Boosting (MART = Multiple Additive
Regression Trees) with LambdaRank gradient computation
LTR Models Evolution
Paper ref: From RankNet to LambdaRank to LambdaMART – an Overview (Burges, 2010)
26

Evaluation
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
27

• Generate top 50 results for query from ES index (lexical search)
• Re-rank using trained LTR model and return top 10 results
• Use LLM (same prompt as point-wise label generation) to determine
relevant / irrelevant judgments
• Aggregate judgments across results, i.e. 7 / 10 relevant  0.7 P@10
• Average P@10 scores across all eval queries  MAP@10
• Our application called for top 10 results equally ranked
• But pipeline could also generate ranked lists and compute rank-aware
metrics such as MRR@k or NDCG@k if needed
Evaluation
28

• Low-effort way to quickly build medium to high relevance LTR relevance
models
• LLMs provide (relatively) cheap and plentiful judgment labels to train LTR
models
• Can be used to jumpstart development of search pipelines.
• Feature Engineering used to inject informative features from different
search modalities – lexical, vector, knowledge graph, etc.
Conclusions
30

• Human judgments hard to acquire, so using LLM makes sense
• Human vs LLM judgment have similar trend, accuracy: 71%, but LLMs more
”lenient” than human
• Overall correlation 0.43, but decreases with increasing scores
• Observation: LLM tries too hard to conclude “RELEVANT” by making leaps of
reasoning humans would not.
Alignment with human judgments
31

Active Learning
32
Image Credit: SuperAnnotate Webinars Page
1. Train LTR model with fully
automated pipeline
2. Deploy as re-ranker
3. Generate search results for user
queries
4. Identify low conference predictions
and re-annotate using human
experts
5. Retrain LTR model with additional
labels from step 4
6. Go to step 2

• Prompt Engineering
• Additional guidelines based on
(human) expert feedback
• Few shot prompting
• Chain of Thought / Auto-
prompting
• Other advanced techniques, e.g.,
APE, self-consistency
• Prefix tuning
• Fine tuning
Prompt Engineering
34
More advanced prompting techniques in: Prompt Engineering Guide

elsevierlabs-os/build-ltr-models-using-llm

Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM)

Similar to Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM) (20)

More from Sujit Pal

More from Sujit Pal (20)

Recently uploaded

Recently uploaded (20)

Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM)