Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
Designing IA for AI - Information Architecture Conference 2024
Building Learning to Rank (LTR) search reranking models using Large Language Models (LLM)
1. PyData Global 2023
Sujit Pal, Elsevier Health
ORCID Id: https://orcid.org/0000-0002-6225-110X
Building Learning to
Rank models for search
using Large Language
Models
2023
2. • Work at the intersection of search and
machine learning
• Interested in Information Retrieval, Natural
Language Processing, Knowledge Graphs
and Machine Learning, and now LLMs and
Generative AI
About Me
2
sujit.pal@elsevier.com
https://www.linkedin.com/in/sujitpal
@palsujit@hachyderm.io
5. Basic Idea (what)
Use LLMs to
generate relevance
judgements
Use relevance
judgements to train
LTR models
Use LTR models to
rerank query results
Profit!
5
6. Rationale (why)
LTR
Easy way to jumpstart relevance
model
Practical for situations where
judgement data is cheap and plentiful
LLM
Potentially unlimited source of
judgement data
70B+ LLM models capable of
mimicking human preferences
6
Large language models can accurately predict searcher preferences (Thomas et al, 2023)
10. Query Sampler
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
10
11. • Determine a set of representative queries system
expected to answer (for training LTR model)
• This was for a specialized search component to
answer long queries, so we sampled from our
query log
• Pretend #-tokens and #-concepts form a normal
distribution, calculate mean and standard
deviation
• Set up boundaries: mean ± s.d.
• Filter queries from query log whose #-tokens and
#-concepts fall within the (mean ± s.d.) boundary
Query Sampler
11
12. Label Generator
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
12
13. Label Generation (pointwise)
q (q, dk)
(q, dk, yk)
Human: You are a medical expert tasked with
identifying if the provided DOCUMENT addresses the
information needs for the provided QUERY.
QUERY: `{query}`
DOCUMENT: `{document}`
Your RESPONSE should be:
- RELEVANT if the DOCUMENT addresses the
information needs for the QUERY
- IRRELEVANT otherwise
Explain your REASONING.
Format the output as follows:
<output>
<response>RESPONSE</response>
<reasoning>REASONING</reasoning>
</output>
Assistant: <output>
Prompt
used
RELEVANT or IRRELEVANT
(binary)
13
14. Label Generation (pairwise)
Human: You are a medical expert who has to judge which
of two DOCUMENTs shown below are relevant for the
given QUERY. Provide your JUDGEMENT as DOCUMENT-
1 or DOCUMENT-2 depending on which DOCUMENT you
think is relevant for the QUERY.
QUERY: `{query}`
DOCUMENT-1: `{document_1}`
DOCUMENT-2: `{document_2}`
Explain your REASONING.
Format your output as follows:
<output>
<response>JUDGEMENT</response>
<reasoning>REASONING</reasoning>
</output>
Assistant: <output>
q (q, dk)
(q, dki, dkj, yk)
(q, dki, dkj)
Generate
pairs
Prompt
used
DOCUMENT-1 or
DOCUMENT-2
14
15. Label Generation (listwise)
q (q, dk)
(q, dk, yk)
Human: You are a medical expert tasked with assigning a SCORE
indicating how relevant the given DOCUMENT is to the given
QUERY.
QUERY: `{query}`
DOCUMENT: `{document}`
Assign the SCORE as follows:
1 - DOCUMENT is completely unrelated to QUERY
2 - DOCUMENT has some relation to QUERY, but mostly off-topic
3 - DOCUMENT is relevant to QUERY, but lacking focus or key
details
4 - DOCUMENT is highly relevant, addressing the main aspects of
QUERY
5 - DOCUMENT is directly relevant and precisely targeted to QUERY
Explain your REASONING for assigning the SCORE.
Format the output as follows:
<output>
<score>SCORE</score>
<reasoning>REASONING</reasoning>
</output>
Assistant: <output>
5-point scale (1-5)
(numeric)
Prompt
used
15
16. Feature Generator
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
16
17. Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Total Term Frequency (TTF) for field
TF (min, max, mean, var) for field
TF*IDF (min, max, mean, var) for field
#-overlapping query tokens w/field
#-overlapping query concepts w/field
#-overlapping query semantic groups w/field
BM25 scores for matching query w/field
Cosine similarity between query and field
Idea Source: Learning to Rank Datasets page from Microsoft Research
17
18. Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Total Term Frequency (TTF) for field
TF (min, max, mean, var) for field
TF*IDF (min, max, mean, var) for field
#-overlapping query tokens w/field
#-overlapping query concepts w/field
#-overlapping query semantic groups w/field
BM25 scores for matching query w/field
Cosine similarity between query and field
Document Fields
- title
- section title
- breadcrumbs
- text
18
19. Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Total Term Frequency (TTF) for field
TF (min, max, mean, var) for field
TF*IDF (min, max, mean, var) for field
#-overlapping query tokens w/field
#-overlapping query concepts w/field
#-overlapping query semantic groups w/field
BM25 scores for matching query w/field
Cosine similarity between query and field
Multiple point estimates for same
feature
19
20. Feature Generation
Query Features
Document
Features
Query-Document
Features
#-tokens in query
Total Term Frequency (TTF) for field
TF (min, max, mean, var) for field
TF*IDF (min, max, mean, var) for field
#-overlapping query tokens w/field
#-overlapping query concepts w/field
#-overlapping query semantic groups w/field
BM25 scores for matching query w/field
Cosine similarity between query and field
Count
Custom
NER
61 features in all
20
21. Model
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
21
22. • Pointwise Models: take query and
document as input and return a
relevance judgment between 0 and 1.
• Pairwise Models: take a query and
pair of documents as input and return
a judgment between -1 and 1
• Listwise Models (not used): take a
query and list of documents and return
list of documents ordered by relevance
• Feature generator takes query and
document and returns a feature vector
LTR Models Recap
Generate
features
Point-
wise
LTR
Model
query
doc judgment
Generate
features
Pairwise
LTR
Model
query
doc-1
judgment
doc-2
22
23. • Pointwise
− 2-layer FCN for binary classification, uses binary relevance data
• RankNet
− 3-layer Siamese network for binary classification, uses pairwise
relevance data
• LambdaRank
− Pairwise model, needs listwise (scored) input, internally
converts to pairwise
− Code adapted from houchenyu/L2R
− Also available via XGBoost using rank:pairwise objective
• LambdaMART
− Also available via XGBoost using rank:ndcg objective
Model Performance
binary pairwise scored
23
26. • RankNet
− Trains using gradient descent
− Gradient computed as ∂C/∂S, where C = cross-entropy,
penalizes difference in desired ranking vs actual ranking, and
S = model score
• LambdaRank
− Multiplies RankNet ∂C/∂S (ƛ) values by |ΔNDCG|, the change
in NDCG caused by swapping a pair of inputs
• LambdaMART
− Combines Gradient Boosting (MART = Multiple Additive
Regression Trees) with LambdaRank gradient computation
LTR Models Evolution
Paper ref: From RankNet to LambdaRank to LambdaMART – an Overview (Burges, 2010)
26
27. Evaluation
Training
Gold set queries
Query logs
q (q, dk)
(q, dk,
yk) (Xk, yk)
q (q, dk)
(q, dk,
Xk)
(q, dk,
Xk, y’k)
(q, dk,
yk)
Inference Evaluation
Elasticsearch
Index (BM25)
Elasticsearch
Index (BM25)
Feature
Generator
Feature
Generator
Label
Generator
Label
Generator
LTR Model
Trained
LTR Model
Reranker
Query
Sampler
P@10
27
28. • Generate top 50 results for query from ES index (lexical search)
• Re-rank using trained LTR model and return top 10 results
• Use LLM (same prompt as point-wise label generation) to determine
relevant / irrelevant judgments
• Aggregate judgments across results, i.e. 7 / 10 relevant 0.7 P@10
• Average P@10 scores across all eval queries MAP@10
• Our application called for top 10 results equally ranked
• But pipeline could also generate ranked lists and compute rank-aware
metrics such as MRR@k or NDCG@k if needed
Evaluation
28
30. • Low-effort way to quickly build medium to high relevance LTR relevance
models
• LLMs provide (relatively) cheap and plentiful judgment labels to train LTR
models
• Can be used to jumpstart development of search pipelines.
• Feature Engineering used to inject informative features from different
search modalities – lexical, vector, knowledge graph, etc.
Conclusions
30
31. • Human judgments hard to acquire, so using LLM makes sense
• Human vs LLM judgment have similar trend, accuracy: 71%, but LLMs more
”lenient” than human
• Overall correlation 0.43, but decreases with increasing scores
• Observation: LLM tries too hard to conclude “RELEVANT” by making leaps of
reasoning humans would not.
Alignment with human judgments
31
32. Active Learning
32
Image Credit: SuperAnnotate Webinars Page
1. Train LTR model with fully
automated pipeline
2. Deploy as re-ranker
3. Generate search results for user
queries
4. Identify low conference predictions
and re-annotate using human
experts
5. Retrain LTR model with additional
labels from step 4
6. Go to step 2