SlideShare a Scribd company logo
1 of 39
Download to read offline
Haystack LIVE!
21 May 2020
How to Build your Training Set
for a Learning to Rank Project
Alessandro Benedetti, Software Engineer
21st May 2020
www.sease.io
● London based - Italian made :)
● Open Source Enthusiasts
● Apache Lucene/Solr experts
● Elasticsearch experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
SEArch SErvices
Clients
Who I am
▪ Born in Tarquinia (ancient Etruscan city
located in central Italy, not far from Rome)
▪ R&D Software Engineer
▪ Search Consultant
▪ Director
▪ Master in Computer Science
▪ Apache Lucene/Solr Committer
▪ Semantic, NLP, Machine Learning
technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Agenda
•Learning To Rank
•Training Set Definition
•Implicit/Explicit feedback
•Feature Engineering
•Relevance Label Estimation
•Metric Evaluation/Loss Function
•Training/Test Set Split
Learning To Rank
What is it ?
Learning from user implicit/explicit feedback
To
Rank documents (sensu lato)
“Learning to rank is the application of machine
learning, typically supervised, semi-supervised or
reinforcement learning, in the construction of
ranking models for information retrieval
systems.” Wikipedia
Learning To Rank
Users Interactions
Logger
Judgement Collector
UI
Interactions
Training
Learning To Rank
• [sci-fi] Sentient system that learn by itself
“Machine Learning stands for that, doesn’t it?” Unknown
• [Offline] Continuously improving itself by ingesting additional feedback*
• [Integration] Easy to set up and tune it -> it takes patience, time and multiple
experiments
• [Explainability] Easy to give a human understandable explanation of why the model
operates in certain ways
Learning To Rank - What is NOT
*this is true for offline Learning To Rank models (the majority so far),
online Learning To Rank models behave differently
• It is the golden truth we are building the ranking model from
• A set of labelled <query, document> pairs
• Each <query, document> example
is composed by:
- relevancy rating
- query Id
- feature vector
• The feature vector is composed by N features (<id>:<value>)
Training Set: How does it look like?
3 qid:1 0:3.4 1:0.7 2:1.5 3:0
2 qid:1 0:5.0 1:0.4 2:1.3 3:0
0 qid:1 0:2.4 1:0.7 2:1.5 3:1
1 qid:2 0:5.7 1:0.2 2:1.1 3:0
3 qid:2 0:0.0 1:0.5 2:4.0 3:0
0 qid:3 0:1.0 1:0.7 2:1.5 3:1
Training Set: Collect Feedback
Ratings
Set
Explicit
Feedback
Implicit
Feedback
Judgements Collector
Interactions Logger
Queen music
Bohemian
Rhapsody
D a n c i n g
Queen
Queen
Albums
Bohemian
Rhapsody
D a n c i n g
Queen
Queen
Albums
Training Set Building
3 qid:1 0:3.4 1:0.7 2:1.5 3:0
2 qid:1 0:5.0 1:0.4 2:1.3 3:0
0 qid:1 0:2.4 1:0.7 2:1.5 3:1
1 qid:2 0:5.7 1:0.2 2:1.1 3:0
3 qid:2 0:0.0 1:0.5 2:4.0 3:0
0 qid:3 0:1.0 1:0.7 2:1.5 3:1
Interactions
Training Set Building
[{
"productId": 206,
"interactionRelevance": 0,
"interactionType": "impression",
"timestamp": "2019-03-15T18:19:34Z",
"userId": "id4",
"query": "free text query",
"userDevice": "item9",
"querySelectedBrands": [209, 201, 204],
"cartCategories": [200, 206, 204],
"userFavouriteColours": [208, 208, 202, 202],
"userAvgPrice": 43,
"productBrand": 207,
"productPrice": 22.0,
"productDiscount": 0.7,
"productReviewAvg": 4.5,
"productReviews": 200,
"productSales": 207,
"productSalesLastWeek": 203
},{…},{…},{…},{…},{…},{…},{…}]
The Most Important Thing
Interactions
[{
"productId": 206,
"interactionRelevance": 0,
"interactionType": "impression",
"timestamp": "2019-03-15T18:19:34Z",
"userId": "id4",
"query": "free text query",
"userDevice": "item9",
"querySelectedBrands": [209, 201, 204],
"cartCategories": [200, 206, 204],
"userFavouriteColours": [208, 208, 202, 202],
"userAvgPrice": 43,
"productBrand": 207,
"productPrice": 22.0,
"productDiscount": 0.7,
"productReviewAvg": 4.5,
"productReviews": 200,
"productSales": 207,
"productSalesLastWeek": 203
},{…},{…},{…},{…},{…},{…},{…}]
• !!! The Interactions/Ratings must be
syntactically correct !!!
• The features associated to the Interaction/Rating
must reflect the real <query,document> pair when
the interaction/rating happened human side
e.g.
the query json element must reflect the query the
user typed, when we got the product 206
impression in response
• This sounds obvious, but it is not, depending on
the technological stack it may be challenging
• Test, test and test!
Feature Engineering
3 qid:1 0:3.4 1:0.7 2:1.5 3:0
2 qid:1 0:5.0 1:0.4 2:1.3 3:0
0 qid:1 0:2.4 1:0.7 2:1.5 3:1
1 qid:2 0:5.7 1:0.2 2:1.1 3:0
3 qid:2 0:0.0 1:0.5 2:4.0 3:0
0 qid:3 0:1.0 1:0.7 2:1.5 3:1
Interactions
Training Set Building
[{
"productId": 206,
"interactionRelevance": 0,
"interactionType": "impression",
"timestamp": "2019-03-15T18:19:34Z",
"userId": "id4",
"query": "free text query",
"userDevice": "item9",
"querySelectedBrands": [209, 201, 204],
"cartCategories": [200, 206, 204],
"userFavouriteColours": [208, 208, 202, 202],
"userAvgPrice": 43,
"productBrand": 207,
"productPrice": 22.0,
"productDiscount": 0.7,
"productReviewAvg": 4.5,
"productReviews": 200,
"productSales": 207,
"productSalesLastWeek": 203
},{…},{…},{…},{…},{…},{…},{…}]
Feature Engineering : Feature Level
Document level Query level Query Dependent
This feature describes a
property of the DOCUMENT.
The value of the feature depends only on
the document instance.
e.g.
Document Type = E-commerce Product
<Product price> is a Document Level feature.
<Product colour> is a Document Level feature.
<Product size> is a Document Level feature.
Document Type = Hotel Stay
<Hotel star rating> is a Document Level feature.
<Hotel price> is a Document Level feature.
<Hotel food rating> is a Document Level feature.
Each sample is a <query,document> pair, the feature vector is its numerical representation
This feature describes a
property of the QUERY.
The value of the feature depends only on
the query instance.
e.g.
Query Type = E-commerce Search
<Query length> is a Query Level feature.
<User device> is a Query Level feature.
<User budget> is a Query Level feature.
This feature describes a
property of the QUERY in correlation
with the DOCUMENT.
The value of the feature depends on
the query and document instance.
e.g.
Query Type = E-commerce Search
Document Type = E-commerce Product
<first Query Term TF in Product title>
is a Query dependent feature.
<first Query Term DF in Product title>
is a Query dependent feature.
<query selected categories intersecting
the product categories>
is a Query dependent feature.
Feature Engineering : Feature Types
Ordinal CategoricalQuantitative
An ordinal feature describes a property
for which the possible values are ordered.
Ordinal variables can be considered
“in between” categorical and quantitative variables.
e.g.
Educational level might be categorized as
1: Elementary school education
2: High school graduate
3: Some college
4: College graduate
5: Graduate degree
1<2<3<4<5
A quantitative feature describes a property
for which the possible values are
a measurable quantity.
e.g.
Document Type = E-commerce Product
<Product price> is a quantity
e.g.
Document Type = Hotel Stay
<Hotel distance from city center>
is a quantity
A categorical feature represents an attribute of an
object that have a set of distinct possible values.
In computer science it is common to call the possible
values of a categorical features Enumerations.
e.g.
Document Type = E-commerce Product
<Product colour> is a categorical feature
<Product brand> is a categorical feature
N.B. It is easy to observe that to give an order
to the values of a categorical feature
does not bring any benefit.
For the Colour feature :
red < blue < black has no general meaning.
Feature Engineering : One Hot Encoding
Categorical Features
e.g.
Document Type = E-commerce Product
<Product colour> is a categorical feature
Values: Red, Green, Blue, Other
Encoded Features:
Given a cardinality of N, we build N-1 encoded
binary features
product_colour_red = 0/1
product_colour_green = 0/1
product_colour_blue = 0/1
product_colour_other = 0/1
Dummy Variable Trap
when you have highly correlated features ->
predict feature value from others features
gender_male
gender_female
High Cardinality Categoricals
you may need to encode only the top frequent
-> information loss
Feature Engineering : Binary Encoding
Categorical Features
e.g.
Document Type = E-commerce Product
<Product colour> is a categorical feature
Values: Red, Green, Blue, Other
Encoded Features:
1) Ordinal Encoding
Red=0, Green=1 Blue=2 Other=3
2) Binary Encoding
product_colour_bit1 = 0/1
product_colour_bit2 = 0/1
Better for high cardinality categorically
Multi Valued?
you may have collisions and not able to use binary features
Feature Engineering : Hash Encoding
Categorical Features
e.g.
Document Type = E-commerce Product
<Product colour> is a categorical feature
Values: Red, Green, Blue, Yellow, Purple, Violet
Encoded Features:
Size=3
Hashing function will take in input each category
and allocate to a bucket represented by a vector of
the size specified
2) Binary Encoding
product_colour_hash1 = +2
product_colour_hash2 = -1
product_colour_hash3 = +1
• Choose a hash function
• Specify a size for the hash
(dedicated number output
features) -> we are going
to add just this number of
features.
We are basically defining
the fine grain of the
representation
• In total/ per feature
• Good for dealing with large
scale cardinality features
Feature Engineering : Best Encoding
• Doing One Hot encoding properly => no information loss
• … but it is expensive in time and memory
• So the recommendation is to experiment :
• One Hot Encoding
• One Hot Encoding with freq threshold
• Binary (if it fits your use case)
• Hash per feature
• Hash per feature set
• Hash of different sizes
You want to observe: time, memory consumption, offline model
evaluation impact at least (and potentially run an online
experiment)
Feature Engineering : Encoding
When?
It is very likely you have defined a pipeline of steps executed to build your training set from
some sort of implicit/explicit feedback.
When should you encode your features?
Ideally on the smallest data-set as possible.
Encoding is expensive in time/memory.
So if you do any reduction, collapse, manipulation of data that reduce in size your training
data, do feature encoding as late as possible, as it is quite an expensive procedure,
especially for high cardinality features.
Feature Engineering : Missing Values
● Some times a missing value is equivalent to a 0 value semantic
e.g.
Domain: e-commerce products
Feature: Discount Percentage - [quantitative, document level feature]
a missing discount percentage could model a 0 discount percentage,
missing values can be filled with 0 values
● Some times a missing feature value can have a completely different semantic
e.g.
Domain: Hotel Stay
Feature: Star Rating - [quantitative, document level feature]
a missing star rating it’s not equivalent to a 0 star rating, so an additional feature
should be added to distinguish
● Take Away :
discuss with the business layer and try to understand your specific use case requirements
check nulls and zeros count across your features, you may discover interesting anomalies
Relevance Label Estimation
3 qid:1 0:3.4 1:0.7 2:1.5 3:0
2 qid:1 0:5.0 1:0.4 2:1.3 3:0
0 qid:1 0:2.4 1:0.7 2:1.5 3:1
1 qid:2 0:5.7 1:0.2 2:1.1 3:0
3 qid:2 0:0.0 1:0.5 2:4.0 3:0
0 qid:3 0:1.0 1:0.7 2:1.5 3:1
Interactions
Training Set Building
[{
"productId": 206,
"interactionRelevance": 0,
"interactionType": "impression",
"timestamp": "2019-03-15T18:19:34Z",
"userId": "id4",
"query": "free text query",
"userDevice": "item9",
"querySelectedBrands": [209, 201, 204],
"cartCategories": [200, 206, 204],
"userFavouriteColours": [208, 208, 202, 202],
"userAvgPrice": 43,
"productBrand": 207,
"productPrice": 22.0,
"productDiscount": 0.7,
"productReviewAvg": 4.5,
"productReviews": 200,
"productSales": 207,
"productSalesLastWeek": 203
},{…},{…},{…},{…},{…},{…},{…}]
Relevance Label : Signal Intensity
Discordant training samples
● Each sample is a user interaction (click, add to cart, sale, ect)
● Some sample are impressions (we have showed the document to the user)
● A rank is attributed to the user interaction types
e.g.
0-Impression < 1-click < 2-add to cart < 3-sale
● The rank becomes the relevance label for the sample
Relevance Label : Simple Click Model
● Each sample is a user interaction (click, add to cart, sale, ect)
● Some sample are impressions (we have showed the document to the user)
● One interaction type is set as the target of optimisation
● Identical Samples are aggregated, the new sample generated will have a new feature:
[Interaction Type Count/ Impressions]
e.g. CTR (Click Through Rate) = for the sample, number of clicks/ number of impressions
● We then get the resulting score (in CTR 0<x<1) and normalise it to get the relevance label
The relevance label scale will depend on the Training algorithm chosen
Relevance Label : Advanced Click Model
Given a sample:
● We have the CTR from previous model
● We compare it with the avg CTR of all samples
● We take into account the statistical significance
of each sample (how likely is we got that result by chance on that size of observations)
● The relevance label will be CTR/avg CTR
(and we drop samples deemed to be statistically irrelevant)
Yet to test it online in production, stay tuned!
More info in John Berryman blog:
http://blog.jnbrymn.com/2018/04/16/better-click-tracking-1/
Relevance Label : Normalisation
● Min/Max normalisation based on local values (local to your data-set)
e.g.
your max CTR is 0.8, that is the max relevance you can expect
your min CTR is 0.2 that is the 0 relevance
● Min/Max normalisation based on absolute values
e.g.
max CTR is 1.0, that is the max relevance you can expect
min CTR is 0, that is the 0 relevance
● Scale of relevance?
0-1
0-4
0-10 ?
Experiment! Risk of overfitting - not simple to compare offline
Using absolute values,
you have consistent data-sets over time
But you are flattening the relevance labels
in your current data set
Point-wise Pair-wise List-wise
How many documents you consider at a time, when calculating the loss function
for your Learning To Rank model?
● Single Document
● You estimate a function that
predicts the best score of the
document
● Rank the results on the
predicted score
● Score of the doc is
independent of the other
scores
in the same result list
● You can use any regression or
classification algorithm
● Pair of documents
● You estimate the optimal local
ordering to maximise the global
● The Objective is to set local
ordering to minimise the
number of inversion across all
pairs
● Generally works better than
point wise because predicting
a local ordering is closer to
solving the ranking problem
that just estimate a regression
score
● Entire list of documents for a
given query
● Direct optimization of IR
measures such as NDCG
● Minimise a specific loss
function
● The evaluation measure is avg
across the queries
● Generally works better than
pair wise
Learning To Rank : Metric Evaluation
Offline Evaluation Metrics[1/3]
• precision TruePositives/TruePositives+FalsePositives
• precision@K (TruePositives/TruePositives+FalsePositives) in topK
• (precision@1, precision@2, precision@10)
• recall TruePositives/TruePositives+FalseNegatives
Learning To Rank : Metric Evaluation
Offline Evaluation Metrics[2/3]
Let’s combine Precision and recall:
Learning To Rank : Metric Evaluation
Offline Evaluation Metrics[3/3]
• DCG@K = Discounted Cumulative Gain@K
• NDCG@K = DCG@K/ Ideal DCG@K
Model1 Model2 Model3 Ideal
1 2 2 4
2 3 4 3
3 2 3 2
4 4 2 2
2 1 1 1
0 0 0 0
0 0 0 0
0.64 0.73 0.79 1.0
Learning To Rank : Metric Evaluation
Learning To Rank : List-wise and NDCG
● The list is the result set of documents
for a given query Id
● In lambdaMart is often used NDCG@K per list
● The Evaluation Metric is avg over the query Ids
when evaluating a training iteration
What happens if you have very small ranked lists per query?
What happens if all the documents in a ranked list have
the same relevance label?
It is extremely important to assess
the distribution of training samples
per query Id
Model1 Model2 Model3 Ideal
1 1 1 1
1 1 1 1
Model1 Model2 Model3 Ideal
3 3 3 3
7 7 7 7
Query1
Query2
Under sampled QueryId can
potentially sky rocket your NDCG
avg
Query Id Generation
3 qid:1 0:3.4 1:0.7 2:1.5 3:0
2 qid:1 0:5.0 1:0.4 2:1.3 3:0
0 qid:1 0:2.4 1:0.7 2:1.5 3:1
1 qid:2 0:5.7 1:0.2 2:1.1 3:0
3 qid:2 0:0.0 1:0.5 2:4.0 3:0
0 qid:3 0:1.0 1:0.7 2:1.5 3:1
Interactions
Training Set Building
[{
"productId": 206,
"interactionRelevance": 0,
"interactionType": "impression",
"timestamp": "2019-03-15T18:19:34Z",
"userId": "id4",
"query": "free text query",
"userDevice": "item9",
"querySelectedBrands": [209, 201, 204],
"cartCategories": [200, 206, 204],
"userFavouriteColours": [208, 208, 202, 202],
"userAvgPrice": 43,
"productBrand": 207,
"productPrice": 22.0,
"productDiscount": 0.7,
"productReviewAvg": 4.5,
"productReviews": 200,
"productSales": 207,
"productSalesLastWeek": 203
},{…},{…},{…},{…},{…},{…},{…}]
Build the Lists: QueryId Hashing 1/2
● Carefully design how you calculate your query Id,
it represents the Q of <Q,D> in each query-document sample).
the same query across your data set must have the same Id
● No free text query? Group query level features (simple hashing, clustering)
● The free text query is an additional info to build a proper query Id where available
(can be used as Id on its own)
● Target: reaching a uniform distribution of samples per query Id
● Drop training samples if queryIds are under-sampled
● Experiment!
Build the Lists: QueryId Hashing 2/3
How to Generate the Query Id
• Concatenate the most prominent query level selected features (from business)
• Concatenate all query level selected features
• Clustering query id generation with the 3 different approaches.
In particular, for k-means, with many different number of clusters
(The number of clusters is a required k-means input parameter.
In all the clustering approaches a cluster corresponds to a query id group).
• A lot of hyper-parameter! -> a lot of experiments
Future Works
• what happens if we keep the undersampled query ids instead of dropping them
(assigning a new queryId to all of them for example).
Build the Lists: QueryId Hashing 2/2
How to Compare the Results?
• The average number of samples per query id in the data set
• The standard deviation of the query id distribution
(which tell us if the number of rows per query id are equally distributed or not)
• The number of distinct query ids
• The number of query id with a low number of rows (below a threshold)
• The number of query id with 1 sample
• The number of query id and the percentage inside some ranges of values
•         For example:
•         (0 to 2]          - 3 query ids - 30%
•         (2 to threshold]  - 2 query ids - 20%
•         (threshold to 99] - 5 query ids - 50%
• The drop rate (the percentage of dropped rows during the removal of the under-sampled query ids)
Split the Set: Training/Test
● Each training set iteration will assess the evaluation metric on the training and validation set
● At the end of the iterations the final model will be evaluated on an unknown Test Set
● This split could be random
● This split could depend on the time the interactions were collected
Split the Set: Training/Test
K-fold Cross Validation
Split the Set: Training/Test
• Be careful if you clean your data-set and then split, because you may end up
in having an unfair Test Set
• the Test set MUST NOT have under-sampled query Ids
• the Test set MUST NOT have query Ids with a single relevance label
• the Test set MUST be representative and with an acceptable size
• It is useful to use the same Test Set for multiple refinements
of the training set building/ training for Offline evaluation
Thanks!

More Related Content

What's hot

Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Andrea Gazzarini
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Andrea Gazzarini
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryAlessandro Benedetti
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Sease
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchTrey Grainger
 

What's hot (16)

Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 

Similar to How to Build your Training Set for a Learning To Rank Project - Haystack

Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Spark Summit
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models OptimizationSonya Liberman
 
Improving the Quality of Existing Software
Improving the Quality of Existing SoftwareImproving the Quality of Existing Software
Improving the Quality of Existing SoftwareSteven Smith
 
A journey to_be_a_software_craftsman
A journey to_be_a_software_craftsmanA journey to_be_a_software_craftsman
A journey to_be_a_software_craftsmanJaehoon Oh
 
Managing an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderManaging an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderProduct School
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
 
Done in 60 seconds - Creating Web 2.0 applications made easy
Done in 60 seconds - Creating Web 2.0 applications made easyDone in 60 seconds - Creating Web 2.0 applications made easy
Done in 60 seconds - Creating Web 2.0 applications made easyRoel Hartman
 
CAD Certification
CAD CertificationCAD Certification
CAD CertificationVskills
 
A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5SSW
 
Business Functional Requirements
Business Functional RequirementsBusiness Functional Requirements
Business Functional RequirementsSunil-QA
 
AI improves software testing by Kari Kakkonen at TQS
AI improves software testing by Kari Kakkonen at TQSAI improves software testing by Kari Kakkonen at TQS
AI improves software testing by Kari Kakkonen at TQSKari Kakkonen
 
Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Steven Smith
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_publicVincent Michel
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Talent42
 

Similar to How to Build your Training Set for a Learning To Rank Project - Haystack (20)

Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models Optimization
 
Improving the Quality of Existing Software
Improving the Quality of Existing SoftwareImproving the Quality of Existing Software
Improving the Quality of Existing Software
 
A journey to_be_a_software_craftsman
A journey to_be_a_software_craftsmanA journey to_be_a_software_craftsman
A journey to_be_a_software_craftsman
 
Managing an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product LeaderManaging an Experimentation Platform by LinkedIn Product Leader
Managing an Experimentation Platform by LinkedIn Product Leader
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
Done in 60 seconds - Creating Web 2.0 applications made easy
Done in 60 seconds - Creating Web 2.0 applications made easyDone in 60 seconds - Creating Web 2.0 applications made easy
Done in 60 seconds - Creating Web 2.0 applications made easy
 
CAD Certification
CAD CertificationCAD Certification
CAD Certification
 
Code Refactoring
Code RefactoringCode Refactoring
Code Refactoring
 
A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5A modern architecturereview–usingcodereviewtools-ver-3.5
A modern architecturereview–usingcodereviewtools-ver-3.5
 
Business Functional Requirements
Business Functional RequirementsBusiness Functional Requirements
Business Functional Requirements
 
AI improves software testing by Kari Kakkonen at TQS
AI improves software testing by Kari Kakkonen at TQSAI improves software testing by Kari Kakkonen at TQS
AI improves software testing by Kari Kakkonen at TQS
 
Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016Improving the Quality of Existing Software - DevIntersection April 2016
Improving the Quality of Existing Software - DevIntersection April 2016
 
BDD Primer
BDD PrimerBDD Primer
BDD Primer
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 
Bdd with m spec
Bdd with m specBdd with m spec
Bdd with m spec
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015Mark Tortoricci - Talent42 2015
Mark Tortoricci - Talent42 2015
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text Collections
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

How to Build your Training Set for a Learning To Rank Project - Haystack

  • 1. Haystack LIVE! 21 May 2020 How to Build your Training Set for a Learning to Rank Project Alessandro Benedetti, Software Engineer 21st May 2020
  • 2. www.sease.io ● London based - Italian made :) ● Open Source Enthusiasts ● Apache Lucene/Solr experts ● Elasticsearch experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning SEArch SErvices
  • 4. Who I am ▪ Born in Tarquinia (ancient Etruscan city located in central Italy, not far from Rome) ▪ R&D Software Engineer ▪ Search Consultant ▪ Director ▪ Master in Computer Science ▪ Apache Lucene/Solr Committer ▪ Semantic, NLP, Machine Learning technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 5. Agenda •Learning To Rank •Training Set Definition •Implicit/Explicit feedback •Feature Engineering •Relevance Label Estimation •Metric Evaluation/Loss Function •Training/Test Set Split
  • 6. Learning To Rank What is it ? Learning from user implicit/explicit feedback To Rank documents (sensu lato)
  • 7. “Learning to rank is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems.” Wikipedia Learning To Rank Users Interactions Logger Judgement Collector UI Interactions Training Learning To Rank
  • 8. • [sci-fi] Sentient system that learn by itself “Machine Learning stands for that, doesn’t it?” Unknown • [Offline] Continuously improving itself by ingesting additional feedback* • [Integration] Easy to set up and tune it -> it takes patience, time and multiple experiments • [Explainability] Easy to give a human understandable explanation of why the model operates in certain ways Learning To Rank - What is NOT *this is true for offline Learning To Rank models (the majority so far), online Learning To Rank models behave differently
  • 9. • It is the golden truth we are building the ranking model from • A set of labelled <query, document> pairs • Each <query, document> example is composed by: - relevancy rating - query Id - feature vector • The feature vector is composed by N features (<id>:<value>) Training Set: How does it look like? 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1
  • 10. Training Set: Collect Feedback Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian Rhapsody D a n c i n g Queen Queen Albums Bohemian Rhapsody D a n c i n g Queen Queen Albums
  • 11. Training Set Building 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  • 12. The Most Important Thing Interactions [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}] • !!! The Interactions/Ratings must be syntactically correct !!! • The features associated to the Interaction/Rating must reflect the real <query,document> pair when the interaction/rating happened human side e.g. the query json element must reflect the query the user typed, when we got the product 206 impression in response • This sounds obvious, but it is not, depending on the technological stack it may be challenging • Test, test and test!
  • 13. Feature Engineering 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  • 14. Feature Engineering : Feature Level Document level Query level Query Dependent This feature describes a property of the DOCUMENT. The value of the feature depends only on the document instance. e.g. Document Type = E-commerce Product <Product price> is a Document Level feature. <Product colour> is a Document Level feature. <Product size> is a Document Level feature. Document Type = Hotel Stay <Hotel star rating> is a Document Level feature. <Hotel price> is a Document Level feature. <Hotel food rating> is a Document Level feature. Each sample is a <query,document> pair, the feature vector is its numerical representation This feature describes a property of the QUERY. The value of the feature depends only on the query instance. e.g. Query Type = E-commerce Search <Query length> is a Query Level feature. <User device> is a Query Level feature. <User budget> is a Query Level feature. This feature describes a property of the QUERY in correlation with the DOCUMENT. The value of the feature depends on the query and document instance. e.g. Query Type = E-commerce Search Document Type = E-commerce Product <first Query Term TF in Product title> is a Query dependent feature. <first Query Term DF in Product title> is a Query dependent feature. <query selected categories intersecting the product categories> is a Query dependent feature.
  • 15. Feature Engineering : Feature Types Ordinal CategoricalQuantitative An ordinal feature describes a property for which the possible values are ordered. Ordinal variables can be considered “in between” categorical and quantitative variables. e.g. Educational level might be categorized as 1: Elementary school education 2: High school graduate 3: Some college 4: College graduate 5: Graduate degree 1<2<3<4<5 A quantitative feature describes a property for which the possible values are a measurable quantity. e.g. Document Type = E-commerce Product <Product price> is a quantity e.g. Document Type = Hotel Stay <Hotel distance from city center> is a quantity A categorical feature represents an attribute of an object that have a set of distinct possible values. In computer science it is common to call the possible values of a categorical features Enumerations. e.g. Document Type = E-commerce Product <Product colour> is a categorical feature <Product brand> is a categorical feature N.B. It is easy to observe that to give an order to the values of a categorical feature does not bring any benefit. For the Colour feature : red < blue < black has no general meaning.
  • 16. Feature Engineering : One Hot Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Other Encoded Features: Given a cardinality of N, we build N-1 encoded binary features product_colour_red = 0/1 product_colour_green = 0/1 product_colour_blue = 0/1 product_colour_other = 0/1 Dummy Variable Trap when you have highly correlated features -> predict feature value from others features gender_male gender_female High Cardinality Categoricals you may need to encode only the top frequent -> information loss
  • 17. Feature Engineering : Binary Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Other Encoded Features: 1) Ordinal Encoding Red=0, Green=1 Blue=2 Other=3 2) Binary Encoding product_colour_bit1 = 0/1 product_colour_bit2 = 0/1 Better for high cardinality categorically Multi Valued? you may have collisions and not able to use binary features
  • 18. Feature Engineering : Hash Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Yellow, Purple, Violet Encoded Features: Size=3 Hashing function will take in input each category and allocate to a bucket represented by a vector of the size specified 2) Binary Encoding product_colour_hash1 = +2 product_colour_hash2 = -1 product_colour_hash3 = +1 • Choose a hash function • Specify a size for the hash (dedicated number output features) -> we are going to add just this number of features. We are basically defining the fine grain of the representation • In total/ per feature • Good for dealing with large scale cardinality features
  • 19. Feature Engineering : Best Encoding • Doing One Hot encoding properly => no information loss • … but it is expensive in time and memory • So the recommendation is to experiment : • One Hot Encoding • One Hot Encoding with freq threshold • Binary (if it fits your use case) • Hash per feature • Hash per feature set • Hash of different sizes You want to observe: time, memory consumption, offline model evaluation impact at least (and potentially run an online experiment)
  • 20. Feature Engineering : Encoding When? It is very likely you have defined a pipeline of steps executed to build your training set from some sort of implicit/explicit feedback. When should you encode your features? Ideally on the smallest data-set as possible. Encoding is expensive in time/memory. So if you do any reduction, collapse, manipulation of data that reduce in size your training data, do feature encoding as late as possible, as it is quite an expensive procedure, especially for high cardinality features.
  • 21. Feature Engineering : Missing Values ● Some times a missing value is equivalent to a 0 value semantic e.g. Domain: e-commerce products Feature: Discount Percentage - [quantitative, document level feature] a missing discount percentage could model a 0 discount percentage, missing values can be filled with 0 values ● Some times a missing feature value can have a completely different semantic e.g. Domain: Hotel Stay Feature: Star Rating - [quantitative, document level feature] a missing star rating it’s not equivalent to a 0 star rating, so an additional feature should be added to distinguish ● Take Away : discuss with the business layer and try to understand your specific use case requirements check nulls and zeros count across your features, you may discover interesting anomalies
  • 22. Relevance Label Estimation 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  • 23. Relevance Label : Signal Intensity Discordant training samples ● Each sample is a user interaction (click, add to cart, sale, ect) ● Some sample are impressions (we have showed the document to the user) ● A rank is attributed to the user interaction types e.g. 0-Impression < 1-click < 2-add to cart < 3-sale ● The rank becomes the relevance label for the sample
  • 24. Relevance Label : Simple Click Model ● Each sample is a user interaction (click, add to cart, sale, ect) ● Some sample are impressions (we have showed the document to the user) ● One interaction type is set as the target of optimisation ● Identical Samples are aggregated, the new sample generated will have a new feature: [Interaction Type Count/ Impressions] e.g. CTR (Click Through Rate) = for the sample, number of clicks/ number of impressions ● We then get the resulting score (in CTR 0<x<1) and normalise it to get the relevance label The relevance label scale will depend on the Training algorithm chosen
  • 25. Relevance Label : Advanced Click Model Given a sample: ● We have the CTR from previous model ● We compare it with the avg CTR of all samples ● We take into account the statistical significance of each sample (how likely is we got that result by chance on that size of observations) ● The relevance label will be CTR/avg CTR (and we drop samples deemed to be statistically irrelevant) Yet to test it online in production, stay tuned! More info in John Berryman blog: http://blog.jnbrymn.com/2018/04/16/better-click-tracking-1/
  • 26. Relevance Label : Normalisation ● Min/Max normalisation based on local values (local to your data-set) e.g. your max CTR is 0.8, that is the max relevance you can expect your min CTR is 0.2 that is the 0 relevance ● Min/Max normalisation based on absolute values e.g. max CTR is 1.0, that is the max relevance you can expect min CTR is 0, that is the 0 relevance ● Scale of relevance? 0-1 0-4 0-10 ? Experiment! Risk of overfitting - not simple to compare offline Using absolute values, you have consistent data-sets over time But you are flattening the relevance labels in your current data set
  • 27. Point-wise Pair-wise List-wise How many documents you consider at a time, when calculating the loss function for your Learning To Rank model? ● Single Document ● You estimate a function that predicts the best score of the document ● Rank the results on the predicted score ● Score of the doc is independent of the other scores in the same result list ● You can use any regression or classification algorithm ● Pair of documents ● You estimate the optimal local ordering to maximise the global ● The Objective is to set local ordering to minimise the number of inversion across all pairs ● Generally works better than point wise because predicting a local ordering is closer to solving the ranking problem that just estimate a regression score ● Entire list of documents for a given query ● Direct optimization of IR measures such as NDCG ● Minimise a specific loss function ● The evaluation measure is avg across the queries ● Generally works better than pair wise Learning To Rank : Metric Evaluation
  • 28. Offline Evaluation Metrics[1/3] • precision TruePositives/TruePositives+FalsePositives • precision@K (TruePositives/TruePositives+FalsePositives) in topK • (precision@1, precision@2, precision@10) • recall TruePositives/TruePositives+FalseNegatives Learning To Rank : Metric Evaluation
  • 29. Offline Evaluation Metrics[2/3] Let’s combine Precision and recall: Learning To Rank : Metric Evaluation
  • 30. Offline Evaluation Metrics[3/3] • DCG@K = Discounted Cumulative Gain@K • NDCG@K = DCG@K/ Ideal DCG@K Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0.64 0.73 0.79 1.0 Learning To Rank : Metric Evaluation
  • 31. Learning To Rank : List-wise and NDCG ● The list is the result set of documents for a given query Id ● In lambdaMart is often used NDCG@K per list ● The Evaluation Metric is avg over the query Ids when evaluating a training iteration What happens if you have very small ranked lists per query? What happens if all the documents in a ranked list have the same relevance label? It is extremely important to assess the distribution of training samples per query Id Model1 Model2 Model3 Ideal 1 1 1 1 1 1 1 1 Model1 Model2 Model3 Ideal 3 3 3 3 7 7 7 7 Query1 Query2 Under sampled QueryId can potentially sky rocket your NDCG avg
  • 32. Query Id Generation 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  • 33. Build the Lists: QueryId Hashing 1/2 ● Carefully design how you calculate your query Id, it represents the Q of <Q,D> in each query-document sample). the same query across your data set must have the same Id ● No free text query? Group query level features (simple hashing, clustering) ● The free text query is an additional info to build a proper query Id where available (can be used as Id on its own) ● Target: reaching a uniform distribution of samples per query Id ● Drop training samples if queryIds are under-sampled ● Experiment!
  • 34. Build the Lists: QueryId Hashing 2/3 How to Generate the Query Id • Concatenate the most prominent query level selected features (from business) • Concatenate all query level selected features • Clustering query id generation with the 3 different approaches. In particular, for k-means, with many different number of clusters (The number of clusters is a required k-means input parameter. In all the clustering approaches a cluster corresponds to a query id group). • A lot of hyper-parameter! -> a lot of experiments Future Works • what happens if we keep the undersampled query ids instead of dropping them (assigning a new queryId to all of them for example).
  • 35. Build the Lists: QueryId Hashing 2/2 How to Compare the Results? • The average number of samples per query id in the data set • The standard deviation of the query id distribution (which tell us if the number of rows per query id are equally distributed or not) • The number of distinct query ids • The number of query id with a low number of rows (below a threshold) • The number of query id with 1 sample • The number of query id and the percentage inside some ranges of values •         For example: •         (0 to 2]          - 3 query ids - 30% •         (2 to threshold]  - 2 query ids - 20% •         (threshold to 99] - 5 query ids - 50% • The drop rate (the percentage of dropped rows during the removal of the under-sampled query ids)
  • 36. Split the Set: Training/Test ● Each training set iteration will assess the evaluation metric on the training and validation set ● At the end of the iterations the final model will be evaluated on an unknown Test Set ● This split could be random ● This split could depend on the time the interactions were collected
  • 37. Split the Set: Training/Test K-fold Cross Validation
  • 38. Split the Set: Training/Test • Be careful if you clean your data-set and then split, because you may end up in having an unfair Test Set • the Test set MUST NOT have under-sampled query Ids • the Test set MUST NOT have query Ids with a single relevance label • the Test set MUST be representative and with an acceptable size • It is useful to use the same Test Set for multiple refinements of the training set building/ training for Offline evaluation