Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Build your Training Set for a Learning To Rank Project - Haystack

259 views

Published on

Presented by Alessandro Benedetti of Sease, Learning to Rank (LTR) is the application of machine learning techniques (typically supervised), in the formulation of ranking models for information retrieval systems.

With LTR becoming more and more popular, organizations struggle with the problem of how to collect and structure relevance signals necessary to train their ranking models.

This talk is a technical guide to explore and master various techniques to generate your training set(s) correctly and efficiently.
Expect to learn how to :
- model and collect the necessary feedback from the users (implicit or explicit)
- calculate for each training sample a relevance label that is meaningful and not ambiguous (Click Through Rate, Sales Rate ...)
- transform the raw data collected in an effective training set (in the numerical vector format most of the LTR training libraries expect)

Join us as we explore real-world scenarios and dos and don'ts from the e-commerce industry.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How to Build your Training Set for a Learning To Rank Project - Haystack

  1. 1. Haystack LIVE! 21 May 2020 How to Build your Training Set for a Learning to Rank Project Alessandro Benedetti, Software Engineer 21st May 2020
  2. 2. www.sease.io ● London based - Italian made :) ● Open Source Enthusiasts ● Apache Lucene/Solr experts ● Elasticsearch experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning SEArch SErvices
  3. 3. Clients
  4. 4. Who I am ▪ Born in Tarquinia (ancient Etruscan city located in central Italy, not far from Rome) ▪ R&D Software Engineer ▪ Search Consultant ▪ Director ▪ Master in Computer Science ▪ Apache Lucene/Solr Committer ▪ Semantic, NLP, Machine Learning technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  5. 5. Agenda •Learning To Rank •Training Set Definition •Implicit/Explicit feedback •Feature Engineering •Relevance Label Estimation •Metric Evaluation/Loss Function •Training/Test Set Split
  6. 6. Learning To Rank What is it ? Learning from user implicit/explicit feedback To Rank documents (sensu lato)
  7. 7. “Learning to rank is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems.” Wikipedia Learning To Rank Users Interactions Logger Judgement Collector UI Interactions Training Learning To Rank
  8. 8. • [sci-fi] Sentient system that learn by itself “Machine Learning stands for that, doesn’t it?” Unknown • [Offline] Continuously improving itself by ingesting additional feedback* • [Integration] Easy to set up and tune it -> it takes patience, time and multiple experiments • [Explainability] Easy to give a human understandable explanation of why the model operates in certain ways Learning To Rank - What is NOT *this is true for offline Learning To Rank models (the majority so far), online Learning To Rank models behave differently
  9. 9. • It is the golden truth we are building the ranking model from • A set of labelled <query, document> pairs • Each <query, document> example is composed by: - relevancy rating - query Id - feature vector • The feature vector is composed by N features (<id>:<value>) Training Set: How does it look like? 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1
  10. 10. Training Set: Collect Feedback Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian Rhapsody D a n c i n g Queen Queen Albums Bohemian Rhapsody D a n c i n g Queen Queen Albums
  11. 11. Training Set Building 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  12. 12. The Most Important Thing Interactions [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}] • !!! The Interactions/Ratings must be syntactically correct !!! • The features associated to the Interaction/Rating must reflect the real <query,document> pair when the interaction/rating happened human side e.g. the query json element must reflect the query the user typed, when we got the product 206 impression in response • This sounds obvious, but it is not, depending on the technological stack it may be challenging • Test, test and test!
  13. 13. Feature Engineering 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  14. 14. Feature Engineering : Feature Level Document level Query level Query Dependent This feature describes a property of the DOCUMENT. The value of the feature depends only on the document instance. e.g. Document Type = E-commerce Product <Product price> is a Document Level feature. <Product colour> is a Document Level feature. <Product size> is a Document Level feature. Document Type = Hotel Stay <Hotel star rating> is a Document Level feature. <Hotel price> is a Document Level feature. <Hotel food rating> is a Document Level feature. Each sample is a <query,document> pair, the feature vector is its numerical representation This feature describes a property of the QUERY. The value of the feature depends only on the query instance. e.g. Query Type = E-commerce Search <Query length> is a Query Level feature. <User device> is a Query Level feature. <User budget> is a Query Level feature. This feature describes a property of the QUERY in correlation with the DOCUMENT. The value of the feature depends on the query and document instance. e.g. Query Type = E-commerce Search Document Type = E-commerce Product <first Query Term TF in Product title> is a Query dependent feature. <first Query Term DF in Product title> is a Query dependent feature. <query selected categories intersecting the product categories> is a Query dependent feature.
  15. 15. Feature Engineering : Feature Types Ordinal CategoricalQuantitative An ordinal feature describes a property for which the possible values are ordered. Ordinal variables can be considered “in between” categorical and quantitative variables. e.g. Educational level might be categorized as 1: Elementary school education 2: High school graduate 3: Some college 4: College graduate 5: Graduate degree 1<2<3<4<5 A quantitative feature describes a property for which the possible values are a measurable quantity. e.g. Document Type = E-commerce Product <Product price> is a quantity e.g. Document Type = Hotel Stay <Hotel distance from city center> is a quantity A categorical feature represents an attribute of an object that have a set of distinct possible values. In computer science it is common to call the possible values of a categorical features Enumerations. e.g. Document Type = E-commerce Product <Product colour> is a categorical feature <Product brand> is a categorical feature N.B. It is easy to observe that to give an order to the values of a categorical feature does not bring any benefit. For the Colour feature : red < blue < black has no general meaning.
  16. 16. Feature Engineering : One Hot Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Other Encoded Features: Given a cardinality of N, we build N-1 encoded binary features product_colour_red = 0/1 product_colour_green = 0/1 product_colour_blue = 0/1 product_colour_other = 0/1 Dummy Variable Trap when you have highly correlated features -> predict feature value from others features gender_male gender_female High Cardinality Categoricals you may need to encode only the top frequent -> information loss
  17. 17. Feature Engineering : Binary Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Other Encoded Features: 1) Ordinal Encoding Red=0, Green=1 Blue=2 Other=3 2) Binary Encoding product_colour_bit1 = 0/1 product_colour_bit2 = 0/1 Better for high cardinality categorically Multi Valued? you may have collisions and not able to use binary features
  18. 18. Feature Engineering : Hash Encoding Categorical Features e.g. Document Type = E-commerce Product <Product colour> is a categorical feature Values: Red, Green, Blue, Yellow, Purple, Violet Encoded Features: Size=3 Hashing function will take in input each category and allocate to a bucket represented by a vector of the size specified 2) Binary Encoding product_colour_hash1 = +2 product_colour_hash2 = -1 product_colour_hash3 = +1 • Choose a hash function • Specify a size for the hash (dedicated number output features) -> we are going to add just this number of features. We are basically defining the fine grain of the representation • In total/ per feature • Good for dealing with large scale cardinality features
  19. 19. Feature Engineering : Best Encoding • Doing One Hot encoding properly => no information loss • … but it is expensive in time and memory • So the recommendation is to experiment : • One Hot Encoding • One Hot Encoding with freq threshold • Binary (if it fits your use case) • Hash per feature • Hash per feature set • Hash of different sizes You want to observe: time, memory consumption, offline model evaluation impact at least (and potentially run an online experiment)
  20. 20. Feature Engineering : Encoding When? It is very likely you have defined a pipeline of steps executed to build your training set from some sort of implicit/explicit feedback. When should you encode your features? Ideally on the smallest data-set as possible. Encoding is expensive in time/memory. So if you do any reduction, collapse, manipulation of data that reduce in size your training data, do feature encoding as late as possible, as it is quite an expensive procedure, especially for high cardinality features.
  21. 21. Feature Engineering : Missing Values ● Some times a missing value is equivalent to a 0 value semantic e.g. Domain: e-commerce products Feature: Discount Percentage - [quantitative, document level feature] a missing discount percentage could model a 0 discount percentage, missing values can be filled with 0 values ● Some times a missing feature value can have a completely different semantic e.g. Domain: Hotel Stay Feature: Star Rating - [quantitative, document level feature] a missing star rating it’s not equivalent to a 0 star rating, so an additional feature should be added to distinguish ● Take Away : discuss with the business layer and try to understand your specific use case requirements check nulls and zeros count across your features, you may discover interesting anomalies
  22. 22. Relevance Label Estimation 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  23. 23. Relevance Label : Signal Intensity Discordant training samples ● Each sample is a user interaction (click, add to cart, sale, ect) ● Some sample are impressions (we have showed the document to the user) ● A rank is attributed to the user interaction types e.g. 0-Impression < 1-click < 2-add to cart < 3-sale ● The rank becomes the relevance label for the sample
  24. 24. Relevance Label : Simple Click Model ● Each sample is a user interaction (click, add to cart, sale, ect) ● Some sample are impressions (we have showed the document to the user) ● One interaction type is set as the target of optimisation ● Identical Samples are aggregated, the new sample generated will have a new feature: [Interaction Type Count/ Impressions] e.g. CTR (Click Through Rate) = for the sample, number of clicks/ number of impressions ● We then get the resulting score (in CTR 0<x<1) and normalise it to get the relevance label The relevance label scale will depend on the Training algorithm chosen
  25. 25. Relevance Label : Advanced Click Model Given a sample: ● We have the CTR from previous model ● We compare it with the avg CTR of all samples ● We take into account the statistical significance of each sample (how likely is we got that result by chance on that size of observations) ● The relevance label will be CTR/avg CTR (and we drop samples deemed to be statistically irrelevant) Yet to test it online in production, stay tuned! More info in John Berryman blog: http://blog.jnbrymn.com/2018/04/16/better-click-tracking-1/
  26. 26. Relevance Label : Normalisation ● Min/Max normalisation based on local values (local to your data-set) e.g. your max CTR is 0.8, that is the max relevance you can expect your min CTR is 0.2 that is the 0 relevance ● Min/Max normalisation based on absolute values e.g. max CTR is 1.0, that is the max relevance you can expect min CTR is 0, that is the 0 relevance ● Scale of relevance? 0-1 0-4 0-10 ? Experiment! Risk of overfitting - not simple to compare offline Using absolute values, you have consistent data-sets over time But you are flattening the relevance labels in your current data set
  27. 27. Point-wise Pair-wise List-wise How many documents you consider at a time, when calculating the loss function for your Learning To Rank model? ● Single Document ● You estimate a function that predicts the best score of the document ● Rank the results on the predicted score ● Score of the doc is independent of the other scores in the same result list ● You can use any regression or classification algorithm ● Pair of documents ● You estimate the optimal local ordering to maximise the global ● The Objective is to set local ordering to minimise the number of inversion across all pairs ● Generally works better than point wise because predicting a local ordering is closer to solving the ranking problem that just estimate a regression score ● Entire list of documents for a given query ● Direct optimization of IR measures such as NDCG ● Minimise a specific loss function ● The evaluation measure is avg across the queries ● Generally works better than pair wise Learning To Rank : Metric Evaluation
  28. 28. Offline Evaluation Metrics[1/3] • precision TruePositives/TruePositives+FalsePositives • precision@K (TruePositives/TruePositives+FalsePositives) in topK • (precision@1, precision@2, precision@10) • recall TruePositives/TruePositives+FalseNegatives Learning To Rank : Metric Evaluation
  29. 29. Offline Evaluation Metrics[2/3] Let’s combine Precision and recall: Learning To Rank : Metric Evaluation
  30. 30. Offline Evaluation Metrics[3/3] • DCG@K = Discounted Cumulative Gain@K • NDCG@K = DCG@K/ Ideal DCG@K Model1 Model2 Model3 Ideal 1 2 2 4 2 3 4 3 3 2 3 2 4 4 2 2 2 1 1 1 0 0 0 0 0 0 0 0 0.64 0.73 0.79 1.0 Learning To Rank : Metric Evaluation
  31. 31. Learning To Rank : List-wise and NDCG ● The list is the result set of documents for a given query Id ● In lambdaMart is often used NDCG@K per list ● The Evaluation Metric is avg over the query Ids when evaluating a training iteration What happens if you have very small ranked lists per query? What happens if all the documents in a ranked list have the same relevance label? It is extremely important to assess the distribution of training samples per query Id Model1 Model2 Model3 Ideal 1 1 1 1 1 1 1 1 Model1 Model2 Model3 Ideal 3 3 3 3 7 7 7 7 Query1 Query2 Under sampled QueryId can potentially sky rocket your NDCG avg
  32. 32. Query Id Generation 3 qid:1 0:3.4 1:0.7 2:1.5 3:0 2 qid:1 0:5.0 1:0.4 2:1.3 3:0 0 qid:1 0:2.4 1:0.7 2:1.5 3:1 1 qid:2 0:5.7 1:0.2 2:1.1 3:0 3 qid:2 0:0.0 1:0.5 2:4.0 3:0 0 qid:3 0:1.0 1:0.7 2:1.5 3:1 Interactions Training Set Building [{ "productId": 206, "interactionRelevance": 0, "interactionType": "impression", "timestamp": "2019-03-15T18:19:34Z", "userId": "id4", "query": "free text query", "userDevice": "item9", "querySelectedBrands": [209, 201, 204], "cartCategories": [200, 206, 204], "userFavouriteColours": [208, 208, 202, 202], "userAvgPrice": 43, "productBrand": 207, "productPrice": 22.0, "productDiscount": 0.7, "productReviewAvg": 4.5, "productReviews": 200, "productSales": 207, "productSalesLastWeek": 203 },{…},{…},{…},{…},{…},{…},{…}]
  33. 33. Build the Lists: QueryId Hashing 1/2 ● Carefully design how you calculate your query Id, it represents the Q of <Q,D> in each query-document sample). the same query across your data set must have the same Id ● No free text query? Group query level features (simple hashing, clustering) ● The free text query is an additional info to build a proper query Id where available (can be used as Id on its own) ● Target: reaching a uniform distribution of samples per query Id ● Drop training samples if queryIds are under-sampled ● Experiment!
  34. 34. Build the Lists: QueryId Hashing 2/3 How to Generate the Query Id • Concatenate the most prominent query level selected features (from business) • Concatenate all query level selected features • Clustering query id generation with the 3 different approaches. In particular, for k-means, with many different number of clusters (The number of clusters is a required k-means input parameter. In all the clustering approaches a cluster corresponds to a query id group). • A lot of hyper-parameter! -> a lot of experiments Future Works • what happens if we keep the undersampled query ids instead of dropping them (assigning a new queryId to all of them for example).
  35. 35. Build the Lists: QueryId Hashing 2/2 How to Compare the Results? • The average number of samples per query id in the data set • The standard deviation of the query id distribution (which tell us if the number of rows per query id are equally distributed or not) • The number of distinct query ids • The number of query id with a low number of rows (below a threshold) • The number of query id with 1 sample • The number of query id and the percentage inside some ranges of values •         For example: •         (0 to 2]          - 3 query ids - 30% •         (2 to threshold]  - 2 query ids - 20% •         (threshold to 99] - 5 query ids - 50% • The drop rate (the percentage of dropped rows during the removal of the under-sampled query ids)
  36. 36. Split the Set: Training/Test ● Each training set iteration will assess the evaluation metric on the training and validation set ● At the end of the iterations the final model will be evaluated on an unknown Test Set ● This split could be random ● This split could depend on the time the interactions were collected
  37. 37. Split the Set: Training/Test K-fold Cross Validation
  38. 38. Split the Set: Training/Test • Be careful if you clean your data-set and then split, because you may end up in having an unfair Test Set • the Test set MUST NOT have under-sampled query Ids • the Test set MUST NOT have query Ids with a single relevance label • the Test set MUST be representative and with an acceptable size • It is useful to use the same Test Set for multiple refinements of the training set building/ training for Offline evaluation
  39. 39. Thanks!

×