Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Academic Papers To Production : A Learning To Rank Story

3,853 views

Published on

This talk is about the journey to bring Learning To Rank (LTR from now on) to the e-commerce domain in a real world scenario, including all the pitfalls and disillutions involved.
LTR is a fantastic approach to solve complex ranking problems but industry domains are far from being the ideal world where those technologies were designed and experimented : open source software implementations are not working perfectly out of the box and require advanced tuning; industry training data is dirty, noisy and incomplete.
This talk will guide you through the different phases and technologies involved in a LTR project with a pragmatic approach.
Feature Engineering, Domain Modelling, Training Set Building, Model Training, Search Integration and Online Evaluation : each of them presents different challenges in the real world and must be carefully approached.

Published in: Software
  • Be the first to comment

From Academic Papers To Production : A Learning To Rank Story

  1. 1. From Academic Papers To Production: A Learning To Rank Story Alessandro Benedetti, Software Engineer, Sease Ltd.

  2. 2. Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  3. 3. Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Measuring Search Quality, Relevancy Tuning Sease Ltd
  4. 4. ● Learning To Rank ! Technologies Involved ! Data Preparation ● Model Training ● Apache Solr Integration ! Conclusions Agenda
  5. 5. Learning To Rank - What is it ? Learning from user implicit/explicit feedback To Rank documents (sensu lato)
  6. 6. Learning To Rank - What is NOT - Sentient system that learn by itself - Continuously improving itself by ingesting additional feedback - Easy to set up and tune it - Easy to give a human understandable explanation of why the model operates in certain ways
  7. 7. Learning To Rank - Technologies Used - Spring Boot[1] - RankLib[2] - Apache Solr >=6.4 [3] [1] https://projects.spring.io/spring-boot/ [2] https://sourceforge.net/p/lemur/wiki/RankLib/ [3] http://lucene.apache.org/solr/
  8. 8. Data Preparation - User feedback harvesting 
 - Feature Engineering - Dataset clean-up - Training/Validation/Test split
  9. 9. User Feedback Harvesting - Explicit user feedback ( Experts/Crowdsourcing)
 - Implicit user feedback ( eCommerce Sale Funnel)
 
 How to assign the relevance label ?
 - Signal intensity to model relevance ( sale > add to chart)
 - Identify a target signal, calculate rates and normalise Discordant training samples
  10. 10. Feature Engineering - Query level / Document level / Query dependent
 - Ordinal/Categorical features -> one hot encoding
 - Missing values
 High Cardinality Categorical
  11. 11. Data Set Cleanup 
 - Resample the dataset - Query Id Hashing - You need bad examples ! (NDCG -> not reflecting real quality) Oversampling by duplication -> over-fitting This strongly affects the evaluation metric (NDCG)
  12. 12. Training/Validation/Test split - K-fold Cross Validation 
 - Temporal Split - Manual split after shuffling Per rankList (subset of queryIds)
  13. 13. Model Training - LambdaMART + NDCG@K
 - Threshold Candidates Count For Splitting -> simplify! - Minimum Leaf Support -> remove outliers
 Reason : missing searched location from training set
  14. 14. Apache Solr Solr is the popular, blazing fast, open source NoSQL search platform from the Apache LuceneTM project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL.
  15. 15. Apache Solr Integration - Features definition (Json + Solr syntax) - Model(s) definition (Json) - Sharded LTR
 - Pagination ( in sharded environment)
  16. 16. Classic Business Level Questions - Given X,Y,Z input features , I would have expected a different ranking -> can you fix this ?
 Solution : no single line manual fix -> trial and error ! - How does the model work ? What are the most important features ?
 Solution : index the model to extract information such as 
 - most frequent features in splits
 - unique thresholds
  17. 17. Classic Business Level Questions - What are good items generally ?
 Solution : developed simple tool[1] to extract from the model top scoring leaves - Why given query X doc Y is higher scored than doc Z?
 Solution : debug Solr score and investigate tree paths [1] https://github.com/alessandrobenedetti/ltr-tools
  18. 18. Conclusions - LTR is a promising and deep technology - It requires effort ! ( it’s not as automatic as you think) - Start collecting user feedback! (if you plan to use LTR) - Good open source support available ( Apache Solr + ES) - Not easy to debug/explain
  19. 19. Questions ?

×