This talk is about the journey to bring Learning To Rank (LTR from now on) to the e-commerce domain in a real world scenario, including all the pitfalls and disillutions involved.
LTR is a fantastic approach to solve complex ranking problems but industry domains are far from being the ideal world where those technologies were designed and experimented : open source software implementations are not working perfectly out of the box and require advanced tuning; industry training data is dirty, noisy and incomplete.
This talk will guide you through the different phases and technologies involved in a LTR project with a pragmatic approach.
Feature Engineering, Domain Modelling, Training Set Building, Model Training, Search Integration and Online Evaluation : each of them presents different challenges in the real world and must be carefully approached.
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To
Production: A Learning To Rank
Alessandro Benedetti, Software Engineer, Sease Ltd.
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Measuring Search Quality, Relevancy Tuning
● Learning To Rank
! Technologies Involved
! Data Preparation
● Model Training
● Apache Solr Integration
Learning To Rank - What is it ?
Learning from user implicit/explicit feedback
Rank documents (sensu lato)
Learning To Rank - What is NOT
- Sentient system that learn by itself
- Continuously improving itself by ingesting additional
- Easy to set up and tune it
- Easy to give a human understandable explanation of why the
model operates in certain ways
Learning To Rank - Technologies Used
- Spring Boot
- Apache Solr >=6.4 
- User feedback harvesting
- Feature Engineering
- Dataset clean-up
- Training/Validation/Test split
User Feedback Harvesting
- Explicit user feedback ( Experts/Crowdsourcing)
- Implicit user feedback ( eCommerce Sale Funnel)
How to assign the relevance label ?
- Signal intensity to model relevance ( sale > add to chart)
- Identify a target signal, calculate rates and normalise
Discordant training samples
- Query level / Document level / Query dependent
- Ordinal/Categorical features -> one hot encoding
- Missing values
High Cardinality Categorical
Data Set Cleanup
- Resample the dataset
- Query Id Hashing
- You need bad examples ! (NDCG -> not reflecting real quality)
Oversampling by duplication -> over-fitting
This strongly affects the evaluation metric (NDCG)
- K-fold Cross Validation
- Temporal Split
- Manual split after shuffling
Per rankList (subset of
- LambdaMART + NDCG@K
- Threshold Candidates Count For Splitting -> simplify!
- Minimum Leaf Support -> remove outliers
Reason : missing searched location from training set
Solr is the popular, blazing fast, open source NoSQL search
platform from the Apache LuceneTM project.
Its major features include powerful full-text search, hit
highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as
Classic Business Level Questions
- Given X,Y,Z input features , I would have expected a
different ranking -> can you fix this ?
Solution : no single line manual fix -> trial and error !
- How does the model work ? What are the most important
Solution : index the model to extract information such as
- most frequent features in splits
- unique thresholds
Classic Business Level Questions
- What are good items generally ?
Solution : developed simple tool to extract from the
model top scoring leaves
- Why given query X doc Y is higher scored than doc Z?
Solution : debug Solr score and investigate tree paths
- LTR is a promising and deep technology
- It requires effort ! ( it’s not as automatic as you think)
- Start collecting user feedback! (if you plan to use LTR)
- Good open source support available ( Apache Solr + ES)
- Not easy to debug/explain