From Academic Papers To Production : A Learning To Rank Story

From Academic Papers To
Production: A Learning To Rank
Story
Alessandro Benedetti, Software Engineer, Sease Ltd.

Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am

Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Measuring Search Quality, Relevancy Tuning
Sease Ltd

● Learning To Rank
! Technologies Involved
! Data Preparation
● Model Training
● Apache Solr Integration
! Conclusions
Agenda

Learning To Rank - What is it ?
Learning from user implicit/explicit feedback
To
Rank documents (sensu lato)

Learning To Rank - What is NOT
- Sentient system that learn by itself
- Continuously improving itself by ingesting additional
feedback
- Easy to set up and tune it
- Easy to give a human understandable explanation of why the
model operates in certain ways

Learning To Rank - Technologies Used
- Spring Boot[1]
- RankLib[2]
- Apache Solr >=6.4 [3]
[1] https://projects.spring.io/spring-boot/
[2] https://sourceforge.net/p/lemur/wiki/RankLib/
[3] http://lucene.apache.org/solr/

Data Preparation
- User feedback harvesting  
- Feature Engineering
- Dataset clean-up
- Training/Validation/Test split

User Feedback Harvesting
- Explicit user feedback ( Experts/Crowdsourcing) 
- Implicit user feedback ( eCommerce Sale Funnel) 
 
How to assign the relevance label ? 
- Signal intensity to model relevance ( sale > add to chart) 
- Identify a target signal, calculate rates and normalise
Discordant training samples

Feature Engineering
- Query level / Document level / Query dependent 
- Ordinal/Categorical features -> one hot encoding 
- Missing values 
High Cardinality Categorical

Data Set Cleanup
 
- Resample the dataset
- Query Id Hashing
- You need bad examples ! (NDCG -> not reflecting real quality)
Oversampling by duplication -> over-fitting
This strongly affects the evaluation metric (NDCG)

Training/Validation/Test split
- K-fold Cross Validation  
- Temporal Split
- Manual split after shuffling
Per rankList (subset of
queryIds)

Model Training
- LambdaMART + NDCG@K 
- Threshold Candidates Count For Splitting -> simplify!
- Minimum Leaf Support -> remove outliers 
Reason : missing searched location from training set

Apache Solr
Solr is the popular, blazing fast, open source NoSQL search
platform from the Apache LuceneTM project.
Its major features include powerful full-text search, hit
highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as
parallel SQL.

Apache Solr Integration
- Features definition (Json + Solr syntax)
- Model(s) definition (Json)
- Sharded LTR 
- Pagination ( in sharded environment)

Classic Business Level Questions
- Given X,Y,Z input features , I would have expected a
different ranking -> can you fix this ? 
Solution : no single line manual fix -> trial and error !
- How does the model work ? What are the most important
features ? 
Solution : index the model to extract information such as  
- most frequent features in splits 
- unique thresholds

Classic Business Level Questions
- What are good items generally ? 
Solution : developed simple tool[1] to extract from the
model top scoring leaves
- Why given query X doc Y is higher scored than doc Z? 
Solution : debug Solr score and investigate tree paths
[1] https://github.com/alessandrobenedetti/ltr-tools

Conclusions
- LTR is a promising and deep technology
- It requires effort ! ( it’s not as automatic as you think)
- Start collecting user feedback! (if you plan to use LTR)
- Good open source support available ( Apache Solr + ES)
- Not easy to debug/explain

From Academic Papers To Production : A Learning To Rank Story

From Academic Papers To Production : A Learning To Rank Story

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to From Academic Papers To Production : A Learning To Rank Story

Similar to From Academic Papers To Production : A Learning To Rank Story (20)

Recently uploaded

Recently uploaded (20)

From Academic Papers To Production : A Learning To Rank Story