Online testing remains the optimal way to prove how your ranking model performs in your real-world scenario. It can lead to many advantages such as having a direct interpretation of the results and confirming the estimation of offline tests. It gives a better understanding of the ranking model behaviour and builds a solid foundation to learn from to improve it.
Nowadays, the available evaluation tools have some limitations and in this talk, we will describe an alternative and customised approach for evaluating ranking models through the use of Kibana.
First of all, we give an overview of online testing, highlighting the pros and cons and describing the state-of-the-art.
We then dive into Kibana’s implementation and the reasons behind it. We will explore the tools Kibana provides, with their constraints for real-world applications, and show, through practical examples, how to create dashboards (with queries and code) to compare different models.
Powerpoint exploring the locations used in television show Time Clash
How To Implement Your Online Search Quality Evaluation With Kibana
1. LONDON INFORMATION RETRIEVAL MEETUP
21/11/2022
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Your Online Search
Quality Evaluation With Kibana
2. ‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE
3. ‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE
6. WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario
7. ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)
8. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query group
➡ One relevance label for all the samples of a query group
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)
9. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
10. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.
11. WHY IS IMPORTANT
● The interpretability of the results
● The reliability of the results
● The possibility to observe the model behavior and improve it
ADVANTAGES
13. COMPARE MODELS ONLINE
A/B TESTING INTERLEAVING
● easier to implement
● most popular
● smaller amount of traffic
● smaller amount of time
● less influenced by user variance
● more sensitive to differences between
models
● prevents users to be exposed to a bad
system
population divided in 2 groups users are shown both variants by
interleaving the results
15. A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
4 sales from the homepage
10 sales from the search page
Model A is better than Model B(?)
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
16. A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
4 sales from the homepage
10 sales from the search page
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
Model A is better than Model B(?)
17. SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!
18. EXPERIMENT DESIGN
● HOW LONG TO TEST
● TESTING DIFFERENT PLATFORMS
● HOW MANY MODELS
- desktop, mobile, tablet independently
- once the statistical significance is higher enough
- generally not less than 2 weeks
- keep it as simple as possible
- depend on the amount of available traffic
20. OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions
21. WHAT IS KIBANA
Kibana is an open-source(?) software that
provides search and visualization
capabilities for indexed data in
Elasticsearch, allowing you to explore and
visualize a large volume of data and
create detailed reporting dashboards
22. OUR KIBANA IMPLEMENTATION
● Creation of a Elasticsearch instance
● Creation of an Index
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS
23. OUR KIBANA IMPLEMENTATION
COLLECT USER INTERACTIONS
FEATURES/FIELDS :
● testGroup: to identify model (name) assigned to a user group
● user query
● timestamp
● interactionType
○ impression (when a document/product is shown to the user)
○ click
○ addToCart
○ sale
● bookId
● queryResultCount (query hits)
● ….
24. KIBANA TOOLS
● DISCOVER the starting point to data exploration
● VISUALIZE
● DASHBOARD
● ALERTING
● MANAGEMENT
different kinds of visualizations:
- Metric, Line, Bar, Pie, Area, Heat map, Table, etc.
- Custom panels with editors (like Time Series Visual Builder and VEGA)
combine data visualizations into functional dashboards
monitor data in real-time and create customized alert triggers
manage indices and adjust runtime configuration
https://www.elastic.co/guide/en/kibana/current/get-started.html
30. MODEL EVALUATION
GENERAL EVALUATION
► used for the overall evaluation
of the models
► helps to evaluate each model
on common metrics (based
on domain requirements)
► helps to easily compare them
TSVB
► useful for discovering bugs in the online testing setup
(nonexistent model name → frontend application issues)
37. MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► used to evaluate model
performance on specific product
(document)
► useful for seeing how the model
behaves on a product of
interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB
39. MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► used to evaluate model
performance on specific
queries
► calculated multiple metrics for
each query
► helpful for in-depth analysis
TSVB
42. MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION
► used to evaluate the distribution of
interactions (impressions) on individual
queries
► helps identify the most frequent queries
for later use on other analysis:
▹ to group by frequency
▹ to calculate specific metrics
VERTICAL
BAR
45. MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT
DATA
TABLE
► used to evaluate the distribution of data by model and by day
► useful for understanding whether the testing process is distributing the
models equally or in the desired percentage
► useful for discovering any imbalances/disruptions
48. MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
► used to evaluate the performance of
models on queries grouped according to
the search demand curve
► may be more interesting to consider the
model on the most frequent queries
► may help to consider other approaches
(e.g. multiple models)
VEGA
53. COMMON QUERIES
HOW TO COMPARE MODELS ON COMMON QUERIES
1. Query to extract the unique query ids per model → save only the buckets from the response in
separate files (e.g. unique_query_ids.json)
2. unique_query_ids_modelA.json and unique_query_ids_modelB.json are the input files for
query_elaboration.py python script
3. The stdout of the python script is the query to be copied into the Kibana filter of visualizations
54. COMMON QUERIES
Query to extract the unique query ids per
model
(e.g. modelB → unique_query_ids_modelB.json)
57. MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS Use the query into the Kibana filter of
visualizations
58. MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS
► used to evaluate the performance of models
on a set of common queries
► used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► may be more interesting to consider the model
on queries with a larger number of results
► with few search results (like 1 or 3) we
expect the difference between models to be
almost absent
VEGA
62. KIBANA PROS & CONS
● Some features still not available
● Limitations for aggregated data
(more complex queries)
● VEGA “limitations” (not simple to use)
● Manual editing of filters (if the data view is
renamed or the model is changed)
PROS CONS
● Easy GUI to use
● Detailed reporting dashboards (aggregating
several visualizations)
● Filter unwanted data
● Automatically update of visualizations and
dashboards
● Export (and import) visualizations and dashboards
(as NDJSON) using the Export objects API
63. ADDITIONAL REFERENCES
● The importance of online testing in learning to rank part 1
● Online Testing for Learning To Rank: Interleaving
● Offline Search Quality Evaluation: Rated Ranking Evaluator (RRE)
● Apache Solr Learning To Rank Interleaving
Keep an eye on our Blog page, as more is coming!
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION