Online testing represents a fundamental method to assess the performance of a ranking model in practical applications, providing the information needed to improve and better understand its behavior. Despite the advantages, the currently available evaluation tools have certain limitations. For this reason, we will present an alternative and customized approach to evaluate ranking models using Kibana. The talk will begin with an overview of online testing, including its benefits and drawbacks. Then, we will provide an in-depth exploration of our Kibana implementation, detailing the reasons behind our approach. Attendees will learn about the various tools provided by Kibana, and with practical examples, we will show how to create visualizations and dashboards, complete with queries and code, to compare different rankers. Attending this presentation will provide participants with valuable knowledge on how to leverage Kibana for the purpose of evaluating ranking models on custom metrics and on specific contexts such as the most popular and “populous” queries.
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
How To Implement Your Online Search Quality Evaluation With Kibana
1. Berlin Buzzwords 2023
20/06/2023
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Online Search
Quality Evaluation With Kibana
2. ‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE
3. ‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE
4. ● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Elasticsearch experts
● Community Contributors
● Active Researchers
● Hot Trends : Neural Search,
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
www.sease.io
SEArch SErvices
7. WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario
8. ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)
9. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query
➡ One relevance label for all the samples of a query
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)
10. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
11. WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.
14. A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
3 sales from the homepage
10 sales from the search page
Model A is better than Model B?
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
15. SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!
17. OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions
18. WHAT IS KIBANA
“Kibana is a software that provides search
and visualization capabilities for indexed
data in Elasticsearch, allowing you to
explore and visualize a large volume of
data and create detailed reporting
dashboards”
19. OUR KIBANA IMPLEMENTATION
● Creation of a Elasticsearch instance
● Creation of an Index
● A/B Testing set up
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS
20. OUR KIBANA IMPLEMENTATION
● E-commerce of books
● The index contains:
○ bookId
○ testGroup to identify model (name) assigned to a user group
○ queryId
○ timestamp
○ interactionType (impression, click, add to cart, sale)
○ queryResultCount (query hits)
○ userDevice
SCENARIO
impression
click
addToCart
sale
23. MODEL EVALUATION
GENERAL EVALUATION
► Used for the overall
evaluation of the models
► Helps to evaluate each
model on common metrics
(based on domain
requirements)
► Helps to easily compare
them
TSVB
► Useful for discovering bugs in the online testing setup (frontend application issues)
30. MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► Used to evaluate model
performance on specific
product (document)
► Useful for seeing how the
model behaves on a product
of interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB
32. MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► Used to evaluate model
performance on specific
queries
► Calculated multiple metrics
for each query
► Helpful for in-depth
analysis
TSVB
queryId
36. MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
► Used to evaluate the performance of
models on queries grouped
according to the search demand
curve
► May be more interesting to consider
the model on the most frequent
queries
► May help to consider other
approaches (e.g. multiple models)
41. MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES
► Used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► May be more interesting to consider the model
on queries with a larger number of results
► With few search results (like 1 or 3) we
expect the difference between models to be
almost absent
► Used to evaluate the performance of models
on a set of common queries
{“bool”:{“should”:[{“match_phrase”:{“queryId”:”0”}},{“match_phrase”:{“queryId”:”10”}},{“match_phrase”:{“queryId”:”3”}}],...
42. QUERIES IN COMMON
1. Extract the unique query ids per model
2. Extract queries in common between the two sets extracted in 1)
3. Elaborate the result as an Elasticsearch query to be copied into the Kibana
visualizations filter
https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
43. MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES Use the query into the Kibana filter of
visualizations
{
"bool": {
"minimum_should_match": 1,
"should": [
{
"match": {
"queryId": "0"
}},
{
"match": {
"queryId": "10"
}},
{
"match": {
"queryId": "3"
}}]
}}
46. ► Online evaluation allows you to assess how the system performs online
with real users
► A/B testing: remove noise
► Why Kibana?
○ Easy GUI to use
○ Detailed reporting dashboards
○ Filter unwanted data
○ Automatically update of visualizations
○ Export (and import) visualizations and dashboards (as NDJSON) using
the Export objects API
KEY TAKEAWAYS
47. KEY TAKEAWAYS
► Time Series Visual Builder Table: good for basic comparisons with custom
metrics and easy to set up.
○ General evaluation
○ Single product evaluation
○ Per model most frequent queries
► VEGA: highly customizable in both data and graphical aspects.
○ Evaluation for queries’ frequency ranges
○ Evaluation for queries’ hits/results ranges
48. ADDITIONAL REFERENCES
OUR BLOG POSTS ABOUT THIS TALK
● https://sease.io/2023/03/online-search-quality-evaluation-with-kibana-introduction.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-visualization-examples.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
● https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
● https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION
50. NOTES
The more observant among you might noticed that in some of our visualizations the
CTR/ATR is higher than 1.
This is due to how the dataset was created (It was only done for this presentation purpose,
as an example).
In an ideal scenario you should never have a number of clicks greater than the number of
impressions (since the user should have seen the product before clicking on it) or the
number of add to carts greater than the number of clicks.