How To Implement Your Online Search Quality Evaluation With Kibana

Berlin Buzzwords 2023
20/06/2023
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Online Search
Quality Evaluation With Kibana

‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE

‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE

● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Elasticsearch experts
● Community Contributors
● Active Researchers
● Hot Trends : Neural Search,
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
www.sease.io
SEArch SErvices

OVERVIEW
Online Evaluation
A/B Testing
Our Kibana Implementation
Visualization Examples
Key Takeaways

WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario

ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)

WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query
➡ One relevance label for all the samples of a query
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)

WHY IS IMPORTANT
evaluation:
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).

WHY IS IMPORTANT
evaluation:
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.

A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
3 sales from the homepage
10 sales from the search page
Model A is better than Model B?
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing

SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!

OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions

WHAT IS KIBANA
“Kibana is a software that provides search
and visualization capabilities for indexed
data in Elasticsearch, allowing you to
explore and visualize a large volume of
data and create detailed reporting
dashboards”

● Creation of a Elasticsearch instance
● Creation of an Index
● A/B Testing set up
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS

● E-commerce of books
● The index contains:
○ bookId
○ testGroup to identify model (name) assigned to a user group
○ queryId
○ timestamp
○ interactionType (impression, click, add to cart, sale)
○ queryResultCount (query hits)
○ userDevice
SCENARIO
impression
click
addToCart
sale

KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/tsvb.html
● Tabular form
● Table columns: different custom metrics calculation
● Rows: terms aggregations (i.e. testGroup).
TIME SERIES VISUAL BUILDER (TSVB) - TABLE

MODEL EVALUATION
GENERAL EVALUATION
► Used for the overall
evaluation of the models
► Helps to evaluate each
model on common metrics
(based on domain
requirements)
► Helps to easily compare
them
TSVB
► Useful for discovering bugs in the online testing setup (frontend application issues)

MODEL EVALUATION
GENERAL EVALUATION

MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► Used to evaluate model
performance on specific
product (document)
► Useful for seeing how the
model behaves on a product
of interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB

MODEL EVALUATION
SINGLE PRODUCT EVALUATION

MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► Used to evaluate model
performance on specific
queries
► Calculated multiple metrics
for each query
► Helpful for in-depth
analysis
TSVB
queryId

MODEL EVALUATION
PER MODEL TOP 5 QUERIES

KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/vega.html
● Custom visualizations
● Query response manipulation
● High customizable format
● Query DSL (Domain Specific Language)
VEGA

MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
► Used to evaluate the performance of
models on queries grouped
according to the search demand
curve
► May be more interesting to consider
the model on the most frequent
queries
► May help to consider other
approaches (e.g. multiple models)

MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES

MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES
► Used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► May be more interesting to consider the model
on queries with a larger number of results
► With few search results (like 1 or 3) we
expect the difference between models to be
almost absent
► Used to evaluate the performance of models
on a set of common queries
{“bool”:{“should”:[{“match_phrase”:{“queryId”:”0”}},{“match_phrase”:{“queryId”:”10”}},{“match_phrase”:{“queryId”:”3”}}],...

QUERIES IN COMMON
1. Extract the unique query ids per model
2. Extract queries in common between the two sets extracted in 1)
3. Elaborate the result as an Elasticsearch query to be copied into the Kibana
visualizations filter
https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html

MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES Use the query into the Kibana filter of
visualizations
{
"bool": {
"minimum_should_match": 1,
"should": [
{
"match": {
"queryId": "0"
}},
{
"match": {
"queryId": "10"
}},
{
"match": {
"queryId": "3"
}}]
}}

► Online evaluation allows you to assess how the system performs online
with real users
► A/B testing: remove noise
► Why Kibana?
○ Easy GUI to use
○ Detailed reporting dashboards
○ Filter unwanted data
○ Automatically update of visualizations
○ Export (and import) visualizations and dashboards (as NDJSON) using
the Export objects API
KEY TAKEAWAYS

KEY TAKEAWAYS
► Time Series Visual Builder Table: good for basic comparisons with custom
metrics and easy to set up.
○ General evaluation
○ Single product evaluation
○ Per model most frequent queries
► VEGA: highly customizable in both data and graphical aspects.
○ Evaluation for queries’ frequency ranges
○ Evaluation for queries’ hits/results ranges

ADDITIONAL REFERENCES
OUR BLOG POSTS ABOUT THIS TALK
● https://sease.io/2023/03/online-search-quality-evaluation-with-kibana-introduction.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-visualization-examples.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
● https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
● https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION

NOTES
The more observant among you might noticed that in some of our visualizations the
CTR/ATR is higher than 1.
This is due to how the dataset was created (It was only done for this presentation purpose,
as an example).
In an ideal scenario you should never have a number of clicks greater than the number of
impressions (since the user should have seen the product before clicking on it) or the
number of add to carts greater than the number of clicks.

How To Implement Your Online Search Quality Evaluation With Kibana

Recommended

Recommended

More Related Content

Similar to How To Implement Your Online Search Quality Evaluation With Kibana

Similar to How To Implement Your Online Search Quality Evaluation With Kibana (20)

More from Sease

More from Sease (20)

Recently uploaded

Recently uploaded (20)

How To Implement Your Online Search Quality Evaluation With Kibana