How To Implement Your Online Search Quality Evaluation With Kibana

LONDON INFORMATION RETRIEVAL MEETUP
21/11/2022
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Your Online Search
Quality Evaluation With Kibana

‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE

‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE

OVERVIEW
Online Evaluation
A/B Testing
Our Kibana Implementation
Visualization Examples
Kibana Pros & Cons

WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario

ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)

WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query group
➡ One relevance label for all the samples of a query group
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)

WHY IS IMPORTANT
evaluation:
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).

WHY IS IMPORTANT
evaluation:
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.

WHY IS IMPORTANT
● The interpretability of the results
● The reliability of the results
● The possibility to observe the model behavior and improve it
ADVANTAGES

COMPARE MODELS ONLINE
A/B TESTING INTERLEAVING
● easier to implement
● most popular
● smaller amount of traffic
● smaller amount of time
● less influenced by user variance
● more sensitive to differences between
models
● prevents users to be exposed to a bad
system
population divided in 2 groups users are shown both variants by
interleaving the results

A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
Model A is better than Model B(?)
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing

A/B TESTING NOISE
MODEL A
MODEL B
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
Model A is better than Model B(?)

SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!

EXPERIMENT DESIGN
● HOW LONG TO TEST
● TESTING DIFFERENT PLATFORMS
● HOW MANY MODELS
- desktop, mobile, tablet independently
- once the statistical significance is higher enough
- generally not less than 2 weeks
- keep it as simple as possible
- depend on the amount of available traffic

OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions

WHAT IS KIBANA
Kibana is an open-source(?) software that
provides search and visualization
capabilities for indexed data in
Elasticsearch, allowing you to explore and
visualize a large volume of data and
create detailed reporting dashboards

● Creation of a Elasticsearch instance
● Creation of an Index
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS

COLLECT USER INTERACTIONS
FEATURES/FIELDS :
● testGroup: to identify model (name) assigned to a user group
● user query
● timestamp
● interactionType
○ impression (when a document/product is shown to the user)
○ click
○ addToCart
○ sale
● bookId
● queryResultCount (query hits)
● ….

KIBANA TOOLS
● DISCOVER the starting point to data exploration
● VISUALIZE
● DASHBOARD
● ALERTING
● MANAGEMENT
different kinds of visualizations:
- Metric, Line, Bar, Pie, Area, Heat map, Table, etc.
- Custom panels with editors (like Time Series Visual Builder and VEGA)
combine data visualizations into functional dashboards
monitor data in real-time and create customized alert triggers
manage indices and adjust runtime configuration
https://www.elastic.co/guide/en/kibana/current/get-started.html

KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html
USED PANEL TYPES
● DATA TABLE
● Tabular form
● Table columns: different predefined metrics calculation
● Table Rows: terms aggregations (i.e. testGroup).
● Sub-aggregations (i.e. timestamp per day).

KIBANA TOOLS
USED PANEL TYPES
● TIME SERIES VISUAL BUILDER(TSVB) - TABLE
● Tabular form
● Table columns: different custom metrics calculation
● Rows: terms aggregations (i.e. testGroup).

KIBANA TOOLS
USED PANEL TYPES
● VERTICAL BAR
● Vertical bar form
● Y-axis: predefined metric calculation
● X-axis: aggregated data (buckets)

KIBANA TOOLS
USED PANEL TYPES
● VEGA
● Custom visualizations
● Query response manipulation
● High customize formats

MODEL EVALUATION
GENERAL EVALUATION
► used for the overall evaluation
of the models
► helps to evaluate each model
on common metrics (based
on domain requirements)
► helps to easily compare them
TSVB
► useful for discovering bugs in the online testing setup
(nonexistent model name → frontend application issues)

MODEL EVALUATION
GENERAL EVALUATION

MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► used to evaluate model
performance on specific product
(document)
► useful for seeing how the model
behaves on a product of
interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB

MODEL EVALUATION
SINGLE PRODUCT EVALUATION

MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► used to evaluate model
performance on specific
queries
► calculated multiple metrics for
each query
► helpful for in-depth analysis
TSVB

MODEL EVALUATION
PER MODEL TOP 5 QUERIES

MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION
► used to evaluate the distribution of
interactions (impressions) on individual
queries
► helps identify the most frequent queries
for later use on other analysis:
▹ to group by frequency
▹ to calculate specific metrics
VERTICAL
BAR

MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION

MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT
DATA
TABLE
► used to evaluate the distribution of data by model and by day
► useful for understanding whether the testing process is distributing the
models equally or in the desired percentage
► useful for discovering any imbalances/disruptions

MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT

MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
► used to evaluate the performance of
models on queries grouped according to
the search demand curve
► may be more interesting to consider the
model on the most frequent queries
► may help to consider other approaches
(e.g. multiple models)
VEGA

MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY

COMMON QUERIES
HOW TO COMPARE MODELS ON COMMON QUERIES
1. Query to extract the unique query ids per model → save only the buckets from the response in
separate files (e.g. unique_query_ids.json)
2. unique_query_ids_modelA.json and unique_query_ids_modelB.json are the input files for
query_elaboration.py python script
3. The stdout of the python script is the query to be copied into the Kibana filter of visualizations

COMMON QUERIES
Query to extract the unique query ids per
model
(e.g. modelB → unique_query_ids_modelB.json)

COMMON QUERIES
INPUT
- modelA_query_file:
unique_query_ids_modelA.json
- modelB_query_file:
unique_query_ids_modelB.json
query_elaboration.py

MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS Use the query into the Kibana filter of
visualizations

MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS
► used to evaluate the performance of models
on a set of common queries
► used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► may be more interesting to consider the model
on queries with a larger number of results
► with few search results (like 1 or 3) we
expect the difference between models to be
almost absent
VEGA

MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS

KIBANA PROS & CONS
● Some features still not available
● Limitations for aggregated data
(more complex queries)
● VEGA “limitations” (not simple to use)
● Manual editing of filters (if the data view is
renamed or the model is changed)
PROS CONS
● Easy GUI to use
● Detailed reporting dashboards (aggregating
several visualizations)
● Filter unwanted data
● Automatically update of visualizations and
dashboards
● Export (and import) visualizations and dashboards
(as NDJSON) using the Export objects API

ADDITIONAL REFERENCES
● The importance of online testing in learning to rank part 1
● Online Testing for Learning To Rank: Interleaving
● Offline Search Quality Evaluation: Rated Ranking Evaluator (RRE)
● Apache Solr Learning To Rank Interleaving
Keep an eye on our Blog page, as more is coming!
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION

How To Implement Your Online Search Quality Evaluation With Kibana

Recommended

Recommended

More Related Content

Similar to How To Implement Your Online Search Quality Evaluation With Kibana

Similar to How To Implement Your Online Search Quality Evaluation With Kibana (20)

More from Sease

More from Sease (20)

Recently uploaded

Recently uploaded (20)

How To Implement Your Online Search Quality Evaluation With Kibana