SlideShare a Scribd company logo
1 of 64
Download to read offline
LONDON INFORMATION RETRIEVAL MEETUP
21/11/2022
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Your Online Search
Quality Evaluation With Kibana
‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE
OVERVIEW
Online Evaluation
A/B Testing
Our Kibana Implementation
Visualization Examples
Kibana Pros & Cons
ONLINE EVALUATION
WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario
ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query group
➡ One relevance label for all the samples of a query group
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.
WHY IS IMPORTANT
● The interpretability of the results
● The reliability of the results
● The possibility to observe the model behavior and improve it
ADVANTAGES
A/B TESTING
COMPARE MODELS ONLINE
A/B TESTING INTERLEAVING
● easier to implement
● most popular
● smaller amount of traffic
● smaller amount of time
● less influenced by user variance
● more sensitive to differences between
models
● prevents users to be exposed to a bad
system
population divided in 2 groups users are shown both variants by
interleaving the results
A/B TESTING
A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
4 sales from the homepage
10 sales from the search page
Model A is better than Model B(?)
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
4 sales from the homepage
10 sales from the search page
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
Model A is better than Model B(?)
SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!
EXPERIMENT DESIGN
● HOW LONG TO TEST
● TESTING DIFFERENT PLATFORMS
● HOW MANY MODELS
- desktop, mobile, tablet independently
- once the statistical significance is higher enough
- generally not less than 2 weeks
- keep it as simple as possible
- depend on the amount of available traffic
OUR KIBANA
IMPLEMENTATION
OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions
WHAT IS KIBANA
Kibana is an open-source(?) software that
provides search and visualization
capabilities for indexed data in
Elasticsearch, allowing you to explore and
visualize a large volume of data and
create detailed reporting dashboards
OUR KIBANA IMPLEMENTATION
● Creation of a Elasticsearch instance
● Creation of an Index
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS
OUR KIBANA IMPLEMENTATION
COLLECT USER INTERACTIONS
FEATURES/FIELDS :
● testGroup: to identify model (name) assigned to a user group
● user query
● timestamp
● interactionType
○ impression (when a document/product is shown to the user)
○ click
○ addToCart
○ sale
● bookId
● queryResultCount (query hits)
● ….
KIBANA TOOLS
● DISCOVER the starting point to data exploration
● VISUALIZE
● DASHBOARD
● ALERTING
● MANAGEMENT
different kinds of visualizations:
- Metric, Line, Bar, Pie, Area, Heat map, Table, etc.
- Custom panels with editors (like Time Series Visual Builder and VEGA)
combine data visualizations into functional dashboards
monitor data in real-time and create customized alert triggers
manage indices and adjust runtime configuration
https://www.elastic.co/guide/en/kibana/current/get-started.html
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html
USED PANEL TYPES
● DATA TABLE
● Tabular form
● Table columns: different predefined metrics calculation
● Table Rows: terms aggregations (i.e. testGroup).
● Sub-aggregations (i.e. timestamp per day).
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html
USED PANEL TYPES
● TIME SERIES VISUAL BUILDER(TSVB) - TABLE
● Tabular form
● Table columns: different custom metrics calculation
● Rows: terms aggregations (i.e. testGroup).
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html
USED PANEL TYPES
● VERTICAL BAR
● Vertical bar form
● Y-axis: predefined metric calculation
● X-axis: aggregated data (buckets)
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html
USED PANEL TYPES
● VEGA
● Custom visualizations
● Query response manipulation
● High customize formats
VISUALIZATION
EXAMPLES
MODEL EVALUATION
GENERAL EVALUATION
► used for the overall evaluation
of the models
► helps to evaluate each model
on common metrics (based
on domain requirements)
► helps to easily compare them
TSVB
► useful for discovering bugs in the online testing setup
(nonexistent model name → frontend application issues)
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
KIBANA TOOLS
DASHBOARD
MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► used to evaluate model
performance on specific product
(document)
► useful for seeing how the model
behaves on a product of
interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB
MODEL EVALUATION
SINGLE PRODUCT EVALUATION
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► used to evaluate model
performance on specific
queries
► calculated multiple metrics for
each query
► helpful for in-depth analysis
TSVB
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION
► used to evaluate the distribution of
interactions (impressions) on individual
queries
► helps identify the most frequent queries
for later use on other analysis:
▹ to group by frequency
▹ to calculate specific metrics
VERTICAL
BAR
MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION
MODEL EVALUATION
PER MODEL QUERY COUNT DISTRIBUTION
MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT
DATA
TABLE
► used to evaluate the distribution of data by model and by day
► useful for understanding whether the testing process is distributing the
models equally or in the desired percentage
► useful for discovering any imbalances/disruptions
MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT
MODEL EVALUATION
PER MODEL INTERACTIONS DAILY COUNT
MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
► used to evaluate the performance of
models on queries grouped according to
the search demand curve
► may be more interesting to consider the
model on the most frequent queries
► may help to consider other approaches
(e.g. multiple models)
VEGA
MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
MODEL EVALUATION
BASED ON QUERIES’ FREQUENCY
COMMON QUERIES
HOW TO COMPARE MODELS ON COMMON QUERIES
1. Query to extract the unique query ids per model → save only the buckets from the response in
separate files (e.g. unique_query_ids.json)
2. unique_query_ids_modelA.json and unique_query_ids_modelB.json are the input files for
query_elaboration.py python script
3. The stdout of the python script is the query to be copied into the Kibana filter of visualizations
COMMON QUERIES
Query to extract the unique query ids per
model
(e.g. modelB → unique_query_ids_modelB.json)
COMMON QUERIES
INPUT
- modelA_query_file:
unique_query_ids_modelA.json
- modelB_query_file:
unique_query_ids_modelB.json
query_elaboration.py
COMMON QUERIES
INPUT
- modelA_query_file:
unique_query_ids_modelA.json
- modelB_query_file:
unique_query_ids_modelB.json
query_elaboration.py
MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS Use the query into the Kibana filter of
visualizations
MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS
► used to evaluate the performance of models
on a set of common queries
► used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► may be more interesting to consider the model
on queries with a larger number of results
► with few search results (like 1 or 3) we
expect the difference between models to be
almost absent
VEGA
MODEL EVALUATION
BASED ON QUERIES’ RESULTS/HITS
KIBANA TOOLS
DASHBOARD
KIBANA PROS & CONS
KIBANA PROS & CONS
● Some features still not available
● Limitations for aggregated data
(more complex queries)
● VEGA “limitations” (not simple to use)
● Manual editing of filters (if the data view is
renamed or the model is changed)
PROS CONS
● Easy GUI to use
● Detailed reporting dashboards (aggregating
several visualizations)
● Filter unwanted data
● Automatically update of visualizations and
dashboards
● Export (and import) visualizations and dashboards
(as NDJSON) using the Export objects API
ADDITIONAL REFERENCES
● The importance of online testing in learning to rank part 1
● Online Testing for Learning To Rank: Interleaving
● Offline Search Quality Evaluation: Rated Ranking Evaluator (RRE)
● Apache Solr Learning To Rank Interleaving
Keep an eye on our Blog page, as more is coming!
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION
THANK YOU!

More Related Content

Similar to How To Implement Your Online Search Quality Evaluation With Kibana

Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...VWO
 
Visual guidance calgary user group
Visual guidance calgary user groupVisual guidance calgary user group
Visual guidance calgary user groupBerkovich Consulting
 
GAUC 2020 - presentatie Hans en Reinier
GAUC 2020 - presentatie Hans en ReinierGAUC 2020 - presentatie Hans en Reinier
GAUC 2020 - presentatie Hans en ReinierOnline Dialogue
 
31-Steps Conversion & Retention Optimization Checklist.pptx
31-Steps Conversion & Retention Optimization Checklist.pptx31-Steps Conversion & Retention Optimization Checklist.pptx
31-Steps Conversion & Retention Optimization Checklist.pptxRamanParashar3
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom
 
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...Using Microsoft Excel in Your Next Internal and External Audit - Learning The...
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...Jim Kaplan CIA CFE
 
Data - How to Use it & When by Square and Call Rail Product Leader
Data - How to Use it & When by Square and Call Rail Product LeaderData - How to Use it & When by Square and Call Rail Product Leader
Data - How to Use it & When by Square and Call Rail Product LeaderProduct School
 
Data-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingData-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingJack Nguyen (Hung Tien)
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsDatabricks
 
Google Analytics Training - full 2017
Google Analytics Training - full 2017Google Analytics Training - full 2017
Google Analytics Training - full 2017Nate Plaunt
 
Advancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationAdvancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationRamkumar Ravichandran
 
Supercharge Your Corporate Dashboards With UX Analytics
Supercharge Your Corporate Dashboards With UX AnalyticsSupercharge Your Corporate Dashboards With UX Analytics
Supercharge Your Corporate Dashboards With UX AnalyticsUserZoom
 
The five essential steps to building a data product
The five essential steps to building a data productThe five essential steps to building a data product
The five essential steps to building a data productBirst
 
Understanding Web Analytics and Google Analytics
Understanding Web Analytics and Google AnalyticsUnderstanding Web Analytics and Google Analytics
Understanding Web Analytics and Google AnalyticsPrathamesh Kulkarni
 
Deep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee InsightsDeep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee InsightsApigee | Google Cloud
 
User testing methodology
User testing methodologyUser testing methodology
User testing methodologyJames Hatfield
 
How to Use Data for Product Decisions by YouTube Product Manager
How to Use Data for Product Decisions by YouTube Product ManagerHow to Use Data for Product Decisions by YouTube Product Manager
How to Use Data for Product Decisions by YouTube Product ManagerProduct School
 

Similar to How To Implement Your Online Search Quality Evaluation With Kibana (20)

Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
Beyond the Primary KPI: Leveraging Bad Test Results | Masters of Conversion b...
 
Visual guidance calgary user group
Visual guidance calgary user groupVisual guidance calgary user group
Visual guidance calgary user group
 
GAUC 2020 - presentatie Hans en Reinier
GAUC 2020 - presentatie Hans en ReinierGAUC 2020 - presentatie Hans en Reinier
GAUC 2020 - presentatie Hans en Reinier
 
31-Steps Conversion & Retention Optimization Checklist.pptx
31-Steps Conversion & Retention Optimization Checklist.pptx31-Steps Conversion & Retention Optimization Checklist.pptx
31-Steps Conversion & Retention Optimization Checklist.pptx
 
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
UserZoom Education Series - Research Deep Dive - Advanced - Task-Based TOL (b...
 
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...Using Microsoft Excel in Your Next Internal and External Audit - Learning The...
Using Microsoft Excel in Your Next Internal and External Audit - Learning The...
 
Startup analytics
Startup analyticsStartup analytics
Startup analytics
 
Data - How to Use it & When by Square and Call Rail Product Leader
Data - How to Use it & When by Square and Call Rail Product LeaderData - How to Use it & When by Square and Call Rail Product Leader
Data - How to Use it & When by Square and Call Rail Product Leader
 
Data-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingData-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B Testing
 
PQF Overview
PQF OverviewPQF Overview
PQF Overview
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Google Analytics Training - full 2017
Google Analytics Training - full 2017Google Analytics Training - full 2017
Google Analytics Training - full 2017
 
Advancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organizationAdvancing Testing Program Maturity in your organization
Advancing Testing Program Maturity in your organization
 
Supercharge Your Corporate Dashboards With UX Analytics
Supercharge Your Corporate Dashboards With UX AnalyticsSupercharge Your Corporate Dashboards With UX Analytics
Supercharge Your Corporate Dashboards With UX Analytics
 
The five essential steps to building a data product
The five essential steps to building a data productThe five essential steps to building a data product
The five essential steps to building a data product
 
Understanding Web Analytics and Google Analytics
Understanding Web Analytics and Google AnalyticsUnderstanding Web Analytics and Google Analytics
Understanding Web Analytics and Google Analytics
 
C1 t1,t2,t3,t4 complete
C1 t1,t2,t3,t4 completeC1 t1,t2,t3,t4 complete
C1 t1,t2,t3,t4 complete
 
Deep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee InsightsDeep-Dive: Predicting Customer Behavior with Apigee Insights
Deep-Dive: Predicting Customer Behavior with Apigee Insights
 
User testing methodology
User testing methodologyUser testing methodology
User testing methodology
 
How to Use Data for Product Decisions by YouTube Product Manager
How to Use Data for Product Decisions by YouTube Product ManagerHow to Use Data for Product Decisions by YouTube Product Manager
How to Use Data for Product Decisions by YouTube Product Manager
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

How To Implement Your Online Search Quality Evaluation With Kibana

  • 1. LONDON INFORMATION RETRIEVAL MEETUP 21/11/2022 Anna Ruggero, R&D Software Engineer @ Sease Ilaria Petreti, ML Software Engineer @ Sease How To Implement Your Online Search Quality Evaluation With Kibana
  • 2. ‣ Born in Padova ‣ R&D Search Software Engineer ‣ Master in Computer Science at Padova ‣ Artifact Evaluation Committee SIGIR member ‣ Solr, Elasticsearch, OpenSearch expert ‣ Recommender Systems, Big Data, Machine Learning, Ranking Systems Evaluation ‣ Organist, Latin Music Dancer and Orchid Lover ANNA RUGGERO WHO WE ARE
  • 3. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ ML Search Software Engineer ‣ Master in Data Science in Rome ‣ Big Data, Machine Learning, Ranking Systems Evaluation ‣ Sport Lover (Basketball player) ILARIA PETRETI WHO WE ARE
  • 4. OVERVIEW Online Evaluation A/B Testing Our Kibana Implementation Visualization Examples Kibana Pros & Cons
  • 6. WHAT IS ONLINE EVALUATION Online evaluation is one of the most common approaches to measure the effectiveness of an information retrieval system. It remains the optimal way to prove how your system performs in a real-world scenario
  • 7. ONLINE vs OFFLINE Both Offline/Online evaluations are vital for a business ● Find anomalies in data ● Set up a controlled environment not influenced by the external world ● Check how the model performs before using in production ● Save time and money OFFLINE ONLINE ● Assess how the system performs online with real users ● Measure profit/improvements/regressions on live data ● Can identify anomalies for un-expected queries/user behaviors ● More expensive (time and cost)
  • 8. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement: ➡ One sample per query group ➡ One relevance label for all the samples of a query group ➡ Interactions considered for the data set creation ➡ Too small (unrepresentative)
  • 9. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement. 2. Finding a direct correlation between the offline evaluation metrics and the key performance indicator used for the online model performance evaluation (e.g. revenues).
  • 10. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement. 2. Finding a direct correlation between the offline evaluation metrics and the key performance indicator used for the online model performance evaluation (e.g. revenues). 3. Is based on implicit/explicit relevance labels that not always reflect the real user need.
  • 11. WHY IS IMPORTANT ● The interpretability of the results ● The reliability of the results ● The possibility to observe the model behavior and improve it ADVANTAGES
  • 13. COMPARE MODELS ONLINE A/B TESTING INTERLEAVING ● easier to implement ● most popular ● smaller amount of traffic ● smaller amount of time ● less influenced by user variance ● more sensitive to differences between models ● prevents users to be exposed to a bad system population divided in 2 groups users are shown both variants by interleaving the results
  • 15. A/B TESTING NOISE MODEL A 10 sales from the homepage 5 sales from the search page MODEL B 4 sales from the homepage 10 sales from the search page Model A is better than Model B(?) Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
  • 16. A/B TESTING NOISE MODEL A 10 sales from the homepage 5 sales from the search page MODEL B 4 sales from the homepage 10 sales from the search page Be sure to consider ONLY interactions from result pages ranked by the models you are comparing Model A is better than Model B(?)
  • 17. SIGNALS TO MEASURE ● QUERY REFORMULATIONS/BOUNCE RATES ● SALE/REVENUE RATES ● CLICK THROUGH RATES (views, download, add to favourite, ...) ● DWELL TIME (time spent on a search result after the click) ● …. Recommendation: test for direct correlation!
  • 18. EXPERIMENT DESIGN ● HOW LONG TO TEST ● TESTING DIFFERENT PLATFORMS ● HOW MANY MODELS - desktop, mobile, tablet independently - once the statistical significance is higher enough - generally not less than 2 weeks - keep it as simple as possible - depend on the amount of available traffic
  • 20. OUR KIBANA IMPLEMENTATION MAIN MOTIVATING FACTORS ● Limitations of available online evaluation softwares ● Be able to use the same metric that optimizes the model ● Be able to rule out external factors/corrupt interactions
  • 21. WHAT IS KIBANA Kibana is an open-source(?) software that provides search and visualization capabilities for indexed data in Elasticsearch, allowing you to explore and visualize a large volume of data and create detailed reporting dashboards
  • 22. OUR KIBANA IMPLEMENTATION ● Creation of a Elasticsearch instance ● Creation of an Index ● User interactions data Indexing ● Creation of a Data View ● Visualizations and Dashboards Creation for model comparison STEPS
  • 23. OUR KIBANA IMPLEMENTATION COLLECT USER INTERACTIONS FEATURES/FIELDS : ● testGroup: to identify model (name) assigned to a user group ● user query ● timestamp ● interactionType ○ impression (when a document/product is shown to the user) ○ click ○ addToCart ○ sale ● bookId ● queryResultCount (query hits) ● ….
  • 24. KIBANA TOOLS ● DISCOVER the starting point to data exploration ● VISUALIZE ● DASHBOARD ● ALERTING ● MANAGEMENT different kinds of visualizations: - Metric, Line, Bar, Pie, Area, Heat map, Table, etc. - Custom panels with editors (like Time Series Visual Builder and VEGA) combine data visualizations into functional dashboards monitor data in real-time and create customized alert triggers manage indices and adjust runtime configuration https://www.elastic.co/guide/en/kibana/current/get-started.html
  • 25. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html USED PANEL TYPES ● DATA TABLE ● Tabular form ● Table columns: different predefined metrics calculation ● Table Rows: terms aggregations (i.e. testGroup). ● Sub-aggregations (i.e. timestamp per day).
  • 26. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html USED PANEL TYPES ● TIME SERIES VISUAL BUILDER(TSVB) - TABLE ● Tabular form ● Table columns: different custom metrics calculation ● Rows: terms aggregations (i.e. testGroup).
  • 27. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html USED PANEL TYPES ● VERTICAL BAR ● Vertical bar form ● Y-axis: predefined metric calculation ● X-axis: aggregated data (buckets)
  • 28. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/aggregation-reference.html USED PANEL TYPES ● VEGA ● Custom visualizations ● Query response manipulation ● High customize formats
  • 30. MODEL EVALUATION GENERAL EVALUATION ► used for the overall evaluation of the models ► helps to evaluate each model on common metrics (based on domain requirements) ► helps to easily compare them TSVB ► useful for discovering bugs in the online testing setup (nonexistent model name → frontend application issues)
  • 37. MODEL EVALUATION SINGLE PRODUCT EVALUATION ► used to evaluate model performance on specific product (document) ► useful for seeing how the model behaves on a product of interest: e.g. best sellers, new product, most reviewed, sponsored, on sale promotional items, etc. TSVB
  • 39. MODEL EVALUATION PER MODEL TOP 5 QUERIES ► used to evaluate model performance on specific queries ► calculated multiple metrics for each query ► helpful for in-depth analysis TSVB
  • 40. MODEL EVALUATION PER MODEL TOP 5 QUERIES
  • 41. MODEL EVALUATION PER MODEL TOP 5 QUERIES
  • 42. MODEL EVALUATION PER MODEL QUERY COUNT DISTRIBUTION ► used to evaluate the distribution of interactions (impressions) on individual queries ► helps identify the most frequent queries for later use on other analysis: ▹ to group by frequency ▹ to calculate specific metrics VERTICAL BAR
  • 43. MODEL EVALUATION PER MODEL QUERY COUNT DISTRIBUTION
  • 44. MODEL EVALUATION PER MODEL QUERY COUNT DISTRIBUTION
  • 45. MODEL EVALUATION PER MODEL INTERACTIONS DAILY COUNT DATA TABLE ► used to evaluate the distribution of data by model and by day ► useful for understanding whether the testing process is distributing the models equally or in the desired percentage ► useful for discovering any imbalances/disruptions
  • 46. MODEL EVALUATION PER MODEL INTERACTIONS DAILY COUNT
  • 47. MODEL EVALUATION PER MODEL INTERACTIONS DAILY COUNT
  • 48. MODEL EVALUATION BASED ON QUERIES’ FREQUENCY ► used to evaluate the performance of models on queries grouped according to the search demand curve ► may be more interesting to consider the model on the most frequent queries ► may help to consider other approaches (e.g. multiple models) VEGA
  • 49. MODEL EVALUATION BASED ON QUERIES’ FREQUENCY
  • 50. MODEL EVALUATION BASED ON QUERIES’ FREQUENCY
  • 51. MODEL EVALUATION BASED ON QUERIES’ FREQUENCY
  • 52. MODEL EVALUATION BASED ON QUERIES’ FREQUENCY
  • 53. COMMON QUERIES HOW TO COMPARE MODELS ON COMMON QUERIES 1. Query to extract the unique query ids per model → save only the buckets from the response in separate files (e.g. unique_query_ids.json) 2. unique_query_ids_modelA.json and unique_query_ids_modelB.json are the input files for query_elaboration.py python script 3. The stdout of the python script is the query to be copied into the Kibana filter of visualizations
  • 54. COMMON QUERIES Query to extract the unique query ids per model (e.g. modelB → unique_query_ids_modelB.json)
  • 55. COMMON QUERIES INPUT - modelA_query_file: unique_query_ids_modelA.json - modelB_query_file: unique_query_ids_modelB.json query_elaboration.py
  • 56. COMMON QUERIES INPUT - modelA_query_file: unique_query_ids_modelA.json - modelB_query_file: unique_query_ids_modelB.json query_elaboration.py
  • 57. MODEL EVALUATION BASED ON QUERIES’ RESULTS/HITS Use the query into the Kibana filter of visualizations
  • 58. MODEL EVALUATION BASED ON QUERIES’ RESULTS/HITS ► used to evaluate the performance of models on a set of common queries ► used to evaluate the performance of models on queries grouped according to the number of search results returned ► may be more interesting to consider the model on queries with a larger number of results ► with few search results (like 1 or 3) we expect the difference between models to be almost absent VEGA
  • 59. MODEL EVALUATION BASED ON QUERIES’ RESULTS/HITS
  • 62. KIBANA PROS & CONS ● Some features still not available ● Limitations for aggregated data (more complex queries) ● VEGA “limitations” (not simple to use) ● Manual editing of filters (if the data view is renamed or the model is changed) PROS CONS ● Easy GUI to use ● Detailed reporting dashboards (aggregating several visualizations) ● Filter unwanted data ● Automatically update of visualizations and dashboards ● Export (and import) visualizations and dashboards (as NDJSON) using the Export objects API
  • 63. ADDITIONAL REFERENCES ● The importance of online testing in learning to rank part 1 ● Online Testing for Learning To Rank: Interleaving ● Offline Search Quality Evaluation: Rated Ranking Evaluator (RRE) ● Apache Solr Learning To Rank Interleaving Keep an eye on our Blog page, as more is coming! OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION