SlideShare a Scribd company logo
1 of 50
Download to read offline
Berlin Buzzwords 2023
20/06/2023
Anna Ruggero, R&D Software Engineer @ Sease
Ilaria Petreti, ML Software Engineer @ Sease
How To Implement Online Search
Quality Evaluation With Kibana
‣ Born in Padova
‣ R&D Search Software Engineer
‣ Master in Computer Science at Padova
‣ Artifact Evaluation Committee SIGIR member
‣ Solr, Elasticsearch, OpenSearch expert
‣ Recommender Systems, Big Data, Machine
Learning, Ranking Systems Evaluation
‣ Organist, Latin Music Dancer and Orchid Lover
ANNA RUGGERO
WHO WE ARE
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ ML Search Software Engineer
‣ Master in Data Science in Rome
‣ Big Data, Machine Learning, Ranking Systems
Evaluation
‣ Sport Lover (Basketball player)
ILARIA PETRETI
WHO WE ARE
● Headquarter in London/distributed
● Open Source Enthusiasts
● Apache Lucene/Solr/Elasticsearch experts
● Community Contributors
● Active Researchers
● Hot Trends : Neural Search,
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
www.sease.io
SEArch SErvices
OVERVIEW
Online Evaluation
A/B Testing
Our Kibana Implementation
Visualization Examples
Key Takeaways
ONLINE EVALUATION
WHAT IS ONLINE EVALUATION
Online evaluation is one of the most common
approaches to measure the effectiveness of an
information retrieval system.
It remains the optimal way to prove
how your system performs in a real-world
scenario
ONLINE vs OFFLINE
Both Offline/Online evaluations are vital for a business
● Find anomalies in data
● Set up a controlled environment not
influenced by the external world
● Check how the model performs
before using in production
● Save time and money
OFFLINE ONLINE
● Assess how the system performs online with
real users
● Measure profit/improvements/regressions
on live data
● Can identify anomalies for un-expected
queries/user behaviors
● More expensive (time and cost)
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement:
➡ One sample per query
➡ One relevance label for all the samples of a query
➡ Interactions considered for the data set creation
➡ Too small (unrepresentative)
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
WHY IS IMPORTANT
There are several problems that are hard to be detected with an offline
evaluation:
1. An incorrect relevance judgements set allow us to obtain model
evaluation results that aren’t reflecting the real model improvement.
2. Finding a direct correlation between the offline evaluation metrics and
the key performance indicator used for the online model performance
evaluation (e.g. revenues).
3. Is based on implicit/explicit relevance labels that not always reflect the
real user need.
A/B TESTING
A/B TESTING
A/B TESTING NOISE
MODEL A
10 sales from the homepage
5 sales from the search page
MODEL B
3 sales from the homepage
10 sales from the search page
Model A is better than Model B?
Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
SIGNALS TO MEASURE
● QUERY REFORMULATIONS/BOUNCE RATES
● SALE/REVENUE RATES
● CLICK THROUGH RATES (views, download, add to favourite, ...)
● DWELL TIME (time spent on a search result after the click)
● ….
Recommendation: test for direct correlation!
OUR KIBANA
IMPLEMENTATION
OUR KIBANA IMPLEMENTATION
MAIN MOTIVATING FACTORS
● Limitations of available online evaluation softwares
● Be able to use the same metric that optimizes the model
● Be able to rule out external factors/corrupt interactions
WHAT IS KIBANA
“Kibana is a software that provides search
and visualization capabilities for indexed
data in Elasticsearch, allowing you to
explore and visualize a large volume of
data and create detailed reporting
dashboards”
OUR KIBANA IMPLEMENTATION
● Creation of a Elasticsearch instance
● Creation of an Index
● A/B Testing set up
● User interactions data Indexing
● Creation of a Data View
● Visualizations and Dashboards Creation for model comparison
STEPS
OUR KIBANA IMPLEMENTATION
● E-commerce of books
● The index contains:
○ bookId
○ testGroup to identify model (name) assigned to a user group
○ queryId
○ timestamp
○ interactionType (impression, click, add to cart, sale)
○ queryResultCount (query hits)
○ userDevice
SCENARIO
impression
click
addToCart
sale
VISUALIZATION
EXAMPLES
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/tsvb.html
● Tabular form
● Table columns: different custom metrics calculation
● Rows: terms aggregations (i.e. testGroup).
TIME SERIES VISUAL BUILDER (TSVB) - TABLE
MODEL EVALUATION
GENERAL EVALUATION
► Used for the overall
evaluation of the models
► Helps to evaluate each
model on common metrics
(based on domain
requirements)
► Helps to easily compare
them
TSVB
► Useful for discovering bugs in the online testing setup (frontend application issues)
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
MODEL EVALUATION
GENERAL EVALUATION
KIBANA TOOLS
DASHBOARD
MODEL EVALUATION
SINGLE PRODUCT EVALUATION
► Used to evaluate model
performance on specific
product (document)
► Useful for seeing how the
model behaves on a product
of interest:
e.g. best sellers, new product,
most reviewed, sponsored, on
sale promotional items, etc.
TSVB
MODEL EVALUATION
SINGLE PRODUCT EVALUATION
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
► Used to evaluate model
performance on specific
queries
► Calculated multiple metrics
for each query
► Helpful for in-depth
analysis
TSVB
queryId
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
MODEL EVALUATION
PER MODEL TOP 5 QUERIES
KIBANA TOOLS
https://www.elastic.co/guide/en/kibana/current/vega.html
● Custom visualizations
● Query response manipulation
● High customizable format
● Query DSL (Domain Specific Language)
VEGA
MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
► Used to evaluate the performance of
models on queries grouped
according to the search demand
curve
► May be more interesting to consider
the model on the most frequent
queries
► May help to consider other
approaches (e.g. multiple models)
MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
MODEL EVALUATION
BY QUERIES’ FREQUENCY RANGES
MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES
► Used to evaluate the performance of models
on queries grouped according to the number
of search results returned
► May be more interesting to consider the model
on queries with a larger number of results
► With few search results (like 1 or 3) we
expect the difference between models to be
almost absent
► Used to evaluate the performance of models
on a set of common queries
{“bool”:{“should”:[{“match_phrase”:{“queryId”:”0”}},{“match_phrase”:{“queryId”:”10”}},{“match_phrase”:{“queryId”:”3”}}],...
QUERIES IN COMMON
1. Extract the unique query ids per model
2. Extract queries in common between the two sets extracted in 1)
3. Elaborate the result as an Elasticsearch query to be copied into the Kibana
visualizations filter
https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
MODEL EVALUATION
BY QUERIES’ RESULTS/HITS RANGES Use the query into the Kibana filter of
visualizations
{
"bool": {
"minimum_should_match": 1,
"should": [
{
"match": {
"queryId": "0"
}},
{
"match": {
"queryId": "10"
}},
{
"match": {
"queryId": "3"
}}]
}}
KIBANA TOOLS
DASHBOARD
TAKEAWAYS
► Online evaluation allows you to assess how the system performs online
with real users
► A/B testing: remove noise
► Why Kibana?
○ Easy GUI to use
○ Detailed reporting dashboards
○ Filter unwanted data
○ Automatically update of visualizations
○ Export (and import) visualizations and dashboards (as NDJSON) using
the Export objects API
KEY TAKEAWAYS
KEY TAKEAWAYS
► Time Series Visual Builder Table: good for basic comparisons with custom
metrics and easy to set up.
○ General evaluation
○ Single product evaluation
○ Per model most frequent queries
► VEGA: highly customizable in both data and graphical aspects.
○ Evaluation for queries’ frequency ranges
○ Evaluation for queries’ hits/results ranges
ADDITIONAL REFERENCES
OUR BLOG POSTS ABOUT THIS TALK
● https://sease.io/2023/03/online-search-quality-evaluation-with-kibana-introduction.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-visualization-examples.html
● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
● https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html
● https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html
OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION
THANK YOU!
NOTES
The more observant among you might noticed that in some of our visualizations the
CTR/ATR is higher than 1.
This is due to how the dataset was created (It was only done for this presentation purpose,
as an example).
In an ideal scenario you should never have a number of clicks greater than the number of
impressions (since the user should have seen the product before clicking on it) or the
number of add to carts greater than the number of clicks.

More Related Content

Similar to How To Implement Your Online Search Quality Evaluation With Kibana

Webanalytics with Microsoft BI
Webanalytics with Microsoft BIWebanalytics with Microsoft BI
Webanalytics with Microsoft BITillmann Eitelberg
 
Architecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for BusinessArchitecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for BusinessIvo Andreev
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETHacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETTanyaRaina3
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019OpenSource Connections
 
Data and Consumer Product Development
Data and Consumer Product DevelopmentData and Consumer Product Development
Data and Consumer Product DevelopmentGaurav Bhalotia
 
Data Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceData Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceInside Analysis
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Chainsaw Conjoint
Chainsaw ConjointChainsaw Conjoint
Chainsaw ConjointQuestionPro
 
Content Marketing Retreat: Measurement with Google Analytics
Content Marketing Retreat: Measurement with Google Analytics Content Marketing Retreat: Measurement with Google Analytics
Content Marketing Retreat: Measurement with Google Analytics Mightybytes
 
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst
 
BESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesBESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesRoberto García
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsDatabricks
 
Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413Ron Corbisier
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3Ogilvy Consulting
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionNeo4j
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.BI
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentationMichael Young
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
 

Similar to How To Implement Your Online Search Quality Evaluation With Kibana (20)

Webanalytics with Microsoft BI
Webanalytics with Microsoft BIWebanalytics with Microsoft BI
Webanalytics with Microsoft BI
 
Architecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for BusinessArchitecting AI Solutions in Azure for Business
Architecting AI Solutions in Azure for Business
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOETHacktoberFestPune - DSC MESCOE x DSC PVGCOET
HacktoberFestPune - DSC MESCOE x DSC PVGCOET
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Data and Consumer Product Development
Data and Consumer Product DevelopmentData and Consumer Product Development
Data and Consumer Product Development
 
Data Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-RelianceData Visualization and the Art of Self-Reliance
Data Visualization and the Art of Self-Reliance
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Chainsaw Conjoint
Chainsaw ConjointChainsaw Conjoint
Chainsaw Conjoint
 
Content Marketing Retreat: Measurement with Google Analytics
Content Marketing Retreat: Measurement with Google Analytics Content Marketing Retreat: Measurement with Google Analytics
Content Marketing Retreat: Measurement with Google Analytics
 
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
Birst Webinar Slides: "Build vs. Buy - Making the Right Choice for a Great Da...
 
BESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User InterfacesBESDUI: Benchmark for End-User Structured Data User Interfaces
BESDUI: Benchmark for End-User Structured Data User Interfaces
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413Twin Cities Eloqua User Group 092413
Twin Cities Eloqua User Group 092413
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3What's Next: Cloudy with a chance of AI 3
What's Next: Cloudy with a chance of AI 3
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in Production
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
 
Group 3 slide presentation
Group 3 slide presentationGroup 3 slide presentation
Group 3 slide presentation
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

How To Implement Your Online Search Quality Evaluation With Kibana

  • 1. Berlin Buzzwords 2023 20/06/2023 Anna Ruggero, R&D Software Engineer @ Sease Ilaria Petreti, ML Software Engineer @ Sease How To Implement Online Search Quality Evaluation With Kibana
  • 2. ‣ Born in Padova ‣ R&D Search Software Engineer ‣ Master in Computer Science at Padova ‣ Artifact Evaluation Committee SIGIR member ‣ Solr, Elasticsearch, OpenSearch expert ‣ Recommender Systems, Big Data, Machine Learning, Ranking Systems Evaluation ‣ Organist, Latin Music Dancer and Orchid Lover ANNA RUGGERO WHO WE ARE
  • 3. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ ML Search Software Engineer ‣ Master in Data Science in Rome ‣ Big Data, Machine Learning, Ranking Systems Evaluation ‣ Sport Lover (Basketball player) ILARIA PETRETI WHO WE ARE
  • 4. ● Headquarter in London/distributed ● Open Source Enthusiasts ● Apache Lucene/Solr/Elasticsearch experts ● Community Contributors ● Active Researchers ● Hot Trends : Neural Search, Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning www.sease.io SEArch SErvices
  • 5. OVERVIEW Online Evaluation A/B Testing Our Kibana Implementation Visualization Examples Key Takeaways
  • 7. WHAT IS ONLINE EVALUATION Online evaluation is one of the most common approaches to measure the effectiveness of an information retrieval system. It remains the optimal way to prove how your system performs in a real-world scenario
  • 8. ONLINE vs OFFLINE Both Offline/Online evaluations are vital for a business ● Find anomalies in data ● Set up a controlled environment not influenced by the external world ● Check how the model performs before using in production ● Save time and money OFFLINE ONLINE ● Assess how the system performs online with real users ● Measure profit/improvements/regressions on live data ● Can identify anomalies for un-expected queries/user behaviors ● More expensive (time and cost)
  • 9. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement: ➡ One sample per query ➡ One relevance label for all the samples of a query ➡ Interactions considered for the data set creation ➡ Too small (unrepresentative)
  • 10. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement. 2. Finding a direct correlation between the offline evaluation metrics and the key performance indicator used for the online model performance evaluation (e.g. revenues).
  • 11. WHY IS IMPORTANT There are several problems that are hard to be detected with an offline evaluation: 1. An incorrect relevance judgements set allow us to obtain model evaluation results that aren’t reflecting the real model improvement. 2. Finding a direct correlation between the offline evaluation metrics and the key performance indicator used for the online model performance evaluation (e.g. revenues). 3. Is based on implicit/explicit relevance labels that not always reflect the real user need.
  • 14. A/B TESTING NOISE MODEL A 10 sales from the homepage 5 sales from the search page MODEL B 3 sales from the homepage 10 sales from the search page Model A is better than Model B? Be sure to consider ONLY interactions from result pages ranked by the models you are comparing
  • 15. SIGNALS TO MEASURE ● QUERY REFORMULATIONS/BOUNCE RATES ● SALE/REVENUE RATES ● CLICK THROUGH RATES (views, download, add to favourite, ...) ● DWELL TIME (time spent on a search result after the click) ● …. Recommendation: test for direct correlation!
  • 17. OUR KIBANA IMPLEMENTATION MAIN MOTIVATING FACTORS ● Limitations of available online evaluation softwares ● Be able to use the same metric that optimizes the model ● Be able to rule out external factors/corrupt interactions
  • 18. WHAT IS KIBANA “Kibana is a software that provides search and visualization capabilities for indexed data in Elasticsearch, allowing you to explore and visualize a large volume of data and create detailed reporting dashboards”
  • 19. OUR KIBANA IMPLEMENTATION ● Creation of a Elasticsearch instance ● Creation of an Index ● A/B Testing set up ● User interactions data Indexing ● Creation of a Data View ● Visualizations and Dashboards Creation for model comparison STEPS
  • 20. OUR KIBANA IMPLEMENTATION ● E-commerce of books ● The index contains: ○ bookId ○ testGroup to identify model (name) assigned to a user group ○ queryId ○ timestamp ○ interactionType (impression, click, add to cart, sale) ○ queryResultCount (query hits) ○ userDevice SCENARIO impression click addToCart sale
  • 22. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/tsvb.html ● Tabular form ● Table columns: different custom metrics calculation ● Rows: terms aggregations (i.e. testGroup). TIME SERIES VISUAL BUILDER (TSVB) - TABLE
  • 23. MODEL EVALUATION GENERAL EVALUATION ► Used for the overall evaluation of the models ► Helps to evaluate each model on common metrics (based on domain requirements) ► Helps to easily compare them TSVB ► Useful for discovering bugs in the online testing setup (frontend application issues)
  • 30. MODEL EVALUATION SINGLE PRODUCT EVALUATION ► Used to evaluate model performance on specific product (document) ► Useful for seeing how the model behaves on a product of interest: e.g. best sellers, new product, most reviewed, sponsored, on sale promotional items, etc. TSVB
  • 32. MODEL EVALUATION PER MODEL TOP 5 QUERIES ► Used to evaluate model performance on specific queries ► Calculated multiple metrics for each query ► Helpful for in-depth analysis TSVB queryId
  • 33. MODEL EVALUATION PER MODEL TOP 5 QUERIES
  • 34. MODEL EVALUATION PER MODEL TOP 5 QUERIES
  • 35. KIBANA TOOLS https://www.elastic.co/guide/en/kibana/current/vega.html ● Custom visualizations ● Query response manipulation ● High customizable format ● Query DSL (Domain Specific Language) VEGA
  • 36. MODEL EVALUATION BY QUERIES’ FREQUENCY RANGES ► Used to evaluate the performance of models on queries grouped according to the search demand curve ► May be more interesting to consider the model on the most frequent queries ► May help to consider other approaches (e.g. multiple models)
  • 37. MODEL EVALUATION BY QUERIES’ FREQUENCY RANGES
  • 38. MODEL EVALUATION BY QUERIES’ FREQUENCY RANGES
  • 39. MODEL EVALUATION BY QUERIES’ FREQUENCY RANGES
  • 40. MODEL EVALUATION BY QUERIES’ FREQUENCY RANGES
  • 41. MODEL EVALUATION BY QUERIES’ RESULTS/HITS RANGES ► Used to evaluate the performance of models on queries grouped according to the number of search results returned ► May be more interesting to consider the model on queries with a larger number of results ► With few search results (like 1 or 3) we expect the difference between models to be almost absent ► Used to evaluate the performance of models on a set of common queries {“bool”:{“should”:[{“match_phrase”:{“queryId”:”0”}},{“match_phrase”:{“queryId”:”10”}},{“match_phrase”:{“queryId”:”3”}}],...
  • 42. QUERIES IN COMMON 1. Extract the unique query ids per model 2. Extract queries in common between the two sets extracted in 1) 3. Elaborate the result as an Elasticsearch query to be copied into the Kibana visualizations filter https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html
  • 43. MODEL EVALUATION BY QUERIES’ RESULTS/HITS RANGES Use the query into the Kibana filter of visualizations { "bool": { "minimum_should_match": 1, "should": [ { "match": { "queryId": "0" }}, { "match": { "queryId": "10" }}, { "match": { "queryId": "3" }}] }}
  • 46. ► Online evaluation allows you to assess how the system performs online with real users ► A/B testing: remove noise ► Why Kibana? ○ Easy GUI to use ○ Detailed reporting dashboards ○ Filter unwanted data ○ Automatically update of visualizations ○ Export (and import) visualizations and dashboards (as NDJSON) using the Export objects API KEY TAKEAWAYS
  • 47. KEY TAKEAWAYS ► Time Series Visual Builder Table: good for basic comparisons with custom metrics and easy to set up. ○ General evaluation ○ Single product evaluation ○ Per model most frequent queries ► VEGA: highly customizable in both data and graphical aspects. ○ Evaluation for queries’ frequency ranges ○ Evaluation for queries’ hits/results ranges
  • 48. ADDITIONAL REFERENCES OUR BLOG POSTS ABOUT THIS TALK ● https://sease.io/2023/03/online-search-quality-evaluation-with-kibana-introduction.html ● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-visualization-examples.html ● https://sease.io/2023/06/online-search-quality-evaluation-with-kibana-queries-in-common.html ● https://sease.io/2020/04/the-importance-of-online-testing-in-learning-to-rank-part-1.html ● https://sease.io/2020/05/online-testing-for-learning-to-rank-interleaving.html OUR BLOG POSTS ABOUT SEARCH QUALITY EVALUATION
  • 50. NOTES The more observant among you might noticed that in some of our visualizations the CTR/ATR is higher than 1. This is due to how the dataset was created (It was only done for this presentation purpose, as an example). In an ideal scenario you should never have a number of clicks greater than the number of impressions (since the user should have seen the product before clicking on it) or the number of add to carts greater than the number of clicks.