SlideShare a Scribd company logo
Search diversity
Daniel Wärnå, AI Engineer
Dmitry Kan, Principal AI Researcher
Who we are
What we do
Trusted AI partner. We deliver AI-driven solutions
and products to our clients by providing world-class
expertise and tooling.
Vision
AI for people. A world with safe human-centric AI
that frees the human mind for meaningful work.
200+ Experts
100+ PhDs
Network of 500+
Nordics
Helsinki, Turku, Tampere,
Jyväskylä, Oulu,
Stockholm, Copenhagen
US
Palo Alto
Largest private AI lab from the Nordics
DACH
Switzerland
UK
London
Daniel
● First AI engineer at Silo.AI
● Full-stack engineer
● Practices circus :)
Search diversity
Silo.AI
Dmitry
● Built a few search
engines: vertical and web
● Hosts Vector Podcast
● Helps out with Quepid
About us
A search application takes a user’s search query and
attempts to return content that the user is satisfied with
A search engine should:
- Understand the query
- Find relevant items
- Show found items in a relevant order
- But sometimes imprecise is more useful
Search diversity
Silo.AI
Related work
Search diversity
Silo.AI
1. Daniel Tunkelang
2. Jo Kristian Bergum (Vespa)
3. Eric Pugh
Thoughts on Search Result Diversity
Search diversity
Silo.AI
● https://dtunkelang.medium.com/thoughts-on-search-result-
diversity-1df54cb5bf4a
● Diversity sometimes fits broad queries (“men shoes”), but not all
broad queries (“shoes” -> show facets)
● Diversity never works for ambiguous queries (“mixer” - kitchen or
an audio product item?)
● Diversity works best for aspects that aren’t constraints: style, color
or material
● Balance between result desirability and result diversity
● Use stratified sampling, not necessarily with uniform distribution
● Use KL-divergence as penalty for how far off our distribution is
from a desired one (greedy vs combinatorial explosion)
Daniel Tunkelang
Result Diversification with Vespa
Search diversity
Silo.AI
● https://blog.vespa.ai/result-diversification-with-vespa/
● Types of retrieval: keyword, filters and nearest neighbors
● Grouping method is used for diversification
● Group by a predefined category: say colour
● Complex rules for group reordering, e.g. order(-
max(relevance())*count())
● Scaling diversity is an important consideration
Jo Kristian Bergum
https://www.youtube.com/watch?v=pnSv2fLC3lU
Entropy = Importance x Uncertainty
The more important an event in the system, the
more uncertainty it can introduce into the system
http://bit.ly/measure-diversity
Search Diversity
Search diversity
Silo.AI
More precise More diverse
Sweet spot
Search Diversity
Search diversity
Silo.AI
Search engine
Query
Diversification
Result
Search Diversity
Search diversity
Silo.AI
In practice we need to:
- Define what makes two items similar/dissimilar
- Reorder a list of items so that the end result is more
diverse
Diversification methods
Search diversity
Silo.AI
■ Randomization
Diversification methods
Search diversity
Silo.AI
- Randomization
■ Rule based
Rule based diversity
Search diversity
Silo.AI
Alternated between different sources
For example, if we have sources S1, S2, S3
For K = 1, result [S1, S2, S3, S1, S2, S3, …. ]
For K = 2, result [S1, S1, S2, S2, S3, S3, ….]
The choice of selecting S1 could be based on the
_score, weight
Diversification methods
Search diversity
Silo.AI
- Randomization
- Rule based
■ Entropy based
Entropy based diversity
Search diversity
Silo.AI
Weighted and Entropy@K
■ One key takeaway from with Entropy based methods is: High
Entropy -> Diversity
■ For a set of hits from a query, one can compute entropy given we
have probabilities
■ Multiple ways to compute probabilities
■ By aggregating counts of a category
■ By normalizing the scores of the hits
■ Entropy@K
■ Given we have top 50 hits, we can generate new samples by reordering the indices
■ Compute the entropy of top (K=10) of each sample
■ Pick the sampled index with the highest entropy
Entropy based diversity
Search diversity
Silo.AI
Weighted and Entropy@K
■ Weighted entropy
■ By weighting the document rank, one can
force similar hits not to appear at the top of
the results
Diversification methods
Search diversity
Silo.AI
- Randomization
- Rule based
- Entropy based
■ K-Means clustering
K-Means clustering
Search diversity
Silo.AI
■ Clustering: machine learning tool for discovering group
structure underlying the data
■ K-means clustering algorithm: creates the cluster allocations
by computing the centroids/means of data-points, iteratively
■ Data points within the cluster are considered to be relevant
■ To bring diversity:
■ we can show data-points allocated to different clusters first and repeat till we
exhaust all cluster allocations
■ Or by increasing the number of clusters we can just display different clusters
groups
Simplest clustering model
K-Means clustering
Search diversity
Silo.AI
Simplest clustering based diversification
Indices after k-means re-indexing
Indices before re-indexing
K-Means (K=2)
K-Means clustering
Search diversity
Silo.AI
Simplest clustering based diversification
Indices after k-means re-indexing
Indices before re-indexing
K-Means (K=2)
Diversification methods
Search diversity
Silo.AI
- Randomization
- Rule based
- Entropy based
- K-Means clustering
■ DPP - Determinantal point process
DPP - Determinantal Point Process
Search diversity
Silo.AI
■ DPP has connection to physics: thermal equilibrium,
quantum physics
■ Also used in ML: image search, document / video
summarization, product recommendation
■ Maximum a posteriori (MAP) is NP-hard
■ Greedy methods try to speed things up: O(M4)->O(M3)
with approximations
■ We used Exact method that gives O(M3)
■ Runs much faster than approximate O(M3) in practice
■ Code: https://github.com/laming-chen/fast-map-dpp
based on fast greedy MAP inference
How to integrate DPP/K-Means
Search diversity
Silo.AI
■ Both methods require some kind of similarity measure
■ Similarity can be derived from e.g:
■ Sentence embeddings
■ Encoding categorical data as an vector
■ Recommender system output
■ Over-querying results from the underlying search engine might
produce more diverse results
Measuring results
Search diversity
Silo.AI
- Manually evaluate query relevancy, did
relevancy drop due to diversification?
- User metrics, did click-through rate (or some
other metric) increase or decrease?
Thank you!
Search diversity
Silo.AI
- Daniel’s twitter: @danielwarna
- Dmitry’s twitter: @DmitryKan
- Vector Podcast: https://bit.ly/3HVsvcg
Sources and links Quepid: https://quepid.com/
Entropy tool: http://bit.ly/measure-diversity
How to measure diversity talk :
https://2021.berlinbuzzwords.de/session/how-
measure-diversity-search-results
https://towardsdatascience.com/what-is-shannons-entropy-
5ad1b5a83ce1
https://dtunkelang.medium.com/thoughts-on-
search-result-diversity-1df54cb5bf4a
https://blog.vespa.ai/result-diversification-with-
vespa/
Search diversity
Silo.AI

More Related Content

What's hot

Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
Jen Stirrup
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
Tsahi Glik
 
Pycon tw 2013
Pycon tw 2013Pycon tw 2013
Pycon tw 2013
show you
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
Ajay Ohri
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Ansgar Scherp
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
Faisal Siddiqi
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
Joseph Adler
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
Yoss Cohen
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
Turi, Inc.
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
GraphAware
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
Mohamed Taher Alrefaie
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
Sri Ambati
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
Neo4j
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
Turi, Inc.
 
Data scientist roadmap
Data scientist roadmapData scientist roadmap
Data scientist roadmap
Sonu Kumar
 
ISAX
ISAXISAX

What's hot (20)

Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
Pycon tw 2013
Pycon tw 2013Pycon tw 2013
Pycon tw 2013
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Skutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor SmithSkutil - H2O meets Sklearn - Taylor Smith
Skutil - H2O meets Sklearn - Taylor Smith
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...Designing and Building a Graph Database Application – Architectural Choices, ...
Designing and Building a Graph Database Application – Architectural Choices, ...
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Data scientist roadmap
Data scientist roadmapData scientist roadmap
Data scientist roadmap
 
ISAX
ISAXISAX
ISAX
 

Similar to Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Kan & Daniel Wärnå

A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
Louis Dorard
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
KumarKumar570063
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for  Ad-hoc Object RetrievalCombining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
eXascale Infolab
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019
Sonya Liberman
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
Carol McDonald
 
Python and MongoDB as a Market Data Platform by James Blackburn
Python and MongoDB as a Market Data Platform by James BlackburnPython and MongoDB as a Market Data Platform by James Blackburn
Python and MongoDB as a Market Data Platform by James Blackburn
PyData
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)
Sandra Garcia
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Explain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoExplain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You Do
Databricks
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
Databricks
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
DrGnaneswariG
 
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk
 
Splunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search DojoSplunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search Dojo
Splunk
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
PyData
 

Similar to Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Kan & Daniel Wärnå (20)

A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
Future of AI-powered automation in business
Future of AI-powered automation in businessFuture of AI-powered automation in business
Future of AI-powered automation in business
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for  Ad-hoc Object RetrievalCombining Inverted Indices and Structured Search for  Ad-hoc Object Retrieval
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019Recommender Systems @ Scale, Big Data Europe Conference 2019
Recommender Systems @ Scale, Big Data Europe Conference 2019
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Python and MongoDB as a Market Data Platform by James Blackburn
Python and MongoDB as a Market Data Platform by James BlackburnPython and MongoDB as a Market Data Platform by James Blackburn
Python and MongoDB as a Market Data Platform by James Blackburn
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Analytics in Online Retail
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Explain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoExplain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You Do
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Splunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search DojoSplunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search Dojo
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 

More from Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
Dmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
Dmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
Dmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
Dmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
Dmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
Dmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
Dmitry Kan
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
Dmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
Dmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
Dmitry Kan
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
Dmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 

More from Dmitry Kan (20)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 

Recently uploaded

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 

Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Kan & Daniel Wärnå

  • 1. Search diversity Daniel Wärnå, AI Engineer Dmitry Kan, Principal AI Researcher
  • 2. Who we are What we do Trusted AI partner. We deliver AI-driven solutions and products to our clients by providing world-class expertise and tooling. Vision AI for people. A world with safe human-centric AI that frees the human mind for meaningful work. 200+ Experts 100+ PhDs Network of 500+ Nordics Helsinki, Turku, Tampere, Jyväskylä, Oulu, Stockholm, Copenhagen US Palo Alto Largest private AI lab from the Nordics DACH Switzerland UK London
  • 3. Daniel ● First AI engineer at Silo.AI ● Full-stack engineer ● Practices circus :) Search diversity Silo.AI Dmitry ● Built a few search engines: vertical and web ● Hosts Vector Podcast ● Helps out with Quepid About us
  • 4. A search application takes a user’s search query and attempts to return content that the user is satisfied with A search engine should: - Understand the query - Find relevant items - Show found items in a relevant order - But sometimes imprecise is more useful Search diversity Silo.AI
  • 5. Related work Search diversity Silo.AI 1. Daniel Tunkelang 2. Jo Kristian Bergum (Vespa) 3. Eric Pugh
  • 6. Thoughts on Search Result Diversity Search diversity Silo.AI ● https://dtunkelang.medium.com/thoughts-on-search-result- diversity-1df54cb5bf4a ● Diversity sometimes fits broad queries (“men shoes”), but not all broad queries (“shoes” -> show facets) ● Diversity never works for ambiguous queries (“mixer” - kitchen or an audio product item?) ● Diversity works best for aspects that aren’t constraints: style, color or material ● Balance between result desirability and result diversity ● Use stratified sampling, not necessarily with uniform distribution ● Use KL-divergence as penalty for how far off our distribution is from a desired one (greedy vs combinatorial explosion) Daniel Tunkelang
  • 7. Result Diversification with Vespa Search diversity Silo.AI ● https://blog.vespa.ai/result-diversification-with-vespa/ ● Types of retrieval: keyword, filters and nearest neighbors ● Grouping method is used for diversification ● Group by a predefined category: say colour ● Complex rules for group reordering, e.g. order(- max(relevance())*count()) ● Scaling diversity is an important consideration Jo Kristian Bergum
  • 9.
  • 10.
  • 11.
  • 12. Entropy = Importance x Uncertainty The more important an event in the system, the more uncertainty it can introduce into the system
  • 14. Search Diversity Search diversity Silo.AI More precise More diverse Sweet spot
  • 15.
  • 16. Search Diversity Search diversity Silo.AI Search engine Query Diversification Result
  • 17. Search Diversity Search diversity Silo.AI In practice we need to: - Define what makes two items similar/dissimilar - Reorder a list of items so that the end result is more diverse
  • 20. Rule based diversity Search diversity Silo.AI Alternated between different sources For example, if we have sources S1, S2, S3 For K = 1, result [S1, S2, S3, S1, S2, S3, …. ] For K = 2, result [S1, S1, S2, S2, S3, S3, ….] The choice of selecting S1 could be based on the _score, weight
  • 21. Diversification methods Search diversity Silo.AI - Randomization - Rule based ■ Entropy based
  • 22. Entropy based diversity Search diversity Silo.AI Weighted and Entropy@K ■ One key takeaway from with Entropy based methods is: High Entropy -> Diversity ■ For a set of hits from a query, one can compute entropy given we have probabilities ■ Multiple ways to compute probabilities ■ By aggregating counts of a category ■ By normalizing the scores of the hits ■ Entropy@K ■ Given we have top 50 hits, we can generate new samples by reordering the indices ■ Compute the entropy of top (K=10) of each sample ■ Pick the sampled index with the highest entropy
  • 23. Entropy based diversity Search diversity Silo.AI Weighted and Entropy@K ■ Weighted entropy ■ By weighting the document rank, one can force similar hits not to appear at the top of the results
  • 24. Diversification methods Search diversity Silo.AI - Randomization - Rule based - Entropy based ■ K-Means clustering
  • 25. K-Means clustering Search diversity Silo.AI ■ Clustering: machine learning tool for discovering group structure underlying the data ■ K-means clustering algorithm: creates the cluster allocations by computing the centroids/means of data-points, iteratively ■ Data points within the cluster are considered to be relevant ■ To bring diversity: ■ we can show data-points allocated to different clusters first and repeat till we exhaust all cluster allocations ■ Or by increasing the number of clusters we can just display different clusters groups Simplest clustering model
  • 26. K-Means clustering Search diversity Silo.AI Simplest clustering based diversification Indices after k-means re-indexing Indices before re-indexing K-Means (K=2)
  • 27. K-Means clustering Search diversity Silo.AI Simplest clustering based diversification Indices after k-means re-indexing Indices before re-indexing K-Means (K=2)
  • 28. Diversification methods Search diversity Silo.AI - Randomization - Rule based - Entropy based - K-Means clustering ■ DPP - Determinantal point process
  • 29. DPP - Determinantal Point Process Search diversity Silo.AI ■ DPP has connection to physics: thermal equilibrium, quantum physics ■ Also used in ML: image search, document / video summarization, product recommendation ■ Maximum a posteriori (MAP) is NP-hard ■ Greedy methods try to speed things up: O(M4)->O(M3) with approximations ■ We used Exact method that gives O(M3) ■ Runs much faster than approximate O(M3) in practice ■ Code: https://github.com/laming-chen/fast-map-dpp based on fast greedy MAP inference
  • 30. How to integrate DPP/K-Means Search diversity Silo.AI ■ Both methods require some kind of similarity measure ■ Similarity can be derived from e.g: ■ Sentence embeddings ■ Encoding categorical data as an vector ■ Recommender system output ■ Over-querying results from the underlying search engine might produce more diverse results
  • 31. Measuring results Search diversity Silo.AI - Manually evaluate query relevancy, did relevancy drop due to diversification? - User metrics, did click-through rate (or some other metric) increase or decrease?
  • 32.
  • 33. Thank you! Search diversity Silo.AI - Daniel’s twitter: @danielwarna - Dmitry’s twitter: @DmitryKan - Vector Podcast: https://bit.ly/3HVsvcg
  • 34. Sources and links Quepid: https://quepid.com/ Entropy tool: http://bit.ly/measure-diversity How to measure diversity talk : https://2021.berlinbuzzwords.de/session/how- measure-diversity-search-results https://towardsdatascience.com/what-is-shannons-entropy- 5ad1b5a83ce1 https://dtunkelang.medium.com/thoughts-on- search-result-diversity-1df54cb5bf4a https://blog.vespa.ai/result-diversification-with- vespa/ Search diversity Silo.AI

Editor's Notes

  1. From DT: Focus on demand more than on supply
  2. Given a set of hits from a query, one can compute its entropy given we have probabilities for each hit There are two ways to compute probabilities (can be more): Either by aggregation counts of a given category or by normalizing the scores of the hits Entropy@K, where K is 10, Given the top 50 hits, randomly generate new sample by reordering indices compute the entropy of top K=10 hits of each sample pick the sampled index with highest entropy
  3. And weighted entropy can be used as a measure of diversity By weighting the index locations, for example first entry gets high weight and subsequent entries get lower weights. Now if we sample the indices and compute the weighted entropies of each sample and pick a sample with high weighted entropy. we end up getting a diverse enough sample. Essentially penalizing hits with similar results being displayed at the top
  4. Pick items from clusters starting from centroids, then paginate Or show cluster after cluster, starting from most narrow cluster: it has higher diversity / sparsity in data
  5. The numbers mean the rank of an item on the screen