SlideShare a Scribd company logo

How the Lucene More Like This Works

Sease
Sease

The More Like This search functionality is a key feature in Apache Lucene that allows to find similar documents to an input one (text or document). Being widely used but rarely explored, this presentation will start introducing how the MLT works internally. The focus of the talk is to improve the general understanding of MLT and the way you could benefit from it. Building on the introduction the focus will be on the BM25 text similarity function and how this has been (tentatively) included in the MLT through a conspicious refactor and testing process, to improve the identification of the most interesting terms from the input that can drive the similarity search. The presentation will include real world usage examples, proposed patches, pending contributions and future developments such as improved query building through positional phrase queries.

1 of 30
Download to read offline
Apache Lucene/Solr London User Group
How the Lucene More Like
This Works
Alessandro Benedetti, Software Engineer
16th May 2019
Apache Lucene/Solr London User GroupWho I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Apache Lucene/Solr London User GroupSease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning
Apache Lucene/Solr London User Group
● Document Similarity
● Apache Lucene More Like This
! Term Scorer
! BM25
● Interesting Terms Retrieval
● Query Building
! DEMO
! Future Work
! JIRA References
Agenda
Apache Lucene/Solr London User Group
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :

● Collaborative approach 

(users interactions)
● Content Based
● Hybrid
Similar ? 

● Documents accessed in
similar manners by similar
people
● Terms distributions
● All of above
Apache Lucene/Solr London User Group
Real World Use Cases - Streaming Services
Ad

Recommended

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Solr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologySolr Graph Query: Presented by Kevin Watters, KMW Technology
Solr Graph Query: Presented by Kevin Watters, KMW TechnologyLucidworks
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducerslucenerevolution
 
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.jsHeeJung Hwang
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuningelliando dias
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 

More Related Content

What's hot

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in luceneLucidworks (Archived)
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]MongoDB
 
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedSematext Group, Inc.
 
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정PgDay.Seoul
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdfAbhi Jain
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopSudhir Tonse
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cTanel Poder
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 

What's hot (20)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
Naver속도의, 속도에 의한, 속도를 위한 몽고DB (네이버 컨텐츠검색과 몽고DB) [Naver]
 
Tweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities ExplainedTweaking the Base Score: Lucene/Solr Similarities Explained
Tweaking the Base Score: Lucene/Solr Similarities Explained
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
[pgday.Seoul 2022] 서비스개편시 PostgreSQL 도입기 - 진소린 & 김태정
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Architecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash WorkshopArchitecting for the Cloud using NetflixOSS - Codemash Workshop
Architecting for the Cloud using NetflixOSS - Codemash Workshop
 
SQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12cSQL Monitoring in Oracle Database 12c
SQL Monitoring in Oracle Database 12c
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Elk
Elk Elk
Elk
 

Similar to How the Lucene More Like This Works

Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and OntopiaGeir Ove Grønmo
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15Hans Höchtl
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.comFernando Meyer
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 reportKoji Kawamura
 

Similar to How the Lucene More Like This Works (20)

Faceted search using Solr and Ontopia
Faceted search using Solr and OntopiaFaceted search using Solr and Ontopia
Faceted search using Solr and Ontopia
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
ApacheCon NA 2011 report
ApacheCon NA 2011 reportApacheCon NA 2011 report
ApacheCon NA 2011 report
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 

Recently uploaded

"Testing of Helm Charts or There and Back Again", Yura Rochniak
"Testing of Helm Charts or There and Back Again", Yura Rochniak"Testing of Helm Charts or There and Back Again", Yura Rochniak
"Testing of Helm Charts or There and Back Again", Yura RochniakFwdays
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1Inbay UK
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Product School
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...Neo4j
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Product School
 
From Challenger to Champion: How SpiraPlan Outperforms JIRA+Plugins
From Challenger to Champion: How SpiraPlan Outperforms JIRA+PluginsFrom Challenger to Champion: How SpiraPlan Outperforms JIRA+Plugins
From Challenger to Champion: How SpiraPlan Outperforms JIRA+PluginsInflectra
 
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions..."How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...Fwdays
 
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...MarcovanHurne2
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfkatalinjordans1
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIEDanBrown980551
 
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERNRonnelBaroc
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxNeo4j
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVARobert McDermott
 
Apex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxApex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxmohayyudin7826
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Adrian Sanabria
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...UiPathCommunity
 
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Product School
 
"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys VasylievFwdays
 
How we think about an advisor tech stack
How we think about an advisor tech stackHow we think about an advisor tech stack
How we think about an advisor tech stackSummit
 

Recently uploaded (20)

"Testing of Helm Charts or There and Back Again", Yura Rochniak
"Testing of Helm Charts or There and Back Again", Yura Rochniak"Testing of Helm Charts or There and Back Again", Yura Rochniak
"Testing of Helm Charts or There and Back Again", Yura Rochniak
 
IT Nation Evolve event 2024 - Quarter 1
IT Nation Evolve event 2024  - Quarter 1IT Nation Evolve event 2024  - Quarter 1
IT Nation Evolve event 2024 - Quarter 1
 
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...Relationship Counselling: From Disjointed Features to Product-First Thinking ...
Relationship Counselling: From Disjointed Features to Product-First Thinking ...
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
Harnessing the Power of GenAI for Exceptional Product Outcomes by Booking.com...
 
From Challenger to Champion: How SpiraPlan Outperforms JIRA+Plugins
From Challenger to Champion: How SpiraPlan Outperforms JIRA+PluginsFrom Challenger to Champion: How SpiraPlan Outperforms JIRA+Plugins
From Challenger to Champion: How SpiraPlan Outperforms JIRA+Plugins
 
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions..."How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
"How we created an SRE team in Temabit as a part of FOZZY Group in conditions...
 
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...
Digital Transformation Strategy & Plan Templates - www.beyondthecloud.digital...
 
Power of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdfPower of 2024 - WITforce Odyssey.pptx.pdf
Power of 2024 - WITforce Odyssey.pptx.pdf
 
LF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIELF Energy Webinar: Introduction to TROLIE
LF Energy Webinar: Introduction to TROLIE
 
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
21ST CENTURY LITERACY FROM TRADITIONAL TO MODERN
 
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptxThe Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
The Art of the Possible with Graph by Dr Jim Webber Neo4j.pptx
 
Introduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVAIntroduction to Multimodal LLMs with LLaVA
Introduction to Multimodal LLMs with LLaVA
 
Apex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptxApex Replay Debugger and Salesforce Platform Events.pptx
Apex Replay Debugger and Salesforce Platform Events.pptx
 
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
Early Tech Adoption: Foolish or Pragmatic? - 17th ISACA South Florida WOW Con...
 
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
Dev Dives: Leverage APIs and Gen AI to power automations for RPA and software...
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
Cultivating Entrepreneurial Mindset in Product Management: Strategies for Suc...
 
"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev"AIRe - AI Reliability Engineering", Denys Vasyliev
"AIRe - AI Reliability Engineering", Denys Vasyliev
 
How we think about an advisor tech stack
How we think about an advisor tech stackHow we think about an advisor tech stack
How we think about an advisor tech stack
 

How the Lucene More Like This Works

  • 1. Apache Lucene/Solr London User Group How the Lucene More Like This Works Alessandro Benedetti, Software Engineer 16th May 2019
  • 2. Apache Lucene/Solr London User GroupWho I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Apache Lucene/Solr London User GroupSease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  • 4. Apache Lucene/Solr London User Group ● Document Similarity ● Apache Lucene More Like This ! Term Scorer ! BM25 ● Interesting Terms Retrieval ● Query Building ! DEMO ! Future Work ! JIRA References Agenda
  • 5. Apache Lucene/Solr London User Group Document Similarity Problem : find similar documents to a seed one Solution(s) :
 ● Collaborative approach 
 (users interactions) ● Content Based ● Hybrid Similar ? 
 ● Documents accessed in similar manners by similar people ● Terms distributions ● All of above
  • 6. Apache Lucene/Solr London User Group Real World Use Cases - Streaming Services
  • 7. Apache Lucene/Solr London User Group Real World Use Cases - Hotels
  • 8. Apache Lucene/Solr London User Group Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9. Apache Lucene/Solr London User Group ● Search Library (java) ● Structured Documents ! Inverted Index ! Similarity Metrics ( TF-IDF, BM25) ! Fast Search ! Support for advanced queries ! Relevancy tuning Apache Lucene
  • 10. Apache Lucene/Solr London User Group Inverted Index Indexing
  • 11. Apache Lucene/Solr London User Group Input Document More Like This Params Interesting Terms
 Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 12. Apache Lucene/Solr London User Group Responsibility : define a set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ! Useful container for various parameters to be passed More Like This Params
  • 13. Apache Lucene/Solr London User Group ● Field Name ● Field Stats ( Document Count) ● Term Stats ( Document Frequency) ! Term Frequency ! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ! BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 14. Apache Lucene/Solr London User Group ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 15. Apache Lucene/Solr London User Group BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  • 16. Apache Lucene/Solr London User Group BM25 Term Scorer - Term Frequency TF Score
 approaches
 asymptotically (k+1)
 
 k=1.2 in this example
  • 17. Apache Lucene/Solr London User Group BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 
 affects how fast we saturate TF score
  • 18. Apache Lucene/Solr London User Group Responsibility : retrieve from the document a queue of weighted interesting terms Params Used ! Analyzer ! Max Num Token Parsed ! Min Term Frequency ! Min/Max Document Frequency ! Max Query Terms ! Query Time Field Boost Interesting Term Retriever ! Analyze content / Term Vector ! Skip Tokens ! Score Tokens ! Build Queue of Top Scored terms
  • 19. Apache Lucene/Solr London User Group Params Used ! Term Boost Enabled More Like This Query Builder Field1 :
 Term1 Field2 :
 Term2 Field1 :
 Term3 Field1 :
 Term4 Field3 :
 Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 20. Apache Lucene/Solr London User Group Term Boost ! on/off ! Affect each term weight in the MLT query ! It is the term score 
 ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ! field1^5.0 field2^2.0 field3^1.5 ! Affect Term Scorer ! Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 21. Apache Lucene/Solr London User Group More Like This Usage - Lucene Classification ! Given a document D to classify ! K Nearest Neighbours Classifier ! Find Top K similar documents to D ( MLT) ! Classes are extracted ! Class Frequency + Class ranking -> Class probability
  • 22. Apache Lucene/Solr London User Group More Like This Usage - Apache Solr ! More Like This query parser
 ( can be concatenated with other queries) ! More Like This search component
 ( can be assigned to a Request Handler) ! More Like This handler
 ( handler with specific request parameters)
  • 23. Apache Lucene/Solr London User Group More Like This Demo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● Title - Name of the movie ● Directors - The person(s) who directed the making of the film ● Genres - The genre(s) that the movie belongs to
  • 24. Apache Lucene/Solr London User Group More Like This Demo - Tuned ! Enable/Disable Term Boost ! Min Term Frequency ! Min Document Frequency ! Field Boost
  • 25. Apache Lucene/Solr London User Group Future Work ! Query Builder just use Terms and Term Score ! Term Positions ? ! Phrase Queries Boost
 (for terms close in position) ! Sentence boundaries ! Field centric vs Document centric
 ( should high boosted fields kick out
 relevant terms from low boosted fields)
  • 26. Apache Lucene/Solr London User Group Future Work - More Like These ! Multiple documents in input ! Interesting terms across documents ● Useful for Content Based recommender engines
  • 27. Apache Lucene/Solr London User Group Pros ● Apache Lucene Module ! Advanced Params ! Input : 
 - structured document
 - just text ! Build an advanced query ! Leverage the Inverted Index
 ( and additional data structures) More Like This Cons ● Massive single class ! Low cohesion ! Low readability ! Minimum test coverage ! Difficult to extend
 ( and improve)
  • 28. Apache Lucene/Solr London User Group ● LUCENE-7498 - Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor ● LUCENE-8326 - MLT Params Refactor JIRA References
  • 29. Apache Lucene/Solr London User Group Questions ?
  • 30. Apache Lucene/Solr London User GroupThanks!