SlideShare a Scribd company logo
1 of 41
Download to read offline
Grouping & Joining
                Martijn van Groningen
                martijn.vangroningen@searchworkings.com
                Lucene Committer & PMC Member




Thursday, May 17, 2012
Grouping & Joining

       Overview
      ‣ Background



      ‣ Joining



      ‣ Result grouping



      ‣ Conclusion




                          Searchworkings.org - The online search community   2
Thursday, May 17, 2012
Background

       Lucene’s model
      ‣ Lucene is document based.



      ‣ Lucene doesn’t store information about relations between documents.



      ‣ Data often holds relations.



      ‣ Good free text search over relational data.




                          Searchworkings.org - The online search community    3
Thursday, May 17, 2012
Background

       Example
         ‣ Product

               ‣ Name

               ‣ Description

         ‣ Product-item

               ‣ Color

               ‣ Size

               ‣ Price



         ‣ Goal: Show the most applicable product based on product-item criteria.
                               Searchworkings.org - The online search community   4
Thursday, May 17, 2012
Background

       Common Lucene solutions
      ‣ Compound documents.

            ‣ May result in documents with many fields.

      ‣ Subsequent searches.

            ‣ May cause a lot network overhead.




      ‣ Non Lucene based approach:

            ‣ If free text search isn’t very important use a relational database.


                             Searchworkings.org - The online search community       5
Thursday, May 17, 2012
Background

       Example domain
      ‣ Compound Product & Product-items document.

      ‣ Each product-item has its own field prefix.




                          Searchworkings.org - The online search community   6
Thursday, May 17, 2012
Background

       Different solutions
      ‣ Lucene offers solutions to have a 'relational' like search.

            ‣ Parent child



      ‣ Grouping & joining aren't naturally supported.

            ‣ All the solutions do increase the search time.



      ‣ Some scenarios grouping and joining isn't the right solution.




                             Searchworkings.org - The online search community   7
Thursday, May 17, 2012
Joining
                Modelling relations




Thursday, May 17, 2012
Joining

       Introduction
      ‣ Support for parent child like search from Lucene 3.4

            ‣ Not a SQL join.



      ‣ The parent and each children are stored as documents.



      ‣ Two types:

            ‣ Index time join

            ‣ Query time join


                                Searchworkings.org - The online search community   9
Thursday, May 17, 2012
Joining

       Index time join
      ‣ Two block join queries:

            ‣ ToParentBlockJoinQuery

            ‣ ToChildBlockJoinQuery



      ‣ One Lucene collector:

            ‣ ToParentBlockJoinCollector



      ‣ Index time join requires block indexing.


                            Searchworkings.org - The online search community   10
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Atomically adding documents.

            ‣ A block of documents.



      ‣ Each document gets sequentially assigned Lucene document id.



      ‣ IndexWriter#addDocuments(docs);




                            Searchworkings.org - The online search community   11
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Index doesn't record blocks.

            ‣ Segment merging doesn’t re-order documents in a segment.



      ‣ App is responsible for identifying block documents.

            ‣ Marking the last document in a block.



      ‣ Adding a document to a block requires you to reindex the whole block.

      ‣ Removing a document from a block doesn’t requires reindexing a block.


                            Searchworkings.org - The online search community    12
Thursday, May 17, 2012
Joining

       Example domain
      ‣ Parent is the last document in a block.




                          Searchworkings.org - The online search community   13
Thursday, May 17, 2012
Joining

       Block indexing

                             Marking parent documents




                         Searchworkings.org - The online search community   14
Thursday, May 17, 2012
Joining

       Block indexing




                                                                            Add block




                                                                             Add block



                         Searchworkings.org - The online search community                15
Thursday, May 17, 2012
Joining

       ToParentBlockJoinQuery
      ‣ Parent filter marks the parent documents.



      ‣ Child query is executed in the parent space.




      ‣ ToChildBlockJoinQuery works in the opposite direction.
                         Searchworkings.org - The online search community   16
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ Query time joining is executed in two phases and is field based:

            ‣ fromField

            ‣ toField



      ‣ Doesn’t require block indexing.




                          Searchworkings.org - The online search community   17
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ First phase collects all the terms in the fromField for the documents
        that match with the original query.

            ‣ Currently doesn’t take the score from original query into account.



      ‣ The second phase returns the documents that match with the collected
        terms from the previous phase in the toField.



      ‣ Two different implementations:

            ‣ JoinUtil - Lucene (≥ 3.6)

            ‣ Join query parser - Solr (trunk)
                             Searchworkings.org - The online search community      18
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                           Referrer the product id.
                         Searchworkings.org - The online search community   19
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                         Searchworkings.org - The online search community   20
Thursday, May 17, 2012
Joining

       Query time joining




      ‣ Result will contain one product.

      ‣ Possible to join over two indices.




                          Searchworkings.org - The online search community   21
Thursday, May 17, 2012
Joining

       Final thoughts
      ‣ Joining module has good solutions to model parent child relations.



      ‣ Use block join if you care about scoring.

            ‣ Frequent updates can be problematic.

      ‣ Use query time join for parent child filtering.

            ‣ Query time join is slower than index time join.



      ‣ Mostly a Lucene feature only.

      ‣ All code is annotated as experimental.
                             Searchworkings.org - The online search community   22
Thursday, May 17, 2012
Result grouping
                Previously known as Field Collapsing.




Thursday, May 17, 2012
Result grouping

       Introduction
      ‣ Group matching documents that share a common property.



      ‣ Search hit represents a group.

            ‣ Facet counts & total hit count represent groups.



      ‣ Per group collect information

            ‣ Most relevant document.

            ‣ Top three documents.

            ‣ Aggregated counts
                             Searchworkings.org - The online search community   24
Thursday, May 17, 2012
Result grouping

       Usages
      ‣ Group documents by a shared property

            ‣ Product-item by product id (Parent child)



      ‣ Collapse similar looking documents

            ‣ E.g. all results from the Wikipedia domains.



      ‣ Remove duplicates from the search result.

            ‣ Based on a field that contains a hash


                             Searchworkings.org - The online search community   25
Thursday, May 17, 2012
Result grouping

       Example domain




      ‣ Each Product-item is a document, but includes the product data.


                         Searchworkings.org - The online search community   26
Thursday, May 17, 2012
Result grouping

       Implementation
      ‣ Result grouping implemented with Lucene collectors.

            ‣ Module in trunk and a contrib in 3.x versions.



      ‣ Two pass result grouping.

            ‣ Grouping by indexed field, function or doc values.



      ‣ Single pass result grouping.

            ‣ Requires block indexing.


                             Searchworkings.org - The online search community   27
Thursday, May 17, 2012
Result grouping

       Two pass implementation
      ‣ First pass collects the top N groups.

            ‣ Per group: group value + sort value



      ‣ Second pass collects data for each top group.

            ‣ The top N documents per group.

            ‣ Possible other aggregated information.



      ‣ Second pass search ignores all documents outside topN groups.


                            Searchworkings.org - The online search community   28
Thursday, May 17, 2012
Result grouping

       Result grouping - Indexing




                         Searchworkings.org - The online search community   29
Thursday, May 17, 2012
Result grouping

       Result grouping - Searching




                         Searchworkings.org - The online search community   30
Thursday, May 17, 2012
Result grouping

       Result grouping made easier
     ‣ GroupingSearch




     ‣ Solr

         ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

     ‣ Many more options:

         ‣ http://wiki.apache.org/solr/FieldCollapsing
                            Searchworkings.org - The online search community     31
Thursday, May 17, 2012
Result grouping

       Parent child result
      ‣ TopGroups - Equivalent to TopDocs.

            ‣ Hit count

            ‣ Group count

            ‣ Groups

                  ‣ Top documents



      ‣ Facet and total count can represent groups instead of documents.

            ‣ But requires more query time.


                              Searchworkings.org - The online search community   32
Thursday, May 17, 2012
Conclusion
                Compare...




Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Result grouping

            ‣ + Distributed support & Parent child relation as hit.

            ‣ - Parent data duplication

            ‣ - Impact on query time



      ‣ Joining

            ‣ + Fast & no data duplication

            ‣ - Index time join not optimal for updates

            ‣ - Query time join is limited.
                              Searchworkings.org - The online search community   34
Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Compound documents.

            ‣ + Fast and works out-of-the box with all features.

            ‣ - Not flexible when it comes to updates.

            ‣ - Document granularity is set in stone.




                             Searchworkings.org - The online search community   35
Thursday, May 17, 2012
Any questions?




                                          36

Thursday, May 17, 2012
Extra slides
                We have time left!




Thursday, May 17, 2012
Conclusion

       Future work
       ‣ Higher level parent-child API.

             ‣ Needs to cover search & indexing.



       ‣ Joining

             ‣ Distributed support.

             ‣ Represent a hit as a parent child relation in the search result.



       ‣ Result grouping

             ‣ Aggregated grouped information like: sum, avg, min, max etc.
                              Searchworkings.org - The online search community    38
Thursday, May 17, 2012
Joining

       ToParentBlockJoinCollector




         ‣ TopGroups contains a group per top N parent document.

         ‣ Each group contains a parent and child documents.

                          Searchworkings.org - The online search community   39
Thursday, May 17, 2012
Result grouping

       Groups & facet counts
      ‣ Faceting and result grouping are different features.

            ‣ But are often used together!



      ‣ Facet counts can be based on:

            ‣ Found documents.

            ‣ Found groups.

            ‣ Combination of facet value and group.



      ‣ All options are supported in Solr.
                              Searchworkings.org - The online search community   40
Thursday, May 17, 2012
Result grouping

       Doc values
      ‣ Doc values / Column Stride values



      ‣ Prevents the creation of expensive data structures in FieldCache.



      ‣ Inverted index is meant for free text search.



      ‣ All grouping collectors have doc values based implementations!




                          Searchworkings.org - The online search community   41
Thursday, May 17, 2012

More Related Content

What's hot

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
An introduction to React.js
An introduction to React.jsAn introduction to React.js
An introduction to React.jsEmanuele DelBono
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Redux Toolkit - Quick Intro - 2022
Redux Toolkit - Quick Intro - 2022Redux Toolkit - Quick Intro - 2022
Redux Toolkit - Quick Intro - 2022Fabio Biondi
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLJoel Brewer
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
Streams In C# Tutorial
Streams In C# TutorialStreams In C# Tutorial
Streams In C# TutorialSimplilearn
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Jay Patel
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Databricks
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Steve Pember
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 

What's hot (20)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
An introduction to React.js
An introduction to React.jsAn introduction to React.js
An introduction to React.js
 
MongoDB at Scale
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Postgresql
PostgresqlPostgresql
Postgresql
 
Ngrx slides
Ngrx slidesNgrx slides
Ngrx slides
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Redux Toolkit - Quick Intro - 2022
Redux Toolkit - Quick Intro - 2022Redux Toolkit - Quick Intro - 2022
Redux Toolkit - Quick Intro - 2022
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
Streams In C# Tutorial
Streams In C# TutorialStreams In C# Tutorial
Streams In C# Tutorial
 
MongoDB
MongoDBMongoDB
MongoDB
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
Anatomy of a Spring Boot App with Clean Architecture - Spring I/O 2023
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Redux
Introduction to ReduxIntroduction to Redux
Introduction to Redux
 

Similar to Grouping and Joining in Lucene/Solr

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDBPablo Godel
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesPrateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy colodaKBTNHKU
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystemChris Mills
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionGudmundur Thorisson
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Pablo Godel
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing VU University Amsterdam
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsOlafGoerlitz
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentialsreeep
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataPrateek Jain
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS projectRafael C. Jimenez
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...Cheryl McKinnon
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedB. Hamilton
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsSutanay Choudhury
 

Similar to Grouping and Joining in Lucene/Solr (20)

Symfony2 and MongoDB
Symfony2 and MongoDBSymfony2 and MongoDB
Symfony2 and MongoDB
 
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain dissertation defense, Kno.e.sis, Wright State University
Prateek Jain dissertation defense, Kno.e.sis, Wright State University
 
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and QueryingPrateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
Prateek Jain's Dissertation Defense - Linked Open Data Alignment and Querying
 
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based TechniquesLinked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
Linked Open Data Alignment and Enrichment Using Bootstrapping Based Techniques
 
PhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek JainPhD Proposal Defense - Prateek Jain
PhD Proposal Defense - Prateek Jain
 
Pedagogy coloda
Pedagogy colodaPedagogy coloda
Pedagogy coloda
 
NoTube: Models & Semantics
NoTube: Models & SemanticsNoTube: Models & Semantics
NoTube: Models & Semantics
 
The W3C and the web design ecosystem
The W3C and the web design ecosystemThe W3C and the web design ecosystem
The W3C and the web design ecosystem
 
ORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout sessionORCID Outreach Meeting dev breakout session
ORCID Outreach Meeting dev breakout session
 
Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012Symfony2 y MongoDB - deSymfony 2012
Symfony2 y MongoDB - deSymfony 2012
 
DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing DigiBird: on the fly collection integration using crowdsourcing
DigiBird: on the fly collection integration using crowdsourcing
 
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID DescriptionsSplendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
Splendid: SPARQL Endpoint Federation Exploiting VOID Descriptions
 
Linked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable development
 
Linked Open Data: The Essentials
Linked Open Data: The EssentialsLinked Open Data: The Essentials
Linked Open Data: The Essentials
 
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked DataMoving beyond sameAs with PLATO: Partonomy detection for Linked Data
Moving beyond sameAs with PLATO: Partonomy detection for Linked Data
 
Introduction to the BioJS project
Introduction to the BioJS projectIntroduction to the BioJS project
Introduction to the BioJS project
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...
 
Late March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar UpdatedLate March and April 2012 Media 21 Calendar Updated
Late March and April 2012 Media 21 Calendar Updated
 
No More SQL
No More SQLNo More SQL
No More SQL
 
Construction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge GraphsConstruction and Querying of Dynamic Knowledge Graphs
Construction and Querying of Dynamic Knowledge Graphs
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Grouping and Joining in Lucene/Solr

  • 1. Grouping & Joining Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC Member Thursday, May 17, 2012
  • 2. Grouping & Joining Overview ‣ Background ‣ Joining ‣ Result grouping ‣ Conclusion Searchworkings.org - The online search community 2 Thursday, May 17, 2012
  • 3. Background Lucene’s model ‣ Lucene is document based. ‣ Lucene doesn’t store information about relations between documents. ‣ Data often holds relations. ‣ Good free text search over relational data. Searchworkings.org - The online search community 3 Thursday, May 17, 2012
  • 4. Background Example ‣ Product ‣ Name ‣ Description ‣ Product-item ‣ Color ‣ Size ‣ Price ‣ Goal: Show the most applicable product based on product-item criteria. Searchworkings.org - The online search community 4 Thursday, May 17, 2012
  • 5. Background Common Lucene solutions ‣ Compound documents. ‣ May result in documents with many fields. ‣ Subsequent searches. ‣ May cause a lot network overhead. ‣ Non Lucene based approach: ‣ If free text search isn’t very important use a relational database. Searchworkings.org - The online search community 5 Thursday, May 17, 2012
  • 6. Background Example domain ‣ Compound Product & Product-items document. ‣ Each product-item has its own field prefix. Searchworkings.org - The online search community 6 Thursday, May 17, 2012
  • 7. Background Different solutions ‣ Lucene offers solutions to have a 'relational' like search. ‣ Parent child ‣ Grouping & joining aren't naturally supported. ‣ All the solutions do increase the search time. ‣ Some scenarios grouping and joining isn't the right solution. Searchworkings.org - The online search community 7 Thursday, May 17, 2012
  • 8. Joining Modelling relations Thursday, May 17, 2012
  • 9. Joining Introduction ‣ Support for parent child like search from Lucene 3.4 ‣ Not a SQL join. ‣ The parent and each children are stored as documents. ‣ Two types: ‣ Index time join ‣ Query time join Searchworkings.org - The online search community 9 Thursday, May 17, 2012
  • 10. Joining Index time join ‣ Two block join queries: ‣ ToParentBlockJoinQuery ‣ ToChildBlockJoinQuery ‣ One Lucene collector: ‣ ToParentBlockJoinCollector ‣ Index time join requires block indexing. Searchworkings.org - The online search community 10 Thursday, May 17, 2012
  • 11. Joining Block indexing ‣ Atomically adding documents. ‣ A block of documents. ‣ Each document gets sequentially assigned Lucene document id. ‣ IndexWriter#addDocuments(docs); Searchworkings.org - The online search community 11 Thursday, May 17, 2012
  • 12. Joining Block indexing ‣ Index doesn't record blocks. ‣ Segment merging doesn’t re-order documents in a segment. ‣ App is responsible for identifying block documents. ‣ Marking the last document in a block. ‣ Adding a document to a block requires you to reindex the whole block. ‣ Removing a document from a block doesn’t requires reindexing a block. Searchworkings.org - The online search community 12 Thursday, May 17, 2012
  • 13. Joining Example domain ‣ Parent is the last document in a block. Searchworkings.org - The online search community 13 Thursday, May 17, 2012
  • 14. Joining Block indexing Marking parent documents Searchworkings.org - The online search community 14 Thursday, May 17, 2012
  • 15. Joining Block indexing Add block Add block Searchworkings.org - The online search community 15 Thursday, May 17, 2012
  • 16. Joining ToParentBlockJoinQuery ‣ Parent filter marks the parent documents. ‣ Child query is executed in the parent space. ‣ ToChildBlockJoinQuery works in the opposite direction. Searchworkings.org - The online search community 16 Thursday, May 17, 2012
  • 17. Joining Query time joining ‣ Query time joining is executed in two phases and is field based: ‣ fromField ‣ toField ‣ Doesn’t require block indexing. Searchworkings.org - The online search community 17 Thursday, May 17, 2012
  • 18. Joining Query time joining ‣ First phase collects all the terms in the fromField for the documents that match with the original query. ‣ Currently doesn’t take the score from original query into account. ‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField. ‣ Two different implementations: ‣ JoinUtil - Lucene (≥ 3.6) ‣ Join query parser - Solr (trunk) Searchworkings.org - The online search community 18 Thursday, May 17, 2012
  • 19. Joining Query time joining - Indexing Referrer the product id. Searchworkings.org - The online search community 19 Thursday, May 17, 2012
  • 20. Joining Query time joining - Indexing Searchworkings.org - The online search community 20 Thursday, May 17, 2012
  • 21. Joining Query time joining ‣ Result will contain one product. ‣ Possible to join over two indices. Searchworkings.org - The online search community 21 Thursday, May 17, 2012
  • 22. Joining Final thoughts ‣ Joining module has good solutions to model parent child relations. ‣ Use block join if you care about scoring. ‣ Frequent updates can be problematic. ‣ Use query time join for parent child filtering. ‣ Query time join is slower than index time join. ‣ Mostly a Lucene feature only. ‣ All code is annotated as experimental. Searchworkings.org - The online search community 22 Thursday, May 17, 2012
  • 23. Result grouping Previously known as Field Collapsing. Thursday, May 17, 2012
  • 24. Result grouping Introduction ‣ Group matching documents that share a common property. ‣ Search hit represents a group. ‣ Facet counts & total hit count represent groups. ‣ Per group collect information ‣ Most relevant document. ‣ Top three documents. ‣ Aggregated counts Searchworkings.org - The online search community 24 Thursday, May 17, 2012
  • 25. Result grouping Usages ‣ Group documents by a shared property ‣ Product-item by product id (Parent child) ‣ Collapse similar looking documents ‣ E.g. all results from the Wikipedia domains. ‣ Remove duplicates from the search result. ‣ Based on a field that contains a hash Searchworkings.org - The online search community 25 Thursday, May 17, 2012
  • 26. Result grouping Example domain ‣ Each Product-item is a document, but includes the product data. Searchworkings.org - The online search community 26 Thursday, May 17, 2012
  • 27. Result grouping Implementation ‣ Result grouping implemented with Lucene collectors. ‣ Module in trunk and a contrib in 3.x versions. ‣ Two pass result grouping. ‣ Grouping by indexed field, function or doc values. ‣ Single pass result grouping. ‣ Requires block indexing. Searchworkings.org - The online search community 27 Thursday, May 17, 2012
  • 28. Result grouping Two pass implementation ‣ First pass collects the top N groups. ‣ Per group: group value + sort value ‣ Second pass collects data for each top group. ‣ The top N documents per group. ‣ Possible other aggregated information. ‣ Second pass search ignores all documents outside topN groups. Searchworkings.org - The online search community 28 Thursday, May 17, 2012
  • 29. Result grouping Result grouping - Indexing Searchworkings.org - The online search community 29 Thursday, May 17, 2012
  • 30. Result grouping Result grouping - Searching Searchworkings.org - The online search community 30 Thursday, May 17, 2012
  • 31. Result grouping Result grouping made easier ‣ GroupingSearch ‣ Solr ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id ‣ Many more options: ‣ http://wiki.apache.org/solr/FieldCollapsing Searchworkings.org - The online search community 31 Thursday, May 17, 2012
  • 32. Result grouping Parent child result ‣ TopGroups - Equivalent to TopDocs. ‣ Hit count ‣ Group count ‣ Groups ‣ Top documents ‣ Facet and total count can represent groups instead of documents. ‣ But requires more query time. Searchworkings.org - The online search community 32 Thursday, May 17, 2012
  • 33. Conclusion Compare... Thursday, May 17, 2012
  • 34. Conclusion Compare the parent child solutions ‣ Result grouping ‣ + Distributed support & Parent child relation as hit. ‣ - Parent data duplication ‣ - Impact on query time ‣ Joining ‣ + Fast & no data duplication ‣ - Index time join not optimal for updates ‣ - Query time join is limited. Searchworkings.org - The online search community 34 Thursday, May 17, 2012
  • 35. Conclusion Compare the parent child solutions ‣ Compound documents. ‣ + Fast and works out-of-the box with all features. ‣ - Not flexible when it comes to updates. ‣ - Document granularity is set in stone. Searchworkings.org - The online search community 35 Thursday, May 17, 2012
  • 36. Any questions? 36 Thursday, May 17, 2012
  • 37. Extra slides We have time left! Thursday, May 17, 2012
  • 38. Conclusion Future work ‣ Higher level parent-child API. ‣ Needs to cover search & indexing. ‣ Joining ‣ Distributed support. ‣ Represent a hit as a parent child relation in the search result. ‣ Result grouping ‣ Aggregated grouped information like: sum, avg, min, max etc. Searchworkings.org - The online search community 38 Thursday, May 17, 2012
  • 39. Joining ToParentBlockJoinCollector ‣ TopGroups contains a group per top N parent document. ‣ Each group contains a parent and child documents. Searchworkings.org - The online search community 39 Thursday, May 17, 2012
  • 40. Result grouping Groups & facet counts ‣ Faceting and result grouping are different features. ‣ But are often used together! ‣ Facet counts can be based on: ‣ Found documents. ‣ Found groups. ‣ Combination of facet value and group. ‣ All options are supported in Solr. Searchworkings.org - The online search community 40 Thursday, May 17, 2012
  • 41. Result grouping Doc values ‣ Doc values / Column Stride values ‣ Prevents the creation of expensive data structures in FieldCache. ‣ Inverted index is meant for free text search. ‣ All grouping collectors have doc values based implementations! Searchworkings.org - The online search community 41 Thursday, May 17, 2012