Grouping and Joining in Lucene/Solr

Grouping & Joining
Martijn van Groningen
martijn.vangroningen@searchworkings.com
Lucene Committer & PMC Member

Thursday, May 17, 2012

Grouping & Joining

Overview
‣ Background

‣ Joining

‣ Result grouping

‣ Conclusion

Searchworkings.org - The online search community 2

Background

Lucene’s model
‣ Lucene is document based.

‣ Lucene doesn’t store information about relations between documents.

‣ Data often holds relations.

‣ Good free text search over relational data.


Background

Example
‣ Product

‣ Name

‣ Description

‣ Product-item

‣ Color

‣ Size

‣ Price

‣ Goal: Show the most applicable product based on product-item criteria.

Background

Common Lucene solutions
‣ Compound documents.

‣ May result in documents with many fields.

‣ Subsequent searches.

‣ May cause a lot network overhead.

‣ Non Lucene based approach:

‣ If free text search isn’t very important use a relational database.


Background

Example domain
‣ Compound Product & Product-items document.

‣ Each product-item has its own field prefix.


Background

Different solutions
‣ Lucene offers solutions to have a 'relational' like search.

‣ Parent child

‣ Grouping & joining aren't naturally supported.

‣ All the solutions do increase the search time.

‣ Some scenarios grouping and joining isn't the right solution.


Joining
Modelling relations


Joining

Introduction
‣ Support for parent child like search from Lucene 3.4

‣ Not a SQL join.

‣ The parent and each children are stored as documents.

‣ Two types:

‣ Index time join

‣ Query time join


Joining

Index time join
‣ Two block join queries:

‣ ToParentBlockJoinQuery

‣ ToChildBlockJoinQuery

‣ One Lucene collector:

‣ ToParentBlockJoinCollector

‣ Index time join requires block indexing.


Joining

Block indexing
‣ Atomically adding documents.

‣ A block of documents.

‣ Each document gets sequentially assigned Lucene document id.

‣ IndexWriter#addDocuments(docs);


Joining

Block indexing
‣ Index doesn't record blocks.

‣ Segment merging doesn’t re-order documents in a segment.

‣ App is responsible for identifying block documents.

‣ Marking the last document in a block.

‣ Adding a document to a block requires you to reindex the whole block.

‣ Removing a document from a block doesn’t requires reindexing a block.


Joining

Example domain
‣ Parent is the last document in a block.


Joining

Block indexing

Marking parent documents


Joining

Block indexing

Add block

Add block


Joining

ToParentBlockJoinQuery
‣ Parent filter marks the parent documents.

‣ Child query is executed in the parent space.

‣ ToChildBlockJoinQuery works in the opposite direction.

Joining

Query time joining
‣ Query time joining is executed in two phases and is field based:

‣ fromField

‣ toField

‣ Doesn’t require block indexing.


Joining

Query time joining
‣ First phase collects all the terms in the fromField for the documents
that match with the original query.

‣ Currently doesn’t take the score from original query into account.

‣ The second phase returns the documents that match with the collected
terms from the previous phase in the toField.

‣ Two different implementations:

‣ JoinUtil - Lucene (≥ 3.6)

‣ Join query parser - Solr (trunk)

Joining

Query time joining - Indexing

Referrer the product id.

Joining

Query time joining - Indexing


Joining

Query time joining

‣ Result will contain one product.

‣ Possible to join over two indices.


Joining

Final thoughts
‣ Joining module has good solutions to model parent child relations.

‣ Use block join if you care about scoring.

‣ Frequent updates can be problematic.

‣ Use query time join for parent child filtering.

‣ Query time join is slower than index time join.

‣ Mostly a Lucene feature only.

‣ All code is annotated as experimental.

Result grouping
Previously known as Field Collapsing.


Result grouping

Introduction
‣ Group matching documents that share a common property.

‣ Search hit represents a group.

‣ Facet counts & total hit count represent groups.

‣ Per group collect information

‣ Most relevant document.

‣ Top three documents.

‣ Aggregated counts

Result grouping

Usages
‣ Group documents by a shared property

‣ Product-item by product id (Parent child)

‣ Collapse similar looking documents

‣ E.g. all results from the Wikipedia domains.

‣ Remove duplicates from the search result.

‣ Based on a field that contains a hash


Result grouping

Example domain

‣ Each Product-item is a document, but includes the product data.


Result grouping

Implementation
‣ Result grouping implemented with Lucene collectors.

‣ Module in trunk and a contrib in 3.x versions.

‣ Two pass result grouping.

‣ Grouping by indexed field, function or doc values.

‣ Single pass result grouping.

‣ Requires block indexing.


Result grouping

Two pass implementation
‣ First pass collects the top N groups.

‣ Per group: group value + sort value

‣ Second pass collects data for each top group.

‣ The top N documents per group.

‣ Possible other aggregated information.

‣ Second pass search ignores all documents outside topN groups.


Result grouping

Result grouping - Indexing


Result grouping

Result grouping - Searching


Result grouping

Result grouping made easier
‣ GroupingSearch

‣ Solr

‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

‣ Many more options:

‣ http://wiki.apache.org/solr/FieldCollapsing

Result grouping

Parent child result
‣ TopGroups - Equivalent to TopDocs.

‣ Hit count

‣ Group count

‣ Groups

‣ Top documents

‣ Facet and total count can represent groups instead of documents.

‣ But requires more query time.


Conclusion
Compare...


Conclusion

Compare the parent child solutions
‣ Result grouping

‣ + Distributed support & Parent child relation as hit.

‣ - Parent data duplication

‣ - Impact on query time

‣ Joining

‣ + Fast & no data duplication

‣ - Index time join not optimal for updates

‣ - Query time join is limited.

Conclusion

Compare the parent child solutions
‣ Compound documents.

‣ + Fast and works out-of-the box with all features.

‣ - Not flexible when it comes to updates.

‣ - Document granularity is set in stone.


Any questions?

36


Extra slides
We have time left!


Conclusion

Future work
‣ Higher level parent-child API.

‣ Needs to cover search & indexing.

‣ Joining

‣ Distributed support.

‣ Represent a hit as a parent child relation in the search result.

‣ Result grouping

‣ Aggregated grouped information like: sum, avg, min, max etc.

Joining

ToParentBlockJoinCollector

‣ TopGroups contains a group per top N parent document.

‣ Each group contains a parent and child documents.


Result grouping

Groups & facet counts
‣ Faceting and result grouping are different features.

‣ But are often used together!

‣ Facet counts can be based on:

‣ Found documents.

‣ Found groups.

‣ Combination of facet value and group.

‣ All options are supported in Solr.

Result grouping

Doc values
‣ Doc values / Column Stride values

‣ Prevents the creation of expensive data structures in FieldCache.

‣ Inverted index is meant for free text search.

‣ All grouping collectors have doc values based implementations!


Grouping and Joining in Lucene/Solr

More Related Content

What's hot

Similar to Grouping and Joining in Lucene/Solr

More from lucenerevolution

Recently uploaded

Grouping and Joining in Lucene/Solr