Grouping and Joining in Lucene/Solr
Upcoming SlideShare
Loading in...5
×
 

Grouping and Joining in Lucene/Solr

on

  • 16,158 views

Presented by Martijn van Groningen, SearchWorkings - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 ...

Presented by Martijn van Groningen, SearchWorkings - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

In the real world data isn’t flat. Data is often modelled into complex models. Lucene is document oriented and doesn’t support relations natively. The only way you could index this data is by de-normalizing the relations in a document with many fields and execute subsequent queries. Subsequent queries can be expensive and data gets duplicated. This isn’t always ideal. Recently Solr and Lucene provide features that allow you to join and group. You can join and group on fields across documents and still have the power of Lucene’s awesome free text search. In this presentation, we’ll look at these new alternatives, the advantages and disadvantages and how these features can be utilized. how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.

Statistics

Views

Total Views
16,158
Views on SlideShare
16,158
Embed Views
0

Actions

Likes
10
Downloads
136
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Grouping and Joining in Lucene/Solr Grouping and Joining in Lucene/Solr Presentation Transcript

  • Grouping & Joining Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC MemberThursday, May 17, 2012
  • Grouping & Joining Overview ‣ Background ‣ Joining ‣ Result grouping ‣ Conclusion Searchworkings.org - The online search community 2Thursday, May 17, 2012
  • Background Lucene’s model ‣ Lucene is document based. ‣ Lucene doesn’t store information about relations between documents. ‣ Data often holds relations. ‣ Good free text search over relational data. Searchworkings.org - The online search community 3Thursday, May 17, 2012
  • Background Example ‣ Product ‣ Name ‣ Description ‣ Product-item ‣ Color ‣ Size ‣ Price ‣ Goal: Show the most applicable product based on product-item criteria. Searchworkings.org - The online search community 4Thursday, May 17, 2012
  • Background Common Lucene solutions ‣ Compound documents. ‣ May result in documents with many fields. ‣ Subsequent searches. ‣ May cause a lot network overhead. ‣ Non Lucene based approach: ‣ If free text search isn’t very important use a relational database. Searchworkings.org - The online search community 5Thursday, May 17, 2012
  • Background Example domain ‣ Compound Product & Product-items document. ‣ Each product-item has its own field prefix. Searchworkings.org - The online search community 6Thursday, May 17, 2012
  • Background Different solutions ‣ Lucene offers solutions to have a relational like search. ‣ Parent child ‣ Grouping & joining arent naturally supported. ‣ All the solutions do increase the search time. ‣ Some scenarios grouping and joining isnt the right solution. Searchworkings.org - The online search community 7Thursday, May 17, 2012
  • Joining Modelling relationsThursday, May 17, 2012
  • Joining Introduction ‣ Support for parent child like search from Lucene 3.4 ‣ Not a SQL join. ‣ The parent and each children are stored as documents. ‣ Two types: ‣ Index time join ‣ Query time join Searchworkings.org - The online search community 9Thursday, May 17, 2012
  • Joining Index time join ‣ Two block join queries: ‣ ToParentBlockJoinQuery ‣ ToChildBlockJoinQuery ‣ One Lucene collector: ‣ ToParentBlockJoinCollector ‣ Index time join requires block indexing. Searchworkings.org - The online search community 10Thursday, May 17, 2012
  • Joining Block indexing ‣ Atomically adding documents. ‣ A block of documents. ‣ Each document gets sequentially assigned Lucene document id. ‣ IndexWriter#addDocuments(docs); Searchworkings.org - The online search community 11Thursday, May 17, 2012
  • Joining Block indexing ‣ Index doesnt record blocks. ‣ Segment merging doesn’t re-order documents in a segment. ‣ App is responsible for identifying block documents. ‣ Marking the last document in a block. ‣ Adding a document to a block requires you to reindex the whole block. ‣ Removing a document from a block doesn’t requires reindexing a block. Searchworkings.org - The online search community 12Thursday, May 17, 2012
  • Joining Example domain ‣ Parent is the last document in a block. Searchworkings.org - The online search community 13Thursday, May 17, 2012
  • Joining Block indexing Marking parent documents Searchworkings.org - The online search community 14Thursday, May 17, 2012
  • Joining Block indexing Add block Add block Searchworkings.org - The online search community 15Thursday, May 17, 2012
  • Joining ToParentBlockJoinQuery ‣ Parent filter marks the parent documents. ‣ Child query is executed in the parent space. ‣ ToChildBlockJoinQuery works in the opposite direction. Searchworkings.org - The online search community 16Thursday, May 17, 2012
  • Joining Query time joining ‣ Query time joining is executed in two phases and is field based: ‣ fromField ‣ toField ‣ Doesn’t require block indexing. Searchworkings.org - The online search community 17Thursday, May 17, 2012
  • Joining Query time joining ‣ First phase collects all the terms in the fromField for the documents that match with the original query. ‣ Currently doesn’t take the score from original query into account. ‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField. ‣ Two different implementations: ‣ JoinUtil - Lucene (≥ 3.6) ‣ Join query parser - Solr (trunk) Searchworkings.org - The online search community 18Thursday, May 17, 2012
  • Joining Query time joining - Indexing Referrer the product id. Searchworkings.org - The online search community 19Thursday, May 17, 2012
  • Joining Query time joining - Indexing Searchworkings.org - The online search community 20Thursday, May 17, 2012
  • Joining Query time joining ‣ Result will contain one product. ‣ Possible to join over two indices. Searchworkings.org - The online search community 21Thursday, May 17, 2012
  • Joining Final thoughts ‣ Joining module has good solutions to model parent child relations. ‣ Use block join if you care about scoring. ‣ Frequent updates can be problematic. ‣ Use query time join for parent child filtering. ‣ Query time join is slower than index time join. ‣ Mostly a Lucene feature only. ‣ All code is annotated as experimental. Searchworkings.org - The online search community 22Thursday, May 17, 2012
  • Result grouping Previously known as Field Collapsing.Thursday, May 17, 2012
  • Result grouping Introduction ‣ Group matching documents that share a common property. ‣ Search hit represents a group. ‣ Facet counts & total hit count represent groups. ‣ Per group collect information ‣ Most relevant document. ‣ Top three documents. ‣ Aggregated counts Searchworkings.org - The online search community 24Thursday, May 17, 2012
  • Result grouping Usages ‣ Group documents by a shared property ‣ Product-item by product id (Parent child) ‣ Collapse similar looking documents ‣ E.g. all results from the Wikipedia domains. ‣ Remove duplicates from the search result. ‣ Based on a field that contains a hash Searchworkings.org - The online search community 25Thursday, May 17, 2012
  • Result grouping Example domain ‣ Each Product-item is a document, but includes the product data. Searchworkings.org - The online search community 26Thursday, May 17, 2012
  • Result grouping Implementation ‣ Result grouping implemented with Lucene collectors. ‣ Module in trunk and a contrib in 3.x versions. ‣ Two pass result grouping. ‣ Grouping by indexed field, function or doc values. ‣ Single pass result grouping. ‣ Requires block indexing. Searchworkings.org - The online search community 27Thursday, May 17, 2012
  • Result grouping Two pass implementation ‣ First pass collects the top N groups. ‣ Per group: group value + sort value ‣ Second pass collects data for each top group. ‣ The top N documents per group. ‣ Possible other aggregated information. ‣ Second pass search ignores all documents outside topN groups. Searchworkings.org - The online search community 28Thursday, May 17, 2012
  • Result grouping Result grouping - Indexing Searchworkings.org - The online search community 29Thursday, May 17, 2012
  • Result grouping Result grouping - Searching Searchworkings.org - The online search community 30Thursday, May 17, 2012
  • Result grouping Result grouping made easier ‣ GroupingSearch ‣ Solr ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id ‣ Many more options: ‣ http://wiki.apache.org/solr/FieldCollapsing Searchworkings.org - The online search community 31Thursday, May 17, 2012
  • Result grouping Parent child result ‣ TopGroups - Equivalent to TopDocs. ‣ Hit count ‣ Group count ‣ Groups ‣ Top documents ‣ Facet and total count can represent groups instead of documents. ‣ But requires more query time. Searchworkings.org - The online search community 32Thursday, May 17, 2012
  • Conclusion Compare...Thursday, May 17, 2012
  • Conclusion Compare the parent child solutions ‣ Result grouping ‣ + Distributed support & Parent child relation as hit. ‣ - Parent data duplication ‣ - Impact on query time ‣ Joining ‣ + Fast & no data duplication ‣ - Index time join not optimal for updates ‣ - Query time join is limited. Searchworkings.org - The online search community 34Thursday, May 17, 2012
  • Conclusion Compare the parent child solutions ‣ Compound documents. ‣ + Fast and works out-of-the box with all features. ‣ - Not flexible when it comes to updates. ‣ - Document granularity is set in stone. Searchworkings.org - The online search community 35Thursday, May 17, 2012
  • Any questions? 36Thursday, May 17, 2012
  • Extra slides We have time left!Thursday, May 17, 2012
  • Conclusion Future work ‣ Higher level parent-child API. ‣ Needs to cover search & indexing. ‣ Joining ‣ Distributed support. ‣ Represent a hit as a parent child relation in the search result. ‣ Result grouping ‣ Aggregated grouped information like: sum, avg, min, max etc. Searchworkings.org - The online search community 38Thursday, May 17, 2012
  • Joining ToParentBlockJoinCollector ‣ TopGroups contains a group per top N parent document. ‣ Each group contains a parent and child documents. Searchworkings.org - The online search community 39Thursday, May 17, 2012
  • Result grouping Groups & facet counts ‣ Faceting and result grouping are different features. ‣ But are often used together! ‣ Facet counts can be based on: ‣ Found documents. ‣ Found groups. ‣ Combination of facet value and group. ‣ All options are supported in Solr. Searchworkings.org - The online search community 40Thursday, May 17, 2012
  • Result grouping Doc values ‣ Doc values / Column Stride values ‣ Prevents the creation of expensive data structures in FieldCache. ‣ Inverted index is meant for free text search. ‣ All grouping collectors have doc values based implementations! Searchworkings.org - The online search community 41Thursday, May 17, 2012