Presented by Martijn van Groningen, SearchWorkings - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
In the real world data isn’t flat. Data is often modelled into complex models. Lucene is document oriented and doesn’t support relations natively. The only way you could index this data is by de-normalizing the relations in a document with many fields and execute subsequent queries. Subsequent queries can be expensive and data gets duplicated. This isn’t always ideal. Recently Solr and Lucene provide features that allow you to join and group. You can join and group on fields across documents and still have the power of Lucene’s awesome free text search. In this presentation, we’ll look at these new alternatives, the advantages and disadvantages and how these features can be utilized. how these new capabilities impact the design of Solr-based search applications primarily from infrastructure and operational perspectives.
1. Grouping & Joining
Martijn van Groningen
martijn.vangroningen@searchworkings.com
Lucene Committer & PMC Member
Thursday, May 17, 2012
2. Grouping & Joining
Overview
‣ Background
‣ Joining
‣ Result grouping
‣ Conclusion
Searchworkings.org - The online search community 2
Thursday, May 17, 2012
3. Background
Lucene’s model
‣ Lucene is document based.
‣ Lucene doesn’t store information about relations between documents.
‣ Data often holds relations.
‣ Good free text search over relational data.
Searchworkings.org - The online search community 3
Thursday, May 17, 2012
4. Background
Example
‣ Product
‣ Name
‣ Description
‣ Product-item
‣ Color
‣ Size
‣ Price
‣ Goal: Show the most applicable product based on product-item criteria.
Searchworkings.org - The online search community 4
Thursday, May 17, 2012
5. Background
Common Lucene solutions
‣ Compound documents.
‣ May result in documents with many fields.
‣ Subsequent searches.
‣ May cause a lot network overhead.
‣ Non Lucene based approach:
‣ If free text search isn’t very important use a relational database.
Searchworkings.org - The online search community 5
Thursday, May 17, 2012
6. Background
Example domain
‣ Compound Product & Product-items document.
‣ Each product-item has its own field prefix.
Searchworkings.org - The online search community 6
Thursday, May 17, 2012
7. Background
Different solutions
‣ Lucene offers solutions to have a 'relational' like search.
‣ Parent child
‣ Grouping & joining aren't naturally supported.
‣ All the solutions do increase the search time.
‣ Some scenarios grouping and joining isn't the right solution.
Searchworkings.org - The online search community 7
Thursday, May 17, 2012
8. Joining
Modelling relations
Thursday, May 17, 2012
9. Joining
Introduction
‣ Support for parent child like search from Lucene 3.4
‣ Not a SQL join.
‣ The parent and each children are stored as documents.
‣ Two types:
‣ Index time join
‣ Query time join
Searchworkings.org - The online search community 9
Thursday, May 17, 2012
10. Joining
Index time join
‣ Two block join queries:
‣ ToParentBlockJoinQuery
‣ ToChildBlockJoinQuery
‣ One Lucene collector:
‣ ToParentBlockJoinCollector
‣ Index time join requires block indexing.
Searchworkings.org - The online search community 10
Thursday, May 17, 2012
11. Joining
Block indexing
‣ Atomically adding documents.
‣ A block of documents.
‣ Each document gets sequentially assigned Lucene document id.
‣ IndexWriter#addDocuments(docs);
Searchworkings.org - The online search community 11
Thursday, May 17, 2012
12. Joining
Block indexing
‣ Index doesn't record blocks.
‣ Segment merging doesn’t re-order documents in a segment.
‣ App is responsible for identifying block documents.
‣ Marking the last document in a block.
‣ Adding a document to a block requires you to reindex the whole block.
‣ Removing a document from a block doesn’t requires reindexing a block.
Searchworkings.org - The online search community 12
Thursday, May 17, 2012
13. Joining
Example domain
‣ Parent is the last document in a block.
Searchworkings.org - The online search community 13
Thursday, May 17, 2012
14. Joining
Block indexing
Marking parent documents
Searchworkings.org - The online search community 14
Thursday, May 17, 2012
15. Joining
Block indexing
Add block
Add block
Searchworkings.org - The online search community 15
Thursday, May 17, 2012
16. Joining
ToParentBlockJoinQuery
‣ Parent filter marks the parent documents.
‣ Child query is executed in the parent space.
‣ ToChildBlockJoinQuery works in the opposite direction.
Searchworkings.org - The online search community 16
Thursday, May 17, 2012
17. Joining
Query time joining
‣ Query time joining is executed in two phases and is field based:
‣ fromField
‣ toField
‣ Doesn’t require block indexing.
Searchworkings.org - The online search community 17
Thursday, May 17, 2012
18. Joining
Query time joining
‣ First phase collects all the terms in the fromField for the documents
that match with the original query.
‣ Currently doesn’t take the score from original query into account.
‣ The second phase returns the documents that match with the collected
terms from the previous phase in the toField.
‣ Two different implementations:
‣ JoinUtil - Lucene (≥ 3.6)
‣ Join query parser - Solr (trunk)
Searchworkings.org - The online search community 18
Thursday, May 17, 2012
19. Joining
Query time joining - Indexing
Referrer the product id.
Searchworkings.org - The online search community 19
Thursday, May 17, 2012
20. Joining
Query time joining - Indexing
Searchworkings.org - The online search community 20
Thursday, May 17, 2012
21. Joining
Query time joining
‣ Result will contain one product.
‣ Possible to join over two indices.
Searchworkings.org - The online search community 21
Thursday, May 17, 2012
22. Joining
Final thoughts
‣ Joining module has good solutions to model parent child relations.
‣ Use block join if you care about scoring.
‣ Frequent updates can be problematic.
‣ Use query time join for parent child filtering.
‣ Query time join is slower than index time join.
‣ Mostly a Lucene feature only.
‣ All code is annotated as experimental.
Searchworkings.org - The online search community 22
Thursday, May 17, 2012
23. Result grouping
Previously known as Field Collapsing.
Thursday, May 17, 2012
24. Result grouping
Introduction
‣ Group matching documents that share a common property.
‣ Search hit represents a group.
‣ Facet counts & total hit count represent groups.
‣ Per group collect information
‣ Most relevant document.
‣ Top three documents.
‣ Aggregated counts
Searchworkings.org - The online search community 24
Thursday, May 17, 2012
25. Result grouping
Usages
‣ Group documents by a shared property
‣ Product-item by product id (Parent child)
‣ Collapse similar looking documents
‣ E.g. all results from the Wikipedia domains.
‣ Remove duplicates from the search result.
‣ Based on a field that contains a hash
Searchworkings.org - The online search community 25
Thursday, May 17, 2012
26. Result grouping
Example domain
‣ Each Product-item is a document, but includes the product data.
Searchworkings.org - The online search community 26
Thursday, May 17, 2012
27. Result grouping
Implementation
‣ Result grouping implemented with Lucene collectors.
‣ Module in trunk and a contrib in 3.x versions.
‣ Two pass result grouping.
‣ Grouping by indexed field, function or doc values.
‣ Single pass result grouping.
‣ Requires block indexing.
Searchworkings.org - The online search community 27
Thursday, May 17, 2012
28. Result grouping
Two pass implementation
‣ First pass collects the top N groups.
‣ Per group: group value + sort value
‣ Second pass collects data for each top group.
‣ The top N documents per group.
‣ Possible other aggregated information.
‣ Second pass search ignores all documents outside topN groups.
Searchworkings.org - The online search community 28
Thursday, May 17, 2012
29. Result grouping
Result grouping - Indexing
Searchworkings.org - The online search community 29
Thursday, May 17, 2012
30. Result grouping
Result grouping - Searching
Searchworkings.org - The online search community 30
Thursday, May 17, 2012
31. Result grouping
Result grouping made easier
‣ GroupingSearch
‣ Solr
‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id
‣ Many more options:
‣ http://wiki.apache.org/solr/FieldCollapsing
Searchworkings.org - The online search community 31
Thursday, May 17, 2012
32. Result grouping
Parent child result
‣ TopGroups - Equivalent to TopDocs.
‣ Hit count
‣ Group count
‣ Groups
‣ Top documents
‣ Facet and total count can represent groups instead of documents.
‣ But requires more query time.
Searchworkings.org - The online search community 32
Thursday, May 17, 2012
34. Conclusion
Compare the parent child solutions
‣ Result grouping
‣ + Distributed support & Parent child relation as hit.
‣ - Parent data duplication
‣ - Impact on query time
‣ Joining
‣ + Fast & no data duplication
‣ - Index time join not optimal for updates
‣ - Query time join is limited.
Searchworkings.org - The online search community 34
Thursday, May 17, 2012
35. Conclusion
Compare the parent child solutions
‣ Compound documents.
‣ + Fast and works out-of-the box with all features.
‣ - Not flexible when it comes to updates.
‣ - Document granularity is set in stone.
Searchworkings.org - The online search community 35
Thursday, May 17, 2012
37. Extra slides
We have time left!
Thursday, May 17, 2012
38. Conclusion
Future work
‣ Higher level parent-child API.
‣ Needs to cover search & indexing.
‣ Joining
‣ Distributed support.
‣ Represent a hit as a parent child relation in the search result.
‣ Result grouping
‣ Aggregated grouped information like: sum, avg, min, max etc.
Searchworkings.org - The online search community 38
Thursday, May 17, 2012
39. Joining
ToParentBlockJoinCollector
‣ TopGroups contains a group per top N parent document.
‣ Each group contains a parent and child documents.
Searchworkings.org - The online search community 39
Thursday, May 17, 2012
40. Result grouping
Groups & facet counts
‣ Faceting and result grouping are different features.
‣ But are often used together!
‣ Facet counts can be based on:
‣ Found documents.
‣ Found groups.
‣ Combination of facet value and group.
‣ All options are supported in Solr.
Searchworkings.org - The online search community 40
Thursday, May 17, 2012
41. Result grouping
Doc values
‣ Doc values / Column Stride values
‣ Prevents the creation of expensive data structures in FieldCache.
‣ Inverted index is meant for free text search.
‣ All grouping collectors have doc values based implementations!
Searchworkings.org - The online search community 41
Thursday, May 17, 2012