Grouping & Joining
                Martijn van Groningen
                martijn.vangroningen@searchworkings.com
                Lucene Committer & PMC Member




Thursday, May 17, 2012
Grouping & Joining

       Overview
      ‣ Background



      ‣ Joining



      ‣ Result grouping



      ‣ Conclusion




                          Searchworkings.org - The online search community   2
Thursday, May 17, 2012
Background

       Lucene’s model
      ‣ Lucene is document based.



      ‣ Lucene doesn’t store information about relations between documents.



      ‣ Data often holds relations.



      ‣ Good free text search over relational data.




                          Searchworkings.org - The online search community    3
Thursday, May 17, 2012
Background

       Example
         ‣ Product

               ‣ Name

               ‣ Description

         ‣ Product-item

               ‣ Color

               ‣ Size

               ‣ Price



         ‣ Goal: Show the most applicable product based on product-item criteria.
                               Searchworkings.org - The online search community   4
Thursday, May 17, 2012
Background

       Common Lucene solutions
      ‣ Compound documents.

            ‣ May result in documents with many fields.

      ‣ Subsequent searches.

            ‣ May cause a lot network overhead.




      ‣ Non Lucene based approach:

            ‣ If free text search isn’t very important use a relational database.


                             Searchworkings.org - The online search community       5
Thursday, May 17, 2012
Background

       Example domain
      ‣ Compound Product & Product-items document.

      ‣ Each product-item has its own field prefix.




                          Searchworkings.org - The online search community   6
Thursday, May 17, 2012
Background

       Different solutions
      ‣ Lucene offers solutions to have a 'relational' like search.

            ‣ Parent child



      ‣ Grouping & joining aren't naturally supported.

            ‣ All the solutions do increase the search time.



      ‣ Some scenarios grouping and joining isn't the right solution.




                             Searchworkings.org - The online search community   7
Thursday, May 17, 2012
Joining
                Modelling relations




Thursday, May 17, 2012
Joining

       Introduction
      ‣ Support for parent child like search from Lucene 3.4

            ‣ Not a SQL join.



      ‣ The parent and each children are stored as documents.



      ‣ Two types:

            ‣ Index time join

            ‣ Query time join


                                Searchworkings.org - The online search community   9
Thursday, May 17, 2012
Joining

       Index time join
      ‣ Two block join queries:

            ‣ ToParentBlockJoinQuery

            ‣ ToChildBlockJoinQuery



      ‣ One Lucene collector:

            ‣ ToParentBlockJoinCollector



      ‣ Index time join requires block indexing.


                            Searchworkings.org - The online search community   10
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Atomically adding documents.

            ‣ A block of documents.



      ‣ Each document gets sequentially assigned Lucene document id.



      ‣ IndexWriter#addDocuments(docs);




                            Searchworkings.org - The online search community   11
Thursday, May 17, 2012
Joining

       Block indexing
      ‣ Index doesn't record blocks.

            ‣ Segment merging doesn’t re-order documents in a segment.



      ‣ App is responsible for identifying block documents.

            ‣ Marking the last document in a block.



      ‣ Adding a document to a block requires you to reindex the whole block.

      ‣ Removing a document from a block doesn’t requires reindexing a block.


                            Searchworkings.org - The online search community    12
Thursday, May 17, 2012
Joining

       Example domain
      ‣ Parent is the last document in a block.




                          Searchworkings.org - The online search community   13
Thursday, May 17, 2012
Joining

       Block indexing

                             Marking parent documents




                         Searchworkings.org - The online search community   14
Thursday, May 17, 2012
Joining

       Block indexing




                                                                            Add block




                                                                             Add block



                         Searchworkings.org - The online search community                15
Thursday, May 17, 2012
Joining

       ToParentBlockJoinQuery
      ‣ Parent filter marks the parent documents.



      ‣ Child query is executed in the parent space.




      ‣ ToChildBlockJoinQuery works in the opposite direction.
                         Searchworkings.org - The online search community   16
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ Query time joining is executed in two phases and is field based:

            ‣ fromField

            ‣ toField



      ‣ Doesn’t require block indexing.




                          Searchworkings.org - The online search community   17
Thursday, May 17, 2012
Joining

       Query time joining
      ‣ First phase collects all the terms in the fromField for the documents
        that match with the original query.

            ‣ Currently doesn’t take the score from original query into account.



      ‣ The second phase returns the documents that match with the collected
        terms from the previous phase in the toField.



      ‣ Two different implementations:

            ‣ JoinUtil - Lucene (≥ 3.6)

            ‣ Join query parser - Solr (trunk)
                             Searchworkings.org - The online search community      18
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                           Referrer the product id.
                         Searchworkings.org - The online search community   19
Thursday, May 17, 2012
Joining

       Query time joining - Indexing




                         Searchworkings.org - The online search community   20
Thursday, May 17, 2012
Joining

       Query time joining




      ‣ Result will contain one product.

      ‣ Possible to join over two indices.




                          Searchworkings.org - The online search community   21
Thursday, May 17, 2012
Joining

       Final thoughts
      ‣ Joining module has good solutions to model parent child relations.



      ‣ Use block join if you care about scoring.

            ‣ Frequent updates can be problematic.

      ‣ Use query time join for parent child filtering.

            ‣ Query time join is slower than index time join.



      ‣ Mostly a Lucene feature only.

      ‣ All code is annotated as experimental.
                             Searchworkings.org - The online search community   22
Thursday, May 17, 2012
Result grouping
                Previously known as Field Collapsing.




Thursday, May 17, 2012
Result grouping

       Introduction
      ‣ Group matching documents that share a common property.



      ‣ Search hit represents a group.

            ‣ Facet counts & total hit count represent groups.



      ‣ Per group collect information

            ‣ Most relevant document.

            ‣ Top three documents.

            ‣ Aggregated counts
                             Searchworkings.org - The online search community   24
Thursday, May 17, 2012
Result grouping

       Usages
      ‣ Group documents by a shared property

            ‣ Product-item by product id (Parent child)



      ‣ Collapse similar looking documents

            ‣ E.g. all results from the Wikipedia domains.



      ‣ Remove duplicates from the search result.

            ‣ Based on a field that contains a hash


                             Searchworkings.org - The online search community   25
Thursday, May 17, 2012
Result grouping

       Example domain




      ‣ Each Product-item is a document, but includes the product data.


                         Searchworkings.org - The online search community   26
Thursday, May 17, 2012
Result grouping

       Implementation
      ‣ Result grouping implemented with Lucene collectors.

            ‣ Module in trunk and a contrib in 3.x versions.



      ‣ Two pass result grouping.

            ‣ Grouping by indexed field, function or doc values.



      ‣ Single pass result grouping.

            ‣ Requires block indexing.


                             Searchworkings.org - The online search community   27
Thursday, May 17, 2012
Result grouping

       Two pass implementation
      ‣ First pass collects the top N groups.

            ‣ Per group: group value + sort value



      ‣ Second pass collects data for each top group.

            ‣ The top N documents per group.

            ‣ Possible other aggregated information.



      ‣ Second pass search ignores all documents outside topN groups.


                            Searchworkings.org - The online search community   28
Thursday, May 17, 2012
Result grouping

       Result grouping - Indexing




                         Searchworkings.org - The online search community   29
Thursday, May 17, 2012
Result grouping

       Result grouping - Searching




                         Searchworkings.org - The online search community   30
Thursday, May 17, 2012
Result grouping

       Result grouping made easier
     ‣ GroupingSearch




     ‣ Solr

         ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id

     ‣ Many more options:

         ‣ http://wiki.apache.org/solr/FieldCollapsing
                            Searchworkings.org - The online search community     31
Thursday, May 17, 2012
Result grouping

       Parent child result
      ‣ TopGroups - Equivalent to TopDocs.

            ‣ Hit count

            ‣ Group count

            ‣ Groups

                  ‣ Top documents



      ‣ Facet and total count can represent groups instead of documents.

            ‣ But requires more query time.


                              Searchworkings.org - The online search community   32
Thursday, May 17, 2012
Conclusion
                Compare...




Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Result grouping

            ‣ + Distributed support & Parent child relation as hit.

            ‣ - Parent data duplication

            ‣ - Impact on query time



      ‣ Joining

            ‣ + Fast & no data duplication

            ‣ - Index time join not optimal for updates

            ‣ - Query time join is limited.
                              Searchworkings.org - The online search community   34
Thursday, May 17, 2012
Conclusion

       Compare the parent child solutions
      ‣ Compound documents.

            ‣ + Fast and works out-of-the box with all features.

            ‣ - Not flexible when it comes to updates.

            ‣ - Document granularity is set in stone.




                             Searchworkings.org - The online search community   35
Thursday, May 17, 2012
Any questions?




                                          36

Thursday, May 17, 2012
Extra slides
                We have time left!




Thursday, May 17, 2012
Conclusion

       Future work
       ‣ Higher level parent-child API.

             ‣ Needs to cover search & indexing.



       ‣ Joining

             ‣ Distributed support.

             ‣ Represent a hit as a parent child relation in the search result.



       ‣ Result grouping

             ‣ Aggregated grouped information like: sum, avg, min, max etc.
                              Searchworkings.org - The online search community    38
Thursday, May 17, 2012
Joining

       ToParentBlockJoinCollector




         ‣ TopGroups contains a group per top N parent document.

         ‣ Each group contains a parent and child documents.

                          Searchworkings.org - The online search community   39
Thursday, May 17, 2012
Result grouping

       Groups & facet counts
      ‣ Faceting and result grouping are different features.

            ‣ But are often used together!



      ‣ Facet counts can be based on:

            ‣ Found documents.

            ‣ Found groups.

            ‣ Combination of facet value and group.



      ‣ All options are supported in Solr.
                              Searchworkings.org - The online search community   40
Thursday, May 17, 2012
Result grouping

       Doc values
      ‣ Doc values / Column Stride values



      ‣ Prevents the creation of expensive data structures in FieldCache.



      ‣ Inverted index is meant for free text search.



      ‣ All grouping collectors have doc values based implementations!




                          Searchworkings.org - The online search community   41
Thursday, May 17, 2012

Grouping and Joining in Lucene/Solr

  • 1.
    Grouping & Joining Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer & PMC Member Thursday, May 17, 2012
  • 2.
    Grouping & Joining Overview ‣ Background ‣ Joining ‣ Result grouping ‣ Conclusion Searchworkings.org - The online search community 2 Thursday, May 17, 2012
  • 3.
    Background Lucene’s model ‣ Lucene is document based. ‣ Lucene doesn’t store information about relations between documents. ‣ Data often holds relations. ‣ Good free text search over relational data. Searchworkings.org - The online search community 3 Thursday, May 17, 2012
  • 4.
    Background Example ‣ Product ‣ Name ‣ Description ‣ Product-item ‣ Color ‣ Size ‣ Price ‣ Goal: Show the most applicable product based on product-item criteria. Searchworkings.org - The online search community 4 Thursday, May 17, 2012
  • 5.
    Background Common Lucene solutions ‣ Compound documents. ‣ May result in documents with many fields. ‣ Subsequent searches. ‣ May cause a lot network overhead. ‣ Non Lucene based approach: ‣ If free text search isn’t very important use a relational database. Searchworkings.org - The online search community 5 Thursday, May 17, 2012
  • 6.
    Background Example domain ‣ Compound Product & Product-items document. ‣ Each product-item has its own field prefix. Searchworkings.org - The online search community 6 Thursday, May 17, 2012
  • 7.
    Background Different solutions ‣ Lucene offers solutions to have a 'relational' like search. ‣ Parent child ‣ Grouping & joining aren't naturally supported. ‣ All the solutions do increase the search time. ‣ Some scenarios grouping and joining isn't the right solution. Searchworkings.org - The online search community 7 Thursday, May 17, 2012
  • 8.
    Joining Modelling relations Thursday, May 17, 2012
  • 9.
    Joining Introduction ‣ Support for parent child like search from Lucene 3.4 ‣ Not a SQL join. ‣ The parent and each children are stored as documents. ‣ Two types: ‣ Index time join ‣ Query time join Searchworkings.org - The online search community 9 Thursday, May 17, 2012
  • 10.
    Joining Index time join ‣ Two block join queries: ‣ ToParentBlockJoinQuery ‣ ToChildBlockJoinQuery ‣ One Lucene collector: ‣ ToParentBlockJoinCollector ‣ Index time join requires block indexing. Searchworkings.org - The online search community 10 Thursday, May 17, 2012
  • 11.
    Joining Block indexing ‣ Atomically adding documents. ‣ A block of documents. ‣ Each document gets sequentially assigned Lucene document id. ‣ IndexWriter#addDocuments(docs); Searchworkings.org - The online search community 11 Thursday, May 17, 2012
  • 12.
    Joining Block indexing ‣ Index doesn't record blocks. ‣ Segment merging doesn’t re-order documents in a segment. ‣ App is responsible for identifying block documents. ‣ Marking the last document in a block. ‣ Adding a document to a block requires you to reindex the whole block. ‣ Removing a document from a block doesn’t requires reindexing a block. Searchworkings.org - The online search community 12 Thursday, May 17, 2012
  • 13.
    Joining Example domain ‣ Parent is the last document in a block. Searchworkings.org - The online search community 13 Thursday, May 17, 2012
  • 14.
    Joining Block indexing Marking parent documents Searchworkings.org - The online search community 14 Thursday, May 17, 2012
  • 15.
    Joining Block indexing Add block Add block Searchworkings.org - The online search community 15 Thursday, May 17, 2012
  • 16.
    Joining ToParentBlockJoinQuery ‣ Parent filter marks the parent documents. ‣ Child query is executed in the parent space. ‣ ToChildBlockJoinQuery works in the opposite direction. Searchworkings.org - The online search community 16 Thursday, May 17, 2012
  • 17.
    Joining Query time joining ‣ Query time joining is executed in two phases and is field based: ‣ fromField ‣ toField ‣ Doesn’t require block indexing. Searchworkings.org - The online search community 17 Thursday, May 17, 2012
  • 18.
    Joining Query time joining ‣ First phase collects all the terms in the fromField for the documents that match with the original query. ‣ Currently doesn’t take the score from original query into account. ‣ The second phase returns the documents that match with the collected terms from the previous phase in the toField. ‣ Two different implementations: ‣ JoinUtil - Lucene (≥ 3.6) ‣ Join query parser - Solr (trunk) Searchworkings.org - The online search community 18 Thursday, May 17, 2012
  • 19.
    Joining Query time joining - Indexing Referrer the product id. Searchworkings.org - The online search community 19 Thursday, May 17, 2012
  • 20.
    Joining Query time joining - Indexing Searchworkings.org - The online search community 20 Thursday, May 17, 2012
  • 21.
    Joining Query time joining ‣ Result will contain one product. ‣ Possible to join over two indices. Searchworkings.org - The online search community 21 Thursday, May 17, 2012
  • 22.
    Joining Final thoughts ‣ Joining module has good solutions to model parent child relations. ‣ Use block join if you care about scoring. ‣ Frequent updates can be problematic. ‣ Use query time join for parent child filtering. ‣ Query time join is slower than index time join. ‣ Mostly a Lucene feature only. ‣ All code is annotated as experimental. Searchworkings.org - The online search community 22 Thursday, May 17, 2012
  • 23.
    Result grouping Previously known as Field Collapsing. Thursday, May 17, 2012
  • 24.
    Result grouping Introduction ‣ Group matching documents that share a common property. ‣ Search hit represents a group. ‣ Facet counts & total hit count represent groups. ‣ Per group collect information ‣ Most relevant document. ‣ Top three documents. ‣ Aggregated counts Searchworkings.org - The online search community 24 Thursday, May 17, 2012
  • 25.
    Result grouping Usages ‣ Group documents by a shared property ‣ Product-item by product id (Parent child) ‣ Collapse similar looking documents ‣ E.g. all results from the Wikipedia domains. ‣ Remove duplicates from the search result. ‣ Based on a field that contains a hash Searchworkings.org - The online search community 25 Thursday, May 17, 2012
  • 26.
    Result grouping Example domain ‣ Each Product-item is a document, but includes the product data. Searchworkings.org - The online search community 26 Thursday, May 17, 2012
  • 27.
    Result grouping Implementation ‣ Result grouping implemented with Lucene collectors. ‣ Module in trunk and a contrib in 3.x versions. ‣ Two pass result grouping. ‣ Grouping by indexed field, function or doc values. ‣ Single pass result grouping. ‣ Requires block indexing. Searchworkings.org - The online search community 27 Thursday, May 17, 2012
  • 28.
    Result grouping Two pass implementation ‣ First pass collects the top N groups. ‣ Per group: group value + sort value ‣ Second pass collects data for each top group. ‣ The top N documents per group. ‣ Possible other aggregated information. ‣ Second pass search ignores all documents outside topN groups. Searchworkings.org - The online search community 28 Thursday, May 17, 2012
  • 29.
    Result grouping Result grouping - Indexing Searchworkings.org - The online search community 29 Thursday, May 17, 2012
  • 30.
    Result grouping Result grouping - Searching Searchworkings.org - The online search community 30 Thursday, May 17, 2012
  • 31.
    Result grouping Result grouping made easier ‣ GroupingSearch ‣ Solr ‣ http://myhost/solr/select?q=shirt&group=true&group.field=product_id ‣ Many more options: ‣ http://wiki.apache.org/solr/FieldCollapsing Searchworkings.org - The online search community 31 Thursday, May 17, 2012
  • 32.
    Result grouping Parent child result ‣ TopGroups - Equivalent to TopDocs. ‣ Hit count ‣ Group count ‣ Groups ‣ Top documents ‣ Facet and total count can represent groups instead of documents. ‣ But requires more query time. Searchworkings.org - The online search community 32 Thursday, May 17, 2012
  • 33.
    Conclusion Compare... Thursday, May 17, 2012
  • 34.
    Conclusion Compare the parent child solutions ‣ Result grouping ‣ + Distributed support & Parent child relation as hit. ‣ - Parent data duplication ‣ - Impact on query time ‣ Joining ‣ + Fast & no data duplication ‣ - Index time join not optimal for updates ‣ - Query time join is limited. Searchworkings.org - The online search community 34 Thursday, May 17, 2012
  • 35.
    Conclusion Compare the parent child solutions ‣ Compound documents. ‣ + Fast and works out-of-the box with all features. ‣ - Not flexible when it comes to updates. ‣ - Document granularity is set in stone. Searchworkings.org - The online search community 35 Thursday, May 17, 2012
  • 36.
    Any questions? 36 Thursday, May 17, 2012
  • 37.
    Extra slides We have time left! Thursday, May 17, 2012
  • 38.
    Conclusion Future work ‣ Higher level parent-child API. ‣ Needs to cover search & indexing. ‣ Joining ‣ Distributed support. ‣ Represent a hit as a parent child relation in the search result. ‣ Result grouping ‣ Aggregated grouped information like: sum, avg, min, max etc. Searchworkings.org - The online search community 38 Thursday, May 17, 2012
  • 39.
    Joining ToParentBlockJoinCollector ‣ TopGroups contains a group per top N parent document. ‣ Each group contains a parent and child documents. Searchworkings.org - The online search community 39 Thursday, May 17, 2012
  • 40.
    Result grouping Groups & facet counts ‣ Faceting and result grouping are different features. ‣ But are often used together! ‣ Facet counts can be based on: ‣ Found documents. ‣ Found groups. ‣ Combination of facet value and group. ‣ All options are supported in Solr. Searchworkings.org - The online search community 40 Thursday, May 17, 2012
  • 41.
    Result grouping Doc values ‣ Doc values / Column Stride values ‣ Prevents the creation of expensive data structures in FieldCache. ‣ Inverted index is meant for free text search. ‣ All grouping collectors have doc values based implementations! Searchworkings.org - The online search community 41 Thursday, May 17, 2012