SlideShare a Scribd company logo
Document relations
                              Martijn van Groningen
                                 @mvgroningen




Tuesday, November 6, 12
Overview
     • Background
     • Document relations with joining.
     • Various solutions in Lucene and Elasticsearch



Tuesday, November 6, 12
Background - Lucene model
     • Lucene is document based.
     • Lucene doesn’t store information about relations
       between documents.


     • Data often holds relations.
     • Good free text search over relational data.

Tuesday, November 6, 12
Background - Common solutions
     • Compound documents.
           •     May result in documents with many fields.

     • Subsequent searches.
           •     May cause a lot of network overhead.




     • Non Lucene based approach:
           •     Use Lucene in combination with a relational database.




Tuesday, November 6, 12
Background - Example
     • Product
           •     Name
           •     Description

     • Product-item
           •     Color
           •     Size
           •     Price




Tuesday, November 6, 12
Background - Example
     • Compound Product & Product-items document.
     • Each product-item has its own field prefix.




Tuesday, November 6, 12
Background - Other solutions
     • Lucene offers solutions to have a 'relational' like
       search.
           •     Joining
           •     Result grouping


     • Elasticsearch builds on top of the joining
       capabilities.


     • These solutions aren't naturally supported.

Tuesday, November 6, 12
Joining



Tuesday, November 6, 12
Joining
     • Join support available since Lucene 3.4
           •     Not a SQL join!


     • Two distinct joining types:
           •     Index time join
           •     Query time join


     • Joining provides a solution to handle document
       relations.


Tuesday, November 6, 12
Joining - What is out there?
     • Index time:
           •     Lucene’s block join implementation.
           •     Elasticsearch’s nested filter, query and facets.
                •         Built on top of the Lucene’s block join support.


     • Query time:
           •     Lucene’s query time join utility.
           •     Solr’s join query parser.
           •     Elasticsearch’s various parent-child queries and filters.



Tuesday, November 6, 12
Index time join
                            And nested documents.




Tuesday, November 6, 12
Joining - Block join query
     • Lucene block join queries:
           •     ToParentBlockJoinQuery
           •     ToChildBlockJoinQuery


     • Lucene collector:
           •     ToParentBlockJoinCollector


     • Index time join requires block indexing.

Tuesday, November 6, 12
Joining - Block indexing
     • Atomically adding documents.
           •     A block of documents.


     • Each document gets sequentially assigned Lucene
       document id.


     • IndexWriter#addDocuments(docs);


Tuesday, November 6, 12
Joining - Block indexing
     • Index doesn't record blocks.
     • App is responsible for identifying block documents.
     • Segment merging doesn’t re-order documents in a
       segment.


     • Adding athe whole block.block requires you to
       reindex
                document to a



Tuesday, November 6, 12
Joining - block join query

      • Parent is the last document in a block.




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery

                          Marking parent documents




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery



                                  Add block




                                   Add block




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery
     • Parent filter marks the parent documents.

     • Child query is executed in the parent space.




Tuesday, November 6, 12
Block join & Elasticsearch
     • In Elasticsearch exposed as nested objects.
     • Documents are constructed as JSON.
           •     JSON’s nested structure works nicely with block
                 indexing.


     • Elasticsearchoftakes nested documents. and also
       keeps track the
                            care of block indexing




Tuesday, November 6, 12
Elasticsearch’s nested support
     • Support for a nested type in mappings.
     • Nested query.
     • Nested filter.
     • Nested facets.


Tuesday, November 6, 12
Nested type
     • The nested types enables Lucene’s block indexing.
                                           index



                          curl -XPUT 'localhost:9200/products' -d '{
     type                    "mappings" : {
                               "product" : {
                                  "properties" : {
Nested offers                       "offers" : { "type" : "nested" }
                                  }
                               }
                             }
                          }'




Tuesday, November 6, 12
Indexing nested objects
                                                 index         type

                          curl -XPOST 'localhost:9200/products/product' -d '{
                             "name" : "Polo shirt",
                             "description" : "Made of 100% cotton",
 nested objects              "offers" : [
                                  {
                                     "color" : "red",
                                     "size" : "s",
                                     "price" : 999
                                  },
                                  {
                                     "color" : "red",
                                     "size" : "m",
                                     "price" : 1099
                                  },
                                  {
                                     "color" : "blue",
                                     "size" : "s",
                                     "price" : 999
                                  }
                             ]
                          }'




Tuesday, November 6, 12
Nested query
                                                                   The nested field
     curl -XPOST 'localhost:9200/products/product/_search' -d '{
        "query" : {
                                                                   path in mapping.
          "nested" : {
              "path" : "offers",
              "score_mode" : "total",                               Sum the individual
              "query" : {                                            nested matches.
                  "bool" : {
                       "must" : [
                           {
                               "term" : {
                                   "color" : "blue"
                               }
                           },
                                                                    Color red would match
                           {                                        the previous document.
                               "term" : {
                                   "size" : "m"
                               }
                           }
                       ]
                  }
              }
          }
        }
     }'




Tuesday, November 6, 12
Nested facets
     curl -XPOST 'localhost:9200/products/product/_search' -d '{
        "facets" : {
          "color" : {
             "terms_stats" : {
                "key_field" : "size",
                "value_field" : "price"
             },
             "nested" : "offers"
          }                                              "facets":{
        }                                                    "color":{
     }'                                                          "_type":"terms_stats",
                                                                 "missing":0,
                                                                 "terms":[
                                                                     {
        A facet for nested field                                          "term":"s",
                                                                         "count":2,
                 offers.                                                 "total_count":2,
                                                                         "min":999.0,
                                                                         "max":999.0,
                                                                         "total":1998.0,
                                                                         "mean":999.0
                                                                     },
                                                                     ...
                        Counts 2 nested documents                ]
                                 for term: s                 }
                                                         }




Tuesday, November 6, 12
Query time join
                           and parent & child relations.




Tuesday, November 6, 12
Query time joining
     • Documents are joined during query time.
           •     More expensive, but more flexible.


     • Two types of query time joins:
           •     Parent child joining.
           •     Field based joining.




Tuesday, November 6, 12
Lucene’s query time join
     • Query time joining is executed in two phases.
     • Field based joining:
           •     ‘from’ field
           •     ‘to’ field




     • Doesn’t require block indexing.
Tuesday, November 6, 12
Query time join - JoinUtil
     • Firstthe documents thatthe terms in the fromField
       for
              phase collects all
                                 match with the original
            query.


     • The second the collected termsdocumentsprevious
       match with
                   phase returns the
                                      from the
                                               that
            phase in the toField.


     • One public method:
           •     JoinUtil#createJoinQuery(...)



Tuesday, November 6, 12
Joining - JoinUtil




                          Referrer the product id.



Tuesday, November 6, 12
Joining - JoinUtil




Tuesday, November 6, 12
Joining - JoinUtil



                                           Join utility




 • Result will contain one products.
 • Possible to do ‘join’ across indices.

Tuesday, November 6, 12
Elasticsearch’s query time join
     • A parent child solution.
     • Not related to Lucene’s query time join.
     • Support consists out of:
           •     The _parent field.
           •     The top_children query.
           •     The has_parent & has_child filter & query.
           •     Scoped facets.


Tuesday, November 6, 12
The _parent field
     • Points to the parent type.
     • Mapping attribute to be define on the child type.
                           curl -XPUT 'localhost:9200/products' -d '{
                              "mappings" : {
                                "offer" : {
                                   "_parent" : {
                                     "type" : "product"
                                   }
                                }
                              }
                           }'



     • Elasticsearch uses the _parent field to build an id
       cache.
           •       Makes parent/child queries & filters fast.



Tuesday, November 6, 12
Indexing parent & child documents
     • Parent document:
              curl -XPOST 'localhost:9200/products/product/1' -d '{
                 "name" : "Polo shirt",
                 "description" : "Made of 100% cotton"
              }'                                                            The id of the parent
                                                                          document. Also used for
     • Child documents:                                                           routing.

             curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
                "color" : "red",
                "size" : "s",
                "price" : 999
             }'



             curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
                "color" : "blue",
                          "red",
                "size" : "s",
                         "m",
                "price" : 999
                          1099
             }'




Tuesday, November 6, 12
The ‘top_children’ query
      curl -XPOST 'localhost:9200/products/_search' -d '{
         "query" : {
                                                            Child type
           "top_children" : {
              "type" : "offer",
              "query" : {
                 "term" : {                                 Child query
                   "size" : "m"
                 }
              },
              "score" : "sum"                               Score mode
           }
         }
      }'




     • Internally the in order to getpotentially executed
       several times
                      child query is
                                      enough parent hits.



Tuesday, November 6, 12
The ‘has_child’ query
      curl -XPOST 'localhost:9200/products/_search' -d '{   Child type
         "query" : {
           "has_child" : {
              "type" : "offer",
              "query" : {
                "term" : {
                   "size" : "m"
                                                            Child query
                }
              }
           }
         }
      }'




     • Doesn’tdoc. Works as a filter. into the matching
       parent
               map the child scores



     • The has_parent query matches child document
       instead.

Tuesday, November 6, 12
Scoped facets
     curl -XPOST 'localhost:9200/products/_search' -d '{
        "query" : {
           "has_child" : {
             "type" : "offer",
             "query" : {
                "term" : {
                  "size" : "m"
                }
             },
             "_scope" : "my_scope"
           }                                               Execute facets inside
        },
        "facets" : {
                                                             a specific scope.
           "color" : {
             "terms_stats" : {
                "key_field" : "size",
                "value_field" : "price"
             },
             "scope" : "my_scope"
           }
        }
     }'




Tuesday, November 6, 12
Conclusion
     • Block join & nested object are fast and efficient,
       but lack flexibility.


     • Query time and parent child join are flexible at the
       cost of performance and memory.
           •     Field based query time joining is the most flexible.
           •     Parent child based joining is the fastest.


     • Faceting in combination with document relations
       gives a nice analytical view.


Tuesday, November 6, 12
Any questions?




Tuesday, November 6, 12

More Related Content

What's hot

Azure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overviewAzure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overview
George Walters
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Google Presentation
Google PresentationGoogle Presentation
Google Presentation
Ali-Productionz
 
Business Continuity & Disaster Recovery with Microsoft Azure
Business Continuity & Disaster Recovery with Microsoft AzureBusiness Continuity & Disaster Recovery with Microsoft Azure
Business Continuity & Disaster Recovery with Microsoft Azure
Aymen Mami
 
Marketing 7Ps Presentation
Marketing 7Ps PresentationMarketing 7Ps Presentation
Marketing 7Ps Presentationhappiness daniel
 
Just brigadeiro business plan
Just brigadeiro business planJust brigadeiro business plan
Just brigadeiro business planCaroline Szpira
 
Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)
Chris Dufour
 
Introduction of Windows azure and overview
Introduction of Windows azure and overviewIntroduction of Windows azure and overview
Introduction of Windows azure and overview
Vishal Tandel
 
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Edureka!
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
James Serra
 
Azure purview
Azure purviewAzure purview
Azure purview
Shafqat Turza
 
Adelaide Global Azure Bootcamp 2018 - Azure 101
Adelaide Global Azure Bootcamp 2018 - Azure 101Adelaide Global Azure Bootcamp 2018 - Azure 101
Adelaide Global Azure Bootcamp 2018 - Azure 101
Balabiju
 
Sources of Entrepreneurial Ideas
Sources of Entrepreneurial IdeasSources of Entrepreneurial Ideas
Sources of Entrepreneurial Ideas
Jamel Aves
 
Microsoft System Center Service Manager on a Single Computer
Microsoft System Center Service Manager on a Single ComputerMicrosoft System Center Service Manager on a Single Computer
Microsoft System Center Service Manager on a Single Computer
Shahab Al Yamin Chawdhury
 
Azure Stack Overview
Azure Stack OverviewAzure Stack Overview
Azure Stack Overview
PT Datacomm Diangraha
 
The State of Enterprise Software
The State of Enterprise SoftwareThe State of Enterprise Software
The State of Enterprise Software
Greylock Partners
 
10 steps marketing plan Jollibee
10 steps marketing plan Jollibee10 steps marketing plan Jollibee
10 steps marketing plan Jollibee
Elainrose Esberto
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overview
gjuljo
 
Introduction to Oracle Cloud Infrastructure Services
Introduction to Oracle Cloud Infrastructure ServicesIntroduction to Oracle Cloud Infrastructure Services
Introduction to Oracle Cloud Infrastructure Services
Knoldus Inc.
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 

What's hot (20)

Azure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overviewAzure SQL Database Managed Instance - technical overview
Azure SQL Database Managed Instance - technical overview
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Google Presentation
Google PresentationGoogle Presentation
Google Presentation
 
Business Continuity & Disaster Recovery with Microsoft Azure
Business Continuity & Disaster Recovery with Microsoft AzureBusiness Continuity & Disaster Recovery with Microsoft Azure
Business Continuity & Disaster Recovery with Microsoft Azure
 
Marketing 7Ps Presentation
Marketing 7Ps PresentationMarketing 7Ps Presentation
Marketing 7Ps Presentation
 
Just brigadeiro business plan
Just brigadeiro business planJust brigadeiro business plan
Just brigadeiro business plan
 
Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)Microsoft Azure Platform-as-a-Service (PaaS)
Microsoft Azure Platform-as-a-Service (PaaS)
 
Introduction of Windows azure and overview
Introduction of Windows azure and overviewIntroduction of Windows azure and overview
Introduction of Windows azure and overview
 
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
Azure Training | Microsoft Azure Tutorial | Microsoft Azure Certification | E...
 
Introducing Azure SQL Database
Introducing Azure SQL DatabaseIntroducing Azure SQL Database
Introducing Azure SQL Database
 
Azure purview
Azure purviewAzure purview
Azure purview
 
Adelaide Global Azure Bootcamp 2018 - Azure 101
Adelaide Global Azure Bootcamp 2018 - Azure 101Adelaide Global Azure Bootcamp 2018 - Azure 101
Adelaide Global Azure Bootcamp 2018 - Azure 101
 
Sources of Entrepreneurial Ideas
Sources of Entrepreneurial IdeasSources of Entrepreneurial Ideas
Sources of Entrepreneurial Ideas
 
Microsoft System Center Service Manager on a Single Computer
Microsoft System Center Service Manager on a Single ComputerMicrosoft System Center Service Manager on a Single Computer
Microsoft System Center Service Manager on a Single Computer
 
Azure Stack Overview
Azure Stack OverviewAzure Stack Overview
Azure Stack Overview
 
The State of Enterprise Software
The State of Enterprise SoftwareThe State of Enterprise Software
The State of Enterprise Software
 
10 steps marketing plan Jollibee
10 steps marketing plan Jollibee10 steps marketing plan Jollibee
10 steps marketing plan Jollibee
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overview
 
Introduction to Oracle Cloud Infrastructure Services
Introduction to Oracle Cloud Infrastructure ServicesIntroduction to Oracle Cloud Infrastructure Services
Introduction to Oracle Cloud Infrastructure Services
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 

Viewers also liked

Searching Relational Data with Elasticsearch
Searching Relational Data with ElasticsearchSearching Relational Data with Elasticsearch
Searching Relational Data with Elasticsearch
sirensolutions
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
Clifford James
 
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsFaceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Lucidworks
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearch
BeyondTrees
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
Florian Hopf
 

Viewers also liked (6)

Searching Relational Data with Elasticsearch
Searching Relational Data with ElasticsearchSearching Relational Data with Elasticsearch
Searching Relational Data with Elasticsearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid DynamicsFaceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
 
Nested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearchNested and Parent/Child Docs in ElasticSearch
Nested and Parent/Child Docs in ElasticSearch
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
 

Similar to Document relations

Mongo db
Mongo dbMongo db
Mongo db
Girish Talekar
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practice
Jano Suchal
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 
MVP Cloud OS Week Track 1 9 Sept: Data liberty
MVP Cloud OS Week Track 1 9 Sept: Data libertyMVP Cloud OS Week Track 1 9 Sept: Data liberty
MVP Cloud OS Week Track 1 9 Sept: Data libertycsmyth501
 
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
MVP Cloud OS Week: 9 Sept, Track 1 Data LibertyMVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
csmyth501
 
Schema design mongo_boston
Schema design mongo_bostonSchema design mongo_boston
Schema design mongo_bostonMongoDB
 
Schema Design
Schema DesignSchema Design
Schema Design
MongoDB
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
Mathieu Elie
 
MongoSV Schema Workshop
MongoSV Schema WorkshopMongoSV Schema Workshop
MongoSV Schema WorkshopMongoDB
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
George Stathis
 
CouchDB at New York PHP
CouchDB at New York PHPCouchDB at New York PHP
CouchDB at New York PHP
Bradley Holt
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009Jason Davies
 
Schema & Design
Schema & DesignSchema & Design
Schema & DesignMongoDB
 
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
MongoDB
 
Elasticsearch in 15 Minutes
Elasticsearch in 15 MinutesElasticsearch in 15 Minutes
Elasticsearch in 15 Minutes
Karel Minarik
 
Modeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databasesModeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databases
Ryan CrawCour
 
Couchbase Korea User Group 2nd Meetup #2
Couchbase Korea User Group 2nd Meetup #2Couchbase Korea User Group 2nd Meetup #2
Couchbase Korea User Group 2nd Meetup #2
won min jang
 
Schema Design
Schema DesignSchema Design
Schema DesignMongoDB
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
David Pilato
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented Databases
Daniel Coupal
 

Similar to Document relations (20)

Mongo db
Mongo dbMongo db
Mongo db
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practice
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
 
MVP Cloud OS Week Track 1 9 Sept: Data liberty
MVP Cloud OS Week Track 1 9 Sept: Data libertyMVP Cloud OS Week Track 1 9 Sept: Data liberty
MVP Cloud OS Week Track 1 9 Sept: Data liberty
 
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
MVP Cloud OS Week: 9 Sept, Track 1 Data LibertyMVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
 
Schema design mongo_boston
Schema design mongo_bostonSchema design mongo_boston
Schema design mongo_boston
 
Schema Design
Schema DesignSchema Design
Schema Design
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 
MongoSV Schema Workshop
MongoSV Schema WorkshopMongoSV Schema Workshop
MongoSV Schema Workshop
 
Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"Elasticsearch & "PeopleSearch"
Elasticsearch & "PeopleSearch"
 
CouchDB at New York PHP
CouchDB at New York PHPCouchDB at New York PHP
CouchDB at New York PHP
 
CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009CouchDB at JAOO Århus 2009
CouchDB at JAOO Århus 2009
 
Schema & Design
Schema & DesignSchema & Design
Schema & Design
 
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
Conceptos básicos. Seminario web 4: Indexación avanzada, índices de texto y g...
 
Elasticsearch in 15 Minutes
Elasticsearch in 15 MinutesElasticsearch in 15 Minutes
Elasticsearch in 15 Minutes
 
Modeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databasesModeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databases
 
Couchbase Korea User Group 2nd Meetup #2
Couchbase Korea User Group 2nd Meetup #2Couchbase Korea User Group 2nd Meetup #2
Couchbase Korea User Group 2nd Meetup #2
 
Schema Design
Schema DesignSchema Design
Schema Design
 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
 
Semi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented DatabasesSemi Formal Model for Document Oriented Databases
Semi Formal Model for Document Oriented Databases
 

Document relations

  • 1. Document relations Martijn van Groningen @mvgroningen Tuesday, November 6, 12
  • 2. Overview • Background • Document relations with joining. • Various solutions in Lucene and Elasticsearch Tuesday, November 6, 12
  • 3. Background - Lucene model • Lucene is document based. • Lucene doesn’t store information about relations between documents. • Data often holds relations. • Good free text search over relational data. Tuesday, November 6, 12
  • 4. Background - Common solutions • Compound documents. • May result in documents with many fields. • Subsequent searches. • May cause a lot of network overhead. • Non Lucene based approach: • Use Lucene in combination with a relational database. Tuesday, November 6, 12
  • 5. Background - Example • Product • Name • Description • Product-item • Color • Size • Price Tuesday, November 6, 12
  • 6. Background - Example • Compound Product & Product-items document. • Each product-item has its own field prefix. Tuesday, November 6, 12
  • 7. Background - Other solutions • Lucene offers solutions to have a 'relational' like search. • Joining • Result grouping • Elasticsearch builds on top of the joining capabilities. • These solutions aren't naturally supported. Tuesday, November 6, 12
  • 9. Joining • Join support available since Lucene 3.4 • Not a SQL join! • Two distinct joining types: • Index time join • Query time join • Joining provides a solution to handle document relations. Tuesday, November 6, 12
  • 10. Joining - What is out there? • Index time: • Lucene’s block join implementation. • Elasticsearch’s nested filter, query and facets. • Built on top of the Lucene’s block join support. • Query time: • Lucene’s query time join utility. • Solr’s join query parser. • Elasticsearch’s various parent-child queries and filters. Tuesday, November 6, 12
  • 11. Index time join And nested documents. Tuesday, November 6, 12
  • 12. Joining - Block join query • Lucene block join queries: • ToParentBlockJoinQuery • ToChildBlockJoinQuery • Lucene collector: • ToParentBlockJoinCollector • Index time join requires block indexing. Tuesday, November 6, 12
  • 13. Joining - Block indexing • Atomically adding documents. • A block of documents. • Each document gets sequentially assigned Lucene document id. • IndexWriter#addDocuments(docs); Tuesday, November 6, 12
  • 14. Joining - Block indexing • Index doesn't record blocks. • App is responsible for identifying block documents. • Segment merging doesn’t re-order documents in a segment. • Adding athe whole block.block requires you to reindex document to a Tuesday, November 6, 12
  • 15. Joining - block join query • Parent is the last document in a block. Tuesday, November 6, 12
  • 16. Block join - ToChildBlockJoinQuery Marking parent documents Tuesday, November 6, 12
  • 17. Block join - ToChildBlockJoinQuery Add block Add block Tuesday, November 6, 12
  • 18. Block join - ToChildBlockJoinQuery • Parent filter marks the parent documents. • Child query is executed in the parent space. Tuesday, November 6, 12
  • 19. Block join & Elasticsearch • In Elasticsearch exposed as nested objects. • Documents are constructed as JSON. • JSON’s nested structure works nicely with block indexing. • Elasticsearchoftakes nested documents. and also keeps track the care of block indexing Tuesday, November 6, 12
  • 20. Elasticsearch’s nested support • Support for a nested type in mappings. • Nested query. • Nested filter. • Nested facets. Tuesday, November 6, 12
  • 21. Nested type • The nested types enables Lucene’s block indexing. index curl -XPUT 'localhost:9200/products' -d '{ type "mappings" : { "product" : { "properties" : { Nested offers "offers" : { "type" : "nested" } } } } }' Tuesday, November 6, 12
  • 22. Indexing nested objects index type curl -XPOST 'localhost:9200/products/product' -d '{ "name" : "Polo shirt", "description" : "Made of 100% cotton", nested objects "offers" : [ { "color" : "red", "size" : "s", "price" : 999 }, { "color" : "red", "size" : "m", "price" : 1099 }, { "color" : "blue", "size" : "s", "price" : 999 } ] }' Tuesday, November 6, 12
  • 23. Nested query The nested field curl -XPOST 'localhost:9200/products/product/_search' -d '{ "query" : { path in mapping. "nested" : { "path" : "offers", "score_mode" : "total", Sum the individual "query" : { nested matches. "bool" : { "must" : [ { "term" : { "color" : "blue" } }, Color red would match { the previous document. "term" : { "size" : "m" } } ] } } } } }' Tuesday, November 6, 12
  • 24. Nested facets curl -XPOST 'localhost:9200/products/product/_search' -d '{ "facets" : { "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "nested" : "offers" } "facets":{ } "color":{ }' "_type":"terms_stats", "missing":0, "terms":[ { A facet for nested field "term":"s", "count":2, offers. "total_count":2, "min":999.0, "max":999.0, "total":1998.0, "mean":999.0 }, ... Counts 2 nested documents ] for term: s } } Tuesday, November 6, 12
  • 25. Query time join and parent & child relations. Tuesday, November 6, 12
  • 26. Query time joining • Documents are joined during query time. • More expensive, but more flexible. • Two types of query time joins: • Parent child joining. • Field based joining. Tuesday, November 6, 12
  • 27. Lucene’s query time join • Query time joining is executed in two phases. • Field based joining: • ‘from’ field • ‘to’ field • Doesn’t require block indexing. Tuesday, November 6, 12
  • 28. Query time join - JoinUtil • Firstthe documents thatthe terms in the fromField for phase collects all match with the original query. • The second the collected termsdocumentsprevious match with phase returns the from the that phase in the toField. • One public method: • JoinUtil#createJoinQuery(...) Tuesday, November 6, 12
  • 29. Joining - JoinUtil Referrer the product id. Tuesday, November 6, 12
  • 30. Joining - JoinUtil Tuesday, November 6, 12
  • 31. Joining - JoinUtil Join utility • Result will contain one products. • Possible to do ‘join’ across indices. Tuesday, November 6, 12
  • 32. Elasticsearch’s query time join • A parent child solution. • Not related to Lucene’s query time join. • Support consists out of: • The _parent field. • The top_children query. • The has_parent & has_child filter & query. • Scoped facets. Tuesday, November 6, 12
  • 33. The _parent field • Points to the parent type. • Mapping attribute to be define on the child type. curl -XPUT 'localhost:9200/products' -d '{ "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } }' • Elasticsearch uses the _parent field to build an id cache. • Makes parent/child queries & filters fast. Tuesday, November 6, 12
  • 34. Indexing parent & child documents • Parent document: curl -XPOST 'localhost:9200/products/product/1' -d '{ "name" : "Polo shirt", "description" : "Made of 100% cotton" }' The id of the parent document. Also used for • Child documents: routing. curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{ "color" : "red", "size" : "s", "price" : 999 }' curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{ "color" : "blue", "red", "size" : "s", "m", "price" : 999 1099 }' Tuesday, November 6, 12
  • 35. The ‘top_children’ query curl -XPOST 'localhost:9200/products/_search' -d '{ "query" : { Child type "top_children" : { "type" : "offer", "query" : { "term" : { Child query "size" : "m" } }, "score" : "sum" Score mode } } }' • Internally the in order to getpotentially executed several times child query is enough parent hits. Tuesday, November 6, 12
  • 36. The ‘has_child’ query curl -XPOST 'localhost:9200/products/_search' -d '{ Child type "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" Child query } } } } }' • Doesn’tdoc. Works as a filter. into the matching parent map the child scores • The has_parent query matches child document instead. Tuesday, November 6, 12
  • 37. Scoped facets curl -XPOST 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" } }, "_scope" : "my_scope" } Execute facets inside }, "facets" : { a specific scope. "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "scope" : "my_scope" } } }' Tuesday, November 6, 12
  • 38. Conclusion • Block join & nested object are fast and efficient, but lack flexibility. • Query time and parent child join are flexible at the cost of performance and memory. • Field based query time joining is the most flexible. • Parent child based joining is the fastest. • Faceting in combination with document relations gives a nice analytical view. Tuesday, November 6, 12