Document relations
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Document relations

on

  • 3,719 views

Talk about document relations in lucene and elasticsearch at ApacheCon.

Talk about document relations in lucene and elasticsearch at ApacheCon.

Statistics

Views

Total Views
3,719
Views on SlideShare
3,689
Embed Views
30

Actions

Likes
13
Downloads
71
Comments
0

2 Embeds 30

https://twitter.com 29
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Document relations Presentation Transcript

  • 1. Document relations Martijn van Groningen @mvgroningenTuesday, November 6, 12
  • 2. Overview • Background • Document relations with joining. • Various solutions in Lucene and ElasticsearchTuesday, November 6, 12
  • 3. Background - Lucene model • Lucene is document based. • Lucene doesn’t store information about relations between documents. • Data often holds relations. • Good free text search over relational data.Tuesday, November 6, 12
  • 4. Background - Common solutions • Compound documents. • May result in documents with many fields. • Subsequent searches. • May cause a lot of network overhead. • Non Lucene based approach: • Use Lucene in combination with a relational database.Tuesday, November 6, 12
  • 5. Background - Example • Product • Name • Description • Product-item • Color • Size • PriceTuesday, November 6, 12
  • 6. Background - Example • Compound Product & Product-items document. • Each product-item has its own field prefix.Tuesday, November 6, 12
  • 7. Background - Other solutions • Lucene offers solutions to have a relational like search. • Joining • Result grouping • Elasticsearch builds on top of the joining capabilities. • These solutions arent naturally supported.Tuesday, November 6, 12
  • 8. JoiningTuesday, November 6, 12
  • 9. Joining • Join support available since Lucene 3.4 • Not a SQL join! • Two distinct joining types: • Index time join • Query time join • Joining provides a solution to handle document relations.Tuesday, November 6, 12
  • 10. Joining - What is out there? • Index time: • Lucene’s block join implementation. • Elasticsearch’s nested filter, query and facets. • Built on top of the Lucene’s block join support. • Query time: • Lucene’s query time join utility. • Solr’s join query parser. • Elasticsearch’s various parent-child queries and filters.Tuesday, November 6, 12
  • 11. Index time join And nested documents.Tuesday, November 6, 12
  • 12. Joining - Block join query • Lucene block join queries: • ToParentBlockJoinQuery • ToChildBlockJoinQuery • Lucene collector: • ToParentBlockJoinCollector • Index time join requires block indexing.Tuesday, November 6, 12
  • 13. Joining - Block indexing • Atomically adding documents. • A block of documents. • Each document gets sequentially assigned Lucene document id. • IndexWriter#addDocuments(docs);Tuesday, November 6, 12
  • 14. Joining - Block indexing • Index doesnt record blocks. • App is responsible for identifying block documents. • Segment merging doesn’t re-order documents in a segment. • Adding athe whole block.block requires you to reindex document to aTuesday, November 6, 12
  • 15. Joining - block join query • Parent is the last document in a block.Tuesday, November 6, 12
  • 16. Block join - ToChildBlockJoinQuery Marking parent documentsTuesday, November 6, 12
  • 17. Block join - ToChildBlockJoinQuery Add block Add blockTuesday, November 6, 12
  • 18. Block join - ToChildBlockJoinQuery • Parent filter marks the parent documents. • Child query is executed in the parent space.Tuesday, November 6, 12
  • 19. Block join & Elasticsearch • In Elasticsearch exposed as nested objects. • Documents are constructed as JSON. • JSON’s nested structure works nicely with block indexing. • Elasticsearchoftakes nested documents. and also keeps track the care of block indexingTuesday, November 6, 12
  • 20. Elasticsearch’s nested support • Support for a nested type in mappings. • Nested query. • Nested filter. • Nested facets.Tuesday, November 6, 12
  • 21. Nested type • The nested types enables Lucene’s block indexing. index curl -XPUT localhost:9200/products -d { type "mappings" : { "product" : { "properties" : {Nested offers "offers" : { "type" : "nested" } } } } }Tuesday, November 6, 12
  • 22. Indexing nested objects index type curl -XPOST localhost:9200/products/product -d { "name" : "Polo shirt", "description" : "Made of 100% cotton", nested objects "offers" : [ { "color" : "red", "size" : "s", "price" : 999 }, { "color" : "red", "size" : "m", "price" : 1099 }, { "color" : "blue", "size" : "s", "price" : 999 } ] }Tuesday, November 6, 12
  • 23. Nested query The nested field curl -XPOST localhost:9200/products/product/_search -d { "query" : { path in mapping. "nested" : { "path" : "offers", "score_mode" : "total", Sum the individual "query" : { nested matches. "bool" : { "must" : [ { "term" : { "color" : "blue" } }, Color red would match { the previous document. "term" : { "size" : "m" } } ] } } } } }Tuesday, November 6, 12
  • 24. Nested facets curl -XPOST localhost:9200/products/product/_search -d { "facets" : { "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "nested" : "offers" } "facets":{ } "color":{ } "_type":"terms_stats", "missing":0, "terms":[ { A facet for nested field "term":"s", "count":2, offers. "total_count":2, "min":999.0, "max":999.0, "total":1998.0, "mean":999.0 }, ... Counts 2 nested documents ] for term: s } }Tuesday, November 6, 12
  • 25. Query time join and parent & child relations.Tuesday, November 6, 12
  • 26. Query time joining • Documents are joined during query time. • More expensive, but more flexible. • Two types of query time joins: • Parent child joining. • Field based joining.Tuesday, November 6, 12
  • 27. Lucene’s query time join • Query time joining is executed in two phases. • Field based joining: • ‘from’ field • ‘to’ field • Doesn’t require block indexing.Tuesday, November 6, 12
  • 28. Query time join - JoinUtil • Firstthe documents thatthe terms in the fromField for phase collects all match with the original query. • The second the collected termsdocumentsprevious match with phase returns the from the that phase in the toField. • One public method: • JoinUtil#createJoinQuery(...)Tuesday, November 6, 12
  • 29. Joining - JoinUtil Referrer the product id.Tuesday, November 6, 12
  • 30. Joining - JoinUtilTuesday, November 6, 12
  • 31. Joining - JoinUtil Join utility • Result will contain one products. • Possible to do ‘join’ across indices.Tuesday, November 6, 12
  • 32. Elasticsearch’s query time join • A parent child solution. • Not related to Lucene’s query time join. • Support consists out of: • The _parent field. • The top_children query. • The has_parent & has_child filter & query. • Scoped facets.Tuesday, November 6, 12
  • 33. The _parent field • Points to the parent type. • Mapping attribute to be define on the child type. curl -XPUT localhost:9200/products -d { "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } } • Elasticsearch uses the _parent field to build an id cache. • Makes parent/child queries & filters fast.Tuesday, November 6, 12
  • 34. Indexing parent & child documents • Parent document: curl -XPOST localhost:9200/products/product/1 -d { "name" : "Polo shirt", "description" : "Made of 100% cotton" } The id of the parent document. Also used for • Child documents: routing. curl -XPOST localhost:9200/products/offer?parent=1 -d { "color" : "red", "size" : "s", "price" : 999 } curl -XPOST localhost:9200/products/offer?parent=1 -d { "color" : "blue", "red", "size" : "s", "m", "price" : 999 1099 }Tuesday, November 6, 12
  • 35. The ‘top_children’ query curl -XPOST localhost:9200/products/_search -d { "query" : { Child type "top_children" : { "type" : "offer", "query" : { "term" : { Child query "size" : "m" } }, "score" : "sum" Score mode } } } • Internally the in order to getpotentially executed several times child query is enough parent hits.Tuesday, November 6, 12
  • 36. The ‘has_child’ query curl -XPOST localhost:9200/products/_search -d { Child type "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" Child query } } } } } • Doesn’tdoc. Works as a filter. into the matching parent map the child scores • The has_parent query matches child document instead.Tuesday, November 6, 12
  • 37. Scoped facets curl -XPOST localhost:9200/products/_search -d { "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" } }, "_scope" : "my_scope" } Execute facets inside }, "facets" : { a specific scope. "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "scope" : "my_scope" } } }Tuesday, November 6, 12
  • 38. Conclusion • Block join & nested object are fast and efficient, but lack flexibility. • Query time and parent child join are flexible at the cost of performance and memory. • Field based query time joining is the most flexible. • Parent child based joining is the fastest. • Faceting in combination with document relations gives a nice analytical view.Tuesday, November 6, 12
  • 39. Any questions?Tuesday, November 6, 12