Nested and Parent/Child Docs in ElasticSearch

45,389 views

Published on

A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to use the "nested" type and parent-child relations in ElasticSearch to do complex where-clause queries in an efficient way

0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
45,389
On SlideShare
0
From Embeds
0
Number of Embeds
180
Actions
Shares
0
Downloads
289
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide

Nested and Parent/Child Docs in ElasticSearch

  1. 1. Nested & Parent/Child Docs hidden gems in ElasticSearchAnne Veling | ElasticSearch NL Meetup | February 26, 2013
  2. 2. agendaRefworks Flow Reference Manager for ResearchersUse of ElasticSearch in FlowUse Case 1: Nested documentsUse Case 2: Parent/Child relationsLessons Learned
  3. 3. introductionAnne Veling, @annevelingSelf-employed contractor Software Architect Agile process management Performance optimization Lucene/SOLR/ElasticSearch implementations & training
  4. 4. tech stack
  5. 5. architecture Flow CitationMongo AuthorityElastic PDFSearch Pipeline
  6. 6. Citation Canonicalization Use Case 1
  7. 7. Reference CanonicalizationWe built a large Citation Authority index in ElasticSearch With full, deduped metadata for a large portion of English scientific researchIn the Reference Edit screen Try to find high quality matches to a large index of canonical references of scientific articles Based on known fields Title, possibly partial and incorrect Author(s) Other identifying fields: journal, year, …
  8. 8. { "query": { "bool": { "must": [ { "text": { "title": "market elasticity" } }, { "text": { "authors.lastName": "Russell" } }, { "text": { "authors.firstNames": "G" } } ] } }}
  9. 9. problemSearching on a sub-document Searching for all documents where quthors.lastName: “Russell” authors.firstNames: “G” Also matches documents by “Jack Russell and Frederickson, G” We need a sub-document JOIN query… Combined with other information on the parent document (title) Oh noes! We‟re Can‟t using a NoSQL we? database, so we can‟t…
  10. 10. query Lucene block indexing term term Save “children” documents always rightlucene documents before their “parent” document Requires you to write BlockJoinQuery ParentsFilter ChildQuery ToParentBlockJoinQuer y This means: all children (and parent!) needs to be reindexed upon any change in them…
  11. 11. mappingauthors: { properties: { rawName: { analyzer: “caName” type: “string” }, lastName: { analyzer: “caName” type: “string” }, firstNames: { null_value: “__NONAME” analyzer: “caName” type: “string” } }, type: “nested”},title: { analyzer: “caText” type: “string”}
  12. 12. { { "bool" : { "filtered" : { "must" : [ { "query" : { "text" : { "title" : { "query" : "market elasticity", "type" : "phrase", "slop" : 2 query "text" : { "lastName" : { "query" : "Russell", "type" : "boolean", "operator" : "AND" } } } } }, { }, "bool" : { "filter" : { "must" : { "missing" : { "nested" : { "field" : "firstNames" "query" : { } "bool" : { } "should" : [ { } "bool" : { } ] "must" : [ { } "text" : { }, "lastName" : { "path" : "authors" "query" : "Russell", } "type" : "boolean" } } } } } ] }, { } "bool" : { } "must" : { "bool" : { "should" : [ { "text" : { (title:"market elasticity") AND ( "firstNames" : { "query" : "G", authors: ( "type" : "boolean" (lastName:"Russell") AND ( } } (firstNames:"G") OR }, { (firstNames:"g*") OR "prefix" : { "firstNames" : "g" (lastName:"Russell" AND NOT(firstNames)) } ) } ] } ) } } ) } ] } },
  13. 13. “nested”Just setting the subdocument type to “nested” in mappingCombine parent-query with “nested” query that specifiesthe pathComplex subcombination JOIN operationsAutomatic hiding of “nested” subdocuments This will increase your index size
  14. 14. “nested”Efficient! ElasticSearch handles document updates Child-whereclauses handled INSIDE parent query docEnum Children are sharded with their parents => locality!Facet counts (on parent) still correct!Limitations Combinations of nested subdocuments with other queries Like “dis_max”, or “text” No automatic recognition of “authors.lastName” in other queries to a “nested” subquery
  15. 15. Multipage Indexing Use Case 2
  16. 16. architecture doc Flow CitationMongo Authority page page PDFElastic pageSearch Pipeline S3
  17. 17. problemHow to index both Doc metadata and Pages text Doc in Flow app Pages only in PDF pipeline and on S3 Docs updated frequently, on the Flow app Reindex Page would require download of text content from S3…Nested Docs? No; too slow for updates here…
  18. 18. solutionParent/Child documents in ElasticSearch!Store parent type on children type mapping To index a child, specify the parent ID Stored as “_parent” field on the childQuery Combine parent query with “has_child” child-query
  19. 19. itemtext: { properties: { text: { analyzer: “pqdText”, type: “string” } }, _parent: { type: “item” }}
  20. 20. { "bool" : { "must" : [ { "bool" : { "should" : [ { "query_string" : { "query" : "elasticity", "fields" : ["item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.reference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "item.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.reference.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.reference.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ], "use_dis_max" : true, "default_operator" : "and" } }, { "has_child" : { "query" : { "text" : { "text" : { "query" : "elasticity", "type" : "boolean", "operator" : "AND" } } }, "type" : "itemtext", "boost" : 0.1 } } ] } }, { "term" : { "userId" : "user:50a3bd090364f635f24c713c" } } ] }
  21. 21. NOT SO SURE WHO IS PARENT, WHO IS CIN PARENT-CHILD RELATIONS
  22. 22. conclusionsParent/Child „remote key‟ solution in ElasticSearch Easy connection of two types of documents with Separate update cycles Complex JOIN queries possibles, combining parent & child fields Slower than “nested” Locality principle: Children always sharded with parentLimitations Has_child filter returns only parents, cannot return child data But: has_parent filter ElasticSearches caches parent-child ID table in heap…
  23. 23. conclusionsComplex join-style queries canbe done with ElasticSearch SELECT * FROM ARTICLES LEFT JOIN AUTHORS ON Easily AUTHORS.ARTICLEID = ARTICLES.ID WHERE Efficiently ARTICLES.TITLE MATCHES "market elasticity" AND AUTHORS.LASTNAME MATCHES "Russell"Use “nested” types AND AUTHORS.FIRSTNAME MATCHES "G" If data can be duplicated Very efficientUse “parent/child” types For real independently updateable documents
  24. 24. conclusionsElasticSearch rocks Hides complex JSON document to Lucene key/value model mapping Allows you to easily use more of Lucene greatness So you can focus on actual queries and use casesNoSql does not mean NoJoins Just forcing you to model in such a way, joins will be efficient
  25. 25. ElasticSearch “nested” types:the best thing since sliced bread anne@beyondtrees.com thank you @anneveling

×