Your SlideShare is downloading. ×
  • Like
Nested and Parent/Child Docs in ElasticSearch
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Nested and Parent/Child Docs in ElasticSearch

  • 34,911 views
Published

A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to …

A key part of the architecture of RefWorks Flow, a new document workflow tool for researchers, is an ElasticSearch cluster used for citation canonicalization. We will present our findings of how to use the "nested" type and parent-child relations in ElasticSearch to do complex where-clause queries in an efficient way

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
34,911
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
170
Comments
0
Likes
18

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Nested & Parent/Child Docs hidden gems in ElasticSearchAnne Veling | ElasticSearch NL Meetup | February 26, 2013
  • 2. agendaRefworks Flow Reference Manager for ResearchersUse of ElasticSearch in FlowUse Case 1: Nested documentsUse Case 2: Parent/Child relationsLessons Learned
  • 3. introductionAnne Veling, @annevelingSelf-employed contractor Software Architect Agile process management Performance optimization Lucene/SOLR/ElasticSearch implementations & training
  • 4. tech stack
  • 5. architecture Flow CitationMongo AuthorityElastic PDFSearch Pipeline
  • 6. Citation Canonicalization Use Case 1
  • 7. Reference CanonicalizationWe built a large Citation Authority index in ElasticSearch With full, deduped metadata for a large portion of English scientific researchIn the Reference Edit screen Try to find high quality matches to a large index of canonical references of scientific articles Based on known fields Title, possibly partial and incorrect Author(s) Other identifying fields: journal, year, …
  • 8. { "query": { "bool": { "must": [ { "text": { "title": "market elasticity" } }, { "text": { "authors.lastName": "Russell" } }, { "text": { "authors.firstNames": "G" } } ] } }}
  • 9. problemSearching on a sub-document Searching for all documents where quthors.lastName: “Russell” authors.firstNames: “G” Also matches documents by “Jack Russell and Frederickson, G” We need a sub-document JOIN query… Combined with other information on the parent document (title) Oh noes! We‟re Can‟t using a NoSQL we? database, so we can‟t…
  • 10. query Lucene block indexing term term Save “children” documents always rightlucene documents before their “parent” document Requires you to write BlockJoinQuery ParentsFilter ChildQuery ToParentBlockJoinQuer y This means: all children (and parent!) needs to be reindexed upon any change in them…
  • 11. mappingauthors: { properties: { rawName: { analyzer: “caName” type: “string” }, lastName: { analyzer: “caName” type: “string” }, firstNames: { null_value: “__NONAME” analyzer: “caName” type: “string” } }, type: “nested”},title: { analyzer: “caText” type: “string”}
  • 12. { { "bool" : { "filtered" : { "must" : [ { "query" : { "text" : { "title" : { "query" : "market elasticity", "type" : "phrase", "slop" : 2 query "text" : { "lastName" : { "query" : "Russell", "type" : "boolean", "operator" : "AND" } } } } }, { }, "bool" : { "filter" : { "must" : { "missing" : { "nested" : { "field" : "firstNames" "query" : { } "bool" : { } "should" : [ { } "bool" : { } ] "must" : [ { } "text" : { }, "lastName" : { "path" : "authors" "query" : "Russell", } "type" : "boolean" } } } } } ] }, { } "bool" : { } "must" : { "bool" : { "should" : [ { "text" : { (title:"market elasticity") AND ( "firstNames" : { "query" : "G", authors: ( "type" : "boolean" (lastName:"Russell") AND ( } } (firstNames:"G") OR }, { (firstNames:"g*") OR "prefix" : { "firstNames" : "g" (lastName:"Russell" AND NOT(firstNames)) } ) } ] } ) } } ) } ] } },
  • 13. “nested”Just setting the subdocument type to “nested” in mappingCombine parent-query with “nested” query that specifiesthe pathComplex subcombination JOIN operationsAutomatic hiding of “nested” subdocuments This will increase your index size
  • 14. “nested”Efficient! ElasticSearch handles document updates Child-whereclauses handled INSIDE parent query docEnum Children are sharded with their parents => locality!Facet counts (on parent) still correct!Limitations Combinations of nested subdocuments with other queries Like “dis_max”, or “text” No automatic recognition of “authors.lastName” in other queries to a “nested” subquery
  • 15. Multipage Indexing Use Case 2
  • 16. architecture doc Flow CitationMongo Authority page page PDFElastic pageSearch Pipeline S3
  • 17. problemHow to index both Doc metadata and Pages text Doc in Flow app Pages only in PDF pipeline and on S3 Docs updated frequently, on the Flow app Reindex Page would require download of text content from S3…Nested Docs? No; too slow for updates here…
  • 18. solutionParent/Child documents in ElasticSearch!Store parent type on children type mapping To index a child, specify the parent ID Stored as “_parent” field on the childQuery Combine parent query with “has_child” child-query
  • 19. itemtext: { properties: { text: { analyzer: “pqdText”, type: “string” } }, _parent: { type: “item” }}
  • 20. { "bool" : { "must" : [ { "bool" : { "should" : [ { "query_string" : { "query" : "elasticity", "fields" : ["item.reference.title^2.0", "item.reference.authors.lastName^1.5", "item.reference.authors.firstNames", "item.reference.authors.rawName", "item.reference.contributors.lastName", "item.reference.contributors.firstNames", "item.reference.contributors.rawName", "item.reference.abstr", "item.reference.publication.title^1.5", "item.reference.publication.issn", "item.reference.publication.isbn", "item.reference.publication.abbrev", "item.reference.series.editors.lastName", "item.reference.series.editors.firstNames", "item.reference.series.rawName", "item.reference.series.title", "item.reference.publisher.name", "item.reference.publisher.location", "item.reference.publisher.department", "item.reference.userNotes", "item.annotations.note^0.5" ], "use_dis_max" : true, "default_operator" : "and" } }, { "has_child" : { "query" : { "text" : { "text" : { "query" : "elasticity", "type" : "boolean", "operator" : "AND" } } }, "type" : "itemtext", "boost" : 0.1 } } ] } }, { "term" : { "userId" : "user:50a3bd090364f635f24c713c" } } ] }
  • 21. NOT SO SURE WHO IS PARENT, WHO IS CIN PARENT-CHILD RELATIONS
  • 22. conclusionsParent/Child „remote key‟ solution in ElasticSearch Easy connection of two types of documents with Separate update cycles Complex JOIN queries possibles, combining parent & child fields Slower than “nested” Locality principle: Children always sharded with parentLimitations Has_child filter returns only parents, cannot return child data But: has_parent filter ElasticSearches caches parent-child ID table in heap…
  • 23. conclusionsComplex join-style queries canbe done with ElasticSearch SELECT * FROM ARTICLES LEFT JOIN AUTHORS ON Easily AUTHORS.ARTICLEID = ARTICLES.ID WHERE Efficiently ARTICLES.TITLE MATCHES "market elasticity" AND AUTHORS.LASTNAME MATCHES "Russell"Use “nested” types AND AUTHORS.FIRSTNAME MATCHES "G" If data can be duplicated Very efficientUse “parent/child” types For real independently updateable documents
  • 24. conclusionsElasticSearch rocks Hides complex JSON document to Lucene key/value model mapping Allows you to easily use more of Lucene greatness So you can focus on actual queries and use casesNoSql does not mean NoJoins Just forcing you to model in such a way, joins will be efficient
  • 25. ElasticSearch “nested” types:the best thing since sliced bread anne@beyondtrees.com thank you @anneveling