Advertisement
Advertisement

More Related Content

Similar to Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupta & Alisa Zhila, IBM Watson(20)

Advertisement

More from Lucidworks(20)

Advertisement

Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupta & Alisa Zhila, IBM Watson

  1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  2. Working with deeply nested documents in Apache Solr Anshum Gupta, Alisa Zhila IBM Watson
  3. 3 Anshum Gupta • Apache Lucene/Solr committer and PMC member • Search guy @ IBM Watson. • Interested in search and related stuff. • Apache Lucene since 2006 and Solr since 2010.
  4. 4 Alisa Zhila • Apache Lucene/Solr supporter :) • Natural Language Processing technologies @ IBM Watson • Interested in search and related stuff
  5. 5 Agenda • Hierarchical Data/Nested Documents • Indexing Nested Documents • Querying Nested Documents • Faceting on Nested Documents
  6. Hierarchical Documents
  7. 7 • Social media comments, Email threads, Annotated data - AI • Relationship between documents • Possibility to flatten Need for nested data EXAMPLE: Blog Post with Comments Peter Navarro outlines the Trump economic plan Tyler Cowen, September 27, 2016 at 3:07am Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. 1 Ray Lopez September 27, 2016 at 3:21 am I’ll be the first to say this, but the analysis is flawed. {negative} 2 Brian Donohue September 27, 2016 at 9:20 am The math checks out. Solid. {positive} examples from http://marginalrevolution.com
  8. 8 • Can not flatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- IMPOSSIBLE!!! Nested Documents EXAMPLE: Data Flattening Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Comment_authors: [Ray Lopez, Brian Donohue] Comment_dates: [September 27, 2016 at 3:21 am, September 27, 2016 at 9:20 am] Comment_texts: ["I’ll be the first to say this, but the analysis is flawed.", "The math checks out. Solid."] Comment_sentiments: [negative, positive]
  9. 9 • Can not flatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- POSSIBLE!!! (stay tuned) Nested Documents EXAMPLE: Hierarchical Documents Type: Post Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Type: Comment Author: Ray Lopez Date: September 27, 2016 at 3:21 am Text: I’ll be the first to say this, but the analysis is flawed. Sentiment: negative Type: Comment Author: Brian Donohue Date: September 27, 2016 at 9:20 am Text: The math checks out. Solid. Sentiment: positive
  10. 10 • Blog Post Data with Comments and Replies from http://marginalrevolution.com (cured) • 2 posts, 2-3 comments per post, 0-3 replies per comment • Extracted keywords & sentiment data • 4 levels of "nesting" • Too big to show on slides • Data + Scripts + Demo Queries: • https://github.com/alisa-ipn/solr- revolution-2016-nested-demo Running Example
  11. Indexing Nested Documents
  12. 12 • Nested XML • JSON Documents • Add _childDocument_ tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing • JSON Document endpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. Sending Documents to Solr
  13. 13 solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml Sending Documents to Solr: Nested XML <add> <doc> <field name="type">post</field> <field name="author"> "Alex Tabarrok"</field> <field name="title">"The Irony of Hillary Clinton’s Data Analytics"</ field> <field name="body">"Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth."</field> <field name="id">"12015-24204"</field> <doc> <field name="type">comment</field> <field name="author">"Todd"</field> <field name="text">"Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."</field> <field name="sentiment">"positive"</field> <field name="id">"29798-24171"</field> <doc> <field name="type">reply</field> <field name="author">"The Other Jim"</field> <field name="text">"No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win."</field> <field name="sentiment">"negative"</field> <field name="id">"29798-21232"</field> </doc> </doc> </doc> </add>
  14. 14 • Add _childDocument_ tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr Sending Documents to Solr: JSON Documents [{ "path": "1.posts", "id": "28711", "author": "Alex Tabarrok", "title": "The Irony of Hillary Clinton’s Data Analytics", "body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth.", "_childDocuments_": [ { "path": "2.posts.comments", "id": "28711-19237", "author": "Todd", "text": "Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world.", "sentiment": "positive", "_childDocuments_": [ { "path": "3.posts.comments.replies", "author": "The Other Jim", "id": "28711-12444", "sentiment": "negative", "text": "No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win." }]}]}]
  15. 15 • JSON Document endpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs? split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data- binary @small-example-data.json -H ‘Content-type:application/json' NOTE: All documents must contain a unique ID. Sending Documents to Solr: JSON Endpoint
  16. 16 • Update Request Processors don’t work with nested documents • Example: • UUID update processor does not auto-add an id for a child document. • Workaround: • Take responsibility at the client layer to handle the computation for nested documents. • Change the update processor in Solr to handle nested documents. Update Processors and Nested Documents
  17. 17 • The entire block needs reindexing • Forgot to add a meta-data field that might be useful? Complete reindex • Store everything in Solr IF • it’s too expensive to reconstruct the doc from original data source • No access to data anymore e.g. streaming data Re-Indexing Your Documents
  18. 18 • Various ways to index nested documents • Need to re-index entire block Nested Document Indexing Summary
  19. Let’s ask some interesting questions
  20. 20 { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "path":["3.posts.comments.keywords"], "text":["Trump"]}, { "path":["2.posts.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."], "path":["1.posts"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]} Easy question first Find all documents that mention Trump q=text:Trump
  21. 21 { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."], "path":["2.posts.comments"]} Returning certain types of documents Find all comments and replies that mention Trump q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump Recipe: At the data pre-processing stage, add a field that indicates document type and also its path in the hierarchy (-- stay tuned):
  22. 22 { "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]} Returning similar type from different level Find all keywords that are Hillary q=path:*.keywords AND text:Hillary Recipe: Use wild-cards in the field that stores the hierarchy path
  23. Cross-Level Querying
  24. 24 { "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]} Recap so far... Find all keywords that are Hillary q=path:*.keywords AND text:Hillary We're querying precisely for documents which we provide a search condition for Query Level 3 Result Level 3 Query Level 4 Result Level 4 Query Level 2 Result Level 2
  25. 25 Returning parents by querying children: Block Join Parent Query Find all comments whose keywords detected positive sentiment towards Hillary q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive Query Level 3 Result Level 2 { "author":["Brian Donohue"], "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "author":["Todd"], "text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."], "path":["2.posts.comments"]}
  26. 26 { "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]} Returning children by querying parents: Block Join Child Query Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies Query Level 2 Result Level 3
  27. 27 { "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]} Returning children by querying parents: Block Join Child Query Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies Query Level 2 Result Level 3 Block Join Child Query + Filtering Query A bit counterintuitive and non-symmetrical to the BJPQ
  28. 28 { "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"} Returning all document's descendants Block Join Child Query Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative Query Level 2 Results Level 3 Results Level 4
  29. 29 Returning all document's descendants Block Join Child Query Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative Query Level 2 Results Level 3 Results Level 4 { "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"} Issue: no grouping by parent What if we want to bring the whole sub-structure?
  30. 30 Find all negative comments and return them with all their descendants q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*] Query Level 2 Result Level 2 sub- hierarchy Returning document with all descendants: ChildDocTransformer { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "path":["4.posts.comments.replies.keywords"], "text":["U.S."]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Issue: the "sub-hierarchy" is flat
  31. • Returns all descendant documents along with the queried document • flattens the sub-hierarchy • Workarounds: • Reconstruct the document using path ("path":["3.posts.comments.replies"]) information in case you want the entire subtree (result post-processing) • use childFilter in case you want a specific level 31 “This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document." (ChildDocTransformer cwiki) Returning document with all descendants: ChildDocTransformer
  32. 32 Find all negative comments and return them with all replies to them q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=path:3.posts.comments.replies] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Returning document with specific descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + Level 3:replies
  33. 33 Find all negative comments and return them with all their descendants that mention Trump q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]} ] }, ... Returning document with queried descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + sub-levels Issue: cannot use boolean expressions in childFilter query
  34. 34 Cross-Level Querying Mechanisms: • Block Join Parent Query • Block Join Children Query • ChildDocTransformer Good points: • overlapping & complementary features • good capabilities of querying direct ancestors/descendants • possible to query on siblings of different type Drawbacks: • need for data-preprocessing for better querying flexibility • limited support of querying over non-directly related branches (overcome with graphs?) • flattening nested data (additional post-processing is needed for reconstruction) Nested Document Querying Summary
  35. Faceting on Nested Documents
  36. 36 • Solr allows faceting on nested documents! • Two mechanisms for faceting: • Faceting with JSON Facet API (since Solr 5.3) • Block Join Faceting (since Solr 5.5) Faceting on Nested Documents
  37. 37 q=path:2.posts.comments AND sentiment:positive& json.facet={ most_liked_authors : { type: terms, field: author, domain: { blockParent : "path:1.posts"} }} Faceting on parents by descendants JSON Facet API: Parent Domain Count authors of the posts that received positive comments "most_liked_authors":{ "buckets":[ { "val":"Alex Tabarrok", "count":1}, { "val":"Tyler Cowen", "count":1} ] } Query Level 2 Facet Level 1
  38. 38 Faceting on descendants by ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"hillary", "count":4, "counts_by_posts":2}, { "val":"trump", "count":3, "counts_by_posts":2}, { "val":"dnc", "count":1, "counts_by_posts":1}, { "val":"obama", "count":2, "counts_by_posts":1}, { "val":"u.s", "count":1, "counts_by_posts":1} ]}
  39. 39 q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}} Faceting on descendants by ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels
  40. 40 Faceting on descendants by top-level ancestor JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels Issue: only the top-ancestor gets the unique "_root_" field by default q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}
  41. 41 q=path:2.posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(2.posts.comments-id)" }}}}} Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels At pre-processing, introduce unique fields for each level
  42. 42 Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}
  43. Now let's try the same using Block Join Faceting
  44. 44 • Experimental Feature • Needs to be turned on explicitly in solrconfig.xml More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting Block Join Faceting
  45. 45 bjqfacet?q={!parent which=path:2.posts.comments} path:*.comments*keywords&rows=0&facet=true&child.facet.field=text Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain Distribution of keywords that appear in comments and replies by the comments "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } Query Level 2 Facet Descendant Levels
  46. 46 bjqfacet?q={!parent which=path:2.posts.comments} path:*.comments*keywords&rows=0&facet=true&child.facet.field=text Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain Distribution of keywords that appear in comments and replies by the comments "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } Query Level 2 Facet Descendant Levels bjqfacet request handler instead of query
  47. 47 Output Comparison Block Join Facet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]} Distribution of keywords that appear in comments and replies by the comments
  48. 48 Output Comparison Block Join Facet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, ... Distribution of keywords that appear in comments and replies by the comments Output is sorted in alphabetical order. It cannot be changed facet:{ top_keywords : { ... sort: "counts_by_comments desc" }}}
  49. 49 JSON Facet API: • Experimental - but more mature • More developed and established feature • bulky JSON syntax • faceting on children by non-top level ancestors requires introducing unique branch identifiers similar to "_root_" on each level Block Join Facet: • Experimental feature • Lacks controls: sorting, limit... • traditional query-style syntax • proper handling of faceting on children by non-top level ancestors Hierarchical Faceting Summary
  50. 50 • Returning hierarchical structure • JSON facet rollups is in the works - SOLR-8998 • Graph querying might replace a lot of functionalities of cross-level querying - No distributed support right now. • There’s more but the community would love to have more people involved! Community Roadmap
  51. Thank you! Anshum Gupta anshum@apache.org | @anshumgupta Alisa Zhila alisa.zhila@gmail.com https://github.com/alisa-ipn/solr-revolution-2016-nested-demo
Advertisement