Working with deeply nested documents in
Apache Solr
Anshum Gupta
Apache Lucene/Solr PMC member & committer
IBM Watson
2
Anshum Gupta
• Apache Lucene/Solr committer and PMC member
• Search guy @ IBM Watson.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
3
Agenda
• Hierarchical Data/Nested Documents
• Indexing Nested Documents
• Querying Nested Documents
• Faceting on Nested Documents
Thanks to my fellow IBMer Alisa Zhila for working on this with me!
Hierarchical Documents
5
• Social media comments, Email threads,
Annotated data - AI
• Relationship between documents
• Possibility to flatten
Need for nested data
EXAMPLE: Blog Post with Comments
Peter Navarro outlines the Trump economic plan
Tyler Cowen, September 27, 2016 at 3:07am
Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased
exports and reduced imports.
1 Ray Lopez September 27, 2016 at 3:21 am
I’ll be the first to say this, but the analysis is flawed.
{negative}
2 Brian Donohue September 27, 2016 at 9:20 am
The math checks out. Solid.
{positive}
examples from http://marginalrevolution.com
6
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- IMPOSSIBLE!!!
Nested Documents
EXAMPLE: Data Flattening
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Comment_authors: [Ray Lopez, Brian Donohue]
Comment_dates: [September 27, 2016 at 3:21 am,
September 27, 2016 at 9:20 am]
Comment_texts: ["I’ll be the first to say this, but the analysis is
flawed.", "The math checks out. Solid."]
Comment_sentiments: [negative, positive]
7
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- POSSIBLE!!! (stay tuned)
Nested Documents
EXAMPLE: Hierarchical Documents
Type: Post
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Type: Comment
Author: Ray Lopez
Date: September 27, 2016 at 3:21 am
Text: I’ll be the first to say this, but the analysis is flawed.
Sentiment: negative
Type: Comment
Author: Brian Donohue
Date: September 27, 2016 at 9:20 am
Text: The math checks out. Solid.
Sentiment: positive
8
• Blog Post Data with Comments and Replies
from http://marginalrevolution.com (cured)
• 2 posts, 2-3 comments per post, 0-3 replies
per comment
• Extracted keywords & sentiment data
• 4 levels of "nesting"
• Too big to show on slides
• Data + Scripts + Demo Queries:
• https://github.com/alisa-ipn/solr-
revolution-2016-nested-demo
Running Example
Indexing Nested Documents
10
• Nested XML
• JSON Documents
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
Sending Documents to Solr
11
solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml
Sending Documents to Solr: Nested XML
<add>
<doc>
<field name="type">post</field>
<field name="author"> "Alex Tabarrok"</field>
<field name="title">"The Irony of Hillary Clinton’s Data Analytics"</
field>
<field name="body">"Barack Obama’s campaign adopted data but
Hillary Clinton’s campaign has been molded by data from birth."</field>
<field name="id">"12015-24204"</field>
<doc>
<field name="type">comment</field>
<field name="author">"Todd"</field>
<field name="text">"Clinton got out data-ed and out organized in
2008 by Obama. She seems at least to learn over time, and apply the
lessons learned to the real world."</field>
<field name="sentiment">"positive"</field>
<field name="id">"29798-24171"</field>
<doc>
<field name="type">reply</field>
<field name="author">"The Other Jim"</field>
<field name="text">"No, she lost because (1) she is thoroughly
detested person and (2) the DNC decided Obama should therefore
win."</field>
<field name="sentiment">"negative"</field>
<field name="id">"29798-21232"</field>
</doc>
</doc>
</doc>
</add>
12
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr
Sending Documents to Solr: JSON Documents
[{ "path": "1.posts",
"id": "28711",
"author": "Alex Tabarrok",
"title": "The Irony of Hillary Clinton’s Data Analytics",
"body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign
has been molded by data from birth.",
"_childDocuments_": [
{
"path": "2.posts.comments",
"id": "28711-19237",
"author": "Todd",
"text": "Clinton got out data-ed and out organized in 2008 by Obama. She
seems at least to learn over time, and apply the lessons learned to the real world.",
"sentiment": "positive",
"_childDocuments_": [
{
"path": "3.posts.comments.replies",
"author": "The Other Jim",
"id": "28711-12444",
"sentiment": "negative",
"text": "No, she lost because (1) she is thoroughly detested person and
(2) the DNC decided Obama should therefore win."
}]}]}]
13
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs?
split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data-
binary @small-example-data.json -H ‘Content-type:application/json'
NOTE: All documents must contain a unique ID.
Sending Documents to Solr: JSON Endpoint
14
• Update Request Processors don’t work with nested documents
• Example:
• UUID update processor does not auto-add an id for a child document.
• Workaround:
• Take responsibility at the client layer to handle the computation for nested
documents.
• Change the update processor in Solr to handle nested documents.
Update Processors and Nested Documents
15
• The entire block needs reindexing
• Forgot to add a meta-data field that might be useful? Complete reindex
• Store everything in Solr IF
• it’s too expensive to reconstruct the doc from original data source
• No access to data anymore e.g. streaming data
Re-Indexing Your Documents
16
• Various ways to index nested documents
• Need to re-index entire block
Nested Document Indexing Summary
Let’s ask some interesting questions
18
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"path":["3.posts.comments.keywords"],
"text":["Trump"]},
{
"path":["2.posts.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."],
"path":["1.posts"]},
{
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]}
Easy question first
Find all documents that mention Trump
q=text:Trump
19
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."],
"path":["2.posts.comments"]}
Returning certain types of documents
Find all comments and replies that mention Trump
q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump
Recipe:
At the data pre-processing stage, add a field that indicates document type
and also its path in the hierarchy (-- stay tuned):
20
{
"path":["3.posts.comments.keywords"],
"sentiment":["positive"],
"text":["Hillary"]},
{
"path":["4.posts.comments.replies.keywords"],
"sentiment":["negative"],
"text":["Hillary"]},
{
"path":["2.posts.keywords"],
"text":["Hillary"]}
Returning similar type from different level
Find all keywords that are Hillary
q=path:*.keywords AND text:Hillary
Recipe:
Use wild-cards in the field that stores the hierarchy path
Cross-Level Querying
22
{
"path":["3.posts.comments.keywords"],
"sentiment":["positive"],
"text":["Hillary"]},
{
"path":["4.posts.comments.replies.keywords"],
"sentiment":["negative"],
"text":["Hillary"]},
{
"path":["2.posts.keywords"],
"text":["Hillary"]}
Recap so far...
Find all keywords that are Hillary
q=path:*.keywords AND text:Hillary
We're querying precisely for documents
which we provide a search condition for
Query
Level 3
Result
Level 3
Query
Level 4
Result
Level 4
Query
Level 2
Result
Level 2
23
Returning parents by querying children:
Block Join Parent Query
Find all comments whose keywords detected positive sentiment towards Hillary
q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive
Query
Level 3
Result
Level 2
{
"author":["Brian Donohue"],
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering,
but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"author":["Todd"],
"text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to
learn over time, and apply the lessons learned to the real world."],
"path":["2.posts.comments"]}
24
{
"sentiment":["negative"],
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"sentiment":["neutral"],
"text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S.
asset values?"],
"path":["3.posts.comments.replies"]},
{
"sentiment":["positive"],
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"]}
Returning children by querying parents:
Block Join Child Query
Find replies to negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies
Query
Level 2
Result
Level 3
Block Join Child Query + Filter Query
A bit counterintuitive and non-symmetrical to the BJPQ
25
Returning all document's descendants
Block Join Child Query
Find all descendants of negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative
Query
Level 2
Results
Level 3
Results
Level 4
{
"path":["4.posts.comments.replies.keywords"],
"id":"17413-13550",
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"],
"id":"17413-66188"},
{
"path":["3.posts.comments.keywords"],
"id":"12413-12487",
"text":["Hillary"]},
{
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"],
"id":"12413-10998"}
Issue: no grouping by parent
What if we want to bring the whole sub-structure?
26
Find all negative comments and return them with all their descendants
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*]
Query
Level 2
Result
Level 2
sub-
hierarchy
Returning document with all descendants:
ChildDocTransformer
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"path":["4.posts.comments.replies.keywords"],
"text":["U.S."]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Issue: the "sub-hierarchy" is flat
• Returns all descendant documents along with the queried document
• flattens the sub-hierarchy
• Workarounds:
• Reconstruct the document using path ("path":["3.posts.comments.replies"])
information in case you want the entire subtree (result post-processing)
• use childFilter in case you want a specific level
27
“This transformer returns all descendant documents of each parent document matching your query in
a flat list nested inside the matching parent document." (ChildDocTransformer cwiki)
Returning document with all descendants:
ChildDocTransformer
28
Find all negative comments and return them with all replies to them
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*
childFilter=path:3.posts.comments.replies]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with specific descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ Level 3:replies
29
Find all negative comments and return them with all their descendants that mention Trump
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with queried descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ sub-levels
Issue: cannot use boolean expressions in childFilter query
30
Cross-Level Querying Mechanisms:
• Block Join Parent Query
• Block Join Children Query
• ChildDocTransformer
Good points:
• overlapping & complementary features
• good capabilities of querying direct ancestors/descendants
• possible to query on siblings of different type
Drawbacks:
• need for data-preprocessing for better querying flexibility
• limited support of querying over non-directly related branches (overcome with graphs?)
• flattening nested data (additional post-processing is needed for reconstruction)
Nested Document Querying Summary
Faceting on Nested Documents
32
• Solr allows faceting on nested documents!
• Two mechanisms for faceting:
• Faceting with JSON Facet API (since Solr 5.3)
• Block Join Faceting (since Solr 5.5)
Faceting on Nested Documents
33
q=path:2.posts.comments AND sentiment:positive&
json.facet={
most_liked_authors : {
type: terms,
field: author,
domain: { blockParent : "path:1.posts"}
}}
Faceting on parents by descendants
JSON Facet API: Parent Domain
Count authors of the posts that received positive comments
"most_liked_authors":{
"buckets":[
{
"val":"Alex Tabarrok",
"count":1},
{
"val":"Tyler Cowen",
"count":1}
]
}
Query
Level 2
Facet
Level 1
34
Faceting on descendants by ancestors
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
"top_keywords":{
"buckets":[{
"val":"hillary",
"count":4,
"counts_by_posts":2},
{
"val":"trump",
"count":3,
"counts_by_posts":2},
{
"val":"dnc",
"count":1,
"counts_by_posts":1},
{
"val":"obama",
"count":2,
"counts_by_posts":1},
{
"val":"u.s",
"count":1,
"counts_by_posts":1}
]}
35
q=path:1.posts&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:1.posts" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_posts desc",
facet: {
counts_by_posts: "unique(_root_)"
}}}}}
Faceting on descendants by ancestors
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
36
Faceting on descendants by top-level ancestor
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
Issue: only the top-ancestor gets the unique "_root_" field by default
q=path:1.posts&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:1.posts" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_posts desc",
facet: {
counts_by_posts: "unique(_root_)"
}}}}}
37
q=path:2.posts.comments&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:2.posts.comments" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_comments desc",
facet: {
counts_by_comments: "unique(2.posts.comments-id)"
}}}}}
Faceting on descendants by intermediate ancestors
JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query
Level 2
Facet
Descendant
Levels
At pre-processing, introduce unique fields for each level
38
Faceting on descendants by intermediate ancestors
JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query
Level 2
Facet
Descendant
Levels
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
{
"val":"Obama",
"count":2,
"counts_by_comments":1},
{
"val":"U.S.",
"count":1,
"counts_by_comments":1}
]}
Now let's try the same using Block Join Faceting
40
• Experimental Feature
• Needs to be turned on explicitly in solrconfig.xml
More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting
Block Join Faceting
41
bjqfacet?q={!parent which=path:2.posts.comments}
path:*.comments*keywords&rows=0&facet=true&child.facet.field=text
Faceting on descendants by ancestors #2:
Block Join Faceting on Children Domain
Distribution of keywords that appear in comments and replies by the comments
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
Query
Level 2
Facet
Descendant
Levels
bjqfacet request handler instead of query
42
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
{
"val":"Obama",
"count":2,
"counts_by_comments":1},
{
"val":"U.S.",
"count":1,
"counts_by_comments":1}
]}
Distribution of keywords that appear in comments and replies by the comments
43
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
...
Distribution of keywords that appear in comments and replies by the comments
Output is sorted in alphabetical
order. It cannot be changed
facet:{
top_keywords : {
...
sort: "counts_by_comments desc"
}}}
44
JSON Facet API:
• Experimental - but more mature
• More developed and established feature
• bulky JSON syntax
• faceting on children by non-top level ancestors requires introducing unique branch
identifiers similar to "_root_" on each level
Block Join Facet:
• Experimental feature
• Lacks controls: sorting, limit...
• traditional query-style syntax
• proper handling of faceting on children by non-top level ancestors
Hierarchical Faceting Summary
45
• Returning hierarchical structure
• JSON facet rollups is in the works - SOLR-8998
• Graph query might replace a lot of functionalities of cross-level querying - No
distributed support right now.
• There’s more but the community would love to have more people involved!
Community Roadmap
Thank you!
Anshum Gupta anshum@apache.org | @anshumgupta
https://github.com/alisa-ipn/solr-revolution-2016-nested-demo

Working with deeply nested documents in Apache Solr

  • 1.
    Working with deeplynested documents in Apache Solr Anshum Gupta Apache Lucene/Solr PMC member & committer IBM Watson
  • 2.
    2 Anshum Gupta • ApacheLucene/Solr committer and PMC member • Search guy @ IBM Watson. • Interested in search and related stuff. • Apache Lucene since 2006 and Solr since 2010.
  • 3.
    3 Agenda • Hierarchical Data/NestedDocuments • Indexing Nested Documents • Querying Nested Documents • Faceting on Nested Documents Thanks to my fellow IBMer Alisa Zhila for working on this with me!
  • 4.
  • 5.
    5 • Social mediacomments, Email threads, Annotated data - AI • Relationship between documents • Possibility to flatten Need for nested data EXAMPLE: Blog Post with Comments Peter Navarro outlines the Trump economic plan Tyler Cowen, September 27, 2016 at 3:07am Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. 1 Ray Lopez September 27, 2016 at 3:21 am I’ll be the first to say this, but the analysis is flawed. {negative} 2 Brian Donohue September 27, 2016 at 9:20 am The math checks out. Solid. {positive} examples from http://marginalrevolution.com
  • 6.
    6 • Can notflatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- IMPOSSIBLE!!! Nested Documents EXAMPLE: Data Flattening Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Comment_authors: [Ray Lopez, Brian Donohue] Comment_dates: [September 27, 2016 at 3:21 am, September 27, 2016 at 9:20 am] Comment_texts: ["I’ll be the first to say this, but the analysis is flawed.", "The math checks out. Solid."] Comment_sentiments: [negative, positive]
  • 7.
    7 • Can notflatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- POSSIBLE!!! (stay tuned) Nested Documents EXAMPLE: Hierarchical Documents Type: Post Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Type: Comment Author: Ray Lopez Date: September 27, 2016 at 3:21 am Text: I’ll be the first to say this, but the analysis is flawed. Sentiment: negative Type: Comment Author: Brian Donohue Date: September 27, 2016 at 9:20 am Text: The math checks out. Solid. Sentiment: positive
  • 8.
    8 • Blog PostData with Comments and Replies from http://marginalrevolution.com (cured) • 2 posts, 2-3 comments per post, 0-3 replies per comment • Extracted keywords & sentiment data • 4 levels of "nesting" • Too big to show on slides • Data + Scripts + Demo Queries: • https://github.com/alisa-ipn/solr- revolution-2016-nested-demo Running Example
  • 9.
  • 10.
    10 • Nested XML •JSON Documents • Add _childDocument_ tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing • JSON Document endpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. Sending Documents to Solr
  • 11.
    11 solr-6.2.1$ bin/post -cdemo-xml ./data/example-data.xml Sending Documents to Solr: Nested XML <add> <doc> <field name="type">post</field> <field name="author"> "Alex Tabarrok"</field> <field name="title">"The Irony of Hillary Clinton’s Data Analytics"</ field> <field name="body">"Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth."</field> <field name="id">"12015-24204"</field> <doc> <field name="type">comment</field> <field name="author">"Todd"</field> <field name="text">"Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."</field> <field name="sentiment">"positive"</field> <field name="id">"29798-24171"</field> <doc> <field name="type">reply</field> <field name="author">"The Other Jim"</field> <field name="text">"No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win."</field> <field name="sentiment">"negative"</field> <field name="id">"29798-21232"</field> </doc> </doc> </doc> </add>
  • 12.
    12 • Add _childDocument_tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr Sending Documents to Solr: JSON Documents [{ "path": "1.posts", "id": "28711", "author": "Alex Tabarrok", "title": "The Irony of Hillary Clinton’s Data Analytics", "body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth.", "_childDocuments_": [ { "path": "2.posts.comments", "id": "28711-19237", "author": "Todd", "text": "Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world.", "sentiment": "positive", "_childDocuments_": [ { "path": "3.posts.comments.replies", "author": "The Other Jim", "id": "28711-12444", "sentiment": "negative", "text": "No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win." }]}]}]
  • 13.
    13 • JSON Documentendpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs? split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data- binary @small-example-data.json -H ‘Content-type:application/json' NOTE: All documents must contain a unique ID. Sending Documents to Solr: JSON Endpoint
  • 14.
    14 • Update RequestProcessors don’t work with nested documents • Example: • UUID update processor does not auto-add an id for a child document. • Workaround: • Take responsibility at the client layer to handle the computation for nested documents. • Change the update processor in Solr to handle nested documents. Update Processors and Nested Documents
  • 15.
    15 • The entireblock needs reindexing • Forgot to add a meta-data field that might be useful? Complete reindex • Store everything in Solr IF • it’s too expensive to reconstruct the doc from original data source • No access to data anymore e.g. streaming data Re-Indexing Your Documents
  • 16.
    16 • Various waysto index nested documents • Need to re-index entire block Nested Document Indexing Summary
  • 17.
    Let’s ask someinteresting questions
  • 18.
    18 { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "path":["3.posts.comments.keywords"], "text":["Trump"]}, { "path":["2.posts.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyedTrump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."], "path":["1.posts"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]} Easy question first Find all documents that mention Trump q=text:Trump
  • 19.
    19 { "text":["LOL. I enjoyedTrump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."], "path":["2.posts.comments"]} Returning certain types of documents Find all comments and replies that mention Trump q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump Recipe: At the data pre-processing stage, add a field that indicates document type and also its path in the hierarchy (-- stay tuned):
  • 20.
  • 21.
  • 22.
    22 { "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]} Recap so far... Findall keywords that are Hillary q=path:*.keywords AND text:Hillary We're querying precisely for documents which we provide a search condition for Query Level 3 Result Level 3 Query Level 4 Result Level 4 Query Level 2 Result Level 2
  • 23.
    23 Returning parents byquerying children: Block Join Parent Query Find all comments whose keywords detected positive sentiment towards Hillary q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive Query Level 3 Result Level 2 { "author":["Brian Donohue"], "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "author":["Todd"], "text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."], "path":["2.posts.comments"]}
  • 24.
    24 { "sentiment":["negative"], "text":["LOL. I enjoyedTrump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]} Returning children by querying parents: Block Join Child Query Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies Query Level 2 Result Level 3 Block Join Child Query + Filter Query A bit counterintuitive and non-symmetrical to the BJPQ
  • 25.
    25 Returning all document'sdescendants Block Join Child Query Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative Query Level 2 Results Level 3 Results Level 4 { "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"} Issue: no grouping by parent What if we want to bring the whole sub-structure?
  • 26.
    26 Find all negativecomments and return them with all their descendants q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*] Query Level 2 Result Level 2 sub- hierarchy Returning document with all descendants: ChildDocTransformer { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "path":["4.posts.comments.replies.keywords"], "text":["U.S."]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Issue: the "sub-hierarchy" is flat
  • 27.
    • Returns alldescendant documents along with the queried document • flattens the sub-hierarchy • Workarounds: • Reconstruct the document using path ("path":["3.posts.comments.replies"]) information in case you want the entire subtree (result post-processing) • use childFilter in case you want a specific level 27 “This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document." (ChildDocTransformer cwiki) Returning document with all descendants: ChildDocTransformer
  • 28.
    28 Find all negativecomments and return them with all replies to them q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=path:3.posts.comments.replies] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Returning document with specific descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + Level 3:replies
  • 29.
    29 Find all negativecomments and return them with all their descendants that mention Trump q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]} ] }, ... Returning document with queried descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + sub-levels Issue: cannot use boolean expressions in childFilter query
  • 30.
    30 Cross-Level Querying Mechanisms: •Block Join Parent Query • Block Join Children Query • ChildDocTransformer Good points: • overlapping & complementary features • good capabilities of querying direct ancestors/descendants • possible to query on siblings of different type Drawbacks: • need for data-preprocessing for better querying flexibility • limited support of querying over non-directly related branches (overcome with graphs?) • flattening nested data (additional post-processing is needed for reconstruction) Nested Document Querying Summary
  • 31.
  • 32.
    32 • Solr allowsfaceting on nested documents! • Two mechanisms for faceting: • Faceting with JSON Facet API (since Solr 5.3) • Block Join Faceting (since Solr 5.5) Faceting on Nested Documents
  • 33.
    33 q=path:2.posts.comments AND sentiment:positive& json.facet={ most_liked_authors: { type: terms, field: author, domain: { blockParent : "path:1.posts"} }} Faceting on parents by descendants JSON Facet API: Parent Domain Count authors of the posts that received positive comments "most_liked_authors":{ "buckets":[ { "val":"Alex Tabarrok", "count":1}, { "val":"Tyler Cowen", "count":1} ] } Query Level 2 Facet Level 1
  • 34.
    34 Faceting on descendantsby ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"hillary", "count":4, "counts_by_posts":2}, { "val":"trump", "count":3, "counts_by_posts":2}, { "val":"dnc", "count":1, "counts_by_posts":1}, { "val":"obama", "count":2, "counts_by_posts":1}, { "val":"u.s", "count":1, "counts_by_posts":1} ]}
  • 35.
    35 q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: {blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}} Faceting on descendants by ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels
  • 36.
    36 Faceting on descendantsby top-level ancestor JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels Issue: only the top-ancestor gets the unique "_root_" field by default q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}
  • 37.
    37 q=path:2.posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: {blockChildren : "path:2.posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(2.posts.comments-id)" }}}}} Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels At pre-processing, introduce unique fields for each level
  • 38.
    38 Faceting on descendantsby intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}
  • 39.
    Now let's trythe same using Block Join Faceting
  • 40.
    40 • Experimental Feature •Needs to be turned on explicitly in solrconfig.xml More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting Block Join Faceting
  • 41.
    41 bjqfacet?q={!parent which=path:2.posts.comments} path:*.comments*keywords&rows=0&facet=true&child.facet.field=text Faceting ondescendants by ancestors #2: Block Join Faceting on Children Domain Distribution of keywords that appear in comments and replies by the comments "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } Query Level 2 Facet Descendant Levels bjqfacet request handler instead of query
  • 42.
    42 Output Comparison Block JoinFacet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]} Distribution of keywords that appear in comments and replies by the comments
  • 43.
    43 Output Comparison Block JoinFacet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, ... Distribution of keywords that appear in comments and replies by the comments Output is sorted in alphabetical order. It cannot be changed facet:{ top_keywords : { ... sort: "counts_by_comments desc" }}}
  • 44.
    44 JSON Facet API: •Experimental - but more mature • More developed and established feature • bulky JSON syntax • faceting on children by non-top level ancestors requires introducing unique branch identifiers similar to "_root_" on each level Block Join Facet: • Experimental feature • Lacks controls: sorting, limit... • traditional query-style syntax • proper handling of faceting on children by non-top level ancestors Hierarchical Faceting Summary
  • 45.
    45 • Returning hierarchicalstructure • JSON facet rollups is in the works - SOLR-8998 • Graph query might replace a lot of functionalities of cross-level querying - No distributed support right now. • There’s more but the community would love to have more people involved! Community Roadmap
  • 46.
    Thank you! Anshum Guptaanshum@apache.org | @anshumgupta https://github.com/alisa-ipn/solr-revolution-2016-nested-demo