This document discusses high performance JSON search and relational faceted browsing using Lucene. It introduces SIREn, a Lucene plugin for indexing and searching JSON documents with a nested data model. SIREn uses tree labeling techniques to represent the JSON document structure and enable both full-text and structural queries. It also allows for relational faceted browsing across multiple record collections through pivot navigation and query rewriting. While BlockJoin supports some nested data in Lucene, SIREn has better scalability through its compression techniques and more flexibility through its schema-agnostic approach.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
High Performance JSON Search and Faceted Browsing with Lucene
1.
2. HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE
Renaud Delbru
renaud@sindicetech.com
renaud.delbru@deri.org
Co-Founder, SindiceTech
Post-Doctoral Researcher, NUIG
3. My Background
•
•
•
Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
Academia & Research
– Ph.D. in Information Retrieval and Semantic Web
– Post-doctoral researcher at National Univerity of Ireland, Galway
Industry
– Technical co-founder of SindiceTech
– Management Platform for Enterprise Knowledge Graph
5. Nested Data Model: Why is it important ?
•
•
SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
8. Nested Data Model: Why is it important ?
•
•
SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
– Duplicate data …
– … but avoid joins
9. Schema-Less Nested Data Model
•
•
•
Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
– No strict schema / structure enforced
Schema-less brings
– Flexibility
– Ease of development
Developers do not have to invest significant modelling effort upfront
10. Introducing SIREn
•
•
•
Lucene/Solr plugin for indexing and searching JSON
Rich data model (JSON)
– Nested objects, nested arrays, datatypes
Schema-agnostic
– No need to define structure (nested model)
– No need to define schema (fields)
11. Overview of the SIREn API
Document
Query
{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
"investments" : [
{
"name" : "Granite Ventures",
"type" : "financial-org"
},
…
]
},
…
]
}
(category_code :
analytics)
AND
(funding_rounds : {
round_code : seed OR a OR angel,
raised_amount : [0 TO 12000000],
* : {
type : financial-org
}
})
12. Theory behind SIREn
•
•
•
Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical ids (here Dewey’s identifiers)
Full-text search operators over the content of a node
Structural search operators over the nodes of the tree
– Ancestor-Descendant, Parent-Child, Sibling, …
25. JSON Tokenizer
•
•
•
Traverses JSON tree using Depth-First
Search
Generates one token per JSON node
Attaches metadata attributes (Dewey id,
datatype, …) to each token
Tokenizer Output
name
1.1
Field
LucidWorks
1.1.1
String
funding_
rounds
1.2
Field
round_
code
1.2.2.1
String
…
26. JSON Analyzer – NodeTokenizerFilter
•
Tokenize the content of a node token based on its datatype
Input
name
1.1
Field
funding_
rounds
1.2
Field
LucidWorks
1.1.1
String
round_
code
1.2.2.1
String
…
Output
name
funding_
rounds
LucidWorks
lucid
works
funding
…
rounds
27. JSON Analyzer – NodeTokenizerFilter
•
Tokenize the content of a node token based on its datatype
Input
name
1.1
Field
funding_
rounds
1.2
Field
LucidWorks
1.1.1
String
round_
code
1.2.2.1
String
…
Output
name
funding_
rounds
LucidWorks
lucid
works
Tokenized with String
datatype analyzer
funding
…
rounds
28. JSON Analyzer – NodeTokenizerFilter
•
Tokenize the content of a node token based on its datatype
Input
name
1.1
Field
funding_
rounds
1.2
Field
LucidWorks
1.1.1
String
round_
code
1.2.2.1
String
…
Output
name
funding_
rounds
LucidWorks
lucid
works
funding
…
rounds
Tokenized with Field
datatype analyzer
29. JSON Analyzer – NodePayloadFilter
•
•
Encode metadata attributes into a term payload
Leverage Payload API to transfer attributes to the Codec API
32. Tree-Labelling Codec – Compression
•
Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Better tolerance against outliers
– Very effective with frequencies, node identifiers and positions (higher
compression rate)
FOR
BFS
AFOR
BFS
BFS
BFS
BFS
34. Node Query
•
•
Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
35. Node Query
•
•
•
Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase
MUST
Boolean
SHOULD
36. Node Query
•
•
•
Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
Boolean
Phrase
MUST
Boolean
SHOULD
37. Node Query
•
•
•
Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
Boolean
Phrase
Twig
MUST
NOT
Boolean
Range
SHOULD
SHOULD
38. Node Query
•
•
•
Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
TwigQuery
– Consist of a root query and one or
more descendant or child queries
– Can be nested to form complex tree
structure
– Can be rewritten as a pure boolean
query
Boolean
Phrase
Twig
MUST
NOT
Boolean
Range
SHOULD
SHOULD
39. Application: Relational Faceted Navigation
•
•
Faceted Navigation
– Data-driven exploratory interface
– User incrementally adds constraints
– Restricted to one record collection
Relational Faceted Navigation
– Enables navigation of interrelated record collections
– Constraints affect all record collections
– New navigation operation: Pivot
• Switch user view to a record collection
41. Data Model
•
•
•
Each collection has its own data model (document)
Lucene fields for facets
JSON field for relationships with records from other collections
Company
Investment
Investor
Country
Year
Type
Category
Amount
JSON
JSON
JSON
42. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
Investor
category_
code
round_
code
type
country_
code
funding_
rounds
raised_
amount
investments -1
funding_
rounds -1
[…]
category_
code
round_
code
raised_
amount
investments
[…]
country_
code
[…]
type
investments
[…]
type
[…]
round_
code
raised_
amount
funding_
rounds -1
[…]
category_
code
country_
code
43. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
category_
code
country_
code
funding_
rounds
[…]
round_
code
raised_
amount
investments
[…]
type
Investor
44. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
category_
code
country_
code
funding_
rounds
[…]
round_
code
raised_
amount
investments
[…]
type
Investor
45. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
round_
code
raised_
amount
funding_
rounds -1
[…]
category_
code
country_
code
investments
[…]
type
Investor
46. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
round_
code
raised_
amount
funding_
rounds -1
[…]
category_
code
country_
code
investments
[…]
type
Investor
47. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
Investor
type
investments -1
[…]
round_
code
raised_
amount
funding_
rounds -1
[…]
category_
code
country_
code
48. JSON Model
•
•
JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company
Investment
Investor
type
investments -1
[…]
round_
code
raised_
amount
funding_
rounds -1
[…]
category_
code
country_
code
57. Comparison with BlockJoin
•
Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small and well-defined schema
58. Lucene BlockJoin - Scalability
•
•
Increase artificially the number of documents in the index
– One document per nested data record
Cache size linear with the number of nested data records
– Increased memory usage
59. Lucene BlockJoin - Flexibility
•
•
•
Developers must be aware of the relations between nested data records
– At indexing time to tag parent records
– At querying time to filter parent records
Upfront effort required to design and configure the system
– Define Parent-Child relationships between record collections
– Define attributes for each record collection
If not properly designed, risk of incorrect matches
60. Comparison with BlockJoin
•
•
BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effort
‒ Memory usage dependent on nested data structure
Tree-Labelling
+ Can handle arbitrary and large nested model
+ Memory friendly
‒ Have to re-think and re-implement Lucene’s features
61. Conclusion
•
•
•
•
•
Nested data model becomes more and more prevalent
Searching nested data brings new challenges: performance, scalability, flexibility
Different approaches exist, each one with pros and cons
SIREn plugin based on tree-labelling techniques
Enables new kind of search applications, e.g., relational faceted browser, with subsecond response time
•
SIREn Availability
– Trial license currently available
– In negotiation with the University to open-source
62. Acknowledgement
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Research Council for Science, Engineering and Technology.