High Performance JSON Search and Relational Faceted Browsing with Lucene

HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru
renaud@sindicetech.com
renaud.delbru@deri.org

Co-Founder, SindiceTech
Post-Doctoral Researcher, NUIG

My Background
•

•

•

Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
Academia & Research
– Ph.D. in Information Retrieval and Semantic Web
– Post-doctoral researcher at National Univerity of Ireland, Galway
Industry
– Technical co-founder of SindiceTech
– Management Platform for Enterprise Knowledge Graph

Agenda
•
•
•
•
•

Nested Data Model
SIREn Overview & Theory
SIREn Plugin Architecture
Relational Faceted Browsing
Comparison with BlockJoin

Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships

Denormalising Relational Data

Series A
Granite
Ventures

LucidWorks
Series B

Denormalising Relational Data

Series A

Granite
Ventures

Series B

Granite
Ventures

LucidWorks

Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relational data into nested data
– Convert many-to-one/many into one-to-many relationships
– Duplicate data …
– … but avoid joins

Schema-Less Nested Data Model
•

•

•

Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
– No strict schema / structure enforced
Schema-less brings
– Flexibility
– Ease of development
Developers do not have to invest significant modelling effort upfront

Introducing SIREn
•
•
•

Lucene/Solr plugin for indexing and searching JSON
Rich data model (JSON)
– Nested objects, nested arrays, datatypes
Schema-agnostic
– No need to define structure (nested model)
– No need to define schema (fields)

Overview of the SIREn API
Document

Query

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round_code" : "a",
"raised_amount" : 6000000,
"funded_year" : 2009,
"investments" : [
{
"name" : "Granite Ventures",
"type" : "financial-org"
},
…
]
},
…
]
}

(category_code :

analytics)

AND
(funding_rounds : {
round_code : seed OR a OR angel,
raised_amount : [0 TO 12000000],
* : {
type : financial-org
}
})

Theory behind SIREn
•
•
•

Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical ids (here Dewey’s identifiers)
Full-text search operators over the content of a node
Structural search operators over the nodes of the tree
– Ancestor-Descendant, Parent-Child, Sibling, …

Theory behind SIREn: Tree-Labelling

{
{
"round_code" : "a",
…
},
…
]
}

name

LucidWorks

funding_
rounds
round_
code

a

raised_
amount

6000000

…

Theory behind SIREn: Tree-Labelling

1

{
{
"round_code" : "a",
…
},
…
]
}

name

LucidWorks

1.1

1.1.1

funding_
rounds

1.2

1.2.1
round_
code

a

1.2.2.1

1.2.2.1.1

raised_
amount

6000000

1.2.2.2
…
1.2.2

1.2.2.2.1

Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

2.2.1

4.2.1

SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

Node Query

Codec
Tree-Labelling Codec

Legend:

Lucene

SIREn

JSON Field

<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="json" type="json" indexed="true" stored="false"/>
…
</fields>
<types>
<fieldType name="json"
class="org.sindice.siren.solr.schema.JsonField"
datatypeConfig="datatypes.xml"/>
…
</types>

schema.xml sample

Datatypes
<datatype name="http://www.w3.org/2001/XMLSchema#String"
class="org.sindice.siren.solr.schema.TextDatatype">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</datatype>
<datatype name="http://www.w3.org/2001/XMLSchema#int"
class="org.sindice.siren.solr.schema.TrieDatatype"
precisionStep="8"
type="integer"/>

datatypes.xml sample

JSON Tokenizer
•
•
•

Traverses JSON tree using Depth-First
Search
Generates one token per JSON node
Attaches metadata attributes (Dewey id,
datatype, …) to each token

Tokenizer Output
name
1.1
Field

LucidWorks
1.1.1
String

funding_
rounds
1.2
Field

round_
code
1.2.2.1
String

…

JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

funding

…

rounds

•


Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

Tokenized with String
datatype analyzer

funding

…

rounds

•


Input
name
1.1
Field

funding_
rounds
1.2
Field

LucidWorks
1.1.1
String

round_
code
1.2.2.1
String

…

Output
name

funding_
rounds

LucidWorks

lucid

works

funding

…

rounds

Tokenized with Field
datatype analyzer

JSON Analyzer – NodePayloadFilter
•
•

Encode metadata attributes into a term payload
Leverage Payload API to transfer attributes to the Codec API

Tree-Labelling Codec – File Structure

Block

.doc

Header

Doc identifiers

Node frequencies

.nod

Header

Node identifiers

Term frequencies

.pos

Header

Term positions

Tree-Labelling Codec – Compression
•

Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Better tolerance against outliers
– Very effective with frequencies, node identifiers and positions (higher
compression rate)

FOR

BFS

AFOR

BFS

BFS

BFS

BFS

Node Query
•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: document ids, node ids then positions
Adaptation of all Lucene’s Query classes to the new file structure
– NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …

Node Query
•

•

•

Query Processing
TwigQuery
– Consist of a root query and one or
more descendant or child queries

Boolean

Phrase
MUST

Boolean
SHOULD

Node Query
•

•

•

Query Processing
TwigQuery
– Can be nested to form complex tree
structure

Boolean

Phrase

Twig

MUST

NOT

Boolean

Range

SHOULD

SHOULD

Node Query
•

•

•

Query Processing
TwigQuery
– Can be nested to form complex tree
structure
– Can be rewritten as a pure boolean
query

Boolean

Phrase

Twig

MUST

NOT

Boolean

Range

SHOULD

SHOULD

Application: Relational Faceted Navigation
•

•

Faceted Navigation
– Data-driven exploratory interface
– User incrementally adds constraints
– Restricted to one record collection
Relational Faceted Navigation
– Enables navigation of interrelated record collections
– Constraints affect all record collections
– New navigation operation: Pivot
• Switch user view to a record collection

Relational Faceted Navigation – Demo

HCLS Demo: http://hcls.sindice.com/pivot-browser/

Data Model
•
•
•

Each collection has its own data model (document)
Lucene fields for facets
JSON field for relationships with records from other collections

Company

Investment

Investor

Country

Year

Type

Category

Amount

JSON

JSON

JSON

JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be very large
Company

Investment

Investor

category_
code

round_
code

type

country_
code
funding_
rounds

raised_
amount

investments -1

funding_
rounds -1

[…]

category_
code

round_
code
raised_
amount
investments

[…]

country_
code

[…]
type

investments

[…]
type

[…]
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code
country_
code

JSON Model
•
•

Company

Investment

category_
code
country_
code
funding_
rounds

[…]
round_
code
raised_
amount
investments

[…]
type

Investor

JSON Model
•
•

Company

Investment
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code

country_
code
investments

[…]
type

Investor

JSON Model
•
•

Company

Investment

Investor

type
investments -1

[…]
round_
code
raised_
amount
funding_
rounds -1

[…]
category_
code
country_
code

Navigation Model: Drill-Down

collection : Company
AND
country_code : irl
AND
category_code : software

Lucene query

Navigation Model: Pivot

collection : Investment

Lucene query



Query Rewriting
collection : Company
AND
country_code : irl
AND

Preceding Lucene query

Lucene query

funding_rounds -1 : {
country_code : irl,
}

JSON query



Lucene query
country_code : irl,
}

JSON query


collection : Investor

Lucene query
investments -1 : {
founded_year : 2012,
country_code : irl,
}
}

JSON query

•

Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small and well-defined schema

Lucene BlockJoin - Scalability
•
•

Increase artificially the number of documents in the index
– One document per nested data record
Cache size linear with the number of nested data records
– Increased memory usage

Lucene BlockJoin - Flexibility
•

•

•

Developers must be aware of the relations between nested data records
– At indexing time to tag parent records
– At querying time to filter parent records
Upfront effort required to design and configure the system
– Define Parent-Child relationships between record collections
– Define attributes for each record collection
If not properly designed, risk of incorrect matches

•

•

BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effort
‒ Memory usage dependent on nested data structure
Tree-Labelling
+ Can handle arbitrary and large nested model
+ Memory friendly
‒ Have to re-think and re-implement Lucene’s features

Conclusion
•
•
•
•
•

Nested data model becomes more and more prevalent
Searching nested data brings new challenges: performance, scalability, flexibility
Different approaches exist, each one with pros and cons
SIREn plugin based on tree-labelling techniques
Enables new kind of search applications, e.g., relational faceted browser, with subsecond response time

•

SIREn Availability
– Trial license currently available
– In negotiation with the University to open-source

Acknowledgement
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Research Council for Science, Engineering and Technology.

High Performance JSON Search and Relational Faceted Browsing with Lucene

More Related Content

What's hot

Viewers also liked

Similar to High Performance JSON Search and Relational Faceted Browsing with Lucene

More from lucenerevolution

Recently uploaded

High Performance JSON Search and Relational Faceted Browsing with Lucene