Your SlideShare is downloading. ×
0
HIGH PERFORMANCE JSON SEARCH AND
RELATIONAL FACETED BROWSING WITH LUCENE

Renaud Delbru
renaud@sindicetech.com
renaud.delb...
My Background
•

•

•

Lucene / Solr
– User since 7 years
– Built a web search engine – sindice.com (700M documents)
Acade...
Agenda
•
•
•
•
•

Nested Data Model
SIREn Overview & Theory
SIREn Plugin Architecture
Relational Faceted Browsing
Comparis...
Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relatio...
Denormalising Relational Data

Series A
Granite
Ventures

LucidWorks
Series B
Denormalising Relational Data

Series A

Granite
Ventures

Series B

Granite
Ventures

LucidWorks
Nested Data Model: Why is it important ?
•
•

SQL
– Query-time join
performance penalty
NoSQL
– Denormalisation of relatio...
Schema-Less Nested Data Model
•

•

•

Model becoming prevalent: JSON, XML, Avro, …
– Can be arbitrarily nested and large
...
Introducing SIREn
•
•
•

Lucene/Solr plugin for indexing and searching JSON
Rich data model (JSON)
– Nested objects, neste...
Overview of the SIREn API
Document

Query

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
...
Theory behind SIREn
•
•
•

Inspired from tree-labelling scheme techniques (XML IR)
– Label each node with a hierarchical i...
Theory behind SIREn: Tree-Labelling

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"round...
Theory behind SIREn: Tree-Labelling

1

{
"name" : "LucidWorks",
"category_code" : "analytics",
"funding_rounds" : [
{
"ro...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
Theory behind SIREn: Query Processing

Query
name

?

name

Inverted Index

LucidWorks

1.1

2.2

2.5

LucidWorks

1.5.3

...
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

N...
JSON Field

<fields>
<field name="id" type="string" indexed="true" stored="true"/>
<field name="json" type="json" indexed=...
Datatypes
<datatype name="http://www.w3.org/2001/XMLSchema#String"
class="org.sindice.siren.solr.schema.TextDatatype">
<an...
JSON Tokenizer
•
•
•

Traverses JSON tree using Depth-First
Search
Generates one token per JSON node
Attaches metadata att...
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

f...
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

f...
JSON Analyzer – NodeTokenizerFilter
•

Tokenize the content of a node token based on its datatype

Input
name
1.1
Field

f...
JSON Analyzer – NodePayloadFilter
•
•

Encode metadata attributes into a term payload
Leverage Payload API to transfer att...
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

N...
Tree-Labelling Codec – File Structure

Block

.doc

Header

Doc identifiers

Node frequencies

.nod

Header

Node identifi...
Tree-Labelling Codec – Compression
•

Adaptive Frame Of Reference
– Adapt the encoding to the integer distribution
– Bette...
SIREn Plugin Architecture - Overview

Document

Analysis

Flexible Query Parser

JSON Query Parser
Query

JSON Analyzer

N...
Node Query
•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: docume...
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: doc...
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: doc...
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: doc...
Node Query
•

•

•

Query Processing
– Collects matching document and node identifiers
– Posting list traversal order: doc...
Application: Relational Faceted Navigation
•

•

Faceted Navigation
– Data-driven exploratory interface
– User incremental...
Relational Faceted Navigation – Demo

HCLS Demo: http://hcls.sindice.com/pivot-browser/
Data Model
•
•
•

Each collection has its own data model (document)
Lucene fields for facets
JSON field for relationships ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
JSON Model
•
•

JSON field: Tree covering all the relationships with records from other collections
Resulting tree can be ...
Navigation Model : Drill-Down
Navigation Model: Drill-Down

collection : Company
AND
country_code : irl
AND
category_code : software

Lucene query
Navigation Model: Pivot
Navigation Model: Pivot

collection : Investment

Lucene query
Navigation Model: Pivot

collection : Investment

Query Rewriting
collection : Company
AND
country_code : irl
AND
category...
Navigation Model: Pivot

collection : Investment

Lucene query
funding_rounds -1 : {
country_code : irl,
category_code : s...
Navigation Model: Pivot
Navigation Model: Pivot

collection : Investor

Lucene query
investments -1 : {
founded_year : 2012,
funding_rounds -1 : {...
Comparison with BlockJoin
•

Lucene BlockJoin
– Introduced support for indexing and searching nested data …
– … for small ...
Lucene BlockJoin - Scalability
•
•

Increase artificially the number of documents in the index
– One document per nested d...
Lucene BlockJoin - Flexibility
•

•

•

Developers must be aware of the relations between nested data records
– At indexin...
Comparison with BlockJoin
•

•

BlockJoin
+ Works out of the box with all Lucene’s features
‒ Requires upfront design effo...
Conclusion
•
•
•
•
•

Nested data model becomes more and more prevalent
Searching nested data brings new challenges: perfo...
Acknowledgement
This material is based upon works supported by the European FP7 project LOD2
(257943) and the Irish Resear...
High Performance JSON Search and Relational Faceted Browsing with Lucene
Upcoming SlideShare
Loading in...5
×

High Performance JSON Search and Relational Faceted Browsing with Lucene

2,447

Published on

Presented by Renaud Delbru, Co-Founder, SindiceTech

In this presentation, we will discuss how Lucene and Solr can be used for very efficient search of tree-shaped schemaless document, e.g. JSON or XML, and can be then made to address both graph and relational data search. We will discuss the capabilities of SIREn, a Lucene/Solr plugin we have developed to deal with huge collections of tree-shaped schemaless documents, and how SIREn is built using Lucene extensibility capabilities (Analysis, Codec, Flexible Query Parser). We will compare it with Lucene's BlockJoin Query API in nested schemaless data intensive scenarios. We will then go through use cases that show how relational or graph data can be turned into JSON documents using Hadoop and Pig, and how this can be used in conjunction with SIREn to create relational faceting systems with unprecedented performance. Take-away lessons from this session will be awareness about using Lucene/Solr and Hadoop for relational and graph data search, as well as the awareness that it is now possible to have relational faceted browsers with sub-second response time on commodity hardware.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,447
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
54
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "High Performance JSON Search and Relational Faceted Browsing with Lucene"

  1. 1. HIGH PERFORMANCE JSON SEARCH AND RELATIONAL FACETED BROWSING WITH LUCENE Renaud Delbru renaud@sindicetech.com renaud.delbru@deri.org Co-Founder, SindiceTech Post-Doctoral Researcher, NUIG
  2. 2. My Background • • • Lucene / Solr – User since 7 years – Built a web search engine – sindice.com (700M documents) Academia & Research – Ph.D. in Information Retrieval and Semantic Web – Post-doctoral researcher at National Univerity of Ireland, Galway Industry – Technical co-founder of SindiceTech – Management Platform for Enterprise Knowledge Graph
  3. 3. Agenda • • • • • Nested Data Model SIREn Overview & Theory SIREn Plugin Architecture Relational Faceted Browsing Comparison with BlockJoin
  4. 4. Nested Data Model: Why is it important ? • • SQL – Query-time join performance penalty NoSQL – Denormalisation of relational data into nested data – Convert many-to-one/many into one-to-many relationships
  5. 5. Denormalising Relational Data Series A Granite Ventures LucidWorks Series B
  6. 6. Denormalising Relational Data Series A Granite Ventures Series B Granite Ventures LucidWorks
  7. 7. Nested Data Model: Why is it important ? • • SQL – Query-time join performance penalty NoSQL – Denormalisation of relational data into nested data – Convert many-to-one/many into one-to-many relationships – Duplicate data … – … but avoid joins
  8. 8. Schema-Less Nested Data Model • • • Model becoming prevalent: JSON, XML, Avro, … – Can be arbitrarily nested and large – No strict schema / structure enforced Schema-less brings – Flexibility – Ease of development Developers do not have to invest significant modelling effort upfront
  9. 9. Introducing SIREn • • • Lucene/Solr plugin for indexing and searching JSON Rich data model (JSON) – Nested objects, nested arrays, datatypes Schema-agnostic – No need to define structure (nested model) – No need to define schema (fields)
  10. 10. Overview of the SIREn API Document Query { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, "investments" : [ { "name" : "Granite Ventures", "type" : "financial-org" }, … ] }, … ] } (category_code : analytics) AND (funding_rounds : { round_code : seed OR a OR angel, raised_amount : [0 TO 12000000], * : { type : financial-org } })
  11. 11. Theory behind SIREn • • • Inspired from tree-labelling scheme techniques (XML IR) – Label each node with a hierarchical ids (here Dewey’s identifiers) Full-text search operators over the content of a node Structural search operators over the nodes of the tree – Ancestor-Descendant, Parent-Child, Sibling, …
  12. 12. Theory behind SIREn: Tree-Labelling { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, … }, … ] } name LucidWorks funding_ rounds round_ code a raised_ amount 6000000 …
  13. 13. Theory behind SIREn: Tree-Labelling 1 { "name" : "LucidWorks", "category_code" : "analytics", "funding_rounds" : [ { "round_code" : "a", "raised_amount" : 6000000, "funded_year" : 2009, … }, … ] } name LucidWorks 1.1 1.1.1 funding_ rounds 1.2 1.2.1 round_ code a 1.2.2.1 1.2.2.1.1 raised_ amount 6000000 1.2.2.2 … 1.2.2 1.2.2.2.1
  14. 14. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  15. 15. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  16. 16. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  17. 17. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  18. 18. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  19. 19. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  20. 20. Theory behind SIREn: Query Processing Query name ? name Inverted Index LucidWorks 1.1 2.2 2.5 LucidWorks 1.5.3 2.2.1 4.2.1
  21. 21. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  22. 22. JSON Field <fields> <field name="id" type="string" indexed="true" stored="true"/> <field name="json" type="json" indexed="true" stored="false"/> … </fields> <types> <fieldType name="json" class="org.sindice.siren.solr.schema.JsonField" datatypeConfig="datatypes.xml"/> … </types> schema.xml sample
  23. 23. Datatypes <datatype name="http://www.w3.org/2001/XMLSchema#String" class="org.sindice.siren.solr.schema.TextDatatype"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> </analyzer> </datatype> <datatype name="http://www.w3.org/2001/XMLSchema#int" class="org.sindice.siren.solr.schema.TrieDatatype" precisionStep="8" type="integer"/> datatypes.xml sample
  24. 24. JSON Tokenizer • • • Traverses JSON tree using Depth-First Search Generates one token per JSON node Attaches metadata attributes (Dewey id, datatype, …) to each token Tokenizer Output name 1.1 Field LucidWorks 1.1.1 String funding_ rounds 1.2 Field round_ code 1.2.2.1 String …
  25. 25. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works funding … rounds
  26. 26. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works Tokenized with String datatype analyzer funding … rounds
  27. 27. JSON Analyzer – NodeTokenizerFilter • Tokenize the content of a node token based on its datatype Input name 1.1 Field funding_ rounds 1.2 Field LucidWorks 1.1.1 String round_ code 1.2.2.1 String … Output name funding_ rounds LucidWorks lucid works funding … rounds Tokenized with Field datatype analyzer
  28. 28. JSON Analyzer – NodePayloadFilter • • Encode metadata attributes into a term payload Leverage Payload API to transfer attributes to the Codec API
  29. 29. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  30. 30. Tree-Labelling Codec – File Structure Block .doc Header Doc identifiers Node frequencies .nod Header Node identifiers Term frequencies .pos Header Term positions
  31. 31. Tree-Labelling Codec – Compression • Adaptive Frame Of Reference – Adapt the encoding to the integer distribution – Better tolerance against outliers – Very effective with frequencies, node identifiers and positions (higher compression rate) FOR BFS AFOR BFS BFS BFS BFS
  32. 32. SIREn Plugin Architecture - Overview Document Analysis Flexible Query Parser JSON Query Parser Query JSON Analyzer Node Query Codec Tree-Labelling Codec Legend: Lucene SIREn
  33. 33. Node Query • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, …
  34. 34. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries Boolean Phrase MUST Boolean SHOULD
  35. 35. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries Boolean Phrase MUST Boolean SHOULD
  36. 36. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries – Can be nested to form complex tree structure Boolean Phrase Twig MUST NOT Boolean Range SHOULD SHOULD
  37. 37. Node Query • • • Query Processing – Collects matching document and node identifiers – Posting list traversal order: document ids, node ids then positions Adaptation of all Lucene’s Query classes to the new file structure – NodeTermQuery, NodeBooleanQuery, NodePhraseQuery, … TwigQuery – Consist of a root query and one or more descendant or child queries – Can be nested to form complex tree structure – Can be rewritten as a pure boolean query Boolean Phrase Twig MUST NOT Boolean Range SHOULD SHOULD
  38. 38. Application: Relational Faceted Navigation • • Faceted Navigation – Data-driven exploratory interface – User incrementally adds constraints – Restricted to one record collection Relational Faceted Navigation – Enables navigation of interrelated record collections – Constraints affect all record collections – New navigation operation: Pivot • Switch user view to a record collection
  39. 39. Relational Faceted Navigation – Demo HCLS Demo: http://hcls.sindice.com/pivot-browser/
  40. 40. Data Model • • • Each collection has its own data model (document) Lucene fields for facets JSON field for relationships with records from other collections Company Investment Investor Country Year Type Category Amount JSON JSON JSON
  41. 41. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor category_ code round_ code type country_ code funding_ rounds raised_ amount investments -1 funding_ rounds -1 […] category_ code round_ code raised_ amount investments […] country_ code […] type investments […] type […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  42. 42. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment category_ code country_ code funding_ rounds […] round_ code raised_ amount investments […] type Investor
  43. 43. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment category_ code country_ code funding_ rounds […] round_ code raised_ amount investments […] type Investor
  44. 44. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code investments […] type Investor
  45. 45. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code investments […] type Investor
  46. 46. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor type investments -1 […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  47. 47. JSON Model • • JSON field: Tree covering all the relationships with records from other collections Resulting tree can be very large Company Investment Investor type investments -1 […] round_ code raised_ amount funding_ rounds -1 […] category_ code country_ code
  48. 48. Navigation Model : Drill-Down
  49. 49. Navigation Model: Drill-Down collection : Company AND country_code : irl AND category_code : software Lucene query
  50. 50. Navigation Model: Pivot
  51. 51. Navigation Model: Pivot collection : Investment Lucene query
  52. 52. Navigation Model: Pivot collection : Investment Query Rewriting collection : Company AND country_code : irl AND category_code : software Preceding Lucene query Lucene query funding_rounds -1 : { country_code : irl, category_code : software } JSON query
  53. 53. Navigation Model: Pivot collection : Investment Lucene query funding_rounds -1 : { country_code : irl, category_code : software } JSON query
  54. 54. Navigation Model: Pivot
  55. 55. Navigation Model: Pivot collection : Investor Lucene query investments -1 : { founded_year : 2012, funding_rounds -1 : { country_code : irl, category_code : software } } JSON query
  56. 56. Comparison with BlockJoin • Lucene BlockJoin – Introduced support for indexing and searching nested data … – … for small and well-defined schema
  57. 57. Lucene BlockJoin - Scalability • • Increase artificially the number of documents in the index – One document per nested data record Cache size linear with the number of nested data records – Increased memory usage
  58. 58. Lucene BlockJoin - Flexibility • • • Developers must be aware of the relations between nested data records – At indexing time to tag parent records – At querying time to filter parent records Upfront effort required to design and configure the system – Define Parent-Child relationships between record collections – Define attributes for each record collection If not properly designed, risk of incorrect matches
  59. 59. Comparison with BlockJoin • • BlockJoin + Works out of the box with all Lucene’s features ‒ Requires upfront design effort ‒ Memory usage dependent on nested data structure Tree-Labelling + Can handle arbitrary and large nested model + Memory friendly ‒ Have to re-think and re-implement Lucene’s features
  60. 60. Conclusion • • • • • Nested data model becomes more and more prevalent Searching nested data brings new challenges: performance, scalability, flexibility Different approaches exist, each one with pros and cons SIREn plugin based on tree-labelling techniques Enables new kind of search applications, e.g., relational faceted browser, with subsecond response time • SIREn Availability – Trial license currently available – In negotiation with the University to open-source
  61. 61. Acknowledgement This material is based upon works supported by the European FP7 project LOD2 (257943) and the Irish Research Council for Science, Engineering and Technology.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×