Your SlideShare is downloading. ×
0
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Proposal for nested document support in Lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Proposal for nested document support in Lucene

12,744

Published on

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,744
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Nested Documents in Lucene
    High-performance support for parent/child document relations
    mark@searcharea.co.uk
  • 2. Problem:
    The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document.
    Single
    Lucene
    document
  • 3. Problem: “Cross-matching”
    When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resume
    John
    Name
    John
    A1 in Maths
    A1, E1
    Grade
    E1 in Science
    Subject
    Maths, Science
    !
    False match for query:
    Grade:A1 AND Subject:Science
  • 4. Unacceptable solution #1
    One modeling approach is to store related items in the same field and use proximity operators in queries
    Name
    John
    A1 Maths….E1 Science
    GradeAndSubject
    John
    Example query:
    “GradeAndSubject:”A1 Science”~2
    A1 in Maths
    E1 in Science
    !
    Slow
    !
    Not scalable with number of fields
    • Loss of fieldnames as context in query
    • 5. Proximity distances must grow.
    • 6. Only one choice of Analyzer for given field
  • Unacceptable solution #2
    Use numbered fieldnames to group related structures
    Name
    John
    Example query:
    ( Grade1:A1 AND
    Subject1:Science)
    OR
    (Grade2:A1 AND
    Subject2:Science )

    A1
    Grade1
    Maths
    Subject1
    E1
    Grade2
    John
    Subject2
    Science
    A1 in Maths
    E1 in Science
    !
    Slow
    !
    Not scalable with number of nested structures
    • More numbered fieldnames = more query complexity and more unique tokens in index
  • Solution: Nested documents
    The existing Lucene codebase can be used to simply store multiple “nested” documents to represent arbitrarily complex structures. Related documents are just added in sequence
    John
    Name
    John
    A1 in Maths
    A1
    E1
    Grade
    Grade
    E1 in Science
    Subject
    Maths
    Subject
    Science
    ?
    But how to query?....
  • 7. Solution: Nested Document Queries
    Nested documents need to be queried using new NestedDocumentQuery class which understands document relationships
    John
    Name
    A1
    E1
    Grade
    Grade
    docType
    resume
    Subject
    Maths
    Subject
    Science
    New NestedDocumentQuery
    • Executes child search using any arbitrary Lucene Query object e.g. Boolean, fuzzy, numeric etc
    • 8. Reports any matches as a match on the parent document not the child
    • 9. Super-fast evaluation of joins between child and parent
    • 10. Requires an indexed field to identify parent documents
    ?
  • 11. Solution: Example Query
    Find resume of person called “John” with A1 grade in Maths
    John
    Name
    E1
    A1
    resume
    Grade
    docType
    Grade
    Subject
    Science
    Subject
    Maths
    The NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic
  • 12. Solution: Join speed
    Unlike a database, the cost of a join (child to parent) is blisteringly fast
    3) Find first prior set bit e.g. position #356,670
    100000100000000100000001000000010000001000010000000001000000100000100001
    2) Index directly into cached BitSet at position #356,675
    1) Match reported on document #356,675
    ParentQuery
    4) Attribute match to doc #356,670
    NestedDocumentQuery
    ChildQuery
    The BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)
  • 13. Other advantages
    Parent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website)
    Nesting levels can be arbitrarily deep
    Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)
  • 14. “Lucene is not a database”, but…..
    Structure matters
    Many data sources are a mix of structured and unstructured content (e.g. microformats). This is unlikely to change. Lucene has historically been about unstructured text but has steadily been adding structured capability (Trie, spatial, facets) and become a great solution for hybrid data. However support for modeling and querying non-trivial data structures is missing currently.
    Relationships matter
    This proposal is not to recreate the full capabilities of a SQL database with arbitrary relationships. However we can benefit greatly from providing simple parent-child relationships
    We have some unique capabilities
    Parent-child joins are very fast
    Unlike SQL we can return partial, relevance-ranked matches
    Probably more akin to XML databases than SQL databases
  • 15. Next steps
    Existing code/unit tests can be released to Lucene project if there is sufficient interest. This software has been deployed in production on large datasets.
    The matching approach is reliant on parents and children being held in the same Lucene index segment. Additional control is needed to enforce this more rigorously - either by
    Adding more user-control over IndexWritersegment creation where applications understand/control parent-child dependencies OR
    Making Lucene aware of parent-child relationships e.g. new method Document.add(Document)
    Query parser support
    XML Query Parser support is available
    End-user Query parser could add new syntax e.g. +candidateLocale:UK +child(grade:A1 AND subject:music)
  • 16. Thoughts?
    Feedback encouraged on dev@lucene.apache.org

×