Andrzej bialecki lr-2013-dublin
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Andrzej bialecki lr-2013-dublin

on

  • 1,126 views

Presented by Andrzej Bialecki, LucidWorks ...

Presented by Andrzej Bialecki, LucidWorks

This session presents a set of Solr components for easy management of "sidecar indexes" - indexes that extend the main index with additional stored and / or indexed fields. Conceptually this can be viewed as an extension of the ExternalFileField or as a static join between documents from two collections. This functionality is useful in applications that require very different update regimes for the two parts of the index (e.g. main catalogue items combined with clickthroughs).

Statistics

Views

Total Views
1,126
Views on SlideShare
818
Embed Views
308

Actions

Likes
0
Downloads
13
Comments
0

2 Embeds 308

http://www.lucenerevolution.org 245
http://lucenerevolution.org 63

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Andrzej bialecki lr-2013-dublin Presentation Transcript

  • 1. SOLR SIDE-CAR INDEX Andrzej Bialecki. LucidWorks ab@lucidworks.com
  • 2. About the speaker • • • • Started using Lucene in 2003 (1.2-dev…) Created Luke – the Lucene Index Toolbox Apache Nutch, Hadoop, Solr committer, Lucene PMC member LucidWorks engineer
  • 3. Agenda • • • • • Challenge: incremental document updates Existing solutions and workarounds Sidecar index strategy and components Scalability and performance QA
  • 4. Challenge: incremental document updates • • • Incremental update (field-level update): modification of a part of document Sounds like a fundamentally useful functionality! But Lucene / Solr doesn’t offer true field-level updates (yet!) – “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document” – “Atomic update” functionality in Solr is a (useful) syntactic sugar
  • 5. Common use cases for field updates • • • Documents composed logically of two parts with different update schedules – E.g. mostly static documents with some quickly changing fields Two different classes of data in changing fields – Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns – Text fields: e.g. reviews, tags, click-through feedback, user profiles Challenge: how to integrate these modifications with the main index content? – Re-indexing whole documents isn’t always an option
  • 6. True full-text (inverted fields) incremental updates • • Very complex issue, broad impact on many Lucene internals – Inverted index structure is not optimized for partial document updates – At least another 6-12 months away? LUCENE-4258 – work in progress
  • 7. Handling updates via full re-index • • • If the corpus is small, or incremental updates infrequent… just re-index everything! Pros: – Relatively easy to implement – update source documents and re-index – Allows adding all types of data, including e.g. labels as searchable text Cons: – Infeasible for larger corpora or frequent updates, time-wise and cost-wise – Requires keeping around the source documents • Sometimes inconvenient, when documents are assembled in a complex pipeline
  • 8. Handling updates via Solr’s ExternalFileField • • • Pros: – Simple to implement – Updates are easy – just file edits, no need to re-index Cons: – Only docId => field : number – Not suitable for full-text searchable field updates • E.g. can’t support user-generated labels attached to a doc – Still useful if a simple “popularity”-type metric is sufficient Internally implemented as an in-memory ValueSource usable by function queries doc0=1.5 doc1=2.5 doc2=0.5 …
  • 9. Numeric DocValues updates • Since Lucene/Solr 4.6 … to be released Really Soon • •  Details can be found in LUCENE-5189 As simple as: indexWriter.updateNumericDocValue(term, field, value) • • • Neatly solves the problem of numeric updates: popularity, in-stock, etc. Some limitations: – Massive updates still somewhat costly until the next merge (like deletes) – Can only update existing fields Obviously doesn’t address the full-text inverted field updates
  • 10. Lucene ParallelReader overview • • • 0 Pretends that two or more IndexReader-s are slices of the same index – Slices contain data for different fields – Both stored and inverted parts are supported – Data for matching docs is joined on the fly Structure of all indexes MUST match 1:1 !!! – The same number of segments – The same count of docs per segment – Internal document ID-s must match 1:1 – List of deletes is taken from the first index Sounds cool … but in practice it’s rarely used: – It’s very difficult to meet these requirements – This is even more difficult in the presence of index updates and merges f1, f2, f3, f4… ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … 6 0 f1, f2, … 0 f3, f4, … main IR parallel IR
  • 11. Handling updates via ParallelReader • • Pros: – All types of data (e.g. searchable full-text labels) can be added Cons: – Must ensure that the other index always matches the structure of the main index – Complicated and fragile (rebuild on every update?) – No tools to manage this parallel index in Solr ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 4 5 0 1 f1, f2, ... f1, f2, … 6 0 f1, f2, … main IR 0 1 f3, f4, ... f3, f4, … 0 1 f3, f4, ... f3, f4, … 0 f3, f4, … parallel IR
  • 12. Sidecar Index Components for Solr • • • • Uses the ParallelReader strategy for field updates – “Main” and “sidecar” data comes from two different Solr collections – “Sidecar” collection is updated independently from the main collection – “Sidecar” collection is used as a source of document fields for building and updating a parallel index Integrates the management of ParallelReader (“sidecar index”) into Solr – Initial creation of ParallelReader, including synchronization of internal ID-s – Tracking of updates and IndexReader.reopen(…) events Partly based on a version of Click Framework in LucidWorks Search Available under Apache License here: http://github.com/LucidWorks/sidecar_index
  • 13. “Main”, “sidecar” collections and parallel index • • • • • “Main” collection contains only the parts of documents with “main” fields “Sidecar” collection is a source of documents with “sidecar” fields SidecarIndexReaderFactory creates and maintains the parallel index (sidecar index) “Main” collection uses SidecarIndexReader that acts as ParallelReader Main index is updated as usual, via the “main” collection’s IndexWriter Solr Main_collection SidecarIndexReader main index sidecar index Sidecar_collection
  • 14. Implementation details • • • SidecarIndexReaderFactory extends Solr’s IndexReaderFactory – newReader(Directory, SolrCore) – initial open – newReader(IndexWriter, SolrCore) – NRT open SidecarIndexReader acts like a ParallelReader – Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader – Basically had to re-implement the logic from ParallelReader  ParallelReader challenges: – How to synchronize internal ID-s? – How to create segments that are of the same size as those of the main index? – How to handle deleted documents? – How to handle updates to the main index? – How to handle updates to the sidecar data?
  • 15. Sidecar collection ParallelReader challenges and solutions • • How to synchronize internal ID-s? – “Main” collection is traversed sequentially by internal docId – Primary key is retrieved for each document – Matching document is found in the “sidecar” collection – Matching document is added to the “sidecar” index Very costly phase! – Random seek and retrieval from “sidecar” collection – Primary key lookup is fast – … but stored field retrieval and indexing isn’t Main collection G B C E A F D q=id:D 0 1 2 3 0 1 2 3 D, f2, ... B, f2, ... A, f2, ... F, f2, … 4 5 0 1 C, f2, ... G, f2, … 6 0 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … f3, f4, ... f3, f4, ... f3, f4, … E, f2, … main IR 0 1 2 f3, f4, ... f3, f4, ... f3, f4, ... sidecar IR
  • 16. ParallelReader challenges and solutions • • • Optimization 1: don’t rebuild data for unmodified segments Optimization 2 (cheating): ignore NRT segments How to handle deleted docs? – Insert dummy (empty) documents so that the number and the order of documents still match ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … X 7 0 1 f1, f2, ... f1, f2, … 0 1 dummy f3, f4, … NRT 0 f1, f2, … main IR sidecar IR
  • 17. Implementation: SidecarMergePolicy • • How to create segments that are of the same size as the “main” index? Carefully manage the “sidecar” index creation: – IndexWriter uses SerialMergeScheduler to prevent out-of-order merges – Force flush when reaching the next target count of documents – Merges are enforced using SidecarMergePolicy that tracks the sizes of the “main” index segments ParallelReader 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 4 5 0 1 f1, f2, ... f1, f2, … 0 1 f3, f4, ... f3, f4, … 6 0 f1, f2, … 0 f3, f4, … main IR SidecarMergePolicy target sizes: Seg0 – 4 docs Seg1 – 2 docs Seg2 – 1 doc sidecar IR
  • 18. Implementation: SidecarIndexReader • • • • • Re-implements the logic of ParallelReader – ParallelReader != DirectoryReader Exposes Directory of the “main” index for replication – Replicas need the “sidecar” collection replica to rebuild the sidecar index locally – If document routing and shard placement is the same then we don’t have to use distributed search – all data will be local Reopen(…) avoids rebuilding unmodified segments Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when necessary – When there’s a major merge in the “main” index – When “sidecar” data is updated Ref-counting of IndexReaders at different levels is very tricky!
  • 19. Example configuration in solrconfig.xml <indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool> </indexReaderFactory>
  • 20. Example use case: integration of click-through data • • • Raw click-through data: – Query, query_time, docId, click_time [, user] Aggregated click-through data: – User-generated popularity score: F(number and timing of clicks per docId) • Numeric updates – User-generated labels: F(top-N queries that led to clicks on docId) • Full-text searchable updates – User profiles: F(top-N queries per user, top-N docId-s clicked, etc) – … Queries can now be expanded to score based on TF/IDF in user-generated labels
  • 21. Scalability and performance
  • 22. Scalability and performance • • • • Initial full rebuild is very costly – ~0.6 ms / document – 1 mln docs = 600 sec = 10 min – Not even close to “real time” … Cost related to new segments in “main” index depends on the size of segments Major merge events will trigger full rebuild BUT: search-time cost is negligible
  • 23. Caveats • • Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track – The sidecar code is still unstable and occasionally explodes Performance of full rebuild quickly becomes the bottleneck on frequent updates – So the main use case is massive but infrequent updates of “sidecar” data • Code: http://github.com/LucidWorks/sidecar_index • Fixes and contributions are welcome – the code is Apache licensed
  • 24. Agenda • • • • • Challenge: incremental document updates Existing solutions and workarounds Sidecar index strategy and components Scalability and performance QA
  • 25. QA Andrzej Bialecki ab@lucidworks.com