Consuming RealTime Signals in Solr


Published on

Slides from the Solr/Lucene Meetup in Bangalore on July 25th.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Consuming RealTime Signals in Solr

  1. 1. Consuming Real time Signals in Solr Umesh Prasad SDE 3 @ Flipkart
  2. 2. Flipkart’s Index Flipkart’s Index 1. Data organized in multiple indexes/Solr cores. Couple of millions of documents. 2. SKUs are documents. 3. Data organized in multiple solr cores. 4. Extensive use of facets and filters. 5. All search doesn’t allow faceting. Lots of custom components 1. Custom collectors ( for enabling blending of results for diversity / personalization ) 2. Custom Query parsers ( for enabling really customized scoring) 3. Custom fields
  3. 3. Typical Ecommerce Document ● Catalogue data ○ Static ○ Largely textual ● Pricing related data ○ Dynamic ○ Faster moving ● Offers ○ Channel specific based on nature of event ● Availability ○ Dynamic ○ Faster moving and more...
  4. 4. First Cut Integration 1. Catalogue Management System aka CMS a. Single Source of truth for all Systems b. Merges data from multiple sources, doing joins and keeps the latest snapshot, keyed by Product Id c. Raises notification whenever the data changes . Catalogue Management System (Static and dynamic) Data Import Handler (Fetch, Transform, Dedup, Update) SOLR Notification Sales Signals, Custom tags
  5. 5. But …. 1. Limitations a. Too much data ( and more than 80% , not of any interest to search system) b. CMS has to keep data for ever. (Remember it is source of truth). But search System doesn’t need to index all documents. ( obsolete products). So lots of drops. c. Merging becomes too much for CMS. Introduces Lag. 2. DIH Limitations a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262) b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and single threaded doesn’t help. c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But difficult to integrate proactively.
  6. 6. So CMS is re-factored CMS (service) Dynamic Field 1 Service (service) Notification stream Notification stream dynamic sorting fields ( sparse but a lot of them ) (mysql db) Snapshot SOLR Master External Field , consumed through DIH Solr Slaves
  7. 7. Why are Partial updates a challenge in Lucene ? 1. Update a. Lucene doesn’t support partial updates. Tough to do with inverted index. It is because all terms for that document needs to be updated. Lots of open tickets b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258 (overlay segment based) , Incremental Field Updates through Stacked Segments c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5} d. Document @ t2 → Term vectors { T1, T4, T10 } e. Inverted index actually stores the posting list for its terms. These posting lists are quite sparse and compressed using delta encodings for efficiency reasons. f. T1 → {1, 5, 7 } etc g. T2 → {2, 5, 6} h. To support partial update, the document has to be removed from posting listing of all its previous terms .. That is non-trivial. Because that will involve remembering and storing all terms for a given document. i. So instead Lucene and inverted index systems, mark old document as deleted in another data structure (live docs)
  8. 8. Why are Partial updates a challenge in Lucene ? 1. What it means is a update in actually a. Delete + Add . ( Regardless of which attribute changed) b. Deleted documents are compacted by a background merge thread. 2. Updates become only after a commit c. Soft commit will create a new segment in memory. d. Hard commit will do a fsync to directory.
  9. 9. But do we need to re-index a document ? Lets evaluate 1. Lucene might hold 3 kinds of data a. Data used for actual search ( analyzed, converted into tokens ) b. Data used for plain filtering ( not analyzed, e.g. price, discount) c. Data used for ranking ( e.g. relevancy signals and there can be a lot of them) 2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing. a. Pipeline can be spam filtering → text cleaning → duplicate detection → NLP → Entity extraction etc etc 3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags , usually with enumerated values a. Can be dynamic b. Can be governed by policies and business constraints.
  10. 10. But do we need to re-index a document ? Lets evaluate 1. Ranking Signals ⇒ Needs to be row oriented. a. Can be batch update (e.g. category specific ranks, ratings) or real time updates e.g. availability. b. Lucene actually un-inverts such fields using FieldCache c. Doc values were introduced to manage the cost of FieldCache and better provide updatability. d. updatable NumericDocValues (LUCENE-5189, since 4.6) , updatable binary doc values (LUCENE-5513, since 4.8) e. Solr still doesn’t have updatabale doc values. Jira ticket open, but issues around update/write-ahead logs. ( SOLR- 5944)
  11. 11. First Approach : Leverage Updatable Numeric DocValues 1. Solr Limitation : Easily overcome in master slave model by plugging your own update chain and accessing IndexWriter directly. 2. But : a. You need a commit for docvalues to reflect. ( Not real time !! ) b. Filtering on DocValues : is inefficient. Specially on Numeric Fields. c. Making it work is solr cloud is non trivial. For details please see SOLR-5944. d. Docvalues are dense. Updates are not stacked. It always dumps the full view of modified field doc value on every commit. (optimizing for search performance) (http://shaierera. e. But what if we had 500 fields doc values for millions of docs.
  12. 12. First Approach : Leverage Updatable Numeric DocValues 1. Commit caveats: a. Soft commits is NOT FREE. Soft-commit in solr = IndexWriter.getReader() in lucene == flush + open . There is NRTCachingDirectory, which caches the small segment produced and makes it cheaper to do soft commits. Details can found in McCandless’s post. b. In Solr invalidate all caches and they have to be re- generated on every commit. Some caches like filterCache have a huge impact on performance. Warming them up itself might take 2-3 minutes at times. c. Warmup puts memory pressure on jvm and builds spikes in allocations. Some caches like documentCache can’t even be warmed up. d. More commits ⇒ more segments ⇒ more merges
  13. 13. 2nd Approach. : NRT Store and Value Sources - abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext) Gets the values for this reader and the context that was previously passed to createWeight() FunctionValues - boolean exists(int doc) : Returns true if there is a value for this document - double doubleVal(int doc) Value Sources Allowed us to Plug External Data sources right inside Solr. These external data need not be part of the index themselves, but should be easily retrievable. Because they would be called millions of times and right inside a loop.
  14. 14. The Challenge 1. Entries in Solr caches have really no expiry time and have no way to invalidate entries. 2. Solution : Get rid of query cache altogether. But still, we have filterCache. 3. So now : matching and scoring had to be really fast. a. Calls to value source need to be extremely fast. We have optimized them out, so that they are as fast as accessing doc values. b. The cost of ranking functions themselves. Some of the optimizations involved getting and reducing cost of Math functions themselves
  15. 15. So the learnings 1. Understand your data, change rate and what you want to do with your data 2. Solr / Lucene have really good abstractions both around indexing and query. Both provide you with a lot of hooks and plugins. Think through and take advantage of them. 3. Experiment, profile and benchmark. Delve into the APIs and internals. 4. The experts do help. The dense docValues and softcommits not being free, were direct contributions of discussions with Shalin. 5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built a lucene-codecs (which built and updated inverted index in redis).