SlideShare a Scribd company logo
1 of 15
Download to read offline
Consuming Real
time Signals in Solr
Umesh Prasad
SDE 3 @ Flipkart
Flipkart’s Index
Flipkart’s Index
1. Data organized in multiple indexes/Solr cores. Couple of millions of
documents.
2. SKUs are documents.
3. Data organized in multiple solr cores.
4. Extensive use of facets and filters.
5. All search doesn’t allow faceting.
Lots of custom components
1. Custom collectors ( for enabling blending of results for diversity /
personalization )
2. Custom Query parsers ( for enabling really customized scoring)
3. Custom fields
Typical Ecommerce Document
● Catalogue data
○ Static
○ Largely textual
● Pricing related data
○ Dynamic
○ Faster moving
● Offers
○ Channel specific based on nature of event
● Availability
○ Dynamic
○ Faster moving
and more...
First Cut Integration
1. Catalogue Management System aka CMS
a. Single Source of truth for all Systems
b. Merges data from multiple sources, doing joins and keeps the latest snapshot,
keyed by Product Id
c. Raises notification whenever the data changes .
Catalogue Management System
(Static and dynamic)
Data Import
Handler
(Fetch, Transform,
Dedup,
Update)
SOLR
Notification
Sales Signals,
Custom tags
But ….
1. Limitations
a. Too much data ( and more than 80% , not of any interest to search system)
b. CMS has to keep data for ever. (Remember it is source of truth). But search System
doesn’t need to index all documents. ( obsolete products). So lots of drops.
c. Merging becomes too much for CMS. Introduces Lag.
2. DIH Limitations
a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262)
b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and
single threaded doesn’t help.
c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But
difficult to integrate proactively.
So CMS is re-factored
CMS
(service)
Dynamic Field 1
Service (service)
Notification stream
Notification stream
dynamic sorting fields (
sparse but a lot of them
)
(mysql db)
Snapshot
SOLR Master
External Field ,
consumed through
DIH
Solr
Slaves
Why are Partial updates a challenge in Lucene ?
1. Update
a. Lucene doesn’t support partial updates. Tough to do with inverted index. It
is because all terms for that document needs to be updated. Lots of open
tickets
b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258
(overlay segment based) , Incremental Field Updates through Stacked
Segments
c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5}
d. Document @ t2 → Term vectors { T1, T4, T10 }
e. Inverted index actually stores the posting list for its terms. These posting
lists are quite sparse and compressed using delta encodings for efficiency
reasons.
f. T1 → {1, 5, 7 } etc
g. T2 → {2, 5, 6}
h. To support partial update, the document has to be removed from posting
listing of all its previous terms .. That is non-trivial. Because that will involve
remembering and storing all terms for a given document.
i. So instead Lucene and inverted index systems, mark old document as
deleted in another data structure (live docs)
Why are Partial updates a challenge in Lucene ?
1. What it means is a update in actually
a. Delete + Add . ( Regardless of which
attribute changed)
b. Deleted documents are compacted by a
background merge thread.
2. Updates become only after a commit
c. Soft commit will create a new segment in
memory.
d. Hard commit will do a fsync to directory.
But do we need to re-index a document ? Lets evaluate
1. Lucene might hold 3 kinds of data
a. Data used for actual search ( analyzed, converted into tokens )
b. Data used for plain filtering ( not analyzed, e.g. price, discount)
c. Data used for ranking ( e.g. relevancy signals and there can be a
lot of them)
2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing.
a. Pipeline can be spam filtering → text cleaning → duplicate
detection → NLP → Entity extraction etc etc
3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags ,
usually with enumerated values
a. Can be dynamic
b. Can be governed by policies and business constraints.
But do we need to re-index a document ? Lets evaluate
1. Ranking Signals ⇒ Needs to be row oriented.
a. Can be batch update (e.g. category specific ranks, ratings)
or real time updates e.g. availability.
b. Lucene actually un-inverts such fields using FieldCache
c. Doc values were introduced to manage the cost of
FieldCache and better provide updatability.
d. updatable NumericDocValues (LUCENE-5189, since 4.6)
, updatable binary doc values (LUCENE-5513, since 4.8)
e. Solr still doesn’t have updatabale doc values. Jira ticket
open, but issues around update/write-ahead logs. ( SOLR-
5944)
First Approach : Leverage Updatable Numeric DocValues
1. Solr Limitation : Easily overcome in master slave model by
plugging your own update chain and accessing IndexWriter
directly.
2. But :
a. You need a commit for docvalues to reflect. ( Not real time !! )
b. Filtering on DocValues : is inefficient. Specially on Numeric
Fields.
c. Making it work is solr cloud is non trivial. For details please
see SOLR-5944.
d. Docvalues are dense. Updates are not stacked. It always
dumps the full view of modified field doc value on every
commit. (optimizing for search performance) (http://shaierera.
blogspot.in/2014/04/updatable-docvalues-under-hood.html)
e. But what if we had 500 fields doc values for millions of docs.
First Approach : Leverage Updatable Numeric DocValues
1. Commit caveats:
a. Soft commits is NOT FREE.
Soft-commit in solr = IndexWriter.getReader() in lucene ==
flush + open .
There is NRTCachingDirectory, which caches the small
segment produced and makes it cheaper to do soft
commits. Details can found in McCandless’s post.
b. In Solr invalidate all caches and they have to be re-
generated on every commit. Some caches like filterCache
have a huge impact on performance. Warming them up
itself might take 2-3 minutes at times.
c. Warmup puts memory pressure on jvm and builds spikes
in allocations. Some caches like documentCache can’t
even be warmed up.
d. More commits ⇒ more segments ⇒ more merges
2nd Approach. : NRT Store and Value Sources
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html
- abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext)
Gets the values for this reader and the context that was previously passed to createWeight()
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm
FunctionValues
- boolean exists(int doc) : Returns true if there is a value for this document
- double doubleVal(int doc)
Value Sources Allowed us to Plug External Data sources right inside Solr. These
external data need not be part of the index themselves, but should be easily retrievable.
Because they would be called millions of times and right inside a loop.
The Challenge
1. Entries in Solr caches have really no expiry time and have no way to invalidate entries.
2. Solution : Get rid of query cache altogether. But still, we have filterCache.
3. So now : matching and scoring had to be really fast.
a. Calls to value source need to be extremely fast. We have optimized them out, so
that they are as fast as accessing doc values.
b. The cost of ranking functions themselves. Some of the optimizations involved
getting and reducing cost of Math functions themselves
So the learnings
1. Understand your data, change rate and what you want to do with your data
2. Solr / Lucene have really good abstractions both around indexing and query. Both
provide you with a lot of hooks and plugins. Think through and take advantage of them.
3. Experiment, profile and benchmark. Delve into the APIs and internals.
4. The experts do help. The dense docValues and softcommits not being free, were direct
contributions of discussions with Shalin.
5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built
a lucene-codecs (which built and updated inverted index in redis).

More Related Content

What's hot

Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 

What's hot (20)

Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System TablesObservability of InfluxDB IOx: Tracing, Metrics and System Tables
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 

Viewers also liked

The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test upload
Anupam Jain
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 

Viewers also liked (7)

The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test upload
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguages
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Evolution of Programming Languages
Evolution of Programming LanguagesEvolution of Programming Languages
Evolution of Programming Languages
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Evolution of Programming Languages Over the Years
Evolution of Programming Languages Over the YearsEvolution of Programming Languages Over the Years
Evolution of Programming Languages Over the Years
 

Similar to Consuming RealTime Signals in Solr

127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
Amit Sharma
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qs
Capgemini
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
ocporacledba
 

Similar to Consuming RealTime Signals in Solr (20)

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Remus_3_0
Remus_3_0Remus_3_0
Remus_3_0
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
BigDataDebugging
BigDataDebuggingBigDataDebugging
BigDataDebugging
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qs
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Bt0066
Bt0066Bt0066
Bt0066
 
B T0066
B T0066B T0066
B T0066
 
LDV.pptx
LDV.pptxLDV.pptx
LDV.pptx
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Consuming RealTime Signals in Solr

  • 1. Consuming Real time Signals in Solr Umesh Prasad SDE 3 @ Flipkart
  • 2. Flipkart’s Index Flipkart’s Index 1. Data organized in multiple indexes/Solr cores. Couple of millions of documents. 2. SKUs are documents. 3. Data organized in multiple solr cores. 4. Extensive use of facets and filters. 5. All search doesn’t allow faceting. Lots of custom components 1. Custom collectors ( for enabling blending of results for diversity / personalization ) 2. Custom Query parsers ( for enabling really customized scoring) 3. Custom fields
  • 3. Typical Ecommerce Document ● Catalogue data ○ Static ○ Largely textual ● Pricing related data ○ Dynamic ○ Faster moving ● Offers ○ Channel specific based on nature of event ● Availability ○ Dynamic ○ Faster moving and more...
  • 4. First Cut Integration 1. Catalogue Management System aka CMS a. Single Source of truth for all Systems b. Merges data from multiple sources, doing joins and keeps the latest snapshot, keyed by Product Id c. Raises notification whenever the data changes . Catalogue Management System (Static and dynamic) Data Import Handler (Fetch, Transform, Dedup, Update) SOLR Notification Sales Signals, Custom tags
  • 5. But …. 1. Limitations a. Too much data ( and more than 80% , not of any interest to search system) b. CMS has to keep data for ever. (Remember it is source of truth). But search System doesn’t need to index all documents. ( obsolete products). So lots of drops. c. Merging becomes too much for CMS. Introduces Lag. 2. DIH Limitations a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262) b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and single threaded doesn’t help. c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But difficult to integrate proactively.
  • 6. So CMS is re-factored CMS (service) Dynamic Field 1 Service (service) Notification stream Notification stream dynamic sorting fields ( sparse but a lot of them ) (mysql db) Snapshot SOLR Master External Field , consumed through DIH Solr Slaves
  • 7. Why are Partial updates a challenge in Lucene ? 1. Update a. Lucene doesn’t support partial updates. Tough to do with inverted index. It is because all terms for that document needs to be updated. Lots of open tickets b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258 (overlay segment based) , Incremental Field Updates through Stacked Segments c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5} d. Document @ t2 → Term vectors { T1, T4, T10 } e. Inverted index actually stores the posting list for its terms. These posting lists are quite sparse and compressed using delta encodings for efficiency reasons. f. T1 → {1, 5, 7 } etc g. T2 → {2, 5, 6} h. To support partial update, the document has to be removed from posting listing of all its previous terms .. That is non-trivial. Because that will involve remembering and storing all terms for a given document. i. So instead Lucene and inverted index systems, mark old document as deleted in another data structure (live docs)
  • 8. Why are Partial updates a challenge in Lucene ? 1. What it means is a update in actually a. Delete + Add . ( Regardless of which attribute changed) b. Deleted documents are compacted by a background merge thread. 2. Updates become only after a commit c. Soft commit will create a new segment in memory. d. Hard commit will do a fsync to directory.
  • 9. But do we need to re-index a document ? Lets evaluate 1. Lucene might hold 3 kinds of data a. Data used for actual search ( analyzed, converted into tokens ) b. Data used for plain filtering ( not analyzed, e.g. price, discount) c. Data used for ranking ( e.g. relevancy signals and there can be a lot of them) 2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing. a. Pipeline can be spam filtering → text cleaning → duplicate detection → NLP → Entity extraction etc etc 3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags , usually with enumerated values a. Can be dynamic b. Can be governed by policies and business constraints.
  • 10. But do we need to re-index a document ? Lets evaluate 1. Ranking Signals ⇒ Needs to be row oriented. a. Can be batch update (e.g. category specific ranks, ratings) or real time updates e.g. availability. b. Lucene actually un-inverts such fields using FieldCache c. Doc values were introduced to manage the cost of FieldCache and better provide updatability. d. updatable NumericDocValues (LUCENE-5189, since 4.6) , updatable binary doc values (LUCENE-5513, since 4.8) e. Solr still doesn’t have updatabale doc values. Jira ticket open, but issues around update/write-ahead logs. ( SOLR- 5944)
  • 11. First Approach : Leverage Updatable Numeric DocValues 1. Solr Limitation : Easily overcome in master slave model by plugging your own update chain and accessing IndexWriter directly. 2. But : a. You need a commit for docvalues to reflect. ( Not real time !! ) b. Filtering on DocValues : is inefficient. Specially on Numeric Fields. c. Making it work is solr cloud is non trivial. For details please see SOLR-5944. d. Docvalues are dense. Updates are not stacked. It always dumps the full view of modified field doc value on every commit. (optimizing for search performance) (http://shaierera. blogspot.in/2014/04/updatable-docvalues-under-hood.html) e. But what if we had 500 fields doc values for millions of docs.
  • 12. First Approach : Leverage Updatable Numeric DocValues 1. Commit caveats: a. Soft commits is NOT FREE. Soft-commit in solr = IndexWriter.getReader() in lucene == flush + open . There is NRTCachingDirectory, which caches the small segment produced and makes it cheaper to do soft commits. Details can found in McCandless’s post. b. In Solr invalidate all caches and they have to be re- generated on every commit. Some caches like filterCache have a huge impact on performance. Warming them up itself might take 2-3 minutes at times. c. Warmup puts memory pressure on jvm and builds spikes in allocations. Some caches like documentCache can’t even be warmed up. d. More commits ⇒ more segments ⇒ more merges
  • 13. 2nd Approach. : NRT Store and Value Sources http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html - abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext) Gets the values for this reader and the context that was previously passed to createWeight() http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm FunctionValues - boolean exists(int doc) : Returns true if there is a value for this document - double doubleVal(int doc) Value Sources Allowed us to Plug External Data sources right inside Solr. These external data need not be part of the index themselves, but should be easily retrievable. Because they would be called millions of times and right inside a loop.
  • 14. The Challenge 1. Entries in Solr caches have really no expiry time and have no way to invalidate entries. 2. Solution : Get rid of query cache altogether. But still, we have filterCache. 3. So now : matching and scoring had to be really fast. a. Calls to value source need to be extremely fast. We have optimized them out, so that they are as fast as accessing doc values. b. The cost of ranking functions themselves. Some of the optimizations involved getting and reducing cost of Math functions themselves
  • 15. So the learnings 1. Understand your data, change rate and what you want to do with your data 2. Solr / Lucene have really good abstractions both around indexing and query. Both provide you with a lot of hooks and plugins. Think through and take advantage of them. 3. Experiment, profile and benchmark. Delve into the APIs and internals. 4. The experts do help. The dense docValues and softcommits not being free, were direct contributions of discussions with Shalin. 5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built a lucene-codecs (which built and updated inverted index in redis).