YMMV Not necessarily true for you
Enterprise software – shipping stuff to people
Fine grained events – logs, user behavior, etc.
For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes).
This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
YMMV Not necessarily true for you
Enterprise software – shipping stuff to people
Fine grained events – logs, user behavior, etc.
For everything – solving the problem of “enterprise wide” ops, so it’s everything from everywhere from everyone for all time (until they run out of money for nodes).
This isn’t condemnation of general purpose search engines as much as what we had to do for our domain
It does most of what you want for most cases most of the time.
They’ve solved some really hard problems.
Content search (e.g. news sites, document repos), finite size datasets (e.g. product catalogs), low cardinality datasets that fit in memory. Not us.
Flexible systems with a bevy of full text search features
Moderate and fixed document count: big by historical standards, small by ours.
Design reflects these assumptions.
Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic.
Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region).
All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query.
Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache.
APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between.
Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk.
Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird.
We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want.
We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.
There are plenty of ways we could of pushed the general purpose systems, and we did.
We layered our own partitioning and shard selection on top of Solr Cloud with time-based collection round robining. That got us pretty far, but not far enough. We were starting to do a lot of query rewriting and scheduling.
Run mulitple JVMs per box. Gross. Unsupportable.
Push historical queries out of search to a system such as spark.
Build weird caches of frequent data sets.
At some point, the cost of hacking outweighed the cost of building.
Flexible systems with a bevy of full text search features
Moderate and fixed document count: big by historical standards, small by ours.
Design reflects these assumptions.
Fixed sharding at index creation. Partition events into N buckets. For long retention time-based systems, this isn’t how we think. Let’s keep it until it’s painful. Then we add boxes. When that’s painful, we prune. Not sure what that looks like. Repartitioning is not feasible at scale. Partitions count should be dynamic.
Multi-level partitioning is painful without building your own query layer; by range(time), then hash(region) or identity(region).
All shards are open all the time. Implicit assumption that either you 1. have queries that touch the data evenly or 2. have inifinite resources. Recent time events are hotter than distant, but distant still needs to be available for query.
Poor cache control. Recent data should be in cache. Historical scans shouldn’t push recent data out of cache.
APIs are extremely “single record” focused. REST with record-at-a-time is absolutely abysmal for high throughput systems. Batch indexing is not useful. No in between.
Read replicas are expensive and homogenous. Ideally we have 3 read replicas for the last N days and 1 for others. Replicas (for performance) should take up space in memory, but not on disk.
Ingest concurrency tends to be wonky; whole lotta locking going on. Anecdotally, it’s difficult to get Solr Cloud to light up all cores on a box without running multiple JVMs; something is weird.
We can get the benefits of NRT indexing speed with fewer writer checkpoints because our ingest pipeline acts as a reliable log. We recover from Kafka based on the last time the writer checkpointed so we can checkpoint very infrequently if we want.
We know our data doesn’t change, or changes very little, after a certain point, so we can optimize and freeze indexes reducing write amplification from compactions.