Solr/LuceneStandard Lucene library bolted on HBase Not commonly used Lots of formats/codecs already written
Considerations for HBase What do we need to do?
Built-in vs. external library vs.semi-supported (e.g. security)
Which should I use??• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project – Written in everywhere – hbase-index module – External library
Key Observation“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.” - Lars Hofhansl
Async vs. Synchronous vs.Transactional• We don’t need full transactions – Transactions are slow – Transactions fail with increasing probability as number of servers increases• Optionally async or sync – Async • Inherently ‘dirty’ index• How does index cleanup work? – Inherently different for each type
Where’s my data?• Extra columns vs. index table• HBase Region-pinning – Has to be best-effort or will decrease availability – Helps minimize RPC overhead – Cross-table region-pinning – Needs a coprocessor hook to be useful• HDFS block allocation – Keep index and data blocks on same HDFS node
How much data are we talking?“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is more efficient for reads1. dense indexes (like eventType) where there are likely values of every index key on each region1. very dense indexes (like male/female) where you should just be doing a table scan anyway” - Matt Corgan (9/10/12)
Impact on implementation• Need a lot of knowledge of data to pick the right kind of index – User knows their data, let them do the hard work of picking indexes
What should it look like?• Minimal changes to the top-level interfaces – Add a single new flag? – Configuration based?• Enough that the user gets to be smart about what should be used – We can’t get all cases right – just provide building blocks• Automatically use an index?• Scanner/Filter style use?
Properties for the client• Should the user even see the index lookups?• ACID?• Ordering of results? – Support the current sorted order? – Batch lookup?• Implications on current features – Replication – splitting
Schema(less)• Schema enforced? – Rigid usage of index matching an expected schema? – Schema table? Reserved schema columns?.META.?• Schema-less – Let the user apply whatever they think and use only what actually works• Best-effort – Use client-hinted schema and try to apply all the known indexes
My random thoughts….• Client-side managed indexes are efficient – Minimal RPC overhead • Cleanup is async to client and rarely misses – Solves the cross-region/server problem • Region-pinning is a nice-to-have optimization – Scales without concern for locality – Flexible enough to support custom codecs – Can be built to provide server-side optimizations • Locality aware indexes to minimize RPCs