Jesse Yates Salesforce.com                  Secondary Indexing                   the discussion so far….9/11/12           ...
What is it?
Problem• HBase rows are multi-dimensional  – Only sorted on the row key• How do you efficiently lookup deeper into the  ro...
Example Row        Family       Qualifier   Timestamp   value 1          Name         First       0           Babe 1      ...
Indexing!Row       Family    Qualifier   Timestamp   ValueRuth      Name      Last        0           1  Store the propert...
Use Cases• Point lookups  – Volume of data influences usefulness of index     • Let user decide if they need to use an ind...
Implementations
OmidFull transactional support    Centralized oracle
LilyWAL implementation on top of HBase        100-500 writes/sec
Percolator       Full transactionsDistributed, optimistic locking  ~10 sec latencies possible
Culvert         AsyncDead project, incomplete
http://jyates.github.com/2012/07/0  9/consistent-enough-secondary-            indexes.html       Client-side coordinated i...
Trend Micro Implementation          Still just POC                 ???
Solr/LuceneStandard Lucene library bolted on HBase           Not commonly used Lots of formats/codecs already written
Considerations for HBase    What do we need to do?
Built-in vs.     external library vs.semi-supported (e.g. security)
Which should I use??•   HBase experts write a single ‘right’ impl•   Officially endorse a ‘correct’ version•   What change...
Async vs.Synchronous vs. Transactional
Key Observation“Secondary indexing is inherently an easier  problem than full transactions… secondary  index updates are i...
Async vs. Synchronous vs.Transactional• We don’t need full transactions  – Transactions are slow  – Transactions fail with...
Locality
Where’s my data?• Extra columns vs. index table• HBase Region-pinning  –   Has to be best-effort or will decrease availabi...
Index Cardinality
How much data are we talking?“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a p...
Impact on implementation• Need a lot of knowledge of data to pick the  right kind of index  – User knows their data, let t...
Pluggability
Everyone’s got an impl already• We need to make HBase flexible enough to  support (most) current indexing formats with  mi...
Client-interface
What should it look like?• Minimal changes to the top-level interfaces  – Add a single new flag?  – Configuration based?• ...
Properties for the client• Should the user even see the index lookups?• ACID?• Ordering of results?  – Support the current...
Schema(less)• Schema enforced?  – Rigid usage of index matching an expected schema?  – Schema table? Reserved schema colum...
My random thoughts….• Client-side managed indexes are efficient  – Minimal RPC overhead     • Cleanup is async to client a...
Discussion!
Upcoming SlideShare
Loading in …5
×

Musings on Secondary Indexing in HBase

3,116 views

Published on

Presentation on Secondary Indexes from the 9/11/12 HBase Contributor's Meetup. It discusses the current state of the discussion and some possible future directions.

Published in: Technology

Musings on Secondary Indexing in HBase

  1. 1. Jesse Yates Salesforce.com Secondary Indexing the discussion so far….9/11/12 HBase Pow-wow
  2. 2. What is it?
  3. 3. Problem• HBase rows are multi-dimensional – Only sorted on the row key• How do you efficiently lookup deeper into the row key?
  4. 4. Example Row Family Qualifier Timestamp value 1 Name First 0 Babe 1 Name Last 0 RuthHow do we find all people with the last name ‘Ruth’? Full table scan!
  5. 5. Indexing!Row Family Qualifier Timestamp ValueRuth Name Last 0 1 Store the property we need to search for as the primary key • pointer back to the primary row • fast lookup - O(lg(n))
  6. 6. Use Cases• Point lookups – Volume of data influences usefulness of index • Let user decide if they need to use an index• Scan lookup – WHERE age > 16
  7. 7. Implementations
  8. 8. OmidFull transactional support Centralized oracle
  9. 9. LilyWAL implementation on top of HBase 100-500 writes/sec
  10. 10. Percolator Full transactionsDistributed, optimistic locking ~10 sec latencies possible
  11. 11. Culvert AsyncDead project, incomplete
  12. 12. http://jyates.github.com/2012/07/0 9/consistent-enough-secondary- indexes.html Client-side coordinated index Use timestamps to coordinate Not yet implemented
  13. 13. Trend Micro Implementation Still just POC ???
  14. 14. Solr/LuceneStandard Lucene library bolted on HBase Not commonly used Lots of formats/codecs already written
  15. 15. Considerations for HBase What do we need to do?
  16. 16. Built-in vs. external library vs.semi-supported (e.g. security)
  17. 17. Which should I use??• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project – Written in everywhere – hbase-index module – External library
  18. 18. Async vs.Synchronous vs. Transactional
  19. 19. Key Observation“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.” - Lars Hofhansl
  20. 20. Async vs. Synchronous vs.Transactional• We don’t need full transactions – Transactions are slow – Transactions fail with increasing probability as number of servers increases• Optionally async or sync – Async • Inherently ‘dirty’ index• How does index cleanup work? – Inherently different for each type
  21. 21. Locality
  22. 22. Where’s my data?• Extra columns vs. index table• HBase Region-pinning – Has to be best-effort or will decrease availability – Helps minimize RPC overhead – Cross-table region-pinning – Needs a coprocessor hook to be useful• HDFS block allocation – Keep index and data blocks on same HDFS node
  23. 23. Index Cardinality
  24. 24. How much data are we talking?“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is more efficient for reads1. dense indexes (like eventType) where there are likely values of every index key on each region1. very dense indexes (like male/female) where you should just be doing a table scan anyway” - Matt Corgan (9/10/12)
  25. 25. Impact on implementation• Need a lot of knowledge of data to pick the right kind of index – User knows their data, let them do the hard work of picking indexes
  26. 26. Pluggability
  27. 27. Everyone’s got an impl already• We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching – Lucene style Codec/CodecProvider?
  28. 28. Client-interface
  29. 29. What should it look like?• Minimal changes to the top-level interfaces – Add a single new flag? – Configuration based?• Enough that the user gets to be smart about what should be used – We can’t get all cases right – just provide building blocks• Automatically use an index?• Scanner/Filter style use?
  30. 30. Properties for the client• Should the user even see the index lookups?• ACID?• Ordering of results? – Support the current sorted order? – Batch lookup?• Implications on current features – Replication – splitting
  31. 31. Schema(less)• Schema enforced? – Rigid usage of index matching an expected schema? – Schema table? Reserved schema columns?.META.?• Schema-less – Let the user apply whatever they think and use only what actually works• Best-effort – Use client-hinted schema and try to apply all the known indexes
  32. 32. My random thoughts….• Client-side managed indexes are efficient – Minimal RPC overhead • Cleanup is async to client and rarely misses – Solves the cross-region/server problem • Region-pinning is a nice-to-have optimization – Scales without concern for locality – Flexible enough to support custom codecs – Can be built to provide server-side optimizations • Locality aware indexes to minimize RPCs
  33. 33. Discussion!

×