Transactions Over Apache HBase

8,652 views

Published on

Learn how to add transactions support for Apache HBase

Published in: Engineering

Transactions Over Apache HBase

  1. 1. TRANSACTIONS OVER HBASE Alex Baranau @abaranau Gary Helmling @gario ! Continuuity
  2. 2. WHO WE ARE • We’ve built Continuuity Reactor: the world’s first scale-out application server for Hadoop • Fast, easy development, deployment and management of Hadoop and HBase apps • Continuuity team has years of experience in using and contributing to Open Source, and we intend to continue doing so.
  3. 3. AGENDA • Transactions over HBase: Why? What? • Implementation: How? • The approach • Transaction Manager • Transaction in batch jobs
  4. 4. THE REACTOR • Continuuity Reactor is an app platform built on Hadoop and HBase • Collect, Process, Store, and Query data. • A Flow is a real-time processor with exactly-once guarantee • A flow is composed of flowlets, connected via queues • All processing happens with ACID guarantees in transactions
  5. 5. HBase Table PROCESSING IN A FLOW ...Queue ... ... Flowlet ... ...
  6. 6. HBase Table PROCESSING IN A FLOW ...Queue ... ... Flowlet ... ...
  7. 7. HBase Table WRAP IN TRANSACTION! ...Queue ... ... Flowlet
  8. 8. TRANSACTIONS: WHAT? • Atomic - Entire transaction is committed as one • Consistent - No partial state change due to failure • Isolated - No dirty reads, transaction is only visible after commit • Durable - Once committed, data is persisted reliably
  9. 9. HBASE: QUICK “WHAT?” • Open-source non-relational distributed column- oriented database modeled after Google’s BigTable • Data is stored in named Tables of Rows • Row consists of Cells:
 (row key, column family, column, timestamp) -> value • Table is split into Regions, each served by a single node in HBase cluster
  10. 10. WHAT ABOUT HBASE? • Atomic operations on cell value: 
 checkAndPut, checkAndDelete, increment, append • Atomic batch of operations on rows within region • No cross region atomic operations support • No cross table atomic operations support • No multi-RPC atomic operations support
  11. 11. TRANSACTIONS: WHY ELSE? • Complex data access patterns • Secondary Indexes • Design for greater performance • Efficient batching of data operations • Client-side buffer reliability fix
  12. 12. IMPLEMENTATION • Multi-Version Concurrency Control • Cell version (timestamp) = transaction ID • All writes in the same transaction use the transaction ID as timestamp • Reads exclude other, uncommitted transactions (for isolation) • Optimistic Concurrency Control • Conflict detection at commit of transaction • Write Conflict: two overlapping transactions write the same row • Rollback of one transaction in case of conflict (whichever commits later)
  13. 13. OPTIMISTIC CONCURRENCY CONTROL • Avoids cost of locking rows and tables • No deadlocks or lock escalations • Cost of conflict detection and possible rollback is higher • Good if conflicts are rare: short transaction, disjoint partitioning of work
  14. 14. ZooKeeper TRANSACTIONS IN CONTEXT Tx Manager (standby) HBase Master 1 Master 2 RS 1 RS 2 RS 4 RS 3 Client 1 Client 2 Client N Tx Manager (active)
  15. 15. TRANSACTION LIFE CYCLE time out try abort failed roll back in HBase write to HBase do work Client Tx Manager none complete V abortsucceeded in progress start tx start start tx commit try commit check conflicts RPC API invalid X invalidate failed
  16. 16. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 Client 2 write = 1002 read = 1001
  17. 17. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1002 read = 1001
  18. 18. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  19. 19. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  20. 20. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  21. 21. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002] write = 1003 read = 1001 excluded=[1002]
  22. 22. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  23. 23. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  24. 24. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  25. 25. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002] write = 1003 read = 1001 excluded=[1002]
  26. 26. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  27. 27. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 conflict! write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  28. 28. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  29. 29. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  30. 30. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  31. 31. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[]
  32. 32. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1003 inprogress=[]
  33. 33. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 start Client 2 write = 1005 read = 1003 inprogress=[] write = 1004 read = 1003
  34. 34. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 read Client 2 write = 1005 read = 1003 inprogress=[] write = 1004 read = 1003
  35. 35. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 conflict! write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  36. 36. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  37. 37. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback failed write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  38. 38. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 invalidate write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  39. 39. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 invalidate write = 1002 read = 1001 Client 2 write = 1004 read = 1003 inprogress=[] invalid=[1002]
  40. 40. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 start Client 2 write = 1005 read = 1003 inprogress=[] invalid=[1002] write = 1004 read = 1003 exclude = [1002]
  41. 41. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 read Client 2 write = 1005 read = 1003 inprogress=[] invalid=[1002] write = 1004 read = 1003 exclude = [1002] invisible!
  42. 42. TRANSACTION MANAGER • Create new transactions • Provides monotonically increasing write pointers • Maintains all in-progress, committed, and invalid transactions • Detect conflicts • Transaction = Write Pointer: Timestamp for HBase writes Read pointer: Upper bound timestamp for reads Excludes: List of timestamps to exclude from reads
  43. 43. TRANSACTION MANAGER • Simple & Fast • All required state is in-memory • Single point of failure? • Persist all state to a write-ahead log • Secondary Tx Manager watches for failure of Primary • Failover can happen quickly
  44. 44. TRANSACTION MANAGER Tx Manager Current State in progress committed invalid read point write point start()
  45. 45. TRANSACTION MANAGER Tx Manager Current State in progress (+) committed invalid read point write point ++ start() Tx Log started, <write pt> HDFS
  46. 46. TRANSACTION MANAGER Tx Manager Current State in progress (-) committed (+) invalid read point write point commit() Tx Log start, <write pt> commit, <write pt> HDFS
  47. 47. TRANSACTION SNAPSHOTS • Write-ahead log provides persistence • Guarantees point-in-time recovery • Longer the log grows, longer recovery takes • Periodically write snapshot of full transaction state • Snapshot + all new logs provides full state
  48. 48. Tx Manager Current State TRANSACTION SNAPSHOTS Tx Log A in progress committed invalid read point write point HDFS
  49. 49. Tx Manager Current State TRANSACTION SNAPSHOTS Tx Log A in progress committed invalid read point write point Tx Log B1 HDFS
  50. 50. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B2 HDFS
  51. 51. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B Tx Snapshot in progress committed invalid read point write point 3 HDFS
  52. 52. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B Tx Snapshot in progress committed invalid read point write point 4 HDFS
  53. 53. HBase TRANSACTION CLEANUP Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback failed write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  54. 54. TRANSACTION CLEANUP: DATA JANITOR • RegionObserver coprocessor • Maintains in-memory snapshot of recent invalid & in-progress sets • Periodically updates from transaction snapshot in HDFS • Purges data from invalid transactions and older versions on flush & compaction
  55. 55. HBase TRANSACTION CLEANUP: DATA JANITOR Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] refresh Data Janitor (RegionObserver) MemStore preFlush() read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Cell TS Value row1:col1 1004 12 1003 11 1002 11 Custom RegionScanner HFile Cell TS Value
  56. 56. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) HFile Cell TS Value Custom RegionScanner MemStore read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11
  57. 57. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) HFile Cell TS Value row1:col1 1004 12Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11
  58. 58. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12
  59. 59. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  60. 60. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  61. 61. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Custom RegionScanner preFlush() MemStore Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  62. 62. REDUNDANT VERSIONS CLEANUP • Every write is new version of a cell • Heavily updated values (e.g. counters) result in lots of versions • Cleanup: keep at most one version of a cell visible to all transactions • Plug-in cleanup into existing Data Janitor CP
  63. 63. TTL • Timestamp of a value is overwritten with tx id • Breaks support for “native TTL” feature in HBase • Tx ids and timestamps cannot be aligned: more than 1K tx/sec wanted
  64. 64. TTL • Encode timestamp in tx id • tx id = (current ts) * 1,000,000 + counter- within-ms • 1B txs/sec max • long value overflow in ~200 years • Plug-in cleanup into existing Data Janitor CP • Apply TTL logic on reads
  65. 65. TRANSACTIONS IN BATCH • Batch jobs are long running • No changes should be visible before job is done • Batch jobs are multi-step • Individual step can be restarted • All steps succeed or all fail semantics needed • Every step retry should be isolated • Uncommitted changes should be visible within same step but not to other steps
  66. 66. TRANSACTIONS IN BATCH • Batch jobs are long-running • No fine-grained conflict detection: last started among successfully committed wins • Delayed rollback execution • Batch jobs are multi-step • For now: batch job uses single transaction, every step re-uses it • For future: nested transactions for steps
  67. 67. TX OVERHEAD • 2 RPC calls, each faster than Put in HBase • Can be improved with batching • To avoid overhead of transactions where appropriate the ACID guarantees can be relaxed • No need to change data format • When reading: read latest version of a cell • When writing: use ts = (current ts) * 1M + counter
  68. 68. WHAT’S NEXT? • Open Source • Continue Scaling Tx Manager • Transaction Groups? • Integration across other transactional stores
  69. 69. QS? Looking for the chance to work with a team that is defining a new category within Big Data? ! We are hiring! http://continuuity.com/careers careers[at]continuuity.com ! ! Alex Baranau @abaranau alexb[at]continuuity.com

×