Your SlideShare is downloading. ×

Transactions Over Apache HBase

3,673

Published on

Learn how to add transactions support for Apache HBase

Learn how to add transactions support for Apache HBase

Published in: Engineering
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,673
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. TRANSACTIONS OVER HBASE Alex Baranau @abaranau Gary Helmling @gario ! Continuuity
  • 2. WHO WE ARE • We’ve built Continuuity Reactor: the world’s first scale-out application server for Hadoop • Fast, easy development, deployment and management of Hadoop and HBase apps • Continuuity team has years of experience in using and contributing to Open Source, and we intend to continue doing so.
  • 3. AGENDA • Transactions over HBase: Why? What? • Implementation: How? • The approach • Transaction Manager • Transaction in batch jobs
  • 4. THE REACTOR • Continuuity Reactor is an app platform built on Hadoop and HBase • Collect, Process, Store, and Query data. • A Flow is a real-time processor with exactly-once guarantee • A flow is composed of flowlets, connected via queues • All processing happens with ACID guarantees in transactions
  • 5. HBase Table PROCESSING IN A FLOW ...Queue ... ... Flowlet ... ...
  • 6. HBase Table PROCESSING IN A FLOW ...Queue ... ... Flowlet ... ...
  • 7. HBase Table WRAP IN TRANSACTION! ...Queue ... ... Flowlet
  • 8. TRANSACTIONS: WHAT? • Atomic - Entire transaction is committed as one • Consistent - No partial state change due to failure • Isolated - No dirty reads, transaction is only visible after commit • Durable - Once committed, data is persisted reliably
  • 9. HBASE: QUICK “WHAT?” • Open-source non-relational distributed column- oriented database modeled after Google’s BigTable • Data is stored in named Tables of Rows • Row consists of Cells:
 (row key, column family, column, timestamp) -> value • Table is split into Regions, each served by a single node in HBase cluster
  • 10. WHAT ABOUT HBASE? • Atomic operations on cell value: 
 checkAndPut, checkAndDelete, increment, append • Atomic batch of operations on rows within region • No cross region atomic operations support • No cross table atomic operations support • No multi-RPC atomic operations support
  • 11. TRANSACTIONS: WHY ELSE? • Complex data access patterns • Secondary Indexes • Design for greater performance • Efficient batching of data operations • Client-side buffer reliability fix
  • 12. IMPLEMENTATION • Multi-Version Concurrency Control • Cell version (timestamp) = transaction ID • All writes in the same transaction use the transaction ID as timestamp • Reads exclude other, uncommitted transactions (for isolation) • Optimistic Concurrency Control • Conflict detection at commit of transaction • Write Conflict: two overlapping transactions write the same row • Rollback of one transaction in case of conflict (whichever commits later)
  • 13. OPTIMISTIC CONCURRENCY CONTROL • Avoids cost of locking rows and tables • No deadlocks or lock escalations • Cost of conflict detection and possible rollback is higher • Good if conflicts are rare: short transaction, disjoint partitioning of work
  • 14. ZooKeeper TRANSACTIONS IN CONTEXT Tx Manager (standby) HBase Master 1 Master 2 RS 1 RS 2 RS 4 RS 3 Client 1 Client 2 Client N Tx Manager (active)
  • 15. TRANSACTION LIFE CYCLE time out try abort failed roll back in HBase write to HBase do work Client Tx Manager none complete V abortsucceeded in progress start tx start start tx commit try commit check conflicts RPC API invalid X invalidate failed
  • 16. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 Client 2 write = 1002 read = 1001
  • 17. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1002 read = 1001
  • 18. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  • 19. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  • 20. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002]
  • 21. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1003 read = 1001 inprogress=[1002] write = 1003 read = 1001 excluded=[1002]
  • 22. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 Tx Manager Client 1 start write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  • 23. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 increment write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  • 24. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002, 1003] write = 1003 read = 1001 excluded=[1002]
  • 25. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002] write = 1003 read = 1001 excluded=[1002]
  • 26. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 commit write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 27. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 conflict! write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 28. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 29. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 30. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 31. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[]
  • 32. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 abort write = 1002 read = 1001 Client 2 write = 1004 read = 1003 inprogress=[]
  • 33. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 start Client 2 write = 1005 read = 1003 inprogress=[] write = 1004 read = 1003
  • 34. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1003 11 Tx Manager Client 1 read Client 2 write = 1005 read = 1003 inprogress=[] write = 1004 read = 1003
  • 35. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 conflict! write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 36. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 37. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback failed write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 38. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 invalidate write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 39. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 invalidate write = 1002 read = 1001 Client 2 write = 1004 read = 1003 inprogress=[] invalid=[1002]
  • 40. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 start Client 2 write = 1005 read = 1003 inprogress=[] invalid=[1002] write = 1004 read = 1003 exclude = [1002]
  • 41. HBase CLIENT SIDE: TX AWARE Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 read Client 2 write = 1005 read = 1003 inprogress=[] invalid=[1002] write = 1004 read = 1003 exclude = [1002] invisible!
  • 42. TRANSACTION MANAGER • Create new transactions • Provides monotonically increasing write pointers • Maintains all in-progress, committed, and invalid transactions • Detect conflicts • Transaction = Write Pointer: Timestamp for HBase writes Read pointer: Upper bound timestamp for reads Excludes: List of timestamps to exclude from reads
  • 43. TRANSACTION MANAGER • Simple & Fast • All required state is in-memory • Single point of failure? • Persist all state to a write-ahead log • Secondary Tx Manager watches for failure of Primary • Failover can happen quickly
  • 44. TRANSACTION MANAGER Tx Manager Current State in progress committed invalid read point write point start()
  • 45. TRANSACTION MANAGER Tx Manager Current State in progress (+) committed invalid read point write point ++ start() Tx Log started, <write pt> HDFS
  • 46. TRANSACTION MANAGER Tx Manager Current State in progress (-) committed (+) invalid read point write point commit() Tx Log start, <write pt> commit, <write pt> HDFS
  • 47. TRANSACTION SNAPSHOTS • Write-ahead log provides persistence • Guarantees point-in-time recovery • Longer the log grows, longer recovery takes • Periodically write snapshot of full transaction state • Snapshot + all new logs provides full state
  • 48. Tx Manager Current State TRANSACTION SNAPSHOTS Tx Log A in progress committed invalid read point write point HDFS
  • 49. Tx Manager Current State TRANSACTION SNAPSHOTS Tx Log A in progress committed invalid read point write point Tx Log B1 HDFS
  • 50. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B2 HDFS
  • 51. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B Tx Snapshot in progress committed invalid read point write point 3 HDFS
  • 52. TRANSACTION SNAPSHOTS Tx Log ATx Manager in progress committed invalid read point write point Current State State Snapshot in progress committed invalid read point write point Tx Log B Tx Snapshot in progress committed invalid read point write point 4 HDFS
  • 53. HBase TRANSACTION CLEANUP Cell TS Value row1:col1 1001 10 row1:col1 1002 11 row1:col1 1003 11 Tx Manager Client 1 rollback failed write = 1002 read = 1001 Client 2 write = 1004 read = 1001 inprogress=[1002]
  • 54. TRANSACTION CLEANUP: DATA JANITOR • RegionObserver coprocessor • Maintains in-memory snapshot of recent invalid & in-progress sets • Periodically updates from transaction snapshot in HDFS • Purges data from invalid transactions and older versions on flush & compaction
  • 55. HBase TRANSACTION CLEANUP: DATA JANITOR Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] refresh Data Janitor (RegionObserver) MemStore preFlush() read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Cell TS Value row1:col1 1004 12 1003 11 1002 11 Custom RegionScanner HFile Cell TS Value
  • 56. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) HFile Cell TS Value Custom RegionScanner MemStore read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11
  • 57. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) HFile Cell TS Value row1:col1 1004 12Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11
  • 58. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12
  • 59. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  • 60. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) Custom RegionScanner read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] MemStore preFlush() Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  • 61. HBase TRANSACTION CLEANUP: DATA JANITOR Data Janitor (RegionObserver) read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Tx Snapshot read point = 1003 write point = 1005 in progress = [1004] committed = [] invalid = [1002] Custom RegionScanner preFlush() MemStore Cell TS Value row1:col1 1004 12 1003 11 1002 11 HFile Cell TS Value row1:col1 1004 12 1003 11
  • 62. REDUNDANT VERSIONS CLEANUP • Every write is new version of a cell • Heavily updated values (e.g. counters) result in lots of versions • Cleanup: keep at most one version of a cell visible to all transactions • Plug-in cleanup into existing Data Janitor CP
  • 63. TTL • Timestamp of a value is overwritten with tx id • Breaks support for “native TTL” feature in HBase • Tx ids and timestamps cannot be aligned: more than 1K tx/sec wanted
  • 64. TTL • Encode timestamp in tx id • tx id = (current ts) * 1,000,000 + counter- within-ms • 1B txs/sec max • long value overflow in ~200 years • Plug-in cleanup into existing Data Janitor CP • Apply TTL logic on reads
  • 65. TRANSACTIONS IN BATCH • Batch jobs are long running • No changes should be visible before job is done • Batch jobs are multi-step • Individual step can be restarted • All steps succeed or all fail semantics needed • Every step retry should be isolated • Uncommitted changes should be visible within same step but not to other steps
  • 66. TRANSACTIONS IN BATCH • Batch jobs are long-running • No fine-grained conflict detection: last started among successfully committed wins • Delayed rollback execution • Batch jobs are multi-step • For now: batch job uses single transaction, every step re-uses it • For future: nested transactions for steps
  • 67. TX OVERHEAD • 2 RPC calls, each faster than Put in HBase • Can be improved with batching • To avoid overhead of transactions where appropriate the ACID guarantees can be relaxed • No need to change data format • When reading: read latest version of a cell • When writing: use ts = (current ts) * 1M + counter
  • 68. WHAT’S NEXT? • Open Source • Continue Scaling Tx Manager • Transaction Groups? • Integration across other transactional stores
  • 69. QS? Looking for the chance to work with a team that is defining a new category within Big Data? ! We are hiring! http://continuuity.com/careers careers[at]continuuity.com ! ! Alex Baranau @abaranau alexb[at]continuuity.com

×