Omid: Scalable an d Highly Available
Transaction Processing for Phoenix

Ohad Shacham, Edward Bortnikov ⎪ PhoenixCon, Jun 13, 2017
Let’s Get Started …
2
Our Yahoo Journey with Transactions over HBase



Omid for Users: Semantics, API, Integration with Phoenix



Omid for Programmers: Architecture and Use Cases



Omid, Advanced: Scalability, HA, Low-Latency
Transaction Processing in NoSQL @Yahoo
3
Motivation: Data Pipelines (Search, Mail, etc.)



Stream Processing a Popular Pattern

Compute Tasks process Data Items that arrive in the Real Time 

Intermediate Artifacts stored in NoSQL (KV-)Storage



Extensive Use of Hadoop Technologies (Storm, HBase)



Scale: Thousands of Hadoop Nodes
Content Indexing for Search
Crawl Docproc
Link
Analysis Stream
Crawl		
schedule	
Content	
Queue	
Links	
STORM
HBase
Zooming in on Tasks
Document processing


Read page content from the store 


Compute search index features


Update computed features

Link processing


Read outgoing links for a page


Update reference for all linked-to pages



begin
begin
commit
commit
Transaction Processing: ACID 101
6
Multiple data accesses in a single logical operation

Atomic 


“All or nothing” – no partial effect observable

Consistent


The DB transitions from one valid state to another

Isolated


Appear to execute in isolation 

Durable


Committed data cannot disappear
Omid (‫)امید‬
7
2011 

Incepted

@Yahoo Research

“Omid1”

2014

Large-Scale

Deployment

@Yahoo

2014/5

Major Re-Design

for Scalability & HA

“Omid2”

2016

Apache 

Incubator

2017

Prototype

Integration

with Phoenix

Transaction Processing Service for Apache HBase
Contributors
8
Ohad Shacham

Yahoo Research

Francisco 

Perez Sorrosal

Yahoo
Edward Bortnikov

Yahoo Research

Eshcar Hillel

Yahoo Research

Idit Keidar

Yahoo, Technion

Ivan Kelly

Midokura



Sameer Paranjpye 

Databricks

Matthieu Morel

Skyscanner 

Igor Katkov

Atlassian

Yonatan Gottesman

Yahoo Research
Omid 101
9
Client Library + Runtime Service



Database Agnostic (can work with other backends)



Snapshot Isolation consistency 



Very Scalable (>380K peak tps) and Highly Available
Omid Programming Example
10
TransactionManager tm = HBaseTransactionManager.newInstance();

TTable txTable = new TTable("MY_TX_TABLE”);



Transaction tx = tm.begin(); // Control path



Put row1 = new Put(Bytes.toBytes("EXAMPLE_ROW1"));

row1.add(family, qualifier, Bytes.toBytes("val1"));

txTable.put(tx, row1); // Data path



Put row2 = new Put(Bytes.toBytes("EXAMPLE_ROW2"));

row2.add(family, qualifier, Bytes.toBytes("val2")); 

txTable.put(tx, row2); // Data path



tm.commit(tx); // Control path
Snapshot Isolation (SI) Semantics
Distinct read (snapshot) and write (commit) points

No write-write conflicts allowed
Tephra: Sibling Technology
12
Transaction Processing technology for HBase



SI Semantics. Design Similar to Omid1 



Apache Incubator since 2016



Integrated with Phoenix to provide ACID semantics (BETA)

Implements some Phoenix-specific scenarios
Phoenix-Omid Integration
13
Work in Progress under JIRA PHOENIX-3623



Backward Compatible – Configurable TP Provider Choice

Current Options: Tephra and Omid



How?

Internal Transaction Abstraction Layer (TAL) API

Multiple Implementations, Configurable Instantiation
Transaction Processing, Refactored
14
Transaction
Abstraction Layer 

Tephra
Client

Omid

Client



Phoenix



Phoenix

Tephra
Client

Refactor
How Omid Works
Client

Begin/Commit

Data
 Data
 Data

Commit

	Table

Persist

Commit

Verify commit
Read/Write

Conflict
Detection

15
Transaction
Manager
(TSO)

Lock-Free SI Implementation. Exploits Built-in MVCC.
Transacti
on
Manager

Client

Begin

Data
 Data
 Data

Commit 

Table

t1

Write (k1, v1, t1)

Write (k2, v2, t1)

Read (k’, last committed t’ < t1)

(k1, v1, t1)
 (k2, v2, t1)

Execution Example
tr = t1

Transaction
Manager

16
Client

Commit: t1, {k1, k2} 

Data
 Data
 Data

Commit 

Table

t2

(k1, v1, t1)
 (k2, v2, t1)

Write (t1, t2)

(t1, t2)

Execution Example
tr = t1

tc = t2

17
Transaction
Manager
Client

Data
 Data
 Data

Commit 

Table

Read (k1, t3)

(k1, v1, t1)
 (k2, v2, t1)
 (t1, t2)

Read (t1)

Execution Example
tr = t3

18
Bottleneck!

Transaction
Manager
Client

Data
 Data
 Data

Commit 

Table

t2

(t1, t2)
(k1,v1,t1,t2)
 (k2,v2,t1,t2)

Delete(t1)

Post-Commit Timestamp Replication
tr = t1

tc = t2

Update
commit
cells

19
Transaction
Manager
Data
 Data
 Data

Commit 

Table

Read (k1, t3)

Using Commit Cells
Client

tr = t3

20
Transaction
Manager

(k1,v1,t1,t2)
 (k2,v2,t1,t2)
Phoenix – New Scenarios for Omid
21
Secondary Indexes

On-the-Fly Index Creation

Atomic Updates

Query by Secondary Key



Extended Snapshot Isolation 

Read-Your-Own-Writes Queries
On-the-Fly Secondary Index Creation
22
CREATE INDEX (CI) in parallel with writes to the base table



How? Distinguish between the pre-CI and post-CI data



CREATE INDEX command issue time defines a timestamp

1. All data committed before snapshot: scanned, bulk-inserted into index 

2. All data generated after snapshot: triggers random update of index

3. All transactions in flight at snapshot time: aborted (FENCE)
Secondary Index: Creation and Maintenance
23
T1

T2

T3

CREATE INDEX started

T4

CREATE INDEX complete

T5

T6



Bulk-
Inserted
into index
 Abort

(enforced
upon
commit)





Added by a
coprocessor



Added by a
coprocessor



Index
update
(stored
procedure)
Extended Snapshot Isolation
24
CREATE TABLE T (ID INT); 



BEGIN;



1: INSERT INTO T 


SELECT ID+10 FROM T;

2: INSERT INTO T 

SELECT ID+100 FROM T;



COMMIT;

Traditional SI: Read-Your-Writes



Challenge: 

Circular Dependency 

(Statement in Infinite Loop)



Solution: Moving Snapshot

(series of checkpoint snapshots)
Moving Snapshot Implementation
25
Checkpoint for

Statement 1

Checkpoint for

Statement 2

Writes by 

Statement 1

Timestamps allocated by TM in blocks.

Client promotes the checkpoint.
Omid Scalability
26
Extremely lean Client-Transaction Manager protocol

Omid1, Tephra replicate the entire state to client side upon BEGIN



Aggressive batching of writes to CT in Transaction Manager



Concurrent conflict detection (experimental)



HA algorithm incurs zero overhead in the mainstream
0

50

100

150

200

250

300

350

400

450

500

550

Omid1
 Omid1 Non Durable
 Omid
 Omid Non Durable

Tps*103
Throughput Benchmark
YCSB workload driver

12-core Transaction Manager 

1G network
0

500

1000

1500

2000

2500

document inversion
 duplicate detection
 out-link processing
 in-link processing
 stream to runtime

TaskLatency(ms)

Commit + CT update

Begin

Compute

Read

Update

Overhead in Production: Web Search Indexing
Low-Latency Omid (Experimental)
29
Original Design: Throughput-Oriented Applications in Mind

Sometimes, this comes at the expense of latency 

Example: writes to Commit Table batched at the Transaction Manager



Key: Dissolve the Transaction Manager I/O Bottleneck

Distribute the Commit Table and the Writes to it



How? 

The client, rather than the TM, persists the Commit Timestamp (CTS)

CTS embedded in the first row written by the transaction
Benchmark: Single-Write Transaction Workload
0

10

20

30

40

50

60

70

80

0
 50
 100
 150
 200
 250
 300

Omid

Low latency

Throughput (tps * 103)

Latency(msec)
Summary
31
Scalable, Highly Available Open Source Transaction Processing



Battle-Tested, Ready for Public Cloud



Integration with Apache Phoenix Underway (GA in 2017)
Thanks to Our Partners for Being Awesome

32
Backup

33
Architecture, Recapped
Client

Begin/Commit

Data
 Data
 Data

Commit

	Table

Persist

Commit

Verify commit
Read/Write

SPoF

34
Transaction
Manager
(TSO)
HA: Primary-Backup Transaction Manager
Client

Data
 Data
 Data

Commit

	Table

35
Transaction
Manager
(TSO)
Transaction
Manager

Recovery
state (ZK)
 Primary

Backup
Split Brain
Client

Commit

	Table

36
Transaction
Manager
(TSO)
Transaction
Manager
 Primary

Backup

Race
Conditions

Violate SI

Take I: 

Fence CT upon 

every write (slow!)
HA Algorithm – Key Ideas
37
Old and New Primaries may write conflicting commit records

No Locks!



Client detects inconsistencies, invalidates problematic records



Lease-Based Leader Election 

Optimization: Local lease check before/after writing to CT

Zero Overhead in Non-Recovery Scenarios

Omid: Scalable and Highly Available Transaction Processing for Phoenix

  • 1.
    Omid: Scalable and Highly Available Transaction Processing for Phoenix Ohad Shacham, Edward Bortnikov ⎪ PhoenixCon, Jun 13, 2017
  • 2.
    Let’s Get Started… 2 Our Yahoo Journey with Transactions over HBase Omid for Users: Semantics, API, Integration with Phoenix Omid for Programmers: Architecture and Use Cases Omid, Advanced: Scalability, HA, Low-Latency
  • 3.
    Transaction Processing inNoSQL @Yahoo 3 Motivation: Data Pipelines (Search, Mail, etc.) Stream Processing a Popular Pattern Compute Tasks process Data Items that arrive in the Real Time Intermediate Artifacts stored in NoSQL (KV-)Storage Extensive Use of Hadoop Technologies (Storm, HBase) Scale: Thousands of Hadoop Nodes
  • 4.
    Content Indexing forSearch Crawl Docproc Link Analysis Stream Crawl schedule Content Queue Links STORM HBase
  • 5.
    Zooming in onTasks Document processing Read page content from the store Compute search index features Update computed features Link processing Read outgoing links for a page Update reference for all linked-to pages begin begin commit commit
  • 6.
    Transaction Processing: ACID101 6 Multiple data accesses in a single logical operation Atomic “All or nothing” – no partial effect observable Consistent The DB transitions from one valid state to another Isolated Appear to execute in isolation Durable Committed data cannot disappear
  • 7.
    Omid (‫)امید‬ 7 2011 Incepted @YahooResearch “Omid1” 2014 Large-Scale Deployment @Yahoo 2014/5 Major Re-Design for Scalability & HA “Omid2” 2016 Apache Incubator 2017 Prototype Integration with Phoenix Transaction Processing Service for Apache HBase
  • 8.
    Contributors 8 Ohad Shacham Yahoo Research Francisco Perez Sorrosal Yahoo Edward Bortnikov Yahoo Research Eshcar Hillel Yahoo Research Idit Keidar Yahoo, Technion Ivan Kelly Midokura Sameer Paranjpye Databricks Matthieu Morel Skyscanner Igor Katkov Atlassian Yonatan Gottesman Yahoo Research
  • 9.
    Omid 101 9 Client Library+ Runtime Service Database Agnostic (can work with other backends) Snapshot Isolation consistency Very Scalable (>380K peak tps) and Highly Available
  • 10.
    Omid Programming Example 10 TransactionManagertm = HBaseTransactionManager.newInstance(); TTable txTable = new TTable("MY_TX_TABLE”); Transaction tx = tm.begin(); // Control path Put row1 = new Put(Bytes.toBytes("EXAMPLE_ROW1")); row1.add(family, qualifier, Bytes.toBytes("val1")); txTable.put(tx, row1); // Data path Put row2 = new Put(Bytes.toBytes("EXAMPLE_ROW2")); row2.add(family, qualifier, Bytes.toBytes("val2")); txTable.put(tx, row2); // Data path tm.commit(tx); // Control path
  • 11.
    Snapshot Isolation (SI)Semantics Distinct read (snapshot) and write (commit) points No write-write conflicts allowed
  • 12.
    Tephra: Sibling Technology 12 TransactionProcessing technology for HBase SI Semantics. Design Similar to Omid1 Apache Incubator since 2016 Integrated with Phoenix to provide ACID semantics (BETA) Implements some Phoenix-specific scenarios
  • 13.
    Phoenix-Omid Integration 13 Work inProgress under JIRA PHOENIX-3623 Backward Compatible – Configurable TP Provider Choice Current Options: Tephra and Omid How? Internal Transaction Abstraction Layer (TAL) API Multiple Implementations, Configurable Instantiation
  • 14.
    Transaction Processing, Refactored 14 Transaction AbstractionLayer Tephra Client Omid Client Phoenix Phoenix Tephra Client Refactor
  • 15.
    How Omid Works Client Begin/Commit Data Data Data Commit Table Persist Commit Verify commit Read/Write Conflict Detection 15 Transaction Manager (TSO) Lock-Free SI Implementation. Exploits Built-in MVCC.
  • 16.
    Transacti on Manager Client Begin Data Data Data Commit Table t1 Write (k1, v1, t1) Write (k2, v2, t1) Read (k’, last committed t’ < t1) (k1, v1, t1) (k2, v2, t1) Execution Example tr = t1 Transaction Manager 16
  • 17.
    Client Commit: t1, {k1,k2} Data Data Data Commit Table t2 (k1, v1, t1) (k2, v2, t1) Write (t1, t2) (t1, t2) Execution Example tr = t1 tc = t2 17 Transaction Manager
  • 18.
    Client Data Data Data Commit Table Read (k1, t3) (k1, v1, t1) (k2, v2, t1) (t1, t2) Read (t1) Execution Example tr = t3 18 Bottleneck! Transaction Manager
  • 19.
    Client Data Data Data Commit Table t2 (t1, t2) (k1,v1,t1,t2) (k2,v2,t1,t2) Delete(t1) Post-Commit Timestamp Replication tr = t1 tc = t2 Update commit cells 19 Transaction Manager
  • 20.
    Data Data Data Commit Table Read (k1, t3) Using Commit Cells Client tr = t3 20 Transaction Manager (k1,v1,t1,t2) (k2,v2,t1,t2)
  • 21.
    Phoenix – NewScenarios for Omid 21 Secondary Indexes On-the-Fly Index Creation Atomic Updates Query by Secondary Key Extended Snapshot Isolation Read-Your-Own-Writes Queries
  • 22.
    On-the-Fly Secondary IndexCreation 22 CREATE INDEX (CI) in parallel with writes to the base table How? Distinguish between the pre-CI and post-CI data CREATE INDEX command issue time defines a timestamp 1. All data committed before snapshot: scanned, bulk-inserted into index 2. All data generated after snapshot: triggers random update of index 3. All transactions in flight at snapshot time: aborted (FENCE)
  • 23.
    Secondary Index: Creationand Maintenance 23 T1 T2 T3 CREATE INDEX started T4 CREATE INDEX complete T5 T6 Bulk- Inserted into index Abort (enforced upon commit) Added by a coprocessor Added by a coprocessor Index update (stored procedure)
  • 24.
    Extended Snapshot Isolation 24 CREATETABLE T (ID INT); BEGIN; 1: INSERT INTO T SELECT ID+10 FROM T; 2: INSERT INTO T SELECT ID+100 FROM T; COMMIT; Traditional SI: Read-Your-Writes Challenge: Circular Dependency (Statement in Infinite Loop) Solution: Moving Snapshot (series of checkpoint snapshots)
  • 25.
    Moving Snapshot Implementation 25 Checkpointfor Statement 1 Checkpoint for Statement 2 Writes by Statement 1 Timestamps allocated by TM in blocks. Client promotes the checkpoint.
  • 26.
    Omid Scalability 26 Extremely leanClient-Transaction Manager protocol Omid1, Tephra replicate the entire state to client side upon BEGIN Aggressive batching of writes to CT in Transaction Manager Concurrent conflict detection (experimental) HA algorithm incurs zero overhead in the mainstream
  • 27.
    0 50 100 150 200 250 300 350 400 450 500 550 Omid1 Omid1 NonDurable Omid Omid Non Durable Tps*103 Throughput Benchmark YCSB workload driver 12-core Transaction Manager 1G network
  • 28.
    0 500 1000 1500 2000 2500 document inversion duplicatedetection out-link processing in-link processing stream to runtime TaskLatency(ms) Commit + CT update Begin Compute Read Update Overhead in Production: Web Search Indexing
  • 29.
    Low-Latency Omid (Experimental) 29 OriginalDesign: Throughput-Oriented Applications in Mind Sometimes, this comes at the expense of latency Example: writes to Commit Table batched at the Transaction Manager Key: Dissolve the Transaction Manager I/O Bottleneck Distribute the Commit Table and the Writes to it How? The client, rather than the TM, persists the Commit Timestamp (CTS) CTS embedded in the first row written by the transaction
  • 30.
    Benchmark: Single-Write TransactionWorkload 0 10 20 30 40 50 60 70 80 0 50 100 150 200 250 300 Omid Low latency Throughput (tps * 103) Latency(msec)
  • 31.
    Summary 31 Scalable, Highly AvailableOpen Source Transaction Processing Battle-Tested, Ready for Public Cloud Integration with Apache Phoenix Underway (GA in 2017)
  • 32.
    Thanks to OurPartners for Being Awesome 32
  • 33.
  • 34.
    Architecture, Recapped Client Begin/Commit Data Data Data Commit Table Persist Commit Verify commit Read/Write SPoF 34 Transaction Manager (TSO)
  • 35.
    HA: Primary-Backup TransactionManager Client Data Data Data Commit Table 35 Transaction Manager (TSO) Transaction Manager Recovery state (ZK) Primary Backup
  • 36.
  • 37.
    HA Algorithm –Key Ideas 37 Old and New Primaries may write conflicting commit records No Locks! Client detects inconsistencies, invalidates problematic records Lease-Based Leader Election Optimization: Local lease check before/after writing to CT Zero Overhead in Non-Recovery Scenarios