Architecture decision records - How not to get lost in the past
Omid: Scalable and Highly Available Transaction Processing for Phoenix
1. Omid: Scalable an d Highly Available
Transaction Processing for Phoenix
Ohad Shacham, Edward Bortnikov ⎪ PhoenixCon, Jun 13, 2017
2. Let’s Get Started …
2
Our Yahoo Journey with Transactions over HBase
Omid for Users: Semantics, API, Integration with Phoenix
Omid for Programmers: Architecture and Use Cases
Omid, Advanced: Scalability, HA, Low-Latency
3. Transaction Processing in NoSQL @Yahoo
3
Motivation: Data Pipelines (Search, Mail, etc.)
Stream Processing a Popular Pattern
Compute Tasks process Data Items that arrive in the Real Time
Intermediate Artifacts stored in NoSQL (KV-)Storage
Extensive Use of Hadoop Technologies (Storm, HBase)
Scale: Thousands of Hadoop Nodes
4. Content Indexing for Search
Crawl Docproc
Link
Analysis Stream
Crawl
schedule
Content
Queue
Links
STORM
HBase
5. Zooming in on Tasks
Document processing
Read page content from the store
Compute search index features
Update computed features
Link processing
Read outgoing links for a page
Update reference for all linked-to pages
begin
begin
commit
commit
6. Transaction Processing: ACID 101
6
Multiple data accesses in a single logical operation
Atomic
“All or nothing” – no partial effect observable
Consistent
The DB transitions from one valid state to another
Isolated
Appear to execute in isolation
Durable
Committed data cannot disappear
7. Omid ()امید
7
2011
Incepted
@Yahoo Research
“Omid1”
2014
Large-Scale
Deployment
@Yahoo
2014/5
Major Re-Design
for Scalability & HA
“Omid2”
2016
Apache
Incubator
2017
Prototype
Integration
with Phoenix
Transaction Processing Service for Apache HBase
8. Contributors
8
Ohad Shacham
Yahoo Research
Francisco
Perez Sorrosal
Yahoo
Edward Bortnikov
Yahoo Research
Eshcar Hillel
Yahoo Research
Idit Keidar
Yahoo, Technion
Ivan Kelly
Midokura
Sameer Paranjpye
Databricks
Matthieu Morel
Skyscanner
Igor Katkov
Atlassian
Yonatan Gottesman
Yahoo Research
9. Omid 101
9
Client Library + Runtime Service
Database Agnostic (can work with other backends)
Snapshot Isolation consistency
Very Scalable (>380K peak tps) and Highly Available
10. Omid Programming Example
10
TransactionManager tm = HBaseTransactionManager.newInstance();
TTable txTable = new TTable("MY_TX_TABLE”);
Transaction tx = tm.begin(); // Control path
Put row1 = new Put(Bytes.toBytes("EXAMPLE_ROW1"));
row1.add(family, qualifier, Bytes.toBytes("val1"));
txTable.put(tx, row1); // Data path
Put row2 = new Put(Bytes.toBytes("EXAMPLE_ROW2"));
row2.add(family, qualifier, Bytes.toBytes("val2"));
txTable.put(tx, row2); // Data path
tm.commit(tx); // Control path
11. Snapshot Isolation (SI) Semantics
Distinct read (snapshot) and write (commit) points
No write-write conflicts allowed
12. Tephra: Sibling Technology
12
Transaction Processing technology for HBase
SI Semantics. Design Similar to Omid1
Apache Incubator since 2016
Integrated with Phoenix to provide ACID semantics (BETA)
Implements some Phoenix-specific scenarios
13. Phoenix-Omid Integration
13
Work in Progress under JIRA PHOENIX-3623
Backward Compatible – Configurable TP Provider Choice
Current Options: Tephra and Omid
How?
Internal Transaction Abstraction Layer (TAL) API
Multiple Implementations, Configurable Instantiation
15. How Omid Works
Client
Begin/Commit
Data
Data
Data
Commit
Table
Persist
Commit
Verify commit
Read/Write
Conflict
Detection
15
Transaction
Manager
(TSO)
Lock-Free SI Implementation. Exploits Built-in MVCC.
17. Client
Commit: t1, {k1, k2}
Data
Data
Data
Commit
Table
t2
(k1, v1, t1)
(k2, v2, t1)
Write (t1, t2)
(t1, t2)
Execution Example
tr = t1
tc = t2
17
Transaction
Manager
18. Client
Data
Data
Data
Commit
Table
Read (k1, t3)
(k1, v1, t1)
(k2, v2, t1)
(t1, t2)
Read (t1)
Execution Example
tr = t3
18
Bottleneck!
Transaction
Manager
19. Client
Data
Data
Data
Commit
Table
t2
(t1, t2)
(k1,v1,t1,t2)
(k2,v2,t1,t2)
Delete(t1)
Post-Commit Timestamp Replication
tr = t1
tc = t2
Update
commit
cells
19
Transaction
Manager
20. Data
Data
Data
Commit
Table
Read (k1, t3)
Using Commit Cells
Client
tr = t3
20
Transaction
Manager
(k1,v1,t1,t2)
(k2,v2,t1,t2)
21. Phoenix – New Scenarios for Omid
21
Secondary Indexes
On-the-Fly Index Creation
Atomic Updates
Query by Secondary Key
Extended Snapshot Isolation
Read-Your-Own-Writes Queries
22. On-the-Fly Secondary Index Creation
22
CREATE INDEX (CI) in parallel with writes to the base table
How? Distinguish between the pre-CI and post-CI data
CREATE INDEX command issue time defines a timestamp
1. All data committed before snapshot: scanned, bulk-inserted into index
2. All data generated after snapshot: triggers random update of index
3. All transactions in flight at snapshot time: aborted (FENCE)
23. Secondary Index: Creation and Maintenance
23
T1
T2
T3
CREATE INDEX started
T4
CREATE INDEX complete
T5
T6
Bulk-
Inserted
into index
Abort
(enforced
upon
commit)
Added by a
coprocessor
Added by a
coprocessor
Index
update
(stored
procedure)
24. Extended Snapshot Isolation
24
CREATE TABLE T (ID INT);
BEGIN;
1: INSERT INTO T
SELECT ID+10 FROM T;
2: INSERT INTO T
SELECT ID+100 FROM T;
COMMIT;
Traditional SI: Read-Your-Writes
Challenge:
Circular Dependency
(Statement in Infinite Loop)
Solution: Moving Snapshot
(series of checkpoint snapshots)
25. Moving Snapshot Implementation
25
Checkpoint for
Statement 1
Checkpoint for
Statement 2
Writes by
Statement 1
Timestamps allocated by TM in blocks.
Client promotes the checkpoint.
26. Omid Scalability
26
Extremely lean Client-Transaction Manager protocol
Omid1, Tephra replicate the entire state to client side upon BEGIN
Aggressive batching of writes to CT in Transaction Manager
Concurrent conflict detection (experimental)
HA algorithm incurs zero overhead in the mainstream
28. 0
500
1000
1500
2000
2500
document inversion
duplicate detection
out-link processing
in-link processing
stream to runtime
TaskLatency(ms)
Commit + CT update
Begin
Compute
Read
Update
Overhead in Production: Web Search Indexing
29. Low-Latency Omid (Experimental)
29
Original Design: Throughput-Oriented Applications in Mind
Sometimes, this comes at the expense of latency
Example: writes to Commit Table batched at the Transaction Manager
Key: Dissolve the Transaction Manager I/O Bottleneck
Distribute the Commit Table and the Writes to it
How?
The client, rather than the TM, persists the Commit Timestamp (CTS)
CTS embedded in the first row written by the transaction
31. Summary
31
Scalable, Highly Available Open Source Transaction Processing
Battle-Tested, Ready for Public Cloud
Integration with Apache Phoenix Underway (GA in 2017)
37. HA Algorithm – Key Ideas
37
Old and New Primaries may write conflicting commit records
No Locks!
Client detects inconsistencies, invalidates problematic records
Lease-Based Leader Election
Optimization: Local lease check before/after writing to CT
Zero Overhead in Non-Recovery Scenarios