Lightweight Transactions at Lightning Speed

•Download as PPTX, PDF•

1 like•2,297 views

This talk will outline the Scylla implementation of Lightweight Transactions (LWT) that brings us to parity with Apache Cassandra. We will cover how to use it, what is working, and what is left to be done. We will also cover what other improvements are in store to improve Scylla's transactional capabilities and why it matters.

Technology

Lightweight transactions
at lightning speed
Konstantin Osipov, Software Team lead

Presenter
Konstantin Osipov, Software Team Lead
Kostja is a well-known expert in the DBMS world, spending most
of his career developing open-source DBMS including Tarantool
and MySQL. At ScyllaDB his focus is transaction support and
synchronous replication.

Quick takeaways
■ Download:
https://hub.docker.com/r/scylladb/scylla-nightly/tags
■ Use --experimental and follow a short tutorial:
https://github.com/scylladb/scylla/wiki/lwt
■ General availability is planned with 3.2 milestone release

How this talk is structured
■ Lightweight transactions: syntax, semantics, benchmarks, metrics
■ Design & Implementation overview
■ Caveats
■ Future work

CQL avoids slow reads
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe';
> SELECT * FROM employees ...;
firstname | lastname | join_date
-----------+----------+------------
John | Doe | 2018-05-19

CQL conditional statement
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe'
IF join_date != null;
[applied]
-----------
False

What statements can be conditional?
Any INSERT, UPDATE or DELETE can have an IF clause:
> UPDATE employees SET join_date = … IF EXISTS;
> INSERT INTO bookings (id, item, client, quantity) VALUES
(…) IF NOT EXISTS;
> UPDATE inventory SET state = 'Used' WHERE itemid = ?
IF state = 'Unused' AND check = 'Passed';
> DELETE FROM tasks WHERE project_id = ? AND task_id = ?
IF task['state'] IN ('Complete', 'Abandoned');

Conditional batches
> BEGIN BATCH
> UPDATE tasks SET n_abandoned = 0 WHERE project_id = 1
> IF n_abandoned > 0
> DELETE FROM tasks WHERE project_id = 1
> AND state = 'Abandoned'
> APPLY BATCH;
[applied]| project_id | state | task_id | n_abandoned
----------+------------+-----------+---------+-------------
True | 1 | Abandoned | 693 | 2

Consistency considerations
■ New consistency command:
SERIAL CONSISTENCY [SERIAL|LOCAL_SERIAL]
■ Eventual CONSISTENCY is still used
■ Consistency settings can be combined to reduce LWT latency

IF is the new WHERE?
WHERE IF
Relation expressions >, <, >=, <=, ==, != Yes Yes
IN condition Yes Yes
Collection element subscription, a[‘key’] Yes Yes
UDT member subscription, a.key Yes No
Uses secondary index for search Yes No
TOKEN(), LIKE, UDF Yes No

What you CAN’T DO
■ Use counter data type
⛔
■ Access multiple partitions
⛔
■ Supply custom TIMESTAMP
⛔
■ Use UNLOGGED
⛔

Differences with Cassandra
Difference Workaround
Per-core partitioning Use shard-aware driver for
optimal performance
Scylla always provides a result
set
No need
No Thrift support Don’t use Thrift.
Hints are not used No need

Setup: single region
Amazon EC2, availability zone US-West-1
■ Rtt time min/avg/max = 0.149/0.181/0.259 ms
■ 3 nodes I3.2xlarge
● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD
■ Replication strategy: Simple
■ Replication factor: 3
■ Integer key and value
■ Go client t3.2xlarge
■ 1-100 connections

Setup: multiple regions
Amazon EC2, zones US-West-1 (2 nodes) and , US-West-2 (2 nodes)
■ Rtt time min/avg/max = 20.74/20.77/20.81 ms
■ 4 nodes I3.2xlarge
● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD
■ Replication strategy: NetworkTopologyStrategy
■ Replication factor: 2+2
■ Integer key and value
■ Go client t3.2xlarge
■ 1-100 connections

$Metrics ■ CQL counters: scylla_cql_{inserts|updates|deletes|batches} Label: conditional={yes|no} ■ storage proxy metrics: scylla_storage_proxy_coordinator_{read|write}_ {latency|timeouts|unavailable|contention|unfinished_commit|c ondition_not_met...}$

Shard-to-shard replication mesh
Replication strategy produces C(n, k) distinct replication groups.
16 nodes, replication factor 3: C(16, 3) = 560
Explodes for larger numbers: C(2560, 3) ≅ 2,796,202,666
Data within a node is further partitioned across shards:

Introducing Paxos
R1
Can I
propose a
value?
R
2
R
3
Accept
new value
Learn
decision
Decision made

Introducing Paxos
R1
Can I
propose a
value?
Check
condition
R
2
R
3
Accept
new value
Learn
decision
Decision made

Caveats
Issue Remedy
4 round trips are very costly Optimize propose and
read rounds
Contention/starvation Implement Paxos leases
Uncertainty on timeout Improved diagnostics
System.paxos state Account in capacity
planning

Scylla RAFT
■ new replication strategy
■ tablet partitioning scheme
■ requested explicitly in CREATE TABLE
■ no client-side timestamps
■ provides isolation for ALL queries

Stay in touch
Konstantin Osipov
kostja@scylladb.com
kostja_osipo
v

What's hot

Using eBPF for High-Performance Networking in CiliumScyllaDB

Performance Wins with eBPF: Getting Started (2021)Brendan Gregg

Cilium - Container Networking with BPF & XDPThomas Graf

Understanding eBPF in a Hurry!Ray Jenkins

Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg

eBPF Trace from Kernel to UserspaceSUSE Labs Taipei

Kernel_Crash_Dump_AnalysisBuland Singh

re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg

Linux Networking ExplainedThomas Graf

XDP in Practice: DDoS Mitigation @CloudflareC4Media

Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB

Linux Systems Performance 2016Brendan Gregg

Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn

Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH

Galera Cluster Best Practices for DBA's and DevOps Part 1Codership Oy - Creators of Galera Cluster

Linux Kernel CrashdumpMarian Marinov

Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto

BPF: Tracing and moreBrendan Gregg

Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg

introduction to linux kernel tcp/ip ptocotol stack monad bobo

What's hot (20)

Using eBPF for High-Performance Networking in Cilium

Performance Wins with eBPF: Getting Started (2021)

Cilium - Container Networking with BPF & XDP

Understanding eBPF in a Hurry!

Linux Performance Analysis: New Tools and Old Secrets

eBPF Trace from Kernel to Userspace

Kernel_Crash_Dump_Analysis

re:Invent 2019 BPF Performance Analysis at Netflix

Linux Networking Explained

XDP in Practice: DDoS Mitigation @Cloudflare

Linux Kernel vs DPDK: HTTP Performance Showdown

Linux Systems Performance 2016

Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn

Camunda BPM 7.2: Performance and Scalability (English)

Galera Cluster Best Practices for DBA's and DevOps Part 1

Linux Kernel Crashdump

Tutorial: Using GoBGP as an IXP connecting router

BPF: Tracing and more

Linux 4.x Tracing Tools: Using BPF Superpowers

introduction to linux kernel tcp/ip ptocotol stack

Similar to Lightweight Transactions at Lightning Speed

11thingsabout11g 12659705398222 Phpapp01Karam Abuataya

11 Things About11gfcamachob

Mod03 linking and acceleratingPeter Haase

MySQL Parallel Replication by Booking.comJean-François Gagné

Optimizing applications and database performanceInam Bukhary

Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB

Beyond php - it's not (just) about the codeWim Godden

MySQLinsanityStanley Huang

OSMC 2008 | Monitoring MySQL by Geert VanderkelenNETWAYS

Apache Cassandra at MacysDataStax Academy

Analyzing SQL Traces generated by EVENT 10046.pptxssuserbad8d3

Quickly Locate Poorly Performing DB2 for z/OS Batch SQL softbasemarketing

Beyond php - it's not (just) about the codeWim Godden

M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresMariaDB plc

Migrations from PLSQL and Transact-SQL - m18Wagner Bianchi

What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc

What is new in PostgreSQL 14?Mydbops

Advanced Query Optimizer Tuning and AnalysisMYXPLAIN

MySQL 5.7 in a NutshellEmily Ikuta

Similar to Lightweight Transactions at Lightning Speed (20)

11thingsabout11g 12659705398222 Phpapp01

11 Things About11g

Mod03 linking and accelerating

MySQL Parallel Replication by Booking.com

Optimizing applications and database performance

Testing Persistent Storage Performance in Kubernetes with Sherlock

Beyond php - it's not (just) about the code

MySQLinsanity

OSMC 2008 | Monitoring MySQL by Geert Vanderkelen

Apache Cassandra at Macys

Analyzing SQL Traces generated by EVENT 10046.pptx

Quickly Locate Poorly Performing DB2 for z/OS Batch SQL

Beyond php - it's not (just) about the code

M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures

Migrations from PLSQL and Transact-SQL - m18

What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1

What is new in PostgreSQL 14?

Advanced Query Optimizer Tuning and Analysis

MySQL 5.7 in a Nutshell

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Install Stable Diffusion in windows machinePadma Pradeep

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Key Features Of Token Development (1).pptxLBM Solutions

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

How to convert PDF to text with Nanonetsnaman860154

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

AI as an Interface for Commercial BuildingsMemoori

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Install Stable Diffusion in windows machine

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Key Features Of Token Development (1).pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

Benefits Of Flutter Compared To Other Frameworks

Handwritten Text Recognition for manuscripts and early printed texts

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

How to convert PDF to text with Nanonets

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

How to Remove Document Management Hurdles with X-Docs?

Scaling API-first – The story of a global engineering organization

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

AI as an Interface for Commercial Buildings

Lightweight Transactions at Lightning Speed

1. Lightweight transactions at lightning speed Konstantin Osipov, Software Team lead

2. Presenter Konstantin Osipov, Software Team Lead Kostja is a well-known expert in the DBMS world, spending most of his career developing open-source DBMS including Tarantool and MySQL. At ScyllaDB his focus is transaction support and synchronous replication.

3. Quick takeaways ■ Download: https://hub.docker.com/r/scylladb/scylla-nightly/tags ■ Use --experimental and follow a short tutorial: https://github.com/scylladb/scylla/wiki/lwt ■ General availability is planned with 3.2 milestone release

4. How this talk is structured ■ Lightweight transactions: syntax, semantics, benchmarks, metrics ■ Design & Implementation overview ■ Caveats ■ Future work

5. LWT at a glance

6. Pre-LWT: write fast path

7. CQL avoids slow reads > UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe'; > SELECT * FROM employees ...; firstname | lastname | join_date -----------+----------+------------ John | Doe | 2018-05-19

8. CQL conditional statement > UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe' IF join_date != null; [applied] ----------- False

9. What statements can be conditional? Any INSERT, UPDATE or DELETE can have an IF clause: > UPDATE employees SET join_date = … IF EXISTS; > INSERT INTO bookings (id, item, client, quantity) VALUES (…) IF NOT EXISTS; > UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed'; > DELETE FROM tasks WHERE project_id = ? AND task_id = ? IF task['state'] IN ('Complete', 'Abandoned');

10. What statements can be conditional? Any INSERT, UPDATE or DELETE can have an IF clause: > UPDATE employees SET join_date = … IF EXISTS; > INSERT INTO bookings (id, item, client, quantity) VALUES (…) IF NOT EXISTS; > UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed'; > DELETE FROM tasks WHERE project_id = ? AND task_id = ? IF task['state'] IN ('Complete', 'Abandoned');

12. Consistency considerations ■ New consistency command: SERIAL CONSISTENCY [SERIAL|LOCAL_SERIAL] ■ Eventual CONSISTENCY is still used ■ Consistency settings can be combined to reduce LWT latency

13. IF is the new WHERE? WHERE IF Relation expressions >, <, >=, <=, ==, != Yes Yes IN condition Yes Yes Collection element subscription, a[‘key’] Yes Yes UDT member subscription, a.key Yes No Uses secondary index for search Yes No TOKEN(), LIKE, UDF Yes No

14. What you CAN’T DO ■ Use counter data type ⛔ ■ Access multiple partitions ⛔ ■ Supply custom TIMESTAMP ⛔ ■ Use UNLOGGED ⛔

15. Differences with Cassandra Difference Workaround Per-core partitioning Use shard-aware driver for optimal performance Scylla always provides a result set No need No Thrift support Don’t use Thrift. Hints are not used No need

16. Performance

17. Setup: single region Amazon EC2, availability zone US-West-1 ■ Rtt time min/avg/max = 0.149/0.181/0.259 ms ■ 3 nodes I3.2xlarge ● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD ■ Replication strategy: Simple ■ Replication factor: 3 ■ Integer key and value ■ Go client t3.2xlarge ■ 1-100 connections

18. Uncontended write - bandwidth

19. Uncontended write - latency

20. Contended SERIAL write - bandwidth

21. Contended SERIAL write - latency

22. Setup: multiple regions Amazon EC2, zones US-West-1 (2 nodes) and , US-West-2 (2 nodes) ■ Rtt time min/avg/max = 20.74/20.77/20.81 ms ■ 4 nodes I3.2xlarge ● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD ■ Replication strategy: NetworkTopologyStrategy ■ Replication factor: 2+2 ■ Integer key and value ■ Go client t3.2xlarge ■ 1-100 connections

23. Uncontended write - bandwidth

24. Uncontended write - latency

25. Metrics ■ CQL counters: scylla_cql_{inserts|updates|deletes|batches} Label: conditional={yes|no} ■ storage proxy metrics: scylla_storage_proxy_coordinator_{read|write}_ {latency|timeouts|unavailable|contention|unfinished_commit|c ondition_not_met...}

26. Metrics

27. Under the hood

28. Foundations of architecture

29. Consistent hashing and vnodes

30. Shard-to-shard replication mesh Replication strategy produces C(n, k) distinct replication groups. 16 nodes, replication factor 3: C(16, 3) = 560 Explodes for larger numbers: C(2560, 3) ≅ 2,796,202,666 Data within a node is further partitioned across shards:

31. Introducing Paxos R1 Can I propose a value? R 2 R 3 Accept new value Learn decision Decision made

32. Introducing Paxos R1 Can I propose a value? Check condition R 2 R 3 Accept new value Learn decision Decision made

33. Caveats

34. Caveats Issue Remedy 4 round trips are very costly Optimize propose and read rounds Contention/starvation Implement Paxos leases Uncertainty on timeout Improved diagnostics System.paxos state Account in capacity planning

35. Future work

36. Scylla RAFT ■ new replication strategy ■ tablet partitioning scheme ■ requested explicitly in CREATE TABLE ■ no client-side timestamps ■ provides isolation for ALL queries

37. Stay in touch Konstantin Osipov kostja@scylladb.com kostja_osipo v

Editor's Notes

Hi, my name is Konstantin Osipov, and I am working on lightweight transaction support in Scylla. I've been involved with databases for nearly two decades, most notably MySQL, where I worked on prepared statements, stored procedures, foreign key constraints, metadata locking, and Tarantool in-memory database where I served ~9 years as a leading engineer and CTO.
This talk is about lightweight transactions support in Scylla, and since this is a very wished for feature many of you have the most burning questions like "is it there?" and "how can I get it?" - which I'll answer first. It is there, in Scylla trunk and you can download it at https://hub.docker.com/r/scylladb/scylla-nightly/tags. It is going to be included into the upcoming 3.2 release - which is planned later this year. The implementation is nearly fully compatible with Cassandra, so those of you who are familiar with Cassandra, perhaps now have sufficient information to skip this talk and get a coffee and/or a cigarette instead. Enjoy.
Those of you who are interested in the secrets of internal works of LWT, how to best use it, benchmarks, caveats, and future work, please stay on. And I am here to learn too - about your LWT usage patterns, wishes, and pet peeves. I will structure the talk as follows. * We'll start by looking at LWT feature: the syntax, semantics, strengths and weaknesses * We will continue with presenting a few benchmarks and discussing how to optimally use the feature, including what metrics we provide to monitor its usage * We'll look at Scylla architecture background and possible approaches to LWT implementation * Then we'll study the implementation, which is based on an infamously difficult yet very elegant and minimal distributed algorithm (also known as distributed consensus protocol), called Paxos * We'll end by discussing the state of Paxos implementation in Scylla: why it is marked with --experimental, what we plan to do before we remove the mark, and what we plan to do after
If you're familiar with Scylla data modification language, you know that a modification statement never reports back whether it actually changes any rows. This is a property ensuing from two design choices in Scylla: - using log-structured merge trees for storage, which is significantly more efficient for heavy write work than for reads. You can safely assume that a cold read is 10-100x more expensive than a write even when using an SSD device. - accepting client-supplied timestamps for "transaction" identifiers: even if Scylla performed a read of the existing value before applying a change, the end result may well change because a similar transaction on the same key is allowed to proceed on a different node without any coordination, or even a later transaction may supply an earlier timestamp and thus retroactively change the history.
As an example, which commonly tricks SQL users adopting CQL, the following UPDATE statement always succeeds: UPDATE employees SET join_date = 2010-04-28 WHERE firstname = 'John' AND lastname = 'Doe' - you'd better know what you're doing, because if John Doe was not employed before this statement, he will sure be employed after. Well, guess, this is not always what you need. Sometimes you just need a scalable and reliable database which can provide classical transactional consistency model for at least some of your updates - if John Doe is not employed, he should not be hired by an update.
(Since WHERE clause is taken), a new IF clause is added to conduct this intent: Now the statement does what it is supposed to and will *not* coincidentally hire our friend John Doe. But what else can you do with LWT?
IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS: INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS or you could provide a collection of predicates on different row cells: UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed' - all such changes will be consistent and durable. You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=. A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data: INSERT INTO tasks VALUES (task_id, task) (1002, { ... }); INSERT INTO tasks_assigned (task_id, worker_id) VALUES (1001, 'west-1') IF NOT EXISTS; -- Only take the task if it is not taken UPDATE tasks_assigned SET worker_id= 'west-2' WHERE task_id = 1001 IF worker_id= 'west_1'; -- Atomically change failed worker of a task
IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS: INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS or you could provide a collection of predicates on different row cells: UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed' - all such changes will be consistent and durable. You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=. A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data: INSERT INTO tasks VALUES (task_id, task) (1002, { ... }); INSERT INTO tasks_assigned (task_id, worker_id) VALUES (1001, 'west-1') IF NOT EXISTS; -- Only take the task if it is not taken UPDATE tasks_assigned SET worker_id= 'west-2' WHERE task_id = 1001 IF worker_id= 'west_1'; -- Atomically change failed worker of a task
In addition to a single statement, it is possible to combine multiple conditional statements into a batch. A batch can have non-conditional statements as well, but all statements of such a batch may span only a single partition. This is useful when it is desired to update multiple rows in a partition or atomically erase all or a range of rows in it. If any statement in a batch has conditions, entire batch is considered "conditional": it is applied atomically if and only if *all* conditions of all statements in the batch evaluate to TRUE. LWT batches are very similar to multi-statement transactions in relational databases, since they provide multiple-row read consistency, durability and isolation. Yes, with atomic batches in Scylla clients don't see partial changes, as entire partition mutation is applied as all or nothing. The only difference from real transactions is that the batch logic can not "branch", i.e. there is only one ELSE branch and it is "do nothing".
If you wish to avoid an extra learn round, set CONSISTENCY to ANY, and SERIAL CONSISTENCY to SERIAL If you with to have transactional semantics within the current DC, and asynchronously apply the mutation to the remote DC, you can use LOCAL_SERIAL consistency and QUORUM eventual consistency
One may think that IF clause is a new WHERE - and this is true to a large extent, both accept expressions and are applied to the searched row. Unlike WHERE clause, IF conditions never use a secondary index - the rows are fetched before a condition is evaluated. IF condition applies only to a fully qualified row, i.e. you still must specify the partition key and in many cases clustering key, either in WHERE clause, if we deal with DELETE or UPDATE or in SET or VALUES clause, for UPDATE and INSERT. If your restrictions yield multiple rows, your IF condition can not be ambiguous. I.e. it can not evaluate to TRUE for one row and to FALSE for another, which in practice means that for statements restricting only the partition key, and not the clustering key, or the partition key and multiple clustering keys (pk = ? and ck IN (?, ?, ?), only the conditions on static cells are accepted. A current limitation which we plan to lift is that not all predicates are available in conditions: LIKE, TOKEN or user-defined functions are not available. Finally, beware of null semantics for collection values. null for a frozen collection is a stored value, i.e. it is distinct from an absent value and is correspondingly treated in relations. For non-frozen collection != null or == null returns the same result for null values and absent data. There is no reason for this but Cassandra compatibility.
use LWT with tables with counters, use LWT with statements which span multiple partitions, batches or not use user-supplied timestamps: guaranteeing consistency requires that the timestamp is assigned by the transaction coordinator use conditional and non-conditional statements with the same data and expect conditional statements to be consistent. You can actually use non-conditional statements on some cells of a row, and conditional on the other - but in practice this is hardly useful since eventually you'll have to work wit entire row, such as insert or delete it, and it will conflict. Better split such object to "transactional" and "eventually consistent" part and store in two different tables. Other limitations are more minor: while a non-LWT batch can be UNLOGGED, a conditional batch can not. IF conditions must be a perfect conjunct (... AND ... AND ...) while UPDATE is actually insert, UPDATE IF NOT EXISTS is not allowed, since it doesn't make any sense when read as English and not as CQL
Scylla is making an effort to be compatible with Cassandra, down to the level of limitations of the implementation. How is it different? unlike Cassandra, we use per-core data partitioning, so the RPC that is done to perform a transaction talks directly to the right core on a peer replica, avoiding the concurrency overhead. That is, of course, true, if shard-aware driver is used - otherwise we add an extra hop to the right core at the coordinator node just like the first implementation of LWT in Cassandra, we do not store hints for lightweight transaction writes. Cassandra later add hints support, while we do not have plans for it, since the hints seem to be redundant. Unlike Cassandra, Scylla doesn't have LWT support in Thrift protocol and doesn't plan to add it. conditional statements return a result set, and unlike Cassandra, Scylla returns result set metadata to the client at prepare if a statement has conditions. While the columns of the result set are the same as in Cassandra, Scylla always returns the old version of the row, to not confuse the driver while Cassandra returns the result set only if the statement is applied. Let's illustrate this: (go back to the batch statement example)
Remember that an i3.2xlarge is considered a small node for Scylla.
New label {conditional="yes"|"no"} for separate accounting of statements with and without conditions Batch is accounted as conditional if it has at least one statement with conditions All statements of a batch are accounted to cql_statements_in_batches and cql_inserts, cql_deletes, cql_updates with label {conditional="yes"|"no"} depending on whether the batch is conditional or not Serial read: exported under scylla_storage_proxy_coordinator_cas_read_* Conditional write: exported under scylla_storage_proxy_coordinator_cas_write_* latency – latency histogram timeouts – number of timeout errors unavailable – number of failed attempts to form a PAXOS quorum unfinished_commit – number of PAXOS rounds finished by the next request condition_not_met – number of CAS failures due to failed IF condition (only for writes) contention – histogram showing how many requests were retried internally due to contention What to look out for: timeouts, growing latency, contention, unfinished commit, condition not met - all indicate there is something wrong with your app and you’re most likely are doing something wrong.
This screenshot is taken from our graphana monitoring when running the benchmark. We plan to add these metrics to our standard dashboards: https://github.com/scylladb/scylla-monitoring/issues/775
Let's take a look at Scylla implementation - realizing the internal workings of the code helps identify the limits of applying this feature in your projects. It will also let us reason about the next steps for strong consistency in Scylla. Scylla is a shared-nothing system with no central authority or repository of knowledge. Each node owns a fraction of data called token range and all nodes are forming a mesh to deliver the database service.
To avoid uneven distribution of data, the consistent hash ring contains not cnodes, but vnodes - virtual node identifiers, each node owning multiple vnodes. One important way in which Scylla is different from Cassandra is its partitioning scheme, when each token range owned by a node is sub-partitioned into hundreds of sub-ranges, to ensure every CPU core solely owns its own subset of data. This allows for very little coordination between the cores on a single node - similar as there is very little coordination between the nodes in the entire cluster.
So Scylla adds an extra slicing layer, to split vnodes, into per-shard chunks called cnodes. For each token range of a cnode, its peers, or secondary replicas are selected as a product of hash function, thus each transaction ultimately involves a unique set of peers. This approach works very well for building a scalable, fault-tolerant system that minimizes hot spots and reduces impact of a single node failure. Yet it creates tens if not hundreds of thousands of replication "groups" - *distinct* sets of peers participating in a given transaction. The distributed system theory offers two broad sets of algorithms for peer coordination: with a designated leader, which may change once in a while to provide high availability, and leader-less, or, in fact, selecting a leader independently for every transaction. Some have already recognized that I am speaking in very broad terms about Raft vs Paxos family of algorithms. Thanks to the Scylla approach to data partitioning, using a leader-based algorithm would require adding group replication state for every distinct replication group, which means a lot of additional runtime state to maintain, and a lot of implementation complexity to manage, especially when the number of nodes or number of cores on a node changes, and many replication groups are re-formed. A leaderless algorithm trades the need to maintain extra state with an extra negotiation round to select a leader for each transaction. Since this approach allowed us to shorten the time to market we settled on it first, somewhat reassured that Cassandra uses the same technique. So what is Paxos and how does it work?
Paxos was invented as an algorithm for achieving consensus on a single value over unreliable communication channels. Many parts of the algorithm are left to implementers, so it can be tailored to solving the problem of database replication. In Scylla, the algorithm participants are replicas responsible for a given partition key. When a client suggests a change to the key (any modification statement can be represented as a partition mutation), a coordinator node acting on the client's behalf ensures that the majority of replicas holding the key accept the change. Any node in the cluster can be a coordinator for some change. This is done in two steps: first, the majority of replicas responsible for the key make a promise to the coordinator to accept the change, if the coordinator decides to make it. This step is necessary to make sure that no two concurrent coordinators "split" the history, when some replicas accept changes from one coordinator, and others from another. Essentially it temporarily locks out other changes and allows them to happen one at a time. After the coordinator receives a majority of promises, it suggests a change. If the change is accepted by the majority, the algorithm achieved progress. Please note that this illustration assumes a shard-aware driver and the first replica both acting as a coordinator and implicitly sending successfully sending and acknowledging all messages.
In addition to the two steps mandated by the protocol, Scylla has to retrieve the old row to check conditions. Once a proposal is accepted, and the coordinator knows it has been accepted (it got responses from a majority) another query is performed to make sure the change is applied to the base table on each replica. Overall, this makes up to 4 rounds, excluding retries and repairs. The algorithm uses a system table, called system.paxos to store its state. The table is replica-local, i.e. it is not partitioned but contains own data on each replica. The table primary key is a blob, capable of storing a partition key of any user table. This ensures that any Paxos round can find a designated unique slot in the system table to store its state. Once a round is over, the state can be cleared or overwritten - the table has a TTL attached ot it, to ensure old rounds expire. While a node acting as a coordinator is leading the effort in achieving resolution, other nodes are free to do the same and may even hijack the efforts of their peers. In particular, all coordinators share responsibility of carrying out an unfinished round when they encounter it. This makes Paxos resilient against failures such as machine crashes and network outages. This, however, leads to contention under load, since it can be difficult to distinguish a round which has an active coordinator pushing it to completion from a round that was abandoned because the coordinator that started it had failed.
One can already guess from my brief sketch of the implementation that achieving consensus using Paxos is a saga and many things may break on the way. Let's try to look critically at Scylla implementation and summarize its current flaws so that users can be aware of it: the main issue, of course, is that the protocol is very expensive: it's 4 times more expensive than a usual write in terms of network latency, and a hundred times in terms of I/O, since it incurs a read of the old row. By any measures the network latency dominates I/O costs, but these costs should not be discounted either: fetching whole pages of an LSM tree can saturate I/O bandwidth way before network bandwidth limit is reached. Some of the of the protocol RPCs could be collapsed, and work in Scylla has begun to this end. the second largest issue to note is high contention overhead when multiple coordinators attempt to work on the same key. The contention is innate in the liveness property of Paxos - when two coordinators have a row over concurrently changing the same key they need to guess if the other coordinator is alive or not, which may be difficult and costly. So they just back off and wait for a random interval when encounter contention, which introduces exponentially growing delays as the key becomes hotter and hotter. It should be noted that the research has advanced enough to provide industry grade solutions for the problem (Paxos leases), and our team is also looking into applying it. there are circumstances in which the client can not reliably know whether a value is applied or not. In one infamous Cassandra bug a user complains that he gets a timeout exception from a query which is actually successful and the timeout is returned before it is expired! Let's consider a case when a coordinator attempts to perform a change but another coordinator hijacks its partially completed change. Has the change been applied? Maybe, but the coordinator has no time to find out - it has to return "timeout" state to the client, which in turn has to figure out the outcome itself. In other words, it is possible to have a write operation report a failure to the client, but still actually persist the write to a replica. While the situation is also quite possible with any other kind of update, it is aggravated by contention and prolonged nature of Paxos - so any user of LWT has to be taking it seriously. Cassandra 4.0 release delivers better diagnostics of this "unknown" state so that the client can make a more educated decision as to how to proceed (and we plan to do the same). Paxos table state is an extra temporary state that DBAs must take into account when sizing their deployments. It can store up to 3 hours of in-progress transactions on each nodes, which could be quite hefty under high load. We intend to address or otherwise mitigate these issues before the feature becomes generally available. Meanwhile, please bear in mind the high costs of lightweight transactions when designing your applications and use them sparingly, i.e. avoid using for all your data. As already mentioned, a good design pattern is when WLT is used for control plane of your application, while data plane continues to be eventually consistent.
As you could have sensed I'm not actually very happy with many of these issues and I somewhat regret we had to inherit some of them from Cassandra to preserve compatibility. Good news is Scylla is not just a Cassandra clone - CQL is the first front-end to its fantastic massively-parallel database technology, DynamoDB-compatible API is the second and others are quite likely to appear. We plan to continue our efforts in introducing a leader-based synchronous replication to Scylla, which is now a prevalent trend in the industry. To do it right, Scylla will need to change its data partitioning scheme to ensure there is more data locality, and also bring down the number of replication groups in the cluster, from tens of thousands, to hundreds (we still need to keep the number of groups somewhat high to ensure the workload is handled evenly). To avoid making our existing users perform painful migrations, we will begin by using a new partitioning and data replication scheme for new tables created with these options enabled. For such tables we will always mandate server-assigned timestamp for transaction identifier. One advantage of this approach is that it will make all CQL statements, not just conditional statements, strongly consistent. Ensuring isolation will not require a read of the old row or multiple network round trips, so will come at a much lower cost. This is not an official commitment but the current state of mind of some key people on the engineering team.

Lightweight Transactions at Lightning Speed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lightweight Transactions at Lightning Speed

Similar to Lightweight Transactions at Lightning Speed (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Lightweight Transactions at Lightning Speed

Editor's Notes