This talk will outline the Scylla implementation of Lightweight Transactions (LWT) that brings us to parity with Apache Cassandra. We will cover how to use it, what is working, and what is left to be done. We will also cover what other improvements are in store to improve Scylla's transactional capabilities and why it matters.
2. Presenter
Konstantin Osipov, Software Team Lead
Kostja is a well-known expert in the DBMS world, spending most
of his career developing open-source DBMS including Tarantool
and MySQL. At ScyllaDB his focus is transaction support and
synchronous replication.
7. CQL avoids slow reads
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe';
> SELECT * FROM employees ...;
firstname | lastname | join_date
-----------+----------+------------
John | Doe | 2018-05-19
8. CQL conditional statement
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe'
IF join_date != null;
[applied]
-----------
False
9. What statements can be conditional?
Any INSERT, UPDATE or DELETE can have an IF clause:
> UPDATE employees SET join_date = … IF EXISTS;
> INSERT INTO bookings (id, item, client, quantity) VALUES
(…) IF NOT EXISTS;
> UPDATE inventory SET state = 'Used' WHERE itemid = ?
IF state = 'Unused' AND check = 'Passed';
> DELETE FROM tasks WHERE project_id = ? AND task_id = ?
IF task['state'] IN ('Complete', 'Abandoned');
10. What statements can be conditional?
Any INSERT, UPDATE or DELETE can have an IF clause:
> UPDATE employees SET join_date = … IF EXISTS;
> INSERT INTO bookings (id, item, client, quantity) VALUES
(…) IF NOT EXISTS;
> UPDATE inventory SET state = 'Used' WHERE itemid = ?
IF state = 'Unused' AND check = 'Passed';
> DELETE FROM tasks WHERE project_id = ? AND task_id = ?
IF task['state'] IN ('Complete', 'Abandoned');
11. Conditional batches
> BEGIN BATCH
> UPDATE tasks SET n_abandoned = 0 WHERE project_id = 1
> IF n_abandoned > 0
> DELETE FROM tasks WHERE project_id = 1
> AND state = 'Abandoned'
> APPLY BATCH;
[applied]| project_id | state | task_id | n_abandoned
----------+------------+-----------+---------+-------------
True | 1 | Abandoned | 693 | 2
12. Consistency considerations
■ New consistency command:
SERIAL CONSISTENCY [SERIAL|LOCAL_SERIAL]
■ Eventual CONSISTENCY is still used
■ Consistency settings can be combined to reduce LWT latency
13. IF is the new WHERE?
WHERE IF
Relation expressions >, <, >=, <=, ==, != Yes Yes
IN condition Yes Yes
Collection element subscription, a[‘key’] Yes Yes
UDT member subscription, a.key Yes No
Uses secondary index for search Yes No
TOKEN(), LIKE, UDF Yes No
14. What you CAN’T DO
■ Use counter data type
⛔
■ Access multiple partitions
⛔
■ Supply custom TIMESTAMP
⛔
■ Use UNLOGGED
⛔
15. Differences with Cassandra
Difference Workaround
Per-core partitioning Use shard-aware driver for
optimal performance
Scylla always provides a result
set
No need
No Thrift support Don’t use Thrift.
Hints are not used No need
34. Caveats
Issue Remedy
4 round trips are very costly Optimize propose and
read rounds
Contention/starvation Implement Paxos leases
Uncertainty on timeout Improved diagnostics
System.paxos state Account in capacity
planning
Hi, my name is Konstantin Osipov, and I am working on lightweight transaction support in Scylla.
I've been involved with databases for nearly two decades, most notably MySQL, where I worked on prepared statements, stored procedures, foreign key constraints, metadata locking, and Tarantool in-memory database where I served ~9 years as a leading engineer and CTO.
This talk is about lightweight transactions support in Scylla, and since this is a very wished for feature many of you have
the most burning questions like "is it there?" and "how can I get it?" - which I'll answer first.
It is there, in Scylla trunk and you can download it at https://hub.docker.com/r/scylladb/scylla-nightly/tags.
It is going to be included into the upcoming 3.2 release - which is planned later this year. The implementation is nearly fully
compatible with Cassandra, so those of you who are familiar with Cassandra, perhaps now have sufficient information to skip this talk and get a coffee and/or a cigarette instead. Enjoy.
Those of you who are interested in the secrets of internal works of LWT, how to best use it, benchmarks, caveats, and future work, please stay on.
And I am here to learn too - about your LWT usage patterns, wishes, and pet peeves.
I will structure the talk as follows.
* We'll start by looking at LWT feature: the syntax, semantics, strengths and weaknesses
* We will continue with presenting a few benchmarks and discussing how to optimally use the feature, including what metrics we provide
to monitor its usage
* We'll look at Scylla architecture background and possible approaches to LWT implementation
* Then we'll study the implementation, which is based on an infamously difficult yet very elegant and minimal distributed algorithm (also
known as distributed consensus protocol), called Paxos
* We'll end by discussing the state of Paxos implementation in Scylla: why it is marked with --experimental, what we plan to do before we remove the mark, and what we plan to do after
If you're familiar with Scylla data modification language, you know that a modification statement never reports back whether it actually changes any rows.
This is a property ensuing from two design choices in Scylla:
- using log-structured merge trees for storage, which is significantly more efficient for heavy write work than for reads. You can safely assume that a cold read is 10-100x more expensive than a write even when using an SSD device.
- accepting client-supplied timestamps for "transaction" identifiers: even if Scylla performed a read of the existing value
before applying a change, the end result may well change because a similar transaction on the same key is allowed to proceed on a different node without any coordination, or even a later transaction may supply an earlier timestamp and thus retroactively change the history.
As an example, which commonly tricks SQL users adopting CQL, the following UPDATE statement always succeeds:
UPDATE employees SET join_date = 2010-04-28
WHERE firstname = 'John' AND lastname = 'Doe'
- you'd better know what you're doing, because if John Doe was not employed before this statement, he will sure be employed after.
Well, guess, this is not always what you need. Sometimes you just need a scalable and reliable database which can provide classical transactional consistency model for at least some of your updates - if John Doe is not employed, he should not be hired by an update.
(Since WHERE clause is taken), a new IF clause is added to conduct this intent:
Now the statement does what it is supposed to and will *not* coincidentally hire our friend John Doe.
But what else can you do with LWT?
IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS:
INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS
or you could provide a collection of predicates on different row cells:
UPDATE inventory SET state = 'Used' WHERE itemid = ? IF
state = 'Unused' AND check = 'Passed'
- all such changes will be consistent and durable.
You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=.
A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data:
INSERT INTO tasks VALUES (task_id, task) (1002, { ... });
INSERT INTO tasks_assigned (task_id, worker_id)
VALUES (1001, 'west-1')
IF NOT EXISTS; -- Only take the task if it is not taken
UPDATE tasks_assigned
SET
worker_id= 'west-2'
WHERE task_id = 1001
IF
worker_id= 'west_1'; -- Atomically change failed worker of a task
IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS:
INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS
or you could provide a collection of predicates on different row cells:
UPDATE inventory SET state = 'Used' WHERE itemid = ? IF
state = 'Unused' AND check = 'Passed'
- all such changes will be consistent and durable.
You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=.
A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data:
INSERT INTO tasks VALUES (task_id, task) (1002, { ... });
INSERT INTO tasks_assigned (task_id, worker_id)
VALUES (1001, 'west-1')
IF NOT EXISTS; -- Only take the task if it is not taken
UPDATE tasks_assigned
SET
worker_id= 'west-2'
WHERE task_id = 1001
IF
worker_id= 'west_1'; -- Atomically change failed worker of a task
In addition to a single statement, it is possible to combine multiple conditional statements into a batch. A batch can have non-conditional statements as well, but all statements of such a batch may span only a single partition.
This is useful when it is desired to update multiple rows in a partition or atomically erase all or a range of rows in it.
If any statement in a batch has conditions, entire batch is considered "conditional": it is applied atomically if and only if *all* conditions of all statements in the batch evaluate to TRUE.
LWT batches are very similar to multi-statement transactions in relational databases, since they provide multiple-row read consistency, durability and isolation. Yes, with atomic batches in Scylla clients don't see partial changes, as entire partition mutation is applied as all or nothing.
The only difference from real transactions is that the batch logic can not "branch", i.e. there is only one ELSE branch and it is "do nothing".
If you wish to avoid an extra learn round, set CONSISTENCY to ANY, and SERIAL CONSISTENCY to SERIAL
If you with to have transactional semantics within the current DC, and asynchronously apply the mutation to the remote DC, you can use LOCAL_SERIAL consistency and QUORUM eventual consistency
One may think that IF clause is a new WHERE - and this is true to a large extent, both accept expressions and are applied to the searched row.
Unlike WHERE clause, IF conditions never use a secondary index - the rows are fetched before a condition is evaluated.
IF condition applies only to a fully qualified row, i.e. you still must specify the partition key and in many cases clustering key, either in WHERE clause, if we deal with DELETE or UPDATE or in SET or VALUES clause, for UPDATE and INSERT.
If your restrictions yield multiple rows, your IF condition can not be ambiguous. I.e. it can not evaluate to TRUE for one row and to FALSE for another, which in practice means that for statements restricting only the partition key, and not the clustering key, or the partition key and multiple clustering keys (pk = ? and ck IN (?, ?, ?), only the conditions on static cells are accepted.
A current limitation which we plan to lift is that not all predicates are available in conditions: LIKE, TOKEN or user-defined functions
are not available.
Finally, beware of null semantics for collection values. null for a frozen collection is a stored value, i.e. it is distinct from an absent value and is correspondingly treated in relations. For non-frozen collection != null or == null returns the same result for null values and absent data.
There is no reason for this but Cassandra compatibility.
use LWT with tables with counters,
use LWT with statements which span multiple partitions, batches or not
use user-supplied timestamps: guaranteeing consistency requires that the timestamp is assigned by the transaction coordinator
use conditional and non-conditional statements with the same data and expect conditional statements to be consistent.
You can actually use non-conditional statements on some cells of a row, and conditional on the other - but in practice this is hardly useful since eventually you'll have to work wit entire row, such as insert or delete it, and it will conflict. Better split such object to "transactional" and "eventually consistent" part and store in two different tables.
Other limitations are more minor:
while a non-LWT batch can be UNLOGGED, a conditional batch can not.
IF conditions must be a perfect conjunct (... AND ... AND ...)
while UPDATE is actually insert, UPDATE IF NOT EXISTS is not allowed, since it doesn't make any sense when read as English and not as CQL
Scylla is making an effort to be compatible with Cassandra, down to the level of limitations of the implementation. How is it different?
unlike Cassandra, we use per-core data partitioning, so the RPC that is done to perform a transaction talks directly to the right core on a peer replica, avoiding the concurrency overhead. That is, of course, true, if shard-aware driver is used - otherwise we add an extra hop to the right core at the coordinator node
just like the first implementation of LWT in Cassandra, we do not store hints for lightweight transaction writes. Cassandra later add hints support, while we do not have plans for it, since the hints seem to be redundant.
Unlike Cassandra, Scylla doesn't have LWT support in Thrift protocol and doesn't plan to add it.
conditional statements return a result set, and unlike Cassandra, Scylla returns result set metadata to the client at prepare if a statement has conditions. While the columns of the result set are the same as in Cassandra, Scylla always returns the old version of the row, to not confuse the driver while Cassandra returns the result set only if the statement is applied.
Let's illustrate this:
(go back to the batch statement example)
Remember that an i3.2xlarge is considered a small node for Scylla.
New label {conditional="yes"|"no"} for separate accounting of statements with and without conditions
Batch is accounted as conditional if it has at least one statement with conditions
All statements of a batch are accounted to cql_statements_in_batches and cql_inserts, cql_deletes, cql_updates with label {conditional="yes"|"no"} depending on whether the batch is conditional or not
Serial read: exported under
scylla_storage_proxy_coordinator_cas_read_*
Conditional write: exported under
scylla_storage_proxy_coordinator_cas_write_*
latency – latency histogram
timeouts – number of timeout errors
unavailable – number of failed attempts to form a PAXOS quorum
unfinished_commit – number of PAXOS rounds finished by the next request
condition_not_met – number of CAS failures due to failed IF condition (only for writes)
contention – histogram showing how many requests were retried internally due to contention
What to look out for: timeouts, growing latency, contention, unfinished commit, condition not met - all indicate there is something wrong with your app and you’re most likely are doing something wrong.
This screenshot is taken from our graphana monitoring when running the benchmark.
We plan to add these metrics to our standard dashboards: https://github.com/scylladb/scylla-monitoring/issues/775
Let's take a look at Scylla implementation - realizing the internal workings of the code helps identify the limits of applying this feature in your projects. It will also let us reason about the next steps for strong consistency in Scylla.
Scylla is a shared-nothing system with no central authority or repository of knowledge. Each node owns a fraction of data called token range and all nodes are forming a mesh to deliver the database service.
To avoid uneven distribution of data, the consistent hash ring contains not cnodes, but vnodes - virtual node identifiers, each node owning multiple vnodes.
One important way in which Scylla is different from Cassandra is its partitioning scheme, when each token range owned by a node is sub-partitioned into hundreds of sub-ranges, to ensure every CPU core solely owns its own subset of data. This allows for very little coordination between the cores on a single node - similar as there is very little coordination between the nodes in the entire cluster.
So Scylla adds an extra slicing layer, to split vnodes, into per-shard chunks called cnodes.
For each token range of a cnode, its peers, or secondary replicas are selected as a product of hash function, thus each transaction ultimately involves a unique set of peers.
This approach works very well for building a scalable, fault-tolerant system that minimizes hot spots and reduces impact of a single node failure.
Yet it creates tens if not hundreds of thousands of replication "groups" - *distinct* sets of peers participating in a given transaction.
The distributed system theory offers two broad sets of algorithms for peer coordination: with a designated leader, which may change once in a while to provide high availability, and leader-less, or, in fact, selecting a leader independently for every transaction.
Some have already recognized that I am speaking in very broad terms about Raft vs Paxos family of algorithms.
Thanks to the Scylla approach to data partitioning, using a leader-based algorithm would require adding group replication state for every distinct replication group, which means a lot of additional runtime state to maintain, and a lot of implementation complexity to manage, especially when the number of nodes or number of cores on a node changes, and many replication groups are re-formed.
A leaderless algorithm trades the need to maintain extra state with an extra negotiation round to select a leader for each transaction.
Since this approach allowed us to shorten the time to market we settled on it first, somewhat reassured that Cassandra uses the same technique.
So what is Paxos and how does it work?
Paxos was invented as an algorithm for achieving consensus on a single value over unreliable communication channels. Many parts of the algorithm are left to implementers, so it can be tailored to solving the problem of database replication.
In Scylla, the algorithm participants are replicas responsible for a given partition key. When a client suggests a change to the key (any modification statement can be represented as a partition mutation), a coordinator node acting on the client's behalf ensures that the majority of replicas holding the key accept the change. Any node in the cluster can be a coordinator for some change.
This is done in two steps: first, the majority of replicas responsible for the key make a promise to the coordinator to accept the change, if the coordinator decides to make it. This step is necessary to make sure that no two concurrent coordinators "split" the history, when some replicas accept changes from one coordinator, and others from another. Essentially it temporarily locks out other changes and allows them to happen one at a time. After the coordinator receives a majority of promises, it suggests a change. If the change is accepted by the majority, the algorithm achieved progress.
Please note that this illustration assumes a shard-aware driver and the first replica both acting as a coordinator and implicitly sending successfully sending and acknowledging all messages.
In addition to the two steps mandated by the protocol, Scylla has to retrieve the old row to check conditions. Once a proposal is accepted, and the coordinator knows it has been accepted (it got responses from a majority) another query is performed to make sure the change is applied to the base table on each replica.
Overall, this makes up to 4 rounds, excluding retries and repairs.
The algorithm uses a system table, called system.paxos to store its state. The table is replica-local, i.e. it is not partitioned but contains own data on each replica. The table primary key is a blob, capable of storing a partition key of any user table. This ensures that any Paxos round can find a designated unique slot in the system table to store its state.
Once a round is over, the state can be cleared or overwritten - the table has a TTL attached ot it, to ensure old rounds expire.
While a node acting as a coordinator is leading the effort in achieving resolution, other nodes are free to do the same and may even hijack the efforts of their peers.
In particular, all coordinators share responsibility of carrying out an unfinished round when they encounter it. This makes Paxos resilient against failures such as machine crashes and network outages. This, however, leads to contention under load, since it can be difficult to distinguish a round which has an active coordinator pushing it to completion from a round that was abandoned because the coordinator that started it had failed.
One can already guess from my brief sketch of the implementation that achieving consensus using Paxos is a saga and many things may break on the way.
Let's try to look critically at Scylla implementation and summarize its current flaws so that users can be aware of it:
the main issue, of course, is that the protocol is very expensive: it's 4 times more expensive than a usual write in terms of network latency, and a hundred times in terms of I/O, since it incurs a read of the old row. By any measures the network latency dominates I/O costs, but these costs should not be discounted either: fetching whole pages of an LSM tree can saturate I/O bandwidth way before network bandwidth limit is reached. Some of the of the protocol RPCs could be collapsed, and work in Scylla has begun to this end.
the second largest issue to note is high contention overhead when multiple coordinators attempt to work on the same key. The contention is innate in the liveness property of Paxos - when two coordinators have a row over concurrently changing the same key they need to guess if the other coordinator is alive or not, which may be difficult and costly. So they just back off and wait for a random interval when encounter contention, which introduces exponentially growing delays as the key becomes hotter and hotter. It should be noted that the research has advanced enough to provide industry grade solutions for the problem (Paxos leases), and our team is also looking into applying it.
there are circumstances in which the client can not reliably know whether a value is applied or not. In one infamous Cassandra bug a user complains that he gets a timeout exception from a query which is actually successful and the timeout is returned before it is expired!
Let's consider a case when a coordinator attempts to perform a change but another coordinator hijacks its partially completed change. Has the change been applied? Maybe, but the coordinator has no time to find out - it has to return "timeout" state to the client, which in turn has to figure out the outcome itself.
In other words, it is possible to have a write operation report a failure to the client, but still actually persist the write to a replica.
While the situation is also quite possible with any other kind of update, it is aggravated by contention and prolonged nature of Paxos - so any user of LWT has to be taking it seriously. Cassandra 4.0 release delivers better diagnostics of this "unknown" state so that the client can make a more educated decision as to how to proceed (and we plan to do the same).
Paxos table state is an extra temporary state that DBAs must take into account when sizing their deployments. It can store up to 3 hours of in-progress transactions on each nodes, which could be quite hefty under high load.
We intend to address or otherwise mitigate these issues before the feature becomes generally available. Meanwhile, please bear in mind the high costs of lightweight transactions when designing your applications and use them sparingly, i.e. avoid using for all your data. As already mentioned, a good design pattern is when WLT is used for control plane of your application, while data plane continues to be eventually consistent.
As you could have sensed I'm not actually very happy with many of these issues and I somewhat regret we had to inherit some of them from Cassandra to preserve compatibility.
Good news is Scylla is not just a Cassandra clone - CQL is the first front-end to its fantastic massively-parallel database technology, DynamoDB-compatible API is the second and others are quite likely to appear.
We plan to continue our efforts in introducing a leader-based synchronous replication to Scylla, which is now a prevalent trend in the industry.
To do it right, Scylla will need to change its data partitioning scheme to ensure there is more data locality, and also bring down the number of
replication groups in the cluster, from tens of thousands, to hundreds (we still need to keep the number of groups somewhat high to ensure the workload is handled evenly).
To avoid making our existing users perform painful migrations, we will begin by using a new partitioning and data replication scheme for new tables created with these options enabled.
For such tables we will always mandate server-assigned timestamp for transaction identifier.
One advantage of this approach is that it will make all CQL statements, not just conditional statements, strongly consistent. Ensuring isolation will not require a read of the old row or multiple network round trips, so will come at a much lower cost.
This is not an official commitment but the current state of mind of some key people on the engineering team.