Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency
Eventually, Scylla Chooses Consistency

Editor's Notes

  • #2 Hi, This talk is about Raft in Scylla - our effort to improve a lot of existing Cassandra functionality and add new strongly consistent features.
  • #3 I’m Konstantin Osipov, I live in Moscow and work on open source databases. In Scylla I’ve been involved with implementation of lightweight transactions.
  • #4 Before discussing Raft, let’s recap the items we delivered recently. Back at Scylla Summit 2019 we announced support for Cassandra lightweight transactions. Lightweight transactions allow all clients agree on a state of a database before making a change to it. Prior to that, Scylla lacked any strongly consistent features. We made a considerable effort testing LWT, and just recently completed an industry standard Jepsen testing for it.
  • #5 In Scylla, LWT are based on Paxos consensus algorithm. Paxos is a leaderless protocol, in which each participant stores little state, which was an advantage considering that to be compatible with Cassandra Scylla needed to allow each partition be independently available. Paxos runs 3 rounds of network messages to commit each transaction. This is 1 round trip less than Cassandra, but still is more than necessary in the optimal case. An important property of LWT is that it works over existing tables and alongside eventually consistent operations. If LWT are not used, the overhead on the rest of the operations is zero. This is the gain of a fairly high cost of the implementation. We mentioned at the 2019 Summit that Scylla is committed to providing an optimized implementation of strongly consistent reads and writes based on Raft. In this talk I will discuss our progress with Raft and what else we’re going to improve using it.
  • #6 So what is Raft anyway? It is a leader based log replication protocol. A very crude explanation of what Raft does, is it elects a leader once, and then the leader is responsible for making all the decisions about the state of the database. This helps avoid extra communication between replicas during individual reads and writes. Each node maintains a state of who the current leader is, and forwards requests to the leader. Scylla clients are unaffected - except now the leader does some more work than replicas, so the load distribution may be less even. This means Scylla will need to Raft instances side by side.
  • #7 Raft is built around the notion of a replicated log. When the leader receives a request, it first stores an entry for it in its log. Then it pushes the entry to replica’s copies of the log. Once the majority of replicas store the entry, the leader applies the entry and instruct the replicas to do the same. On event of leader failure, a replica with the most up to date log becomes the leader.
  • #8 Raft defines not only how group makes a decision, but also the protocol of deciding on new members of the group, and removing group members. This lays a solid foundation for Scylla topology changes: they translate naturally to Raft configuration changes, assuming there is a Raft group for all of the nodes, and no longer need a proprietary protocol.
  • #9 Schema changes translate to simply storing a command in a the global Raft log and then applying the change on each node which has a copy of the log.
  • #10 Because of the additional state (the current leader) stored at each peer, it’s not as straightforward to apply Raft to Scylla data manipulation statements. Maintaining a separate leader for each partition would be just too much overhead, considering individual partition updates may be rare. This is why Scylla, alongside Raft, works on a new partitioner, which would reduce the total number of partitions, while still keeping the number high to guarantee even distribution of data and work, and would allow balance the data between partitions more flexibly. For each such partition, called Tablet, Scylla will run an own instance of Raft algorithm. In the rest of the talk I will discuss these 3 applications of Raft in more detail.
  • #11 Let’s begin with the subject of topology changes and discuss how Raft could be used to improve it.
  • #12 Presently, topology changes in Scylla are eventually consistent. Let’s use node addition as an example. A node wishing to join the cluster advertises itself to the rest of the members through Gossip. For those of you not familiar with the way Gossip works, it’s a great protocol for distributing some infrequently changing information at low cost. It’s very commonly used for failure detection - when healthy clusters enjoy low network overhead induced by a failure detector, and state of a faulty node distributes across the cluster reasonably quickly - a few to several seconds would be a typical interval. Knowing Gossip is not too fast waits for (by default) 30s to let the news spread. Nodes begin forwarding relevant updates to the new node once they are aware of ot. With updates coming in, the node can start data rebalancing. Node removal or decommission works similarly, except the node spreading the rumour (aka the change coordinator) is not necessarily the same node the rumour is about (just what we are used to in real life). This poses some challenges: The actions performed by the change coordinator are unilateral, and assume the operator avoids starting a conflicting change concurrently. The joining node will proceed after a 30s interval even if one of the nodes in the cluster is down and did not get the news about the new member. Such nodes, once are back online, will continue serving queries using old topology until Gossip messages reach them. A repair will then be necessary to restore the configured data replication factor. If a joining node dies mid-way, its added data ranges will remain in the cluster topology and the operator will need to clean them up manually before proceeding with the next change. Since the procedure relies on a fairly slow vehicle to spread the information, it’s hard to split into multiple steps. When we at Scylla discuss how to add multiple nodes concurrently, we consider breaking a single topology change action into smaller, persistent and resumable steps, such as first adding an empty node, then assigning it some data ranges, then actually moving these ranges. Having to wait 30s for each step to settle in through Gossip is not very practical.
  • #13 Raft handles these challenges by including topology changes (called configuration changes there) into protocol core. This part of Raft protocol is also widely adopted and went under extensive scrutiny, so should be naturally preferred to Scylla’s proprietary solution inherited from Cassandra. The way Raft treats topology changes is similar to the way it handles standard strongly consistent reads and writes. A topology change is done by appending two records to the distributed Raft log. The first record is introducing the new topology to the cluster. After the first record is appended to master log, and until the log with this record is shipped to the majority of nodes, the cluster takes into account the new topology (e.g. a new node) in all writes, but doesn’t abandon the old topology yet - it’s also used for all reads and writes. Once the majority of replicas got the information about the new topology, the leader adds the second record to the log. This informs replicas that now it’s safe to discard the old topology and fully switch to the new one. This two-step procedure ensures that no two parts of the cluster operate in two different configurations - worst case, some nodes may still be using joint topology and old one, or joint topology and new one, both of which is safe, but never only old and only new topology. With Raft, Scylla topology changes could be split into multiple steps: First, add the new node to the global Raft group configuration, using the procedure just described Then, commit a record to token_metadata with the new nodes’ token. This will be linearizable with all topologies The, stream ranges to the added node, and update state of each range as it is streamed. Since all the steps are linearized through Raft log it is now possible to permit concurrent topology changes, as long as they don’t conflict. The only conceivable downside is that if the majority of the cluster nodes are down, it may be not possible to perform topology changes at all. Scylla will need to provide an emergency brake instrument to recover clusters so significantly degraded. One possible solution would be directly editing topology information on the remaining nodes, to let them continue in the state that remains.
  • #14 Schema changes are operations such as creating and dropping keyspaces, tables, user defined types or functions. If they are using Raft, they also benefit from linearizability.
  • #15 Currently, Schema changes in Scylla are eventually consistent. Each Scylla node has an own copy of the schema. Requests to change schema are validated against a local copy and then are applied locally. A new data item may be added to the immediately following, before any other cluster node knows about it. There is no coordination between changes at different nodes, and any node is free to propose a change. The change is eventually propagated through the rest of the cluster. The last-timestamp-wins is used to resolve conflicts if two changes against the same object happened concurrently.
  • #16 Data manipulation is aware of possible schema inconsistency. A specific request carries a schema version with it. Scylla is able execute requests with divergent history, so will fetch a particular schema version just to execute a request. This guarantees the schema changes are fully available in presence of network failures. It has some downsides as well. It is possible to submit changes that conflict: e.g. define a table based on UDT, and drop that UDT New features, such as triggers, stored functions, UDFs, aggravate the consistency problem
  • #17 After switching schema changes to Raft any node would still be able to propose a change. However, the change now will be forwarded to Raft leader, where it will be validated against the latest version of the schema. Then, the leader will persist it in a global Raft log, replicated to all nodes of the cluster. Once the majority of replicas confirm persisting its copy of the log, the change will be applied on all replicas. With this approach, all schema changes will form a linear history and divergent/conflicting changes will be impossible. It should open the way to complex but safe dependencies between schema objects, i.e. triggers, constraints or functional indexes. A replica which was down while the cluster has been performing schema changes will catch up with them on boot, but streaming the entire history of changes from the leader. There is also a downside. It will no longer possible to perform a schema change if the majority for the cluster is unreachable or down. It is still possible that a node gets a request for a schema it did not see yet, and will need to fetch schema for it. For older schemas we will maintain a version history. For newer schemas, we will need to make sure that the history can be fetched from any node, not just the leader. https://docs.google.com/presentation/d/1ZazssA802_bUHcJKy7yPUbiVby8acFxbebf-VbmXRDk/edit#slide=id.ga3bc8bcbea_0_131
  • #18 Finally, the ultimate feature enabled by Raft are fast & efficient yet strongly consistent tables. Tablets is a term for a database unit of data distribution and load balancing first introduced in Google BigTable paper from 2006. Let’s see how they work.
  • #19 Today, Scylla’s partitioning strategy is not pluggable. Compare with replication strategy: you can change how many replicas a keyspace has, and where these replicas are located. You can also use QUORUM/LOCAL_QUORUM and SERIAL/LOCAL_SERIAL to work efficiently in cross-dc setup. Scylla partitioner is not like it: all you can choose is what makes a partition key. The key is always hashed to a token, a token mapped to a replica set/shard. Thanks to hashing and use of vnodes (tokens), the data is evenly distributed across the cluster. Most write and read scenarios produce even load on all nodes of the cluster. Hotspots, while possible, are unlikely. Unfortunately, one size still can not fit all. Using the same partitioner for all tables can be rather a hindrance if there are a lot of small tables, which are frequently scanned. Frequent range scans also require an extra step of merging streams produced by multiple nodes. Certain partitions tend to get hot no matter how good is the choice of the partition key. https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  • #20 So in Scylla, we would like to make partitioning strategy a user choice, like the replication factor is today. If a user chooses tablet partitioning, Scylla will store small tables using few tablets. Large partitions (tablets) will be automatically split, and small tablets coalesced if necessary. Other databases that support range-based partitioners include MongoDB, Couchbase, Cockroach… https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  • #21 Tables, partitioned using tablets, will work efficiently with Raft. When Raft is used, the change is stored in the log before it’s applied to the table, so no repair in Cassandra sense is needed - we may still want to “repair” (i.e. sync up) the logs between replicas, but the base tables will stay consistent at all times. This addressed the problem of consistency of derived data, which has been open in Cassandra for along time (many of you who track Cassandra development are familiar with materialized view consistency issues) . https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  • #22 Original Raft does not know about partitions, tokens, shards. It is an abstract algorithm describing replication of an abstract state machine. In Scylla, we have more than one state machine (schema information, topology information, and then each tablet and its replica set is an independent Raft instance), so we want to run many copies of Raft algorithm simultaneously. This poses new challenges: how do we spawn new copies consistently? How much state the algorithm will take? Can we share the overhead of the algorithm, such as the cost of distributed failure detection, between Raft instances? Where to store Raft replication log? Could we avoid the overhead of double logging: raft log and commit log? Could we make these decision configurable, depending on the balance of performance and ease of use? We have already addressed many of these issues in Scylla Raft - a reusable library, which supports joint consensus configuration changes, pluggable state machine, logging and failure detection. We’re working on rebuilding Scylla schema on top of it. The first user-visible impact of the effort is expected in the upcoming year. Stay tuned.