Distributed Database
Consistency: Architectural
Considerations and Tradeoffs
Presented by:
Tzach Livyatan, VP of Product, ScyllaDB
Konstantin Osipov, Director, Software Engineering, ScyllaDB
Poll
Where are you in your
NoSQL Adoption?
2
Distributed Database
Consistency: Architectural
Considerations and Tradeoffs
Presented by:
Tzach Livyatan, VP of Product, ScyllaDB
Konstantin Osipov, Director, Software Engineering, ScyllaDB
Tzach Livyatan
4
VP of Product, ScyllaDB
+ Lead the product team in ScyllaDB
+ Appreciate distributed system testing
+ Lives in Tel Aviv, father of two
Konstantin Osipov
5
Director of Engineering, ScyllaDB
+ Worked on Consensus Algorithms in
ScyllaDB
+ Crazy about distributed system testing
+ Lives in Moscow, father of two
Speaker Photo
6
ScyllaDB is the database for data-intensive apps
that require high performance and low latency
+ Infoworld 2020 Technology of the Year!
+ Founded by designers of KVM Hypervisor
The Database Built for Gamechangers
7
“ScyllaDB stands apart...It’s the rare product
that exceeds my expectations.”
– Martin Heller, InfoWorld contributing editor and reviewer
“For 99.9% of applications, ScyllaDB delivers all the
power a customer will ever need, on workloads that other
databases can’t touch – and at a fraction of the cost of
an in-memory solution.”
– Adrian Bridgewater, Forbes senior contributor
+ Resolves challenges of legacy NoSQL databases
+ >5x higher throughput
+ >20x lower latency
+ >75% TCO savings
+ DBaaS/Cloud, Enterprise and Open Source solutions
+ Proven globally at scale
8
+400 Gamechangers Leverage ScyllaDB
Seamless experiences
across content + devices
Fast computation of flight
pricing
Corporate fleet
management
Real-time analytics
2,000,000 SKU -commerce
management
Real-time location tracking
for friends/family
Video recommendation
management
IoT for industrial
machines
Synchronize browser
properties for millions
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Distributed storage for
distributed ledger tech
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking
Agenda ■ Introduction to ScyllaDB
■ Consistency vs Availability
■ Problem statement: Schema and Topology
Consistency
■ Raft in ScyllaDB
■ Schema and Topology Consistency in ScyllaDB 5.x
■ Next steps
■ QA
9
NoSQL – By Data Model
Key / Value Redis, Aerospike, RocksDB
Document store MongoDB, Couchbase
Wide column store Scylla, Apache Cassandra,
HBase, DynamoDB
Graph Neo4j, JanusGraph
Complexity
10
NoSQL– By Availability vs Consistency
11
Pick Two
Availability
Partition Tolerance
Consistency
PACELC:
Latency vs Consistency
Data - Tunable, Eventual Consistency
Data - Tunable, Eventual Consistency
13
Active/Active, replicated, auto-sharded
14
Data - Tunable, Eventual Consistency
App
App
App
App
App
App
CL= Local
Quorum
CL= One
Meta Data Consistency - Gossip Protocol
16
Scylla Architecture
The Problem with Metadata
Eventual Consistency
What is Database Schema?
Replicating Schema Changes
7
7
6
6
5
CREATE KEYSPACE
clicks
WITH { replication … }
Consistency Model of Schema Changes
id first last
1 John Doe
Time
Node A: Node B:
id first last email
1 John Doe
2 Jenny Smith j@...
id first last email phone
1 John Doe
2 Jenny Smith j@... (867)
id first last phone
1 John Doe
2 Jenny Smith (867)
Split
brain
(In)consistency of Schema Changes
cqlsh:test> create table t (a int primary key);
----------------------------------------------- split ------------------------------------------
cqlsh:test> alter table t rename a to d;
Warning: schema version mismatch detected
cqlsh:test> insert into t (d) values (1);
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance.
cqlsh:test> insert into t (a) values (1);
Unknown identifier a
Eventual Consistency of
Topology Changes
What is Topology?
Topology is defined as all of the following:
the set of nodes in the cluster,
location of those nodes in DCs and racks,
and assignment of ownership of data to nodes
Token Metadata
+ Members, data partitioning and distribution
+ Where does each key live in the cluster?
Token Partitioning
+ token = hash(partition key)
+ token ring: space of all tokens, set of all partition keys
+ token range: set of partition keys
Token ring:
token
token
range
token
Token Metadata
node A node B node C
A
C
B
C
A
B
Token metadata:
+ Each node has a set of tokens assigned during bootstrap
(vnodes)
+ Tokens combined determine primary owning replicas for key
ranges
Token Metadata
A
C
B
C
A
B
{A, C}
{C, B}
{B, A}
{C, A}
{A, B}
{B, C}
token
metadata
replication
metadata
create
keyspace …
with {
replication =
… }
replication strategy
Eventually (In)consistent Topology
+ To ensure data consistency, all coordinators need to agree on
topology
+ Eventually consistent propagation -> stale topology
node A node B node C
Eventually (In)consistent Topology
node A node B node C
A
C
B
C
A
B
Token
Metadata
A
C
B
C
A
B
A
C
B
C
A
B
Eventually (In)consistent Topology
node A node B node C
Cluster
down!
Eventually (In)consistent Topology
node A node B node C
Cluster up
except node C
Eventually (In)consistent Topology
node A node B node C
Token
metadata
(in gossip)
A
B
A
B
Cluster up
except node C
A
B
A
B
Eventually (In)consistent Topology
node A node B node C
Token
metadata
Cluster up
except node C
A
C
B
C
A
B
A
B
A
B
local view in gossip
Eventually (In)consistent Topology
node A node B node C
Token
metadata
(in gossip)
A
B
A
B
node D
A
B
A
B
Bootstrapping
node D
A
B
A
B
Eventually (In)consistent Topology
node A node B node C
Token
metadata
node D
A
B
A
B
Bootstrapping
node D
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
Eventually (In)consistent Topology
Token
metadata
A
B
A
B
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
+ Different token metadata -> different replica sets
+ Different nodes use different quorums -> inconsistent reads
+ Writes go to the wrong replica set temporarily
+ etc.
Eventually (In)consistent Topology
“Cannot” happen:
“Before adding the new node,
check the node’s status in the cluster using nodetool status
command.
You cannot add new nodes to the cluster if any of the nodes are
down.” [1]
[1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/
Strongly Consistent Topology
The plan:
+ Make the database responsible for consistency under all conditions
Why:
+ Gives a reliable safety net for admins
+ Reduces stress
+ Increases confidence
+ Simplifies procedures
Strong needs for strong consistency
+ Reliable, concurrent topology changes
+ Linearizable cluster-wide schema
+ Strongly consistent, partitioned storage
What is Raft?
Raft Intro
Raft is a protocol for state machine replication.
What does it mean?
+ The majority of nodes have the same state
+ State transition happens in the same order on all nodes
Cluster topology is part of the state
How Raft Achieves Consistency
State
machine
State
machine
State
machine
Node A Node B Node C
How Raft Achieves Consistency
State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3
Node A Node B Node C
How Raft Achieves Consistency
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Node A Node B Node C
Leader Based Replication
7
6
6
6
6
CREATE KEYSPACE
clicks
WITH { replication … }
Detecting a leader failure
🔥
B
C
D
E
��
��
��
��
● Leader regularly pings
followers
● Followers become candidates
when don’t receive pings
● 1 ping = 1/10 of a second
● Missing 10 pings = 1 second =
election timeout
● Bigger timeout = decreased
liveness
Raft Leadership Changes
Election starts: S1 is a candidate: More candidates: S1 is elected leader:
T i m e
Randomizing the election timeout
T i m e
🔥
Election
timeout
Election
threshold
E D B C
Nodes become candidates
Leader
failure
Why split votes happen
T i m e
🔥
Election
timeout
Election
threshold
E
P
B C
Nodes become candidates
Leader
failure
G D I
E H
L
K
M
Use Gaussian timeout distribution ?
T i m e
🔥
Election
timeout
Election
threshold
E
P
B C
Nodes become candidates
Leader
failure
G D I
E H
L
K
M
Uniform, scaled threshold
T i m e
🔥
Election
timeout
Election threshold =
size(cluster)
E P B C
Nodes become candidates
Leader
failure
G D I
E H
L
K
M
Ping traffic in Multi-Raft
A
B
C
D
E
● Each node is a leader
of many Groups
● Each leader has to
ping followers
● G2
pings
Solution: shared failure detector
+ Each node pings other nodes
+ Groups share ping information
+ N2
pings
Problem: removing a leader
A
B
C
D
E
● Removed leader is
seen as alive if FD is
shared
● Vanilla Raft doesn’t
ping nodes outside
configuration so is not
affected
● Nodes do not become
candidates
Solution: search for a leader
● The follower forwards
requests to a leader if
it is known
● If there is a request,
but the leader is not
known, sends
AppendEntriesReply to
all nodes to find a
leader
C
D
E
F
B
Raft Configuration Changes
x←1
add
node D
y←2 z←3
del
node A
Time
Replicated log
Raft Configuration Changes
x←1
add
node D
y←2 z←3
del
node A
Time
Replicated log
Scylla Raft: only use Joint configuration
x←1
Begin
add A
y←2 z←3
End
add A
Time
Replicated log
Non-voting Members
A
ADD NODE B
B
Pre-voting
Sticky leadership rule removed
Summary
Scylla Raft implements a number of important extensions
+ Resilience against asymmetric network failures with pre-voting
+ Increased liveness for very large clusters (1000+ nodes)
+ Efficient multi-raft: every node can replicate many state machines
+ Read and write support on all cluster nodes (barriers and forwarding)
+ Non-voting members
Scyla Raft removes some Raft features as redundant:
+ Simple configuration changes
+ Sticky leadership
2, 3
2, 3
2, 3
2, 3
2, 3
Setting up a Fresh Cluster
1
2
3
4
5
1
2
3
4
5
2, 3, 1, 4, 5
1
2
3
4
5
T i m e
Setting up a Fresh Cluster
On a fresh start, ScyllaDB node:
+ Generates and persists unique random Server ID (UUID)
+ Contacts all known peers. Strictly after:
+ contacting all peers in seeds: list
+ exchanging all known Server IDs
+ AND not finding an existing cluster
+ AND if this Server ID is lexicographically the smallest
+ Creates a new Raft Group ID and a new cluster
Topology Changes on Raft
system.token_metadata
+ Have a RAFT group which includes all cluster members (raft_group0)
+ Token metadata be the state machine which is replicated by RAFT
+ Changes of token metadata are raft commands
Schema changes on Raft
To execute a DDL statement, the server:
+ Takes Raft read barrier
+ Reads the latest schema and validates CQL
+ Builds Raft command and signs it with old and new schema id
+ Once command is committed, it’s applied only if old schema id
is the same
+ Retries if commit or apply failed
The Balance Between
Consistency and Availability
Availability of DML
S1
S2
S3
CREATE TABLE t ADD COLUMN b CREATE INDEX t_i1
Raft log:
I
N
S
E
R
T
I
N
T
O
t
S
E
T
b
=
2
S
E
L
E
C
T
b
- schema fetch
+ RAFT eagerly replicates to every node
+ Like RF=ALL tables with auto-repair
+ Request coordinators still use the local view on topology
+ No extra coordination when executing user requests
+ Topology changes use linearizable access for learning and
modification
+ No need for sleep(30s)
+ Faster topology changes
Replacing Gossip with RAFT
Solved Issues
+ Concurrent DDL is now safe
+ Safe topology changes enable elasticity
+ still under --experimental-features-raft
+ Enabled if all nodes are 5.0
Split Brain Problem
App
App
App
App
App
App
Introduced Issues
Raft prefers CONSISTENCY over AVAILABILITY. What does it mean?
+ 2-data center set ups become more fragile
+ Prefer odd number of DCs to avoid split brain
+ Import sstables into a new cluster if permanent loss of majority
+ 5.0 cluster with Raft can’t downgrade to 4.x
Steps to Stronger Consistency in ScyllaDB
+ Tests, tests and more tests
+ Schema consistency - Experimental in 5.0
+ Topology consistency - Coming in 5.x
+ Tablets consistency - Coming in 5.x
Questions?
Poll
How much data do you have under
management in your own
transactional database?
75
Thank you
for joining us today.
@scylladb scylladb/
slack.scylladb.com
@scylladb company/scylladb/
scylladb/

Distributed Database Consistency: Architectural Considerations and Tradeoffs

  • 1.
    Distributed Database Consistency: Architectural Considerationsand Tradeoffs Presented by: Tzach Livyatan, VP of Product, ScyllaDB Konstantin Osipov, Director, Software Engineering, ScyllaDB
  • 2.
    Poll Where are youin your NoSQL Adoption? 2
  • 3.
    Distributed Database Consistency: Architectural Considerationsand Tradeoffs Presented by: Tzach Livyatan, VP of Product, ScyllaDB Konstantin Osipov, Director, Software Engineering, ScyllaDB
  • 4.
    Tzach Livyatan 4 VP ofProduct, ScyllaDB + Lead the product team in ScyllaDB + Appreciate distributed system testing + Lives in Tel Aviv, father of two
  • 5.
    Konstantin Osipov 5 Director ofEngineering, ScyllaDB + Worked on Consensus Algorithms in ScyllaDB + Crazy about distributed system testing + Lives in Moscow, father of two Speaker Photo
  • 6.
    6 ScyllaDB is thedatabase for data-intensive apps that require high performance and low latency
  • 7.
    + Infoworld 2020Technology of the Year! + Founded by designers of KVM Hypervisor The Database Built for Gamechangers 7 “ScyllaDB stands apart...It’s the rare product that exceeds my expectations.” – Martin Heller, InfoWorld contributing editor and reviewer “For 99.9% of applications, ScyllaDB delivers all the power a customer will ever need, on workloads that other databases can’t touch – and at a fraction of the cost of an in-memory solution.” – Adrian Bridgewater, Forbes senior contributor + Resolves challenges of legacy NoSQL databases + >5x higher throughput + >20x lower latency + >75% TCO savings + DBaaS/Cloud, Enterprise and Open Source solutions + Proven globally at scale
  • 8.
    8 +400 Gamechangers LeverageScyllaDB Seamless experiences across content + devices Fast computation of flight pricing Corporate fleet management Real-time analytics 2,000,000 SKU -commerce management Real-time location tracking for friends/family Video recommendation management IoT for industrial machines Synchronize browser properties for millions Threat intelligence service using JanusGraph Real time fraud detection across 6M transactions/day Uber scale, mission critical chat & messaging app Network security threat detection Power ~50M X1 DVRs with billions of reqs/day Precision healthcare via Edison AI Inventory hub for retail operations Property listings and updates Unified ML feature store across the business Cryptocurrency exchange app Geography-based recommendations Distributed storage for distributed ledger tech Global operations- Avon, Body Shop + more Predictable performance for on sale surges GPS-based exercise tracking
  • 9.
    Agenda ■ Introductionto ScyllaDB ■ Consistency vs Availability ■ Problem statement: Schema and Topology Consistency ■ Raft in ScyllaDB ■ Schema and Topology Consistency in ScyllaDB 5.x ■ Next steps ■ QA 9
  • 10.
    NoSQL – ByData Model Key / Value Redis, Aerospike, RocksDB Document store MongoDB, Couchbase Wide column store Scylla, Apache Cassandra, HBase, DynamoDB Graph Neo4j, JanusGraph Complexity 10
  • 11.
    NoSQL– By Availabilityvs Consistency 11 Pick Two Availability Partition Tolerance Consistency PACELC: Latency vs Consistency
  • 12.
    Data - Tunable,Eventual Consistency
  • 13.
    Data - Tunable,Eventual Consistency 13
  • 14.
    Active/Active, replicated, auto-sharded 14 Data- Tunable, Eventual Consistency App App App App App App CL= Local Quorum CL= One
  • 15.
    Meta Data Consistency- Gossip Protocol
  • 16.
  • 17.
    The Problem withMetadata Eventual Consistency
  • 18.
  • 19.
    Replicating Schema Changes 7 7 6 6 5 CREATEKEYSPACE clicks WITH { replication … }
  • 20.
    Consistency Model ofSchema Changes id first last 1 John Doe Time Node A: Node B: id first last email 1 John Doe 2 Jenny Smith j@... id first last email phone 1 John Doe 2 Jenny Smith j@... (867) id first last phone 1 John Doe 2 Jenny Smith (867) Split brain
  • 21.
    (In)consistency of SchemaChanges cqlsh:test> create table t (a int primary key); ----------------------------------------------- split ------------------------------------------ cqlsh:test> alter table t rename a to d; Warning: schema version mismatch detected cqlsh:test> insert into t (d) values (1); Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. cqlsh:test> insert into t (a) values (1); Unknown identifier a
  • 22.
  • 23.
    What is Topology? Topologyis defined as all of the following: the set of nodes in the cluster, location of those nodes in DCs and racks, and assignment of ownership of data to nodes
  • 24.
    Token Metadata + Members,data partitioning and distribution + Where does each key live in the cluster?
  • 25.
    Token Partitioning + token= hash(partition key) + token ring: space of all tokens, set of all partition keys + token range: set of partition keys Token ring: token token range token
  • 26.
    Token Metadata node Anode B node C A C B C A B Token metadata: + Each node has a set of tokens assigned during bootstrap (vnodes) + Tokens combined determine primary owning replicas for key ranges
  • 27.
    Token Metadata A C B C A B {A, C} {C,B} {B, A} {C, A} {A, B} {B, C} token metadata replication metadata create keyspace … with { replication = … } replication strategy
  • 28.
    Eventually (In)consistent Topology +To ensure data consistency, all coordinators need to agree on topology + Eventually consistent propagation -> stale topology node A node B node C
  • 29.
    Eventually (In)consistent Topology nodeA node B node C A C B C A B Token Metadata A C B C A B A C B C A B
  • 30.
    Eventually (In)consistent Topology nodeA node B node C Cluster down!
  • 31.
    Eventually (In)consistent Topology nodeA node B node C Cluster up except node C
  • 32.
    Eventually (In)consistent Topology nodeA node B node C Token metadata (in gossip) A B A B Cluster up except node C A B A B
  • 33.
    Eventually (In)consistent Topology nodeA node B node C Token metadata Cluster up except node C A C B C A B A B A B local view in gossip
  • 34.
    Eventually (In)consistent Topology nodeA node B node C Token metadata (in gossip) A B A B node D A B A B Bootstrapping node D A B A B
  • 35.
    Eventually (In)consistent Topology nodeA node B node C Token metadata node D A B A B Bootstrapping node D A C B C A B A B A B local view local view in gossip
  • 36.
    Eventually (In)consistent Topology Token metadata A B A B A C B C A B A B A B localview local view in gossip + Different token metadata -> different replica sets + Different nodes use different quorums -> inconsistent reads + Writes go to the wrong replica set temporarily + etc.
  • 37.
    Eventually (In)consistent Topology “Cannot”happen: “Before adding the new node, check the node’s status in the cluster using nodetool status command. You cannot add new nodes to the cluster if any of the nodes are down.” [1] [1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/
  • 38.
    Strongly Consistent Topology Theplan: + Make the database responsible for consistency under all conditions Why: + Gives a reliable safety net for admins + Reduces stress + Increases confidence + Simplifies procedures
  • 39.
    Strong needs forstrong consistency + Reliable, concurrent topology changes + Linearizable cluster-wide schema + Strongly consistent, partitioned storage
  • 40.
  • 41.
    Raft Intro Raft isa protocol for state machine replication. What does it mean? + The majority of nodes have the same state + State transition happens in the same order on all nodes Cluster topology is part of the state
  • 42.
    How Raft AchievesConsistency State machine State machine State machine Node A Node B Node C
  • 43.
    How Raft AchievesConsistency State machine Log x←1 y←2 z←3 State machine Log x←1 y←2 z←3 State machine Log x←1 y←2 z←3 Node A Node B Node C
  • 44.
    How Raft AchievesConsistency Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Node A Node B Node C
  • 45.
    Leader Based Replication 7 6 6 6 6 CREATEKEYSPACE clicks WITH { replication … }
  • 46.
    Detecting a leaderfailure 🔥 B C D E �� �� �� �� ● Leader regularly pings followers ● Followers become candidates when don’t receive pings ● 1 ping = 1/10 of a second ● Missing 10 pings = 1 second = election timeout ● Bigger timeout = decreased liveness
  • 47.
    Raft Leadership Changes Electionstarts: S1 is a candidate: More candidates: S1 is elected leader: T i m e
  • 48.
    Randomizing the electiontimeout T i m e 🔥 Election timeout Election threshold E D B C Nodes become candidates Leader failure
  • 49.
    Why split voteshappen T i m e 🔥 Election timeout Election threshold E P B C Nodes become candidates Leader failure G D I E H L K M
  • 50.
    Use Gaussian timeoutdistribution ? T i m e 🔥 Election timeout Election threshold E P B C Nodes become candidates Leader failure G D I E H L K M
  • 51.
    Uniform, scaled threshold Ti m e 🔥 Election timeout Election threshold = size(cluster) E P B C Nodes become candidates Leader failure G D I E H L K M
  • 52.
    Ping traffic inMulti-Raft A B C D E ● Each node is a leader of many Groups ● Each leader has to ping followers ● G2 pings
  • 53.
    Solution: shared failuredetector + Each node pings other nodes + Groups share ping information + N2 pings
  • 54.
    Problem: removing aleader A B C D E ● Removed leader is seen as alive if FD is shared ● Vanilla Raft doesn’t ping nodes outside configuration so is not affected ● Nodes do not become candidates
  • 55.
    Solution: search fora leader ● The follower forwards requests to a leader if it is known ● If there is a request, but the leader is not known, sends AppendEntriesReply to all nodes to find a leader C D E F B
  • 56.
    Raft Configuration Changes x←1 add nodeD y←2 z←3 del node A Time Replicated log
  • 57.
    Raft Configuration Changes x←1 add nodeD y←2 z←3 del node A Time Replicated log
  • 58.
    Scylla Raft: onlyuse Joint configuration x←1 Begin add A y←2 z←3 End add A Time Replicated log
  • 59.
  • 60.
  • 61.
  • 62.
    Summary Scylla Raft implementsa number of important extensions + Resilience against asymmetric network failures with pre-voting + Increased liveness for very large clusters (1000+ nodes) + Efficient multi-raft: every node can replicate many state machines + Read and write support on all cluster nodes (barriers and forwarding) + Non-voting members Scyla Raft removes some Raft features as redundant: + Simple configuration changes + Sticky leadership
  • 63.
    2, 3 2, 3 2,3 2, 3 2, 3 Setting up a Fresh Cluster 1 2 3 4 5 1 2 3 4 5 2, 3, 1, 4, 5 1 2 3 4 5 T i m e
  • 64.
    Setting up aFresh Cluster On a fresh start, ScyllaDB node: + Generates and persists unique random Server ID (UUID) + Contacts all known peers. Strictly after: + contacting all peers in seeds: list + exchanging all known Server IDs + AND not finding an existing cluster + AND if this Server ID is lexicographically the smallest + Creates a new Raft Group ID and a new cluster
  • 65.
    Topology Changes onRaft system.token_metadata + Have a RAFT group which includes all cluster members (raft_group0) + Token metadata be the state machine which is replicated by RAFT + Changes of token metadata are raft commands
  • 66.
    Schema changes onRaft To execute a DDL statement, the server: + Takes Raft read barrier + Reads the latest schema and validates CQL + Builds Raft command and signs it with old and new schema id + Once command is committed, it’s applied only if old schema id is the same + Retries if commit or apply failed
  • 67.
  • 68.
    Availability of DML S1 S2 S3 CREATETABLE t ADD COLUMN b CREATE INDEX t_i1 Raft log: I N S E R T I N T O t S E T b = 2 S E L E C T b - schema fetch
  • 69.
    + RAFT eagerlyreplicates to every node + Like RF=ALL tables with auto-repair + Request coordinators still use the local view on topology + No extra coordination when executing user requests + Topology changes use linearizable access for learning and modification + No need for sleep(30s) + Faster topology changes Replacing Gossip with RAFT
  • 70.
    Solved Issues + ConcurrentDDL is now safe + Safe topology changes enable elasticity + still under --experimental-features-raft + Enabled if all nodes are 5.0
  • 71.
  • 72.
    Introduced Issues Raft prefersCONSISTENCY over AVAILABILITY. What does it mean? + 2-data center set ups become more fragile + Prefer odd number of DCs to avoid split brain + Import sstables into a new cluster if permanent loss of majority + 5.0 cluster with Raft can’t downgrade to 4.x
  • 73.
    Steps to StrongerConsistency in ScyllaDB + Tests, tests and more tests + Schema consistency - Experimental in 5.0 + Topology consistency - Coming in 5.x + Tablets consistency - Coming in 5.x
  • 74.
  • 75.
    Poll How much datado you have under management in your own transactional database? 75
  • 76.
    Thank you for joiningus today. @scylladb scylladb/ slack.scylladb.com @scylladb company/scylladb/ scylladb/