Distributed Database Consistency: Architectural Considerations and Tradeoffs

Distributed Database
Consistency: Architectural
Considerations and Tradeoffs
Presented by:
Tzach Livyatan, VP of Product, ScyllaDB
Konstantin Osipov, Director, Software Engineering, ScyllaDB

Poll
Where are you in your
NoSQL Adoption?
2

Tzach Livyatan
4
VP of Product, ScyllaDB
+ Lead the product team in ScyllaDB
+ Appreciate distributed system testing
+ Lives in Tel Aviv, father of two

Konstantin Osipov
5
Director of Engineering, ScyllaDB
+ Worked on Consensus Algorithms in
ScyllaDB
+ Crazy about distributed system testing
+ Lives in Moscow, father of two
Speaker Photo

6
ScyllaDB is the database for data-intensive apps
that require high performance and low latency

+ Infoworld 2020 Technology of the Year!
+ Founded by designers of KVM Hypervisor
The Database Built for Gamechangers
7
“ScyllaDB stands apart...It’s the rare product
that exceeds my expectations.”
– Martin Heller, InfoWorld contributing editor and reviewer
“For 99.9% of applications, ScyllaDB delivers all the
power a customer will ever need, on workloads that other
databases can’t touch – and at a fraction of the cost of
an in-memory solution.”
– Adrian Bridgewater, Forbes senior contributor
+ Resolves challenges of legacy NoSQL databases
+ >5x higher throughput
+ >20x lower latency
+ >75% TCO savings
+ DBaaS/Cloud, Enterprise and Open Source solutions
+ Proven globally at scale

8
+400 Gamechangers Leverage ScyllaDB
Seamless experiences
across content + devices
Fast computation of flight
pricing
Corporate fleet
management
Real-time analytics
2,000,000 SKU -commerce
management
Real-time location tracking
for friends/family
Video recommendation
management
IoT for industrial
machines
Synchronize browser
properties for millions
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Distributed storage for
distributed ledger tech
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking

Agenda ■ Introduction to ScyllaDB
■ Consistency vs Availability
■ Problem statement: Schema and Topology
Consistency
■ Raft in ScyllaDB
■ Schema and Topology Consistency in ScyllaDB 5.x
■ Next steps
■ QA
9

NoSQL – By Data Model
Key / Value Redis, Aerospike, RocksDB
Document store MongoDB, Couchbase
Wide column store Scylla, Apache Cassandra,
HBase, DynamoDB
Graph Neo4j, JanusGraph
Complexity
10

NoSQL– By Availability vs Consistency
11
Pick Two
Availability
Partition Tolerance
Consistency
PACELC:
Latency vs Consistency

Data - Tunable, Eventual Consistency

13

Active/Active, replicated, auto-sharded
14
App
App
App
App
App
App
CL= Local
Quorum
CL= One

Meta Data Consistency - Gossip Protocol

The Problem with Metadata
Eventual Consistency

Replicating Schema Changes
7
7
6
6
5
CREATE KEYSPACE
clicks
WITH { replication … }

Consistency Model of Schema Changes
id first last
1 John Doe
Time
Node A: Node B:
id first last email
1 John Doe
2 Jenny Smith j@...
id first last email phone
1 John Doe
2 Jenny Smith j@... (867)
id first last phone
1 John Doe
2 Jenny Smith (867)
Split
brain

(In)consistency of Schema Changes
cqlsh:test> create table t (a int primary key);
----------------------------------------------- split ------------------------------------------
cqlsh:test> alter table t rename a to d;
Warning: schema version mismatch detected
cqlsh:test> insert into t (d) values (1);
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance.
cqlsh:test> insert into t (a) values (1);
Unknown identifier a

Eventual Consistency of
Topology Changes

What is Topology?
Topology is deﬁned as all of the following:
the set of nodes in the cluster,
location of those nodes in DCs and racks,
and assignment of ownership of data to nodes

Token Metadata
+ Members, data partitioning and distribution
+ Where does each key live in the cluster?

Token Partitioning
+ token = hash(partition key)
+ token ring: space of all tokens, set of all partition keys
+ token range: set of partition keys
Token ring:
token
token
range
token

Token Metadata
node A node B node C
A
C
B
C
A
B
Token metadata:
+ Each node has a set of tokens assigned during bootstrap
(vnodes)
+ Tokens combined determine primary owning replicas for key
ranges

Token Metadata
A
C
B
C
A
B
{A, C}
{C, B}
{B, A}
{C, A}
{A, B}
{B, C}
token
metadata
replication
metadata
create
keyspace …
with {
replication =
… }
replication strategy

Eventually (In)consistent Topology
+ To ensure data consistency, all coordinators need to agree on
topology
+ Eventually consistent propagation -> stale topology

A
C
B
C
A
B
Token
Metadata
A
C
B
C
A
B
A
C
B
C
A
B

Cluster
down!

Cluster up
except node C

Token
metadata
(in gossip)
A
B
A
B
Cluster up
except node C
A
B
A
B

Token
metadata
Cluster up
except node C
A
C
B
C
A
B
A
B
A
B
local view in gossip

Token
metadata
(in gossip)
A
B
A
B
node D
A
B
A
B
Bootstrapping
node D
A
B
A
B

Token
metadata
node D
A
B
A
B
Bootstrapping
node D
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip

Token
metadata
A
B
A
B
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
+ Different token metadata -> different replica sets
+ Different nodes use different quorums -> inconsistent reads
+ Writes go to the wrong replica set temporarily
+ etc.

“Cannot” happen:
“Before adding the new node,
check the node’s status in the cluster using nodetool status
command.
You cannot add new nodes to the cluster if any of the nodes are
down.” [1]
[1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/

Strongly Consistent Topology
The plan:
+ Make the database responsible for consistency under all conditions
Why:
+ Gives a reliable safety net for admins
+ Reduces stress
+ Increases conﬁdence
+ Simpliﬁes procedures

Strong needs for strong consistency
+ Reliable, concurrent topology changes
+ Linearizable cluster-wide schema
+ Strongly consistent, partitioned storage

Raft Intro
Raft is a protocol for state machine replication.
What does it mean?
+ The majority of nodes have the same state
+ State transition happens in the same order on all nodes
Cluster topology is part of the state

How Raft Achieves Consistency
State
machine
State
machine
State
machine
Node A Node B Node C

State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3

Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3

Leader Based Replication
7
6
6
6
6
CREATE KEYSPACE
clicks
WITH { replication … }

Detecting a leader failure
🔥
B
C
D
E
��
��
��
��
● Leader regularly pings
followers
● Followers become candidates
when don’t receive pings
● 1 ping = 1/10 of a second
● Missing 10 pings = 1 second =
election timeout
● Bigger timeout = decreased
liveness

Raft Leadership Changes
Election starts: S1 is a candidate: More candidates: S1 is elected leader:
T i m e

Randomizing the election timeout
T i m e
🔥
Election
timeout
Election
threshold
E D B C
Nodes become candidates
Leader
failure

Why split votes happen
T i m e
🔥
Election
timeout
Election
threshold
E
P
B C
Leader
failure
G D I
E H
L
K
M

Use Gaussian timeout distribution ?
T i m e
🔥
Election
timeout
Election
threshold
E
P
B C
Leader
failure
G D I
E H
L
K
M

Uniform, scaled threshold
T i m e
🔥
Election
timeout
Election threshold =
size(cluster)
E P B C
Leader
failure
G D I
E H
L
K
M

Ping traﬃc in Multi-Raft
A
B
C
D
E
● Each node is a leader
of many Groups
● Each leader has to
ping followers
● G2
pings

Solution: shared failure detector
+ Each node pings other nodes
+ Groups share ping information
+ N2
pings

Problem: removing a leader
A
B
C
D
E
● Removed leader is
seen as alive if FD is
shared
● Vanilla Raft doesn’t
ping nodes outside
conﬁguration so is not
affected
● Nodes do not become
candidates

Solution: search for a leader
● The follower forwards
requests to a leader if
it is known
● If there is a request,
but the leader is not
known, sends
AppendEntriesReply to
all nodes to ﬁnd a
leader
C
D
E
F
B

Raft Conﬁguration Changes
x←1
add
node D
y←2 z←3
del
node A
Time
Replicated log

Scylla Raft: only use Joint conﬁguration
x←1
Begin
add A
y←2 z←3
End
add A
Time
Replicated log

Non-voting Members
A
ADD NODE B
B

Sticky leadership rule removed

Summary
Scylla Raft implements a number of important extensions
+ Resilience against asymmetric network failures with pre-voting
+ Increased liveness for very large clusters (1000+ nodes)
+ Eﬃcient multi-raft: every node can replicate many state machines
+ Read and write support on all cluster nodes (barriers and forwarding)
+ Non-voting members
Scyla Raft removes some Raft features as redundant:
+ Simple conﬁguration changes
+ Sticky leadership

2, 3
2, 3
2, 3
2, 3
2, 3
Setting up a Fresh Cluster
1
2
3
4
5
1
2
3
4
5
2, 3, 1, 4, 5
1
2
3
4
5
T i m e

Setting up a Fresh Cluster
On a fresh start, ScyllaDB node:
+ Generates and persists unique random Server ID (UUID)
+ Contacts all known peers. Strictly after:
+ contacting all peers in seeds: list
+ exchanging all known Server IDs
+ AND not ﬁnding an existing cluster
+ AND if this Server ID is lexicographically the smallest
+ Creates a new Raft Group ID and a new cluster

Topology Changes on Raft
system.token_metadata
+ Have a RAFT group which includes all cluster members (raft_group0)
+ Token metadata be the state machine which is replicated by RAFT
+ Changes of token metadata are raft commands

Schema changes on Raft
To execute a DDL statement, the server:
+ Takes Raft read barrier
+ Reads the latest schema and validates CQL
+ Builds Raft command and signs it with old and new schema id
+ Once command is committed, it’s applied only if old schema id
is the same
+ Retries if commit or apply failed

The Balance Between
Consistency and Availability

Availability of DML
S1
S2
S3
CREATE TABLE t ADD COLUMN b CREATE INDEX t_i1
Raft log:
I
N
S
E
R
T
I
N
T
O
t
S
E
T
b
=
2
S
E
L
E
C
T
b
- schema fetch

+ RAFT eagerly replicates to every node
+ Like RF=ALL tables with auto-repair
+ Request coordinators still use the local view on topology
+ No extra coordination when executing user requests
+ Topology changes use linearizable access for learning and
modiﬁcation
+ No need for sleep(30s)
+ Faster topology changes
Replacing Gossip with RAFT

Solved Issues
+ Concurrent DDL is now safe
+ Safe topology changes enable elasticity
+ still under --experimental-features-raft
+ Enabled if all nodes are 5.0

Split Brain Problem
App
App
App
App
App
App

Introduced Issues
Raft prefers CONSISTENCY over AVAILABILITY. What does it mean?
+ 2-data center set ups become more fragile
+ Prefer odd number of DCs to avoid split brain
+ Import sstables into a new cluster if permanent loss of majority
+ 5.0 cluster with Raft can’t downgrade to 4.x

Steps to Stronger Consistency in ScyllaDB
+ Tests, tests and more tests
+ Schema consistency - Experimental in 5.0
+ Topology consistency - Coming in 5.x
+ Tablets consistency - Coming in 5.x

Poll
How much data do you have under
management in your own
transactional database?
75

Thank you
for joining us today.
@scylladb scylladb/
slack.scylladb.com
@scylladb company/scylladb/
scylladb/

Distributed Database Consistency: Architectural Considerations and Tradeoffs

More Related Content

Similar to Distributed Database Consistency: Architectural Considerations and Tradeoffs

More from ScyllaDB

Recently uploaded

Distributed Database Consistency: Architectural Considerations and Tradeoffs