SlideShare a Scribd company logo
Making Schema
Changes Safe with Raft
Konstantin Osipov
ScyllaDB Engineering
Konstantin Osipov
■ Worked on lightweight transactions in Scylla
■ Crazy about distributed system testing
■ Lives in Moscow, father of two
Director of Engineering, ScyllaDB
What is database schema?
Scylla schema features
You can:
CREATE or DROP
● KEYSPACE, TABLE,
● VIEW, INDEX,
CHANGE OPTIONS:
● of replication, table, view, index options
You can’t:
MODIFY
● RENAME keyspace, table, column
● CHANGE column type
CONSTRAINT:
● UNIQUE, CHECK, FOREIGN
Consistency Model of Schema Changes
id first last
1 John Doe
Time
Node A: Node B:
id first last email
1 John Doe
2 Jenny Smith j@...
id first last email phone
1 John Doe
2 Jenny Smith j@... (867)
id first last phone
1 John Doe
2 Jenny Smith (867)
Split
brain
(In)consistency of Schema Changes
cqlsh:test> create table t (a int primary key);
----------------------------------------------- split ------------------------------------------
cqlsh:test> alter table t rename a to d;
Warning: schema version mismatch detected
cqlsh:test> insert into t (d) values (1);
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance.
cqlsh:test> insert into t (a) values (1);
Unknown identifier a
Schema changes need strong consistency
In Cassandra… In Scylla…
CASSANDRA-10250,
CASSANDRA-10699,
CASSANDRA-14957,
…
#2921, #4648,
#6455, #7426,
#8968, #9774 …
Scylla Raft
Raft intro
Raft is a protocol for state machine replication.
What does it mean?
■ The majority of nodes have the same state
■ State transition happens in the same order on all
nodes
■ Cluster topology is part of the state
How Raft achieves consistency
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Node A Node B Node C
Raft leadership changes
Election starts: S1 is a candidate: More candidates: S1 is elected leader:
Raft configuration changes
x←1
add
node D
y←2 z←3
del
node A
Time
Replicated log
How Scylla Raft is special
Scylla Raft implements a number of important extensions:
■ Increased liveness for very large clusters (1000+ nodes)
■ Resilience against asymmetric network failures
■ Read and write support on all cluster nodes
■ Efficient multi-raft: every node can replicate many state machines
Setting up a fresh cluster
On a fresh start, Scylla node:
■ Generates and persists unique random Server ID (UUID)
■ Contacts all known peers. Strictly after:
• contacting all peers in seeds: list
• exchanging all known Server IDs
• AND not finding an existing cluster
• AND if this Server ID is lexicographically the smallest
■ Creates a new Raft Group ID and a new cluster
2, 3
2, 3
2, 3
2, 3
2, 3
Setting up a fresh cluster
1
2
3
4
5
1
2
3
4
5
2, 3, 1, 4, 5
1
2
3
4
5
Schema changes on Raft
How Scylla changes on Raft work?
To execute a DDL statement, the server:
■ Takes Raft read barrier
■ Reads the latest schema and validates CQL
■ Builds Raft command and signs it with old and new schema id
■ Once command is committed, it’s applied only if old schema id is the same
■ Retries if commit or apply failed
Availability of DML
S1
S2
S3
CREATE TABLE t ADD COLUMN b CREATE INDEX t_i1
Raft log:
- schema fetch
Solved issues
■ Concurrent DDL is now safe
■ still under --experimental-features-raft
■ Enabled if all nodes are 5.0
Introduced issues
Raft prefers CONSISTENCY over AVAILABILITY. What does it mean?
■ 2-data center set ups become more fragile
■ Prefer odd number of DCs to avoid split brain
■ Import sstables into a new cluster if permanent loss of majority
■ 5.0 cluster with Raft can’t downgrade to 4.x
Next steps
■ Testing, testing, testing
■ Topology changes on Raft
■ Tablets
Thank you!
Stay in touch
Konstantin Osipov
@kostja_osipov
kostja@scylladb.com

More Related Content

What's hot

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Implementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed AggregatesImplementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed Aggregates
ScyllaDB
 
Play with FILE Structure - Yet Another Binary Exploit Technique
Play with FILE Structure - Yet Another Binary Exploit TechniquePlay with FILE Structure - Yet Another Binary Exploit Technique
Play with FILE Structure - Yet Another Binary Exploit Technique
Angel Boy
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
confluent
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3
Marco Tusa
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
Colin Charles
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestrator
YoungHeon (Roy) Kim
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
Christian Antognini
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
DataStax Academy
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
ScyllaDB
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
NeoClova
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software Test
I Goo Lee
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼
NeoClova
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
Codership Oy - Creators of Galera Cluster
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
confluent
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
Anne Nicolas
 

What's hot (20)

RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Implementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed AggregatesImplementing Highly Performant Distributed Aggregates
Implementing Highly Performant Distributed Aggregates
 
Play with FILE Structure - Yet Another Binary Exploit Technique
Play with FILE Structure - Yet Another Binary Exploit TechniquePlay with FILE Structure - Yet Another Binary Exploit Technique
Play with FILE Structure - Yet Another Binary Exploit Technique
 
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka’s Rebalance Protocol but Wer...
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
 
My sql failover test using orchestrator
My sql failover test  using orchestratorMy sql failover test  using orchestrator
My sql failover test using orchestrator
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 
A Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache CassandraA Deep Dive Into Understanding Apache Cassandra
A Deep Dive Into Understanding Apache Cassandra
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software Test
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼MariaDB MaxScale monitor 매뉴얼
MariaDB MaxScale monitor 매뉴얼
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
Kernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uringKernel Recipes 2019 - Faster IO through io_uring
Kernel Recipes 2019 - Faster IO through io_uring
 

Similar to Scylla Summit 2022: Making Schema Changes Safe with Raft

Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
Brent Theisen
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
DataStax Academy
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
jbellis
 
How to leave the ORM at home and write SQL
How to leave the ORM at home and write SQLHow to leave the ORM at home and write SQL
How to leave the ORM at home and write SQL
MariaDB plc
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
Stu Hood
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
Aaron Ploetz
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
DataStax Academy
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
Jon Haddad
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Tim Callaghan
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandrashimi_k
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
Stu Hood
 
Performance Tuning Best Practices
Performance Tuning Best PracticesPerformance Tuning Best Practices
Performance Tuning Best Practiceswebhostingguy
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
Frederic Descamps
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
Cliff Gilmore
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 

Similar to Scylla Summit 2022: Making Schema Changes Safe with Raft (20)

Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Cassandra
CassandraCassandra
Cassandra
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Cassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + DynamoCassandra: Open Source Bigtable + Dynamo
Cassandra: Open Source Bigtable + Dynamo
 
How to leave the ORM at home and write SQL
How to leave the ORM at home and write SQLHow to leave the ORM at home and write SQL
How to leave the ORM at home and write SQL
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache CassandraCassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra Day Denver 2014: Introduction to Apache Cassandra
 
Introduction to Cassandra - Denver
Introduction to Cassandra - DenverIntroduction to Cassandra - Denver
Introduction to Cassandra - Denver
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra GuruUse Your MySQL Knowledge to Become an Instant Cassandra Guru
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Performance Tuning Best Practices
Performance Tuning Best PracticesPerformance Tuning Best Practices
Performance Tuning Best Practices
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 

More from ScyllaDB

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
ScyllaDB
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
ScyllaDB
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
ScyllaDB
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
ScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
ScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
ScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
ScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
ScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
ScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
ScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
ScyllaDB
 

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Scylla Summit 2022: Making Schema Changes Safe with Raft

  • 1. Making Schema Changes Safe with Raft Konstantin Osipov ScyllaDB Engineering
  • 2. Konstantin Osipov ■ Worked on lightweight transactions in Scylla ■ Crazy about distributed system testing ■ Lives in Moscow, father of two Director of Engineering, ScyllaDB
  • 4. Scylla schema features You can: CREATE or DROP ● KEYSPACE, TABLE, ● VIEW, INDEX, CHANGE OPTIONS: ● of replication, table, view, index options You can’t: MODIFY ● RENAME keyspace, table, column ● CHANGE column type CONSTRAINT: ● UNIQUE, CHECK, FOREIGN
  • 5. Consistency Model of Schema Changes id first last 1 John Doe Time Node A: Node B: id first last email 1 John Doe 2 Jenny Smith j@... id first last email phone 1 John Doe 2 Jenny Smith j@... (867) id first last phone 1 John Doe 2 Jenny Smith (867) Split brain
  • 6. (In)consistency of Schema Changes cqlsh:test> create table t (a int primary key); ----------------------------------------------- split ------------------------------------------ cqlsh:test> alter table t rename a to d; Warning: schema version mismatch detected cqlsh:test> insert into t (d) values (1); Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. cqlsh:test> insert into t (a) values (1); Unknown identifier a
  • 7. Schema changes need strong consistency In Cassandra… In Scylla… CASSANDRA-10250, CASSANDRA-10699, CASSANDRA-14957, … #2921, #4648, #6455, #7426, #8968, #9774 …
  • 9. Raft intro Raft is a protocol for state machine replication. What does it mean? ■ The majority of nodes have the same state ■ State transition happens in the same order on all nodes ■ Cluster topology is part of the state
  • 10. How Raft achieves consistency Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Node A Node B Node C
  • 11. Raft leadership changes Election starts: S1 is a candidate: More candidates: S1 is elected leader:
  • 12. Raft configuration changes x←1 add node D y←2 z←3 del node A Time Replicated log
  • 13. How Scylla Raft is special Scylla Raft implements a number of important extensions: ■ Increased liveness for very large clusters (1000+ nodes) ■ Resilience against asymmetric network failures ■ Read and write support on all cluster nodes ■ Efficient multi-raft: every node can replicate many state machines
  • 14. Setting up a fresh cluster On a fresh start, Scylla node: ■ Generates and persists unique random Server ID (UUID) ■ Contacts all known peers. Strictly after: • contacting all peers in seeds: list • exchanging all known Server IDs • AND not finding an existing cluster • AND if this Server ID is lexicographically the smallest ■ Creates a new Raft Group ID and a new cluster
  • 15. 2, 3 2, 3 2, 3 2, 3 2, 3 Setting up a fresh cluster 1 2 3 4 5 1 2 3 4 5 2, 3, 1, 4, 5 1 2 3 4 5
  • 17. How Scylla changes on Raft work? To execute a DDL statement, the server: ■ Takes Raft read barrier ■ Reads the latest schema and validates CQL ■ Builds Raft command and signs it with old and new schema id ■ Once command is committed, it’s applied only if old schema id is the same ■ Retries if commit or apply failed
  • 18. Availability of DML S1 S2 S3 CREATE TABLE t ADD COLUMN b CREATE INDEX t_i1 Raft log: - schema fetch
  • 19. Solved issues ■ Concurrent DDL is now safe ■ still under --experimental-features-raft ■ Enabled if all nodes are 5.0
  • 20. Introduced issues Raft prefers CONSISTENCY over AVAILABILITY. What does it mean? ■ 2-data center set ups become more fragile ■ Prefer odd number of DCs to avoid split brain ■ Import sstables into a new cluster if permanent loss of majority ■ 5.0 cluster with Raft can’t downgrade to 4.x
  • 21. Next steps ■ Testing, testing, testing ■ Topology changes on Raft ■ Tablets
  • 22. Thank you! Stay in touch Konstantin Osipov @kostja_osipov kostja@scylladb.com

Editor's Notes

  1. Hi I’m Kostja and today I’m going to talk about schema changes on RAft.
  2. The easiest way to think about database schema comes from the world of relations where it’s a collection of headers of relational tables. Each column has a column name and a data type, meaning all cells in this column conform to the constraints of this type. The need for a schema comes from desire to save on storage, specifically avoid storing the same name and type information in each cell, as well as make relational algebra possible. Join conditions are based on cell equality, something difficult to define for cells with different type. Just like relational databases, Scylla requires column types and enforces type constraints. It also supports complex types, such sets, maps, lists and user defined types. This makes its data representation more compact, compared to document databases, but does require the clients to think through their schemas when designing an application. A less commonly considered part of database schema are replication and storage properties, indexes, views, and access rights. Finally, there is the power of data definition verbs to change the existing schema. Is could be possible to add or drop columns, change column types, add unique, check and foreign key constraints. Replicating schema statements across nodes has to use its own path, as it impacts all nodes in the cluster and requires coordinated error handling - you can’t fail an operation on one node while succeeding it on another. Some data definition operations require a complete scan or even a rebuild of data in the table. E.g. building a unique index must check that the table has no duplicates in the given column. Changing column type may require converting each cell from one physical representation to another. Scylla currently avoids rewriting data in schema change operations. Instead, it transforms data on the fly, to make it conform with the client schema. For example, if the current schema doesn’t contain a column but the actual row stil stores it, the column is removed from the row before it’s to the user. The advantage of this approach is that existing schema change statements are lightweight and less error prone. Some complex data transformations, however, have to be done on the client, which has to concern themselves with the operation consistency. In a distributed environment each node can have a slightly different version of schema. To be able to return consistent results in this environments, Scylla signs data retrieval operations passed between nodes with the schema version. The receiving node must make sure that the returned data is in the format required by the client. If it is a newer format that this node is not yet aware of, it will request the schema information from the sender.
  3. Let’s recap what you can and can’t do in Scylla: You can: create tables, views, udts, add or drop columns, truncate data, define roles and access rights. You can’t: change column types (you can do compatible changes, like text to blob), have unique secondary keys, triggers or constraints, rename a column Some of the unsupported operations can’t be implemented in an eventually consistent environment. It would be difficult to support unique secondary keys unless the definition of uniqueness is the same on all nodes - a duplicate may slip through the cracks when a node is being added and data is being moved from one node to another. Being a multi-consistency-model database, it’s not impossible that Scylla will eventually get these advanced schema features - so we’d better begin building a foundation for them now. Questions to the audience: How many people would like to have unique secondary keys? How many people use materialized views? How many people are unhappy with materialized views in Scylla? Why?
  4. To see why Scylla supports some schema changes and not others, let’s consider how schema changes may fail and how,Scylla recovers from such a failure. One obvious kind of failure is a down node. Schema changes are allowed to proceed even if all but one node in the cluster are down. What happens then? When nodes get up, they’ll learn through gossip that cluster has received a schema update and fetch it from the node that has it. But what if the node or nodes is not down but is partitioned away? It’s possible that two subsets of the clusters operate using a different version of schema. Eventually when partitioning heals, nodes learn about the changes and run compaction their data will be converted to the most recent format.
  5. A more complex scenario is when two concurrent changes conflict. E.g. imagine one part of the cluster adds a column to a table, and another part adds a different column with the same definition.The schema with the most recent timestamp is going to win but if the column received any updates corresponding to the “shadowed” definition of the schema they will be lost after schema reconciliation. There are many other ways in which eventual schema consistency may backfire: You may not see the table you just created. “Schema agreement” happens via gossip, so can easily take seconds. So if the node receiving a write is not the same node which received the schema change, it does not learn about it immediately Clients may get incorrect errors since schema changes use local node’s view of the schema, which might not be fully up to date. there is no prevention against duplicate attempts to create any object, not just columns e.g. keyspace, table, view, index. Both concurrent attempts may succeed, with an object which has a newer creation timestamp shadowing the older one with the same name, with its entire contents. Not all diversions in schema can be merged with eventual consistency rules. Dropping a user defined type when schema relies on it renders the table unusable, the contents of the table becomes inaccessible. Concurrent changes to the same object can be lost, like some added columns in cassandra issue 10250. Interaction of schema changes and topology changes produces another bouquet of gotchas: Changing keyspace replication factor does not take immediate effect and if done during topology change can lead to a data loss. A failure during topology change can violate LWT linearizability Writes may be lost during topology changes To sum up, the only (dubious) advantage of the algorithm is total liveness - the cluster is able to make progress with data definition verbs even in presence of a majority failure. The client is responsible for making schema changes always through the same node, thus manually enforcing linearizability. Schema reconciliation algorithms are not a feature, as their specific behaviour is not documented. Rather, it’s a best effort to patch up for otherwise undefined behaviour.
  6. The problem is acknowledged in Cassandra and Scylla communities as far back as in 2015, with duplicates constantly piling up. Strongly consistent features of Scylla such as LWT and upcoming tablets aren’t really consistent unless DDL and topology changes are consistent as well.
  7. Welcome Raft, the base algorithm used for strong consistency in Scylla 5.0. Let’s talk about it in more detail.
  8. Raft is often a called a protocol of state machine replication. A database is a kind of a state machine, and replicating a database is having the same copy of data on every node. For schema, the replicated state is keyspace, table and view definitions. By means of Raft we can make sure each cluster node not just has the same copy of the data, but applies all state changes - that is, data definition commands - in the same order. Moreover, if nodes restart, join or leave, the order must stay the same. System liveness must be preserved as long as the majority of the cluster is up. Handling of node failures, joining and leaving as part of the protocol for data replication was the primary reason Scylla settled on using Raft for schema consistency. Let’s discuss it in more detail.
  9. For a deep dive into Raft, I recommend “Raft study” - a video lecture by John Ousterhout, as well as Raft PhD - key chapters are are 1 through 4. If you’re looking into writing an own implementation, having studied many, I encourage you to look into Scylla Raft - in my opinion, it is highly isolated, commented and carefully tested. So what’s the idea of a replicated state machine? Suppose you had a program or an application that you wanted to make reliable. One way to do that is to execute that program on a collection of machines and ensure they execute it in exactly the same way. So a state machine is just a program or an application that takes inputs and produces outputs. A replicated log can help to make sure that these state machines execute exactly the same commands. Here’s how it works. A client of the system that wants to execute a command passes it to one of these machines. That command, let’s call it X then gets recorded in the log of the local machine, and then, in addition, the command is passed to the other machines and recorded in their logs as well. Once the command has been safely replicated in the logs, then it can be passed to the state machines for execution. And when one of the state machines is finished executing the command, the result can be returned back to the client program. And you can see that as long as the logs on the state machines are identical, and the state machines execute the commands in the same order, we know they are going to produce the same results. So it’s the job of the consensus module to ensure that the command is replicated and then pass it to the state machine for execution. The system makes progress as long as any majority of the servers are up and can communicate with each other. (2 out of 3, 3 out of 5).
  10. In Raft, the servers are not equal at any given point in time.Clients communicate with the leader, and the leader communicates with other servers to replicate commands. This decomposes the problem of consensus algorithm into: Normal operation, when there is a running leader what you do when a leader crashes and one needs to elect a new leader. Being an otherwise a-symmetric, leader-based algorithm, Raft falls back to symmetry for leader election. Any follower that doesn’t get updates from the leader for the duration of an election timeout is able to become a candidate and request others to vote for itself. The candidate which gets a majority of votes declares itself the leader and begins replicating logs to all other members of the cluster. Split votes, i.e. situations when no candidate gets a majority of votes to become a leader are possible. In November 2020 Cloudflare recorded a few hour outage due to of a prolonged failure to elect a leader in presence of an asymmetric network failure. Packets were routed from one part of network to the other, but not back. Nodes would repeatedly time out, request votes, this would upset the existing leader. Newer versions of Raft implement a special extension, called pre-voting, mandating fresh candidates to dry-run an election before starting a real one and disrupting the existing leader. Nodes which get pings from the current leader vote negatively during pre-vote. If pre-voting does not collect a majority of votes, the candidate doesn’t start the real election. Scylla has pre-voting implemented and always on. Turns out, it allows to simplify other parts of Raft, specifically drop Raft rules related to sticky leadership. This was made possible thanks to our decision to implement Raft from scratch on top of Seastar, and not adopt an existing implementation.
  11. Another important reason Raft is so valuable for Scylla is that cluster configuration, or topology changes, are part of Raft core. In order to add or remove a node in Raft, the client applies a special command to the log, and once its replicated to the current majority, a new majority is formed. Scylla uses an extended, two-step configuration change procedure, allowing it to add or remove more than one node at a time, or add and a remove nodes in a single configuration change. This allows us replace a node without risk to render the cluster unusable if replace fails. Another advanced feature of Scylla is being able to add a non-voting members to the cluster. A non-voting member acts as a normal Raft node but it can neither vote nor get elected. In Scylla, new nodes join the cluster as non-voting members, thus a join failure doesn’t impact Raft quorum (the rules for determining majority and thus making progress). A node that failed to join can be easily removed even if some nodes in the cluster are down. Once a node has completed advertising tokens and transferring ranges, it becomes a full voting member of the ring.
  12. There are other ways in which Scylla Raft is special. Given Scylla clusters can be quite large, we paid special attention to make sure elections are swift. Scylla randomizes each node’s election timeout (the interval after which the node starts an election if it doesn’t hear from the current leader), and spreads it proportionally to the cluster size. Even in a 1000 node cluster each node will start election roughly in its own time, allowing it to request votes from followers without interference from other candidates. Thanks to this, typical time to elect a new leader is one to three seconds. We coded support for Raft read barriers and automatic forwarding of commands to the leader. This made Scylla Raft more symmetric, allowing followers to execute major parts of schema changes, offloading the leader which only needs to log the mutations - changes of system tables rows. Finally, we greatly reduced the cost of failure detection, allowing multiple raft instances (we call these raft groups) on a node share a single failure detector. We plan to run an own mini-Raft cluster (a replica set) for each tablet, so reduced failure detection overhead a single replica set incurs as much as we could. Scylla Raft implementation is isolated from disk and network, which allowed us to test it extensively without having to create large clusters. A minute-long tests runs hundreds of thousands of configuration events alongside injected network splits, node failures and packet drops.
  13. Raft addresses many issues, but provides only basic initial cluster setup. In Scylla it’s long been a rule that nodes need to be added to the cluster one at a time and operator has to wait for the joining node to advertise its token and complete streaming before joining the next node. Apart from slowing down the overall join procedure, limiting the elasticity of the cluster, this procedure is also inherently unsafe: there were no safety net for operator error, for example, starting two new nodes could lead to some ranges being replicated incorrectly. Joining relied on gossip to propagate tokens, which added constant unpleasant delays to the procedure. With transition to Raft, we wanted to address this problem as well. For it, we introduced a new protocol for cluster assembly, which we called “cluster discovery”. The idea of the protocol is as follows: If a joining node is able to contact with any cluster node which already has an existing Raft group, it will use that group configuration machinery to join itself. But if we start several fresh nodes, there is no cluster yet. In that case, each new node continuously discovers its peers until it can build a complete map of the cluster and as soon as it is able to do it, it may start a new Raft group.
  14. Consider a fresh ring of 5 nodes which haven’t formed a cluster yet. Imagine each node in the ring, has information about initial contact points in its configuration file, seeds section. The node contacts initial peers, and finds out there are other nodes in the cluster. It continues doing so until no new members appear and all existing members responded. Then a new raft configuration can be formed. Distributed protocols are not considered safe unless proved correct. To validate correctness of the new Scylla discovery protocol we created a TLA+ specification and ran it to completeness for all reasonable cluster sizes. In 5.0 where we must support pre-raft nodes, the discovery protocol lives side by side with gossip. In future, we’ll be able to switch and make scylla cluster boot take subsecond time.
  15. With such a strong foundation, implementing schema changes on Raft was a breeze.
  16. As you perhaps know, Scylla stores all schema definitions in system tables. Each node has a copy, and the node which received the mutation propagates it to all other nodes. So the biggest change in 5.0 is that this propagation of the change is no longer eventual but happens through raft. All cluster nodes become part of a raft group we call internally “group 0”, since its the first group ever. The group has a single leader which is actively pushing all changes to all members. Any member willing to make a change forwards it to the leader, which commits it on the majority before materializing as a new schema version. If a node is disconnected from the leader, or disconnected from the majority, it can no longer make a schema change. For a connected node, the steps are as follows: Before a node executes the command, it reads the latest schema, issuing a Raft read barrier. Imagine you want to drop a table. You need to make sure the table exists, and if it doesn’t return an error to the client. Similar validations happen for all CQL commands, and as a result a change to system schema is built, which is recorded in the raft log. But what if two nodes try to do the changes at the same time? Both of them may be able to record their changes in the log. Or, if a connection to the leader is flickering, we may end up in a state of uncertainty, when the command has executed, but the caller doesn’t know the outcome. To protect against double execution or execution in a wrong order, each command is signed with old and new schema id. When the command is applied, the current state of the schema must match old id, and the new id is recorded in schema version history, so that newer commands may fail. In cases of uncertainty change id is used to validate that the command is indeed applied to the state machine. If a race is detected, application of the command turns into a no op and the entire procedure is restarted.
  17. Did switch to Raft have any impact on CRUD performance or availability? The way insert, update, or delete works with the schema in 5.0 is similar to 4.0. The coordinator signs the mutation with its schema version. If a mutation with an older version arrives to a node with a newer schema, it is automatically converted to the newer version. If a mutation with a newer version arrives to a node with an older schema, that node will fetch its schema from the coordinator. Schema fetch doesn’t need a leader to be around and can happen from any peer. This preserves DML availability guarantees during Raft leader changes or a network partition. Raft leader is constantly pushing schema updates to all nodes, so the cases for outdated schema should be much more rare. Conflicts between DDL and topology are still resolved eventually. Specifically, changes to keyspace replication factor still take place without actual data replication, and when adding a node we still use gossip to wait for “schema agreement”.
  18. Let’s recap. Starting from Scylla 5.0, concurrent DDL is safe. Anomalies such as spurious errors, shadowed keyspaces, tables or columns are impossible. Schema propagation happens much faster, making it easier to write day to day applications. The features still requires an experimental switch, and once is enabled, there is no downgrade path. We’re actively working on weeding out all the bugs and turning off experimental, which we hope to do later this year. Heterogeneous cluster continue to work, but will use old gossip-style communication to propagate schema. Starting from the next major release Raft schema management will become the default, making the problems we discuss today a strict legacy.
  19. Not all changes introduced by Raft are rosy. There are cases when raft preference of consistency over availability may impact production deployments. Raft preserves liveness as long as you have a majority of nodes. A split brain in a two DC setup is one notable case when majority can be permanently lost. Scylla 5.0 with Raft will not admit any DDL statements in case of split brain, while DML will continue to work. We’re looking into introducing a nodetool command to promote such isolated cluster into a new one, however, if network split is temporary, this would be a wrong answer to the problem. Question: how much of an impact this is to you?
  20. Use of Raft doesn’t stop with schema changes. I welcome you to attend a talk by our distinguished engineer, Tomash Grabiec, to learn more about our future plans.
  21. Thanks! Thanks for attending, this was a session about schema changes on Raft in Scylla 5.0.