SlideShare a Scribd company logo
1 of 37
Lightweight transactions
at lightning speed
Konstantin Osipov, Software Team lead
Presenter
Konstantin Osipov, Software Team Lead
Kostja is a well-known expert in the DBMS world, spending most
of his career developing open-source DBMS including Tarantool
and MySQL. At ScyllaDB his focus is transaction support and
synchronous replication.
Quick takeaways
■ Download:
https://hub.docker.com/r/scylladb/scylla-nightly/tags
■ Use --experimental and follow a short tutorial:
https://github.com/scylladb/scylla/wiki/lwt
■ General availability is planned with 3.2 milestone release
How this talk is structured
■ Lightweight transactions: syntax, semantics, benchmarks, metrics
■ Design & Implementation overview
■ Caveats
■ Future work
LWT at a glance
Pre-LWT: write fast path
CQL avoids slow reads
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe';
> SELECT * FROM employees ...;
firstname | lastname | join_date
-----------+----------+------------
John | Doe | 2018-05-19
CQL conditional statement
> UPDATE employees SET join_date = '2018-05-19' WHERE
firstname = 'John' AND lastname = 'Doe'
IF join_date != null;
[applied]
-----------
False
What statements can be conditional?
Any INSERT, UPDATE or DELETE can have an IF clause:
> UPDATE employees SET join_date = … IF EXISTS;
> INSERT INTO bookings (id, item, client, quantity) VALUES
(…) IF NOT EXISTS;
> UPDATE inventory SET state = 'Used' WHERE itemid = ?
IF state = 'Unused' AND check = 'Passed';
> DELETE FROM tasks WHERE project_id = ? AND task_id = ?
IF task['state'] IN ('Complete', 'Abandoned');
What statements can be conditional?
Any INSERT, UPDATE or DELETE can have an IF clause:
> UPDATE employees SET join_date = … IF EXISTS;
> INSERT INTO bookings (id, item, client, quantity) VALUES
(…) IF NOT EXISTS;
> UPDATE inventory SET state = 'Used' WHERE itemid = ?
IF state = 'Unused' AND check = 'Passed';
> DELETE FROM tasks WHERE project_id = ? AND task_id = ?
IF task['state'] IN ('Complete', 'Abandoned');
Conditional batches
> BEGIN BATCH
> UPDATE tasks SET n_abandoned = 0 WHERE project_id = 1
> IF n_abandoned > 0
> DELETE FROM tasks WHERE project_id = 1
> AND state = 'Abandoned'
> APPLY BATCH;
[applied]| project_id | state | task_id | n_abandoned
----------+------------+-----------+---------+-------------
True | 1 | Abandoned | 693 | 2
Consistency considerations
■ New consistency command:
SERIAL CONSISTENCY [SERIAL|LOCAL_SERIAL]
■ Eventual CONSISTENCY is still used
■ Consistency settings can be combined to reduce LWT latency
IF is the new WHERE?
WHERE IF
Relation expressions >, <, >=, <=, ==, != Yes Yes
IN condition Yes Yes
Collection element subscription, a[‘key’] Yes Yes
UDT member subscription, a.key Yes No
Uses secondary index for search Yes No
TOKEN(), LIKE, UDF Yes No
What you CAN’T DO
■ Use counter data type
⛔
■ Access multiple partitions
⛔
■ Supply custom TIMESTAMP
⛔
■ Use UNLOGGED
⛔
Differences with Cassandra
Difference Workaround
Per-core partitioning Use shard-aware driver for
optimal performance
Scylla always provides a result
set
No need
No Thrift support Don’t use Thrift.
Hints are not used No need
Performance
Setup: single region
Amazon EC2, availability zone US-West-1
■ Rtt time min/avg/max = 0.149/0.181/0.259 ms
■ 3 nodes I3.2xlarge
● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD
■ Replication strategy: Simple
■ Replication factor: 3
■ Integer key and value
■ Go client t3.2xlarge
■ 1-100 connections
Uncontended write - bandwidth
Uncontended write - latency
Contended SERIAL write - bandwidth
Contended SERIAL write - latency
Setup: multiple regions
Amazon EC2, zones US-West-1 (2 nodes) and , US-West-2 (2 nodes)
■ Rtt time min/avg/max = 20.74/20.77/20.81 ms
■ 4 nodes I3.2xlarge
● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD
■ Replication strategy: NetworkTopologyStrategy
■ Replication factor: 2+2
■ Integer key and value
■ Go client t3.2xlarge
■ 1-100 connections
Uncontended write - bandwidth
Uncontended write - latency
Metrics
■ CQL counters:
scylla_cql_{inserts|updates|deletes|batches}
Label: conditional={yes|no}
■ storage proxy metrics:
scylla_storage_proxy_coordinator_{read|write}_
{latency|timeouts|unavailable|contention|unfinished_commit|c
ondition_not_met...}
Metrics
Under the hood
Foundations of architecture
Consistent hashing and vnodes
Shard-to-shard replication mesh
Replication strategy produces C(n, k) distinct replication groups.
16 nodes, replication factor 3: C(16, 3) = 560
Explodes for larger numbers: C(2560, 3) ≅ 2,796,202,666
Data within a node is further partitioned across shards:
Introducing Paxos
R1
Can I
propose a
value?
R
2
R
3
Accept
new value
Learn
decision
Decision made
Introducing Paxos
R1
Can I
propose a
value?
Check
condition
R
2
R
3
Accept
new value
Learn
decision
Decision made
Caveats
Caveats
Issue Remedy
4 round trips are very costly Optimize propose and
read rounds
Contention/starvation Implement Paxos leases
Uncertainty on timeout Improved diagnostics
System.paxos state Account in capacity
planning
Future work
Scylla RAFT
■ new replication strategy
■ tablet partitioning scheme
■ requested explicitly in CREATE TABLE
■ no client-side timestamps
■ provides isolation for ALL queries
Stay in touch
Konstantin Osipov
kostja@scylladb.com
kostja_osipo
v

More Related Content

What's hot

Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumScyllaDB
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPThomas Graf
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisBuland Singh
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareC4Media
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
Linux Kernel Crashdump
Linux Kernel CrashdumpLinux Kernel Crashdump
Linux Kernel CrashdumpMarian Marinov
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack monad bobo
 

What's hot (20)

Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in Cilium
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_Analysis
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
XDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @CloudflareXDP in Practice: DDoS Mitigation @Cloudflare
XDP in Practice: DDoS Mitigation @Cloudflare
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 
Linux Kernel Crashdump
Linux Kernel CrashdumpLinux Kernel Crashdump
Linux Kernel Crashdump
 
Tutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting routerTutorial: Using GoBGP as an IXP connecting router
Tutorial: Using GoBGP as an IXP connecting router
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 

Similar to Lightweight Transactions at Lightning Speed

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01Karam Abuataya
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11gfcamachob
 
Mod03 linking and accelerating
Mod03 linking and acceleratingMod03 linking and accelerating
Mod03 linking and acceleratingPeter Haase
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comJean-François Gagné
 
Optimizing applications and database performance
Optimizing applications and database performanceOptimizing applications and database performance
Optimizing applications and database performanceInam Bukhary
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert VanderkelenOSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert VanderkelenNETWAYS
 
Analyzing SQL Traces generated by EVENT 10046.pptx
Analyzing SQL Traces generated by EVENT 10046.pptxAnalyzing SQL Traces generated by EVENT 10046.pptx
Analyzing SQL Traces generated by EVENT 10046.pptxssuserbad8d3
 
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL Quickly Locate Poorly Performing DB2 for z/OS Batch SQL
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL softbasemarketing
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresM|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresMariaDB plc
 
Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Wagner Bianchi
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1MariaDB plc
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?Mydbops
 
Advanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisAdvanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisMYXPLAIN
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellEmily Ikuta
 

Similar to Lightweight Transactions at Lightning Speed (20)

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11g
 
Mod03 linking and accelerating
Mod03 linking and acceleratingMod03 linking and accelerating
Mod03 linking and accelerating
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
Optimizing applications and database performance
Optimizing applications and database performanceOptimizing applications and database performance
Optimizing applications and database performance
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
MySQLinsanity
MySQLinsanityMySQLinsanity
MySQLinsanity
 
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert VanderkelenOSMC 2008 | Monitoring MySQL by Geert Vanderkelen
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Analyzing SQL Traces generated by EVENT 10046.pptx
Analyzing SQL Traces generated by EVENT 10046.pptxAnalyzing SQL Traces generated by EVENT 10046.pptx
Analyzing SQL Traces generated by EVENT 10046.pptx
 
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL Quickly Locate Poorly Performing DB2 for z/OS Batch SQL
Quickly Locate Poorly Performing DB2 for z/OS Batch SQL
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored ProceduresM|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures
 
Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18Migrations from PLSQL and Transact-SQL - m18
Migrations from PLSQL and Transact-SQL - m18
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
Advanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisAdvanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and Analysis
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a Nutshell
 

More from ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLScyllaDB
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...ScyllaDB
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityScyllaDB
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptxScyllaDB
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDBScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationScyllaDB
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsScyllaDB
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesScyllaDB
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsScyllaDB
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101ScyllaDB
 

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Lightweight Transactions at Lightning Speed

  • 1. Lightweight transactions at lightning speed Konstantin Osipov, Software Team lead
  • 2. Presenter Konstantin Osipov, Software Team Lead Kostja is a well-known expert in the DBMS world, spending most of his career developing open-source DBMS including Tarantool and MySQL. At ScyllaDB his focus is transaction support and synchronous replication.
  • 3. Quick takeaways ■ Download: https://hub.docker.com/r/scylladb/scylla-nightly/tags ■ Use --experimental and follow a short tutorial: https://github.com/scylladb/scylla/wiki/lwt ■ General availability is planned with 3.2 milestone release
  • 4. How this talk is structured ■ Lightweight transactions: syntax, semantics, benchmarks, metrics ■ Design & Implementation overview ■ Caveats ■ Future work
  • 5. LWT at a glance
  • 7. CQL avoids slow reads > UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe'; > SELECT * FROM employees ...; firstname | lastname | join_date -----------+----------+------------ John | Doe | 2018-05-19
  • 8. CQL conditional statement > UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe' IF join_date != null; [applied] ----------- False
  • 9. What statements can be conditional? Any INSERT, UPDATE or DELETE can have an IF clause: > UPDATE employees SET join_date = … IF EXISTS; > INSERT INTO bookings (id, item, client, quantity) VALUES (…) IF NOT EXISTS; > UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed'; > DELETE FROM tasks WHERE project_id = ? AND task_id = ? IF task['state'] IN ('Complete', 'Abandoned');
  • 10. What statements can be conditional? Any INSERT, UPDATE or DELETE can have an IF clause: > UPDATE employees SET join_date = … IF EXISTS; > INSERT INTO bookings (id, item, client, quantity) VALUES (…) IF NOT EXISTS; > UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed'; > DELETE FROM tasks WHERE project_id = ? AND task_id = ? IF task['state'] IN ('Complete', 'Abandoned');
  • 11. Conditional batches > BEGIN BATCH > UPDATE tasks SET n_abandoned = 0 WHERE project_id = 1 > IF n_abandoned > 0 > DELETE FROM tasks WHERE project_id = 1 > AND state = 'Abandoned' > APPLY BATCH; [applied]| project_id | state | task_id | n_abandoned ----------+------------+-----------+---------+------------- True | 1 | Abandoned | 693 | 2
  • 12. Consistency considerations ■ New consistency command: SERIAL CONSISTENCY [SERIAL|LOCAL_SERIAL] ■ Eventual CONSISTENCY is still used ■ Consistency settings can be combined to reduce LWT latency
  • 13. IF is the new WHERE? WHERE IF Relation expressions >, <, >=, <=, ==, != Yes Yes IN condition Yes Yes Collection element subscription, a[‘key’] Yes Yes UDT member subscription, a.key Yes No Uses secondary index for search Yes No TOKEN(), LIKE, UDF Yes No
  • 14. What you CAN’T DO ■ Use counter data type ⛔ ■ Access multiple partitions ⛔ ■ Supply custom TIMESTAMP ⛔ ■ Use UNLOGGED ⛔
  • 15. Differences with Cassandra Difference Workaround Per-core partitioning Use shard-aware driver for optimal performance Scylla always provides a result set No need No Thrift support Don’t use Thrift. Hints are not used No need
  • 17. Setup: single region Amazon EC2, availability zone US-West-1 ■ Rtt time min/avg/max = 0.149/0.181/0.259 ms ■ 3 nodes I3.2xlarge ● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD ■ Replication strategy: Simple ■ Replication factor: 3 ■ Integer key and value ■ Go client t3.2xlarge ■ 1-100 connections
  • 18. Uncontended write - bandwidth
  • 20. Contended SERIAL write - bandwidth
  • 22. Setup: multiple regions Amazon EC2, zones US-West-1 (2 nodes) and , US-West-2 (2 nodes) ■ Rtt time min/avg/max = 20.74/20.77/20.81 ms ■ 4 nodes I3.2xlarge ● 8 vcores, Intel Xeon E5 2686 v4 2.3GHz, 64GB RAM, 1.9T NVMeSSD ■ Replication strategy: NetworkTopologyStrategy ■ Replication factor: 2+2 ■ Integer key and value ■ Go client t3.2xlarge ■ 1-100 connections
  • 23. Uncontended write - bandwidth
  • 25. Metrics ■ CQL counters: scylla_cql_{inserts|updates|deletes|batches} Label: conditional={yes|no} ■ storage proxy metrics: scylla_storage_proxy_coordinator_{read|write}_ {latency|timeouts|unavailable|contention|unfinished_commit|c ondition_not_met...}
  • 30. Shard-to-shard replication mesh Replication strategy produces C(n, k) distinct replication groups. 16 nodes, replication factor 3: C(16, 3) = 560 Explodes for larger numbers: C(2560, 3) ≅ 2,796,202,666 Data within a node is further partitioned across shards:
  • 31. Introducing Paxos R1 Can I propose a value? R 2 R 3 Accept new value Learn decision Decision made
  • 32. Introducing Paxos R1 Can I propose a value? Check condition R 2 R 3 Accept new value Learn decision Decision made
  • 34. Caveats Issue Remedy 4 round trips are very costly Optimize propose and read rounds Contention/starvation Implement Paxos leases Uncertainty on timeout Improved diagnostics System.paxos state Account in capacity planning
  • 36. Scylla RAFT ■ new replication strategy ■ tablet partitioning scheme ■ requested explicitly in CREATE TABLE ■ no client-side timestamps ■ provides isolation for ALL queries
  • 37. Stay in touch Konstantin Osipov kostja@scylladb.com kostja_osipo v

Editor's Notes

  1. Hi, my name is Konstantin Osipov, and I am working on lightweight transaction support in Scylla. I've been involved with databases for nearly two decades, most notably MySQL, where I worked on prepared statements, stored procedures, foreign key constraints, metadata locking, and Tarantool in-memory database where I served ~9 years as a leading engineer and CTO.
  2. This talk is about lightweight transactions support in Scylla, and since this is a very wished for feature many of you have the most burning questions like "is it there?" and "how can I get it?" - which I'll answer first. It is there, in Scylla trunk and you can download it at https://hub.docker.com/r/scylladb/scylla-nightly/tags. It is going to be included into the upcoming 3.2 release - which is planned later this year. The implementation is nearly fully compatible with Cassandra, so those of you who are familiar with Cassandra, perhaps now have sufficient information to skip this talk and get a coffee and/or a cigarette instead. Enjoy.
  3. Those of you who are interested in the secrets of internal works of LWT, how to best use it, benchmarks, caveats, and future work, please stay on. And I am here to learn too - about your LWT usage patterns, wishes, and pet peeves. I will structure the talk as follows. * We'll start by looking at LWT feature: the syntax, semantics, strengths and weaknesses * We will continue with presenting a few benchmarks and discussing how to optimally use the feature, including what metrics we provide to monitor its usage * We'll look at Scylla architecture background and possible approaches to LWT implementation * Then we'll study the implementation, which is based on an infamously difficult yet very elegant and minimal distributed algorithm (also known as distributed consensus protocol), called Paxos * We'll end by discussing the state of Paxos implementation in Scylla: why it is marked with --experimental, what we plan to do before we remove the mark, and what we plan to do after
  4. If you're familiar with Scylla data modification language, you know that a modification statement never reports back whether it actually changes any rows. This is a property ensuing from two design choices in Scylla: - using log-structured merge trees for storage, which is significantly more efficient for heavy write work than for reads. You can safely assume that a cold read is 10-100x more expensive than a write even when using an SSD device. - accepting client-supplied timestamps for "transaction" identifiers: even if Scylla performed a read of the existing value before applying a change, the end result may well change because a similar transaction on the same key is allowed to proceed on a different node without any coordination, or even a later transaction may supply an earlier timestamp and thus retroactively change the history.
  5. As an example, which commonly tricks SQL users adopting CQL, the following UPDATE statement always succeeds: UPDATE employees SET join_date = 2010-04-28 WHERE firstname = 'John' AND lastname = 'Doe' - you'd better know what you're doing, because if John Doe was not employed before this statement, he will sure be employed after. Well, guess, this is not always what you need. Sometimes you just need a scalable and reliable database which can provide classical transactional consistency model for at least some of your updates - if John Doe is not employed, he should not be hired by an update.
  6. (Since WHERE clause is taken), a new IF clause is added to conduct this intent: Now the statement does what it is supposed to and will *not* coincidentally hire our friend John Doe. But what else can you do with LWT?
  7. IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS: INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS or you could provide a collection of predicates on different row cells: UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed' - all such changes will be consistent and durable. You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=. A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data: INSERT INTO tasks VALUES (task_id, task) (1002, { ... }); INSERT INTO tasks_assigned (task_id, worker_id) VALUES (1001, 'west-1') IF NOT EXISTS; -- Only take the task if it is not taken UPDATE tasks_assigned SET worker_id= 'west-2' WHERE task_id = 1001 IF worker_id= 'west_1'; -- Atomically change failed worker of a task
  8. IF clause is made available for all existing data modification words: INSERT, UPDATE and DELETE. If you just wish to check that a certain row exists or doesn't exist, you could write IF EXISTS or IF NOT EXISTS: INSERT INTO bookings (id, item, client, quantity) VALUES (...) IF NOT EXISTS or you could provide a collection of predicates on different row cells: UPDATE inventory SET state = 'Used' WHERE itemid = ? IF state = 'Unused' AND check = 'Passed' - all such changes will be consistent and durable. You can also query individual cells, or collection elements, use IN and relation operators, such as <, >, >=, <=, ==, !=. A popular design pattern with lightweight transactions is having a registry for critical information, AKA process or state metadata, for example, a task-worker assignment table, and an eventually consistent table with actual data: INSERT INTO tasks VALUES (task_id, task) (1002, { ... }); INSERT INTO tasks_assigned (task_id, worker_id) VALUES (1001, 'west-1') IF NOT EXISTS; -- Only take the task if it is not taken UPDATE tasks_assigned SET worker_id= 'west-2' WHERE task_id = 1001 IF worker_id= 'west_1'; -- Atomically change failed worker of a task
  9. In addition to a single statement, it is possible to combine multiple conditional statements into a batch. A batch can have non-conditional statements as well, but all statements of such a batch may span only a single partition. This is useful when it is desired to update multiple rows in a partition or atomically erase all or a range of rows in it. If any statement in a batch has conditions, entire batch is considered "conditional": it is applied atomically if and only if *all* conditions of all statements in the batch evaluate to TRUE. LWT batches are very similar to multi-statement transactions in relational databases, since they provide multiple-row read consistency, durability and isolation. Yes, with atomic batches in Scylla clients don't see partial changes, as entire partition mutation is applied as all or nothing. The only difference from real transactions is that the batch logic can not "branch", i.e. there is only one ELSE branch and it is "do nothing".
  10. If you wish to avoid an extra learn round, set CONSISTENCY to ANY, and SERIAL CONSISTENCY to SERIAL If you with to have transactional semantics within the current DC, and asynchronously apply the mutation to the remote DC, you can use LOCAL_SERIAL consistency and QUORUM eventual consistency
  11. One may think that IF clause is a new WHERE - and this is true to a large extent, both accept expressions and are applied to the searched row. Unlike WHERE clause, IF conditions never use a secondary index - the rows are fetched before a condition is evaluated. IF condition applies only to a fully qualified row, i.e. you still must specify the partition key and in many cases clustering key, either in WHERE clause, if we deal with DELETE or UPDATE or in SET or VALUES clause, for UPDATE and INSERT. If your restrictions yield multiple rows, your IF condition can not be ambiguous. I.e. it can not evaluate to TRUE for one row and to FALSE for another, which in practice means that for statements restricting only the partition key, and not the clustering key, or the partition key and multiple clustering keys (pk = ? and ck IN (?, ?, ?), only the conditions on static cells are accepted. A current limitation which we plan to lift is that not all predicates are available in conditions: LIKE, TOKEN or user-defined functions are not available. Finally, beware of null semantics for collection values. null for a frozen collection is a stored value, i.e. it is distinct from an absent value and is correspondingly treated in relations. For non-frozen collection != null or == null returns the same result for null values and absent data. There is no reason for this but Cassandra compatibility.
  12. use LWT with tables with counters, use LWT with statements which span multiple partitions, batches or not use user-supplied timestamps: guaranteeing consistency requires that the timestamp is assigned by the transaction coordinator use conditional and non-conditional statements with the same data and expect conditional statements to be consistent. You can actually use non-conditional statements on some cells of a row, and conditional on the other - but in practice this is hardly useful since eventually you'll have to work wit entire row, such as insert or delete it, and it will conflict. Better split such object to "transactional" and "eventually consistent" part and store in two different tables. Other limitations are more minor: while a non-LWT batch can be UNLOGGED, a conditional batch can not. IF conditions must be a perfect conjunct (... AND ... AND ...) while UPDATE is actually insert, UPDATE IF NOT EXISTS is not allowed, since it doesn't make any sense when read as English and not as CQL
  13. Scylla is making an effort to be compatible with Cassandra, down to the level of limitations of the implementation. How is it different? unlike Cassandra, we use per-core data partitioning, so the RPC that is done to perform a transaction talks directly to the right core on a peer replica, avoiding the concurrency overhead. That is, of course, true, if shard-aware driver is used - otherwise we add an extra hop to the right core at the coordinator node just like the first implementation of LWT in Cassandra, we do not store hints for lightweight transaction writes. Cassandra later add hints support, while we do not have plans for it, since the hints seem to be redundant. Unlike Cassandra, Scylla doesn't have LWT support in Thrift protocol and doesn't plan to add it. conditional statements return a result set, and unlike Cassandra, Scylla returns result set metadata to the client at prepare if a statement has conditions. While the columns of the result set are the same as in Cassandra, Scylla always returns the old version of the row, to not confuse the driver while Cassandra returns the result set only if the statement is applied. Let's illustrate this: (go back to the batch statement example)
  14. Remember that an i3.2xlarge is considered a small node for Scylla.
  15. New label {conditional="yes"|"no"} for separate accounting of statements with and without conditions Batch is accounted as conditional if it has at least one statement with conditions All statements of a batch are accounted to cql_statements_in_batches and cql_inserts, cql_deletes, cql_updates with label {conditional="yes"|"no"} depending on whether the batch is conditional or not Serial read: exported under scylla_storage_proxy_coordinator_cas_read_* Conditional write: exported under scylla_storage_proxy_coordinator_cas_write_* latency – latency histogram timeouts – number of timeout errors unavailable – number of failed attempts to form a PAXOS quorum unfinished_commit – number of PAXOS rounds finished by the next request condition_not_met – number of CAS failures due to failed IF condition (only for writes) contention – histogram showing how many requests were retried internally due to contention What to look out for: timeouts, growing latency, contention, unfinished commit, condition not met - all indicate there is something wrong with your app and you’re most likely are doing something wrong.
  16. This screenshot is taken from our graphana monitoring when running the benchmark. We plan to add these metrics to our standard dashboards: https://github.com/scylladb/scylla-monitoring/issues/775
  17. Let's take a look at Scylla implementation - realizing the internal workings of the code helps identify the limits of applying this feature in your projects. It will also let us reason about the next steps for strong consistency in Scylla. Scylla is a shared-nothing system with no central authority or repository of knowledge. Each node owns a fraction of data called token range and all nodes are forming a mesh to deliver the database service.
  18. To avoid uneven distribution of data, the consistent hash ring contains not cnodes, but vnodes - virtual node identifiers, each node owning multiple vnodes. One important way in which Scylla is different from Cassandra is its partitioning scheme, when each token range owned by a node is sub-partitioned into hundreds of sub-ranges, to ensure every CPU core solely owns its own subset of data. This allows for very little coordination between the cores on a single node - similar as there is very little coordination between the nodes in the entire cluster.
  19. So Scylla adds an extra slicing layer, to split vnodes, into per-shard chunks called cnodes. For each token range of a cnode, its peers, or secondary replicas are selected as a product of hash function, thus each transaction ultimately involves a unique set of peers. This approach works very well for building a scalable, fault-tolerant system that minimizes hot spots and reduces impact of a single node failure. Yet it creates tens if not hundreds of thousands of replication "groups" - *distinct* sets of peers participating in a given transaction. The distributed system theory offers two broad sets of algorithms for peer coordination: with a designated leader, which may change once in a while to provide high availability, and leader-less, or, in fact, selecting a leader independently for every transaction. Some have already recognized that I am speaking in very broad terms about Raft vs Paxos family of algorithms. Thanks to the Scylla approach to data partitioning, using a leader-based algorithm would require adding group replication state for every distinct replication group, which means a lot of additional runtime state to maintain, and a lot of implementation complexity to manage, especially when the number of nodes or number of cores on a node changes, and many replication groups are re-formed. A leaderless algorithm trades the need to maintain extra state with an extra negotiation round to select a leader for each transaction. Since this approach allowed us to shorten the time to market we settled on it first, somewhat reassured that Cassandra uses the same technique. So what is Paxos and how does it work?
  20. Paxos was invented as an algorithm for achieving consensus on a single value over unreliable communication channels. Many parts of the algorithm are left to implementers, so it can be tailored to solving the problem of database replication. In Scylla, the algorithm participants are replicas responsible for a given partition key. When a client suggests a change to the key (any modification statement can be represented as a partition mutation), a coordinator node acting on the client's behalf ensures that the majority of replicas holding the key accept the change. Any node in the cluster can be a coordinator for some change. This is done in two steps: first, the majority of replicas responsible for the key make a promise to the coordinator to accept the change, if the coordinator decides to make it. This step is necessary to make sure that no two concurrent coordinators "split" the history, when some replicas accept changes from one coordinator, and others from another. Essentially it temporarily locks out other changes and allows them to happen one at a time. After the coordinator receives a majority of promises, it suggests a change. If the change is accepted by the majority, the algorithm achieved progress. Please note that this illustration assumes a shard-aware driver and the first replica both acting as a coordinator and implicitly sending successfully sending and acknowledging all messages.
  21. In addition to the two steps mandated by the protocol, Scylla has to retrieve the old row to check conditions. Once a proposal is accepted, and the coordinator knows it has been accepted (it got responses from a majority) another query is performed to make sure the change is applied to the base table on each replica. Overall, this makes up to 4 rounds, excluding retries and repairs. The algorithm uses a system table, called system.paxos to store its state. The table is replica-local, i.e. it is not partitioned but contains own data on each replica. The table primary key is a blob, capable of storing a partition key of any user table. This ensures that any Paxos round can find a designated unique slot in the system table to store its state. Once a round is over, the state can be cleared or overwritten - the table has a TTL attached ot it, to ensure old rounds expire. While a node acting as a coordinator is leading the effort in achieving resolution, other nodes are free to do the same and may even hijack the efforts of their peers. In particular, all coordinators share responsibility of carrying out an unfinished round when they encounter it. This makes Paxos resilient against failures such as machine crashes and network outages. This, however, leads to contention under load, since it can be difficult to distinguish a round which has an active coordinator pushing it to completion from a round that was abandoned because the coordinator that started it had failed.
  22. One can already guess from my brief sketch of the implementation that achieving consensus using Paxos is a saga and many things may break on the way. Let's try to look critically at Scylla implementation and summarize its current flaws so that users can be aware of it: the main issue, of course, is that the protocol is very expensive: it's 4 times more expensive than a usual write in terms of network latency, and a hundred times in terms of I/O, since it incurs a read of the old row. By any measures the network latency dominates I/O costs, but these costs should not be discounted either: fetching whole pages of an LSM tree can saturate I/O bandwidth way before network bandwidth limit is reached. Some of the of the protocol RPCs could be collapsed, and work in Scylla has begun to this end. the second largest issue to note is high contention overhead when multiple coordinators attempt to work on the same key. The contention is innate in the liveness property of Paxos - when two coordinators have a row over concurrently changing the same key they need to guess if the other coordinator is alive or not, which may be difficult and costly. So they just back off and wait for a random interval when encounter contention, which introduces exponentially growing delays as the key becomes hotter and hotter. It should be noted that the research has advanced enough to provide industry grade solutions for the problem (Paxos leases), and our team is also looking into applying it. there are circumstances in which the client can not reliably know whether a value is applied or not. In one infamous Cassandra bug a user complains that he gets a timeout exception from a query which is actually successful and the timeout is returned before it is expired! Let's consider a case when a coordinator attempts to perform a change but another coordinator hijacks its partially completed change. Has the change been applied? Maybe, but the coordinator has no time to find out - it has to return "timeout" state to the client, which in turn has to figure out the outcome itself. In other words, it is possible to have a write operation report a failure to the client, but still actually persist the write to a replica. While the situation is also quite possible with any other kind of update, it is aggravated by contention and prolonged nature of Paxos - so any user of LWT has to be taking it seriously. Cassandra 4.0 release delivers better diagnostics of this "unknown" state so that the client can make a more educated decision as to how to proceed (and we plan to do the same). Paxos table state is an extra temporary state that DBAs must take into account when sizing their deployments. It can store up to 3 hours of in-progress transactions on each nodes, which could be quite hefty under high load. We intend to address or otherwise mitigate these issues before the feature becomes generally available. Meanwhile, please bear in mind the high costs of lightweight transactions when designing your applications and use them sparingly, i.e. avoid using for all your data. As already mentioned, a good design pattern is when WLT is used for control plane of your application, while data plane continues to be eventually consistent.
  23. As you could have sensed I'm not actually very happy with many of these issues and I somewhat regret we had to inherit some of them from Cassandra to preserve compatibility. Good news is Scylla is not just a Cassandra clone - CQL is the first front-end to its fantastic massively-parallel database technology, DynamoDB-compatible API is the second and others are quite likely to appear. We plan to continue our efforts in introducing a leader-based synchronous replication to Scylla, which is now a prevalent trend in the industry. To do it right, Scylla will need to change its data partitioning scheme to ensure there is more data locality, and also bring down the number of replication groups in the cluster, from tens of thousands, to hundreds (we still need to keep the number of groups somewhat high to ensure the workload is handled evenly). To avoid making our existing users perform painful migrations, we will begin by using a new partitioning and data replication scheme for new tables created with these options enabled. For such tables we will always mandate server-assigned timestamp for transaction identifier. One advantage of this approach is that it will make all CQL statements, not just conditional statements, strongly consistent. Ensuring isolation will not require a read of the old row or multiple network round trips, so will come at a much lower cost. This is not an official commitment but the current state of mind of some key people on the engineering team.