SlideShare a Scribd company logo
1 of 74
Download to read offline
The Hows and Whys
of a Distributed SQL Database
Alex Robinson / Member of the Technical Staff
alex@cockroachlabs.com / @alexwritescode
Agenda
• Why a distributed SQL database?
• How does a distributed SQL database work?
○ Data distribution
○ Data replication
○ Transactions
Why a distributed SQL database?
A brief history of databases
1960s: The first databases
• Records stored with pointers to each other in a hierarchy or network
• Queries had to process one tuple at a time, traversing the structure
• No independence between logical and physical schemas
○ To optimize for different queries, you had to dump and reload all your data
• Required a lot of programmer effort to use
1970s: SQL / RDBMS
• Relational model was designed in contrast to prior hierarchical DBs
○ Expose simple data structures and high-level language
○ Queries are independent of the underlying physical storage
• Very developer friendly
• Designed to run on a single machine
1980s-1990s: Continued growth of SQL / RDBMS
• SQL databases matured, took over the industry
• Object-oriented databases tried, but didn’t gain much traction
○ Didn’t support complex queries well
○ No standard API across systems
Early 2000s: Custom Sharding
• Rapidly growing scale necessitated new solutions
• Companies added custom middleware to
connect single-node DBs
• No cross-shard transactions, operational
nightmares
MySQL
Middleware
Applications
2004+: NoSQL
• Further growth and operational pain necessitated new systems
• Big change in priorities - scale and availability above all else
• Sacrificed a lot to get there
○ No relational model (custom APIs instead of SQL)
○ No (or very limited) transactions and indexes
○ Manual joins
2010s: “NewSQL” / Distributed SQL
Why another type of database?
Database Limitations
Existing database solutions place an undue burden on application
developers and operators:
● Scale (sql)
● Fault tolerance (sql)
● Limited transactions (nosql)
● Limited indexes (nosql)
● Consistency issues (nosql)
Strong Consistency and Transactions
“We believe it is better to have application
programmers deal with performance
problems due to overuse of transactions as
bottlenecks arise, rather than always coding
around the lack of transactions”
-Google’s Spanner paper
2010s: “NewSQL” / Distributed SQL
● Distributed databases that provide full SQL semantics
● Attempt to combine the best of both worlds
○ Fully-featured SQL
○ Horizontally scalable and highly available
● Entirely new systems, tough to tack onto existing databases
2010s: “NewSQL” / Distributed SQL
● Burst onto the scene with Google’s Spanner system (published in
2012)
● CockroachDB started open source development in 2014
So we know why, but how?
And what’s different from SQL and NoSQL databases?
What we’re going to cover
1. How data is distributed across machines
2. How remote copies of data are replicated and kept in sync
3. How transactions work
Plus how all of this differs from conventional SQL and NoSQL systems
Disclaimer
Data Distribution
Data Distribution in SQL
• Option 1: All data on one server
○ With optional secondary replicas/backups that also store all the data
• Option 2: Manually shard data across separate database instances
○ All data for each shard is still just on one server (plus any secondary replicas)
○ All data distribution choices are being made by the human DB admin
Data Distribution in NoSQL / NewSQL
• Core assumption: entire dataset won’t fit on one machine
• Given that, there are two key questions to answer:
○ How do you divide it up?
○ How do you locate any particular piece of data?
• Two primary approaches: Hashing or Order-Preserving
Option one: Hashing
● Pick one or more hash functions, divide up data by hashing each key
● Deterministically map hashed keys to servers
● Pros: easy to locate data by key
● Con: awkward, inefficient range scans
Option two: Order-Preserving
● Divide sorted key space up into ranges of approximately equal size
● Distribute the resulting key ranges across the servers
● Pros: efficient scans, easy splitting
● Con: requires additional indexing
Option two: Order-Preserving
Each shard (“range”) contains a contiguous segment of the key space
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
lemon
Option two: Order-Preserving
We need an indexing structure to locate a range
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
Ø-lem lem-pea pea-∞
lemon
range index
Option two: Order-Preserving
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lemon
lime
mango
melon
orange
pea-∞
peach
pear
pineapple
raspberry
strawberry
Ø-lem lem-pea pea-∞
range index
Scans (fruits >= “cherry” AND <= “mango”) are efficient
Option two: Order-Preserving
Ø-lem
apricot
banana
blueberry
cherry
grape
lem-pea
lemon
lime
mango
melon
orange
pea-str
peach
pear
pineapple
raspberry
Ø-lem lem-pea
range index
str-∞
strawberry
tamarillo
tamarind
str-∞pea-str
Data Distribution: Placement
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Range 2Range 3 Range 3
Range 2
Range 3
Each range is replicated
to three or more nodes
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Adding a new (empty)
node
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Range 3
A new replica is
allocated, data is
copied.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
Range 3
The new replica is made
live, replacing another.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Node 2
Range 1
Range 2
Node 3
Range 1
Node 4
Range 2Range 3 Range 3
Range 2
Range 3
The old (inactive) replica
is deleted.
Data Distribution: Rebalancing
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
Process continues until
nodes are balanced.
Data Distribution: Recovery
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
X Losing a node causes
recovery of its replicas.
Data Distribution: Recovery
Node 1
Range 1
Range 2
Range 2
Node 2 Node 3
Range 1
Node 4
Range 2Range 3
Range 1
Range 3
Range 2
Range 3
XRange 1
Range 3
A new replica gets
created on an existing
node to replace the lost
replica.
Data Replication
Keeping Remote Copies in Sync
Keeping Copies in Sync in SQL Databases
• Option one: cold backups
○ No real expectation of backups being fully up-to-date
• Option two: primary-secondary replication
○ All writes (and all reads that care about consistency) go to the “Primary” instance
○ All writes get pushed from the Primary to any Secondary instances
Primary/Secondary Replication
Put “cherry”
Put “cherry”
Primary
apricot
banana
blueberry
cherry
grape
Secondary
apricot
banana
blueberry
cherry
grape
In theory, replicas contain identical copies of data: Voila!
Primary/Secondary Replication
• In practice, you have to choose
○ Asynchronous replication
■ Secondaries lag behind primary
■ Failover is likely to lose recent writes
○ Synchronous replication
■ All writes get delayed waiting for secondary acknowledgment
■ What do you do when your secondary fails?
• Failover is really hard to get right
○ How do clients know which instance is the primary at any given time?
Keeping Copies in Sync in NoSQL Databases
• Most NoSQL systems are eventually consistent
• Very wide space of solutions, many different possible behaviors
○ Last write wins
○ Last write wins with vector clocks
○ Leader elections
○ Quorum reads and writes
○ Conflict-free replicated data types (CRDTs)
Keeping Copies in Sync in NewSQL Databases
• Use a distributed consensus protocol from the CS literature
○ Available protocols: Paxos, Raft, Zab, Viewstamped Replication, …
• At a high level:
○ Replicate data to N (often N=3 or N=5) nodes
○ Commit happens when a quorum have written the data
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Put “cherry”
Distributed Consensus: Raft made simple
Write committed when
2 out of 3 nodes have written data
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Put “cherry”
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Put “cherry”
AckAck
Distributed Consensus: Raft made simple
What happens during failover?
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Put “cherry” Leader
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Read “cherry”
Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
On failover, only data written to a
quorum is considered presentRead “cherry”
Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Distributed Consensus: Raft made simple
Read “cherry” Leader
apricot
banana
blueberry
grape
Follower
apricot
banana
blueberry
cherry
grape
Follower
apricot
banana
blueberry
grape
Key not found
Consensus in NewSQL Databases
● Run one consensus group per range of data
● Many practical complications: member change, range splits, upgrades,
scaling number of ranges
Transactions
How can they be made to work in a distributed system?
What is an ACID Transaction?
• Atomic - entire transaction either fully commits or fully aborts
• Consistent - moves from one valid DB state to another
• Isolated - concurrent transactions don’t interfere with each other
○ “Serializable” is needed for true isolation
• Durable - committed transactions won’t disappear
Transactions in SQL Databases
• Atomicity is bootstrapped off a lower-level atomic primitive: log writes
○ All mutations applied as part of a transaction get tagged with a transaction ID
○ “Commit” log entry marks the transaction as committed, making mutations visible
• Isolation is typically provided in one of two ways
○ Read/Write Locks
○ Multi-Version Concurrency Control (MVCC)
Transactions in SQL Databases: Read/Write Locks
• Pretty simple at a high level:
○ When you read a row, take a read lock on it
○ When you write a row, take a write lock on it
○ When you commit, release all locks
• Requires deadlock detection
Transactions in SQL Databases: MVCC
• Store a timestamp with each row
• When updating a row, don’t overwrite the old value
• When reading, return the most recent value from before the read
○ Allows access to historical data
○ Allows long-running read-only transactions to not block new writes
• Need to detect conflicts and abort conflicting transactions
Transactions in NoSQL Databases
• Many NoSQL systems don’t offer transactions at all
• This is one of the biggest tradeoffs that was made to improve scalability
• Those that do typically limit transactions to a single key
○ Single-key transactions don’t require coordinating across shards
○ Can be implemented as just supporting a compare-and-swap operation
Transactions in NewSQL Databases
• Support traditional ACID semantics
• Atomicity is bootstrapped off distributed consensus
○ A “Commit” write to consensus system marks the transaction as committed
• Isolation can be handled similarly to single-node SQL databases
○ But the added latency and distributed state makes things tougher
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango” apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
lemon
mango (txn 1)
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
apricot
cherry (txn 1)
grape
Status: COMMITTED
Txn-1 Record
lemon
mango (txn 1)
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
5. Clean up intents
apricot
cherry
grape
Status: COMMITTED
Txn-1 Record
lemon
mango
orange
Distributed Transactions (CockroachDB)
1. Begin Txn 1
2. Put “cherry”
3. Put “mango”
4. Commit Txn 1
5. Clean up intents
6. Remove Txn 1
apricot
cherry
grape
lemon
mango
orange
Distributed Transactions
● That’s the happy case
● What about conflicting transactions?
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: PENDING
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 1)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 2)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
c. Update intent
Status: PENDING
Txn-2 Record
Distributed Transactions (write conflict)
1. Begin Txn 1
2. Put “cherry”
apricot
cherry (txn 2)
grape
Status: ABORTED
Txn-1 Record
1. Begin Txn 2
2. Put “cherry”
a. Check Txn 1 record
b. Push Txn 1
c. Update intent
3. Commit Txn 2
Status: COMMITTED
Txn-2 Record
Distributed Transactions
1
https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/
2
https://research.google.com/archive/spanner.html
● Transaction atomicity is bootstrapped on top of Raft atomicity
● This description omitted MVCC and other types of conflicts
○ More details on the Cockroach Labs blog1
or Spanner paper2
Summary
Summary
● Databases are best understood in the context in which they were built
● How does a NewSQL database work in comparison to previous DBs?
○ Distribute data to support large data sets and fault-tolerance
○ Replicate consistently with consensus protocols like Raft
○ Distributed transactions using consensus as the foundation
Thank You!
CockroachLabs.com/blog
github.com/cockroachdb/cockroach
alex@cockroachlabs.com / @alexwritescode

More Related Content

What's hot

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Flink Forward
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeErik Krogen
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinChristian Johannsen
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleEDB
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3Marco Tusa
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with SnowflakeMatillion
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 

What's hot (20)

Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby NodeHadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
Galera explained 3
Galera explained 3Galera explained 3
Galera explained 3
 
Get Savvy with Snowflake
Get Savvy with SnowflakeGet Savvy with Snowflake
Get Savvy with Snowflake
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 

Similar to The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial Na Zhu
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outMariaDB plc
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
NoSQL Overview
NoSQL OverviewNoSQL Overview
NoSQL OverviewTu Hoang
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Managementsameerfaizan
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0Ted Wennmark
 

Similar to The Hows and Whys of a Distributed SQL Database - Strange Loop 2017 (20)

Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
ClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale outClustrixDB: how distributed databases scale out
ClustrixDB: how distributed databases scale out
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
NoSQL Overview
NoSQL OverviewNoSQL Overview
NoSQL Overview
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
try
trytry
try
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
NoSql Data Management
NoSql Data ManagementNoSql Data Management
NoSql Data Management
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0MySQL NDB Cluster 8.0
MySQL NDB Cluster 8.0
 
No SQL
No SQLNo SQL
No SQL
 

Recently uploaded

Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

  • 1. The Hows and Whys of a Distributed SQL Database Alex Robinson / Member of the Technical Staff alex@cockroachlabs.com / @alexwritescode
  • 2. Agenda • Why a distributed SQL database? • How does a distributed SQL database work? ○ Data distribution ○ Data replication ○ Transactions
  • 3. Why a distributed SQL database? A brief history of databases
  • 4. 1960s: The first databases • Records stored with pointers to each other in a hierarchy or network • Queries had to process one tuple at a time, traversing the structure • No independence between logical and physical schemas ○ To optimize for different queries, you had to dump and reload all your data • Required a lot of programmer effort to use
  • 5. 1970s: SQL / RDBMS • Relational model was designed in contrast to prior hierarchical DBs ○ Expose simple data structures and high-level language ○ Queries are independent of the underlying physical storage • Very developer friendly • Designed to run on a single machine
  • 6. 1980s-1990s: Continued growth of SQL / RDBMS • SQL databases matured, took over the industry • Object-oriented databases tried, but didn’t gain much traction ○ Didn’t support complex queries well ○ No standard API across systems
  • 7. Early 2000s: Custom Sharding • Rapidly growing scale necessitated new solutions • Companies added custom middleware to connect single-node DBs • No cross-shard transactions, operational nightmares MySQL Middleware Applications
  • 8. 2004+: NoSQL • Further growth and operational pain necessitated new systems • Big change in priorities - scale and availability above all else • Sacrificed a lot to get there ○ No relational model (custom APIs instead of SQL) ○ No (or very limited) transactions and indexes ○ Manual joins
  • 9. 2010s: “NewSQL” / Distributed SQL
  • 10. Why another type of database?
  • 11. Database Limitations Existing database solutions place an undue burden on application developers and operators: ● Scale (sql) ● Fault tolerance (sql) ● Limited transactions (nosql) ● Limited indexes (nosql) ● Consistency issues (nosql)
  • 12. Strong Consistency and Transactions “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions” -Google’s Spanner paper
  • 13. 2010s: “NewSQL” / Distributed SQL ● Distributed databases that provide full SQL semantics ● Attempt to combine the best of both worlds ○ Fully-featured SQL ○ Horizontally scalable and highly available ● Entirely new systems, tough to tack onto existing databases
  • 14. 2010s: “NewSQL” / Distributed SQL ● Burst onto the scene with Google’s Spanner system (published in 2012) ● CockroachDB started open source development in 2014
  • 15. So we know why, but how? And what’s different from SQL and NoSQL databases?
  • 16. What we’re going to cover 1. How data is distributed across machines 2. How remote copies of data are replicated and kept in sync 3. How transactions work Plus how all of this differs from conventional SQL and NoSQL systems
  • 19. Data Distribution in SQL • Option 1: All data on one server ○ With optional secondary replicas/backups that also store all the data • Option 2: Manually shard data across separate database instances ○ All data for each shard is still just on one server (plus any secondary replicas) ○ All data distribution choices are being made by the human DB admin
  • 20. Data Distribution in NoSQL / NewSQL • Core assumption: entire dataset won’t fit on one machine • Given that, there are two key questions to answer: ○ How do you divide it up? ○ How do you locate any particular piece of data? • Two primary approaches: Hashing or Order-Preserving
  • 21. Option one: Hashing ● Pick one or more hash functions, divide up data by hashing each key ● Deterministically map hashed keys to servers ● Pros: easy to locate data by key ● Con: awkward, inefficient range scans
  • 22. Option two: Order-Preserving ● Divide sorted key space up into ranges of approximately equal size ● Distribute the resulting key ranges across the servers ● Pros: efficient scans, easy splitting ● Con: requires additional indexing
  • 23. Option two: Order-Preserving Each shard (“range”) contains a contiguous segment of the key space Ø-lem apricot banana blueberry cherry grape lem-pea lime mango melon orange pea-∞ peach pear pineapple raspberry strawberry lemon
  • 24. Option two: Order-Preserving We need an indexing structure to locate a range Ø-lem apricot banana blueberry cherry grape lem-pea lime mango melon orange pea-∞ peach pear pineapple raspberry strawberry Ø-lem lem-pea pea-∞ lemon range index
  • 27. Data Distribution: Placement Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Range 2Range 3 Range 3 Range 2 Range 3 Each range is replicated to three or more nodes
  • 28. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Adding a new (empty) node
  • 29. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Range 3 A new replica is allocated, data is copied.
  • 30. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 Range 3 The new replica is made live, replacing another.
  • 31. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Node 2 Range 1 Range 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 3 Range 2 Range 3 The old (inactive) replica is deleted.
  • 32. Data Distribution: Rebalancing Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 Process continues until nodes are balanced.
  • 33. Data Distribution: Recovery Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 X Losing a node causes recovery of its replicas.
  • 34. Data Distribution: Recovery Node 1 Range 1 Range 2 Range 2 Node 2 Node 3 Range 1 Node 4 Range 2Range 3 Range 1 Range 3 Range 2 Range 3 XRange 1 Range 3 A new replica gets created on an existing node to replace the lost replica.
  • 36. Keeping Copies in Sync in SQL Databases • Option one: cold backups ○ No real expectation of backups being fully up-to-date • Option two: primary-secondary replication ○ All writes (and all reads that care about consistency) go to the “Primary” instance ○ All writes get pushed from the Primary to any Secondary instances
  • 37. Primary/Secondary Replication Put “cherry” Put “cherry” Primary apricot banana blueberry cherry grape Secondary apricot banana blueberry cherry grape In theory, replicas contain identical copies of data: Voila!
  • 38. Primary/Secondary Replication • In practice, you have to choose ○ Asynchronous replication ■ Secondaries lag behind primary ■ Failover is likely to lose recent writes ○ Synchronous replication ■ All writes get delayed waiting for secondary acknowledgment ■ What do you do when your secondary fails? • Failover is really hard to get right ○ How do clients know which instance is the primary at any given time?
  • 39. Keeping Copies in Sync in NoSQL Databases • Most NoSQL systems are eventually consistent • Very wide space of solutions, many different possible behaviors ○ Last write wins ○ Last write wins with vector clocks ○ Leader elections ○ Quorum reads and writes ○ Conflict-free replicated data types (CRDTs)
  • 40. Keeping Copies in Sync in NewSQL Databases • Use a distributed consensus protocol from the CS literature ○ Available protocols: Paxos, Raft, Zab, Viewstamped Replication, … • At a high level: ○ Replicate data to N (often N=3 or N=5) nodes ○ Commit happens when a quorum have written the data
  • 43. Write committed when 2 out of 3 nodes have written data Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Put “cherry”
  • 44. Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Put “cherry” AckAck
  • 45. Distributed Consensus: Raft made simple What happens during failover?
  • 46. Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Follower apricot banana blueberry grape Follower apricot banana blueberry grape
  • 47. Distributed Consensus: Raft made simple Put “cherry” Leader apricot banana blueberry cherry grape Follower apricot banana blueberry grape Follower apricot banana blueberry grape
  • 48. Distributed Consensus: Raft made simple Read “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape
  • 49. Distributed Consensus: Raft made simple On failover, only data written to a quorum is considered presentRead “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape
  • 50. Distributed Consensus: Raft made simple Read “cherry” Leader apricot banana blueberry grape Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape Key not found
  • 51. Consensus in NewSQL Databases ● Run one consensus group per range of data ● Many practical complications: member change, range splits, upgrades, scaling number of ranges
  • 52. Transactions How can they be made to work in a distributed system?
  • 53. What is an ACID Transaction? • Atomic - entire transaction either fully commits or fully aborts • Consistent - moves from one valid DB state to another • Isolated - concurrent transactions don’t interfere with each other ○ “Serializable” is needed for true isolation • Durable - committed transactions won’t disappear
  • 54. Transactions in SQL Databases • Atomicity is bootstrapped off a lower-level atomic primitive: log writes ○ All mutations applied as part of a transaction get tagged with a transaction ID ○ “Commit” log entry marks the transaction as committed, making mutations visible • Isolation is typically provided in one of two ways ○ Read/Write Locks ○ Multi-Version Concurrency Control (MVCC)
  • 55. Transactions in SQL Databases: Read/Write Locks • Pretty simple at a high level: ○ When you read a row, take a read lock on it ○ When you write a row, take a write lock on it ○ When you commit, release all locks • Requires deadlock detection
  • 56. Transactions in SQL Databases: MVCC • Store a timestamp with each row • When updating a row, don’t overwrite the old value • When reading, return the most recent value from before the read ○ Allows access to historical data ○ Allows long-running read-only transactions to not block new writes • Need to detect conflicts and abort conflicting transactions
  • 57. Transactions in NoSQL Databases • Many NoSQL systems don’t offer transactions at all • This is one of the biggest tradeoffs that was made to improve scalability • Those that do typically limit transactions to a single key ○ Single-key transactions don’t require coordinating across shards ○ Can be implemented as just supporting a compare-and-swap operation
  • 58. Transactions in NewSQL Databases • Support traditional ACID semantics • Atomicity is bootstrapped off distributed consensus ○ A “Commit” write to consensus system marks the transaction as committed • Isolation can be handled similarly to single-node SQL databases ○ But the added latency and distributed state makes things tougher
  • 59. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record
  • 60. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record lemon mango (txn 1) orange
  • 61. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 apricot cherry (txn 1) grape Status: COMMITTED Txn-1 Record lemon mango (txn 1) orange
  • 62. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 5. Clean up intents apricot cherry grape Status: COMMITTED Txn-1 Record lemon mango orange
  • 63. Distributed Transactions (CockroachDB) 1. Begin Txn 1 2. Put “cherry” 3. Put “mango” 4. Commit Txn 1 5. Clean up intents 6. Remove Txn 1 apricot cherry grape lemon mango orange
  • 64. Distributed Transactions ● That’s the happy case ● What about conflicting transactions?
  • 65. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record
  • 66. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record 1. Begin Txn 2 2. Put “cherry” Status: PENDING Txn-2 Record
  • 67. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: PENDING Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record Status: PENDING Txn-2 Record
  • 68. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 1) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 Status: PENDING Txn-2 Record
  • 69. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 2) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 c. Update intent Status: PENDING Txn-2 Record
  • 70. Distributed Transactions (write conflict) 1. Begin Txn 1 2. Put “cherry” apricot cherry (txn 2) grape Status: ABORTED Txn-1 Record 1. Begin Txn 2 2. Put “cherry” a. Check Txn 1 record b. Push Txn 1 c. Update intent 3. Commit Txn 2 Status: COMMITTED Txn-2 Record
  • 71. Distributed Transactions 1 https://www.cockroachlabs.com/blog/serializable-lockless-distributed-isolation-cockroachdb/ 2 https://research.google.com/archive/spanner.html ● Transaction atomicity is bootstrapped on top of Raft atomicity ● This description omitted MVCC and other types of conflicts ○ More details on the Cockroach Labs blog1 or Spanner paper2
  • 73. Summary ● Databases are best understood in the context in which they were built ● How does a NewSQL database work in comparison to previous DBs? ○ Distribute data to support large data sets and fault-tolerance ○ Replicate consistently with consensus protocols like Raft ○ Distributed transactions using consensus as the foundation