Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database
1. Internet-scale Distributed Systems
Google Spanner
a
Synchronously-Replicated
Globally-Distributed
Multi-Version
Database
22.01.2013 Maciej JozwiakPage 1
Presented by:
Maciej Jozwiak
2. Internet-scale Distributed Systems
Agenda
⢠Problem description
⢠Overview of available solutions
⢠Globally-distributed database
⢠Architecture
⢠How is data replicated?
⢠Data model
⢠TrueTime API
⢠Transactions
⢠Summary
22.01.2013 Maciej JozwiakPage 2
3. Internet-scale Distributed Systems
Problem â Need for Scalable MySQL
⢠Googleâs advertising backend
â Based on MySQL
⢠Relations
⢠Query language
â Manually sharded
⢠Resharding is very costly
â Global distribution
22.01.2013 Maciej JozwiakPage 3
SHARDING:
Sharding is another name
for "horizontal
partitioning" of a database.
Rows of a database table
are held separately, form a
partition which can be
located on a separate
database server or physical
location.
4. Internet-scale Distributed Systems22.01.2013 Maciej JozwiakPage 4
⢠Replicated ACID transactions
⢠Schematized semi-relational
tables
⢠Synchronous replication
support across data-centers
⢠Performance
⢠Lack of query language
⢠Scalability
⢠Throughput
⢠Performance
⢠Eventually-consistent
replication support
across data-centers
Overview of Available Solutions
Google Megastore
5. Internet-scale Distributed Systems22.01.2013 Maciej JozwiakPage 5
⢠Replicated ACID transactions
⢠Schematized semi-relational
tables
⢠Synchronous replication
support across data-centers
⢠Performance
⢠Lack of query language
⢠Scalability
⢠Throughput
⢠Performance
⢠Eventually-consistent
replication support
across data-centers
Overview of Available Solutions
Google Megastore
6. Internet-scale Distributed Systems22.01.2013 Maciej JozwiakPage 6
⢠Replicated ACID transactions
⢠Schematized semi-relational
tables
⢠Synchronous replication
support across data-centers
⢠Performance
⢠Lack of query language
⢠Scalability
⢠Throughput
⢠Performance
⢠Eventually-consistent
replication support
across data-centers
Overview of Available Solutions
Google Megastore
7. Internet-scale Distributed Systems
Bridging the gap between Megastore
and Bigtable
22.01.2013 Maciej JozwiakPage 7
Google
Megastore
⢠Removes the need to manually partition data
⢠Synchronous replication and automatic failover
⢠Strong transactional semantics
⢠SQL based query language
⢠Semi-relational, schematized tables
Solution: Google Spanner
8. Internet-scale Distributed Systems
Globally-Distributed Database
22.01.2013 Maciej JozwiakPage 8
Future scale:
⢠one million to 10 million servers
⢠100s to 1000s locations around
the world
⢠1013 directories
⢠1018 bytes of storage
cross-datacenter
replicated data management:
⢠high availability
⢠minimize latency of data reads and writes
⢠replication configuration dynamically
controlled at a fine grain by applications
9. Internet-scale Distributed Systems
Spanner Deployment - Universe
22.01.2013 Maciej JozwiakPage 9
Universe master
(status + interactive debugging)
Placement driver
(move data across
zones automatically)
10. Internet-scale Distributed Systems
How Is Data Replicated?
22.01.2013 Maciej JozwiakPage 10
Paxos:
protocols for solving consensus in a network of unreliable
processors. Consensus is the process of agreeing on one result
among a group of participants. This problem becomes difficult when
the participants or their communication medium may experience
failures.
Spanserver software stack
11. Internet-scale Distributed Systems
Replication Configuration
⢠Replication configurations for data can be dynamically
controllered at a fine grain by applications
⢠Applications can specify constraints to control:
â which datacenters contain which data
â how far data is from user (to control read latency)
â how far replicas are from each other (to control write
latency)
â how many replicas are maintained (to control durability,
availability, and read performance)
⢠North America: 5 replicas, Europe 2 replicas
22.01.2013 Maciej JozwiakPage 11
12. Internet-scale Distributed Systems
Hierarchical Data Model
⢠Universe (Spanner deployment)
â Database
⢠Tables
â Rows and columns
â Must have an ordered set one or more primary key columns
â Primary key uniquely identifies each row
⢠Hierarchies of tables
â Tables must be partioned by client into one or more
hierarchies of tables (INTERLEAVE IN)
â Table in the top â directory table
22.01.2013 Maciej JozwiakPage 12
17. Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej JozwiakPage 17
Albums(2,1) â row from the Albums table for
user_id 2, album_id 1
Interleaving is important because it allows
clients to describe the locality relationship
which is necessary for good performance in a
sharded, distributed database.
19. Internet-scale Distributed Systems
Is Synchronizing Time at the Global
Scale Possible?
22.01.2013 Maciej JozwiakPage 19
Distributed systems dogma:
⢠synchronizing time within and
between datacenters is extremely
hard and uncertain
⢠serialization of requests is
impossible at global scale
20. Internet-scale Distributed Systems
Is Synchronizing Time at the Global
Scale Possible?
22.01.2013 Maciej JozwiakPage 20
Distributed systems dogma:
⢠synchronizing time within and
between datacenters is extremely
hard and uncertain
⢠serialization of requests is
impossible at global scale
21. Internet-scale Distributed Systems
Is Synchronizing Time at the Global
Scale Possible?
22.01.2013 Maciej JozwiakPage 21
Idea: Accept uncertainty, keep it
small and quantify (using GPS
and Atomic Clocks)
22. Internet-scale Distributed Systems
TrueTime API
22.01.2013 Maciej JozwiakPage 22
Idea: Accept uncertainty, keep
it small and quantify (using
GPS and Atomic Clocks)
Novel API distributing a
globally synchronized âproper
timeâ
Method Returns
TT.now() TTinterval: [earliest, latest]
TT.after(t) True if t has definitely passed
TT.before(t) True if t has definitely not
arrived
TT interval - is guaranteed to
contain the absolute time
during which TT.now() was
invoked
23. Internet-scale Distributed Systems
How TrueTime Is Implemented?
22.01.2013 Maciej JozwiakPage 23
set of time master machines per datacenter
majority of masters have
GPS receivers
with dedicated antennas
timeslave daemon per machine
The remaining masters (which we refer
to as Armageddon masters) are
equipped with atomic clocks.
24. Internet-scale Distributed Systems
Time References Vulnerabilities
⢠GPS:
â antenna and receiver failures
â local radio interference
â correlated failures (e.g. spoofing)
â GPS system outages
⢠Atomic clock:
â can drift significantly due to frequency error
2 forms of time reference â 2 failure modes
(uncorrelated to each other):
22.01.2013 Maciej JozwiakPage 24
25. Internet-scale Distributed Systems
How Does Daemon Work?
22.01.2013 Maciej JozwiakPage 25
Daemon polls variety of masters:
⢠chosen from nearby datacenters
⢠from further datacenters
⢠Armageddon masters
Daemon polls variety of masters and
reaches a consensus about correct
timestamp.
Daemonâs poll interval is 30 seconds.
Between synchronizations daemon advertises
a slowy increasing time uncertainty (e)
26. Internet-scale Distributed Systems
Transactions In Spanner
⢠Globally meaningful commit timestamps to
distributed transactions
â If A happens-before B, then
timestamp(A) < timestamp (B)
â A happens-before B if its effects become visible before B
begins, in real time
⢠Visible means acked to client or updates applied to some replica
⢠Begins means first request arrived at Spanner server
⢠Two-phase commit
22.01.2013 Maciej JozwiakPage 26
27. Internet-scale Distributed Systems
What About Performance?
22.01.2013 Maciej JozwiakPage 27
âWe believe it is better to have application
programmers deal with performance problems
due to overuse of transactions as bottlenecks arise,
rather than always coding around the lack of
transactions.â
Two-phase commit can raise availability and performance
issues.
28. Internet-scale Distributed Systems
Summary
⢠Externally consistent global write-transactions with
synchronous replication.
⢠Schematized, semi-relational data model.
⢠SQL-like query interface.
⢠Auto-sharding, auto-rebalancing, automatic failure
response.
⢠Exposes control of data replication and placement to
user/application.
22.01.2013 Maciej JozwiakPage 28