Megastore and Spanner
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54
Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
Motivation
Storage requirements of today’s interactive online applications.
• Scalability (a billion internet users)
• Rapid development
• Responsiveness (low latency)
• Durability and consistency (never lose data)
• Fault tolerant (no unplanned/planned downtime)
• Easy operations (minimize confusion, support is expensive)
These requirements are in conflict.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
Motivation
Relational DBMS, e.g., MySQL, MS SQL, Oracle RDB
• Rich set of features
• Difficult to scale to the massive amount of reads and writes.
NoSQL, e.g., BigTable, Dynamo, Cassandra
• Highly Scalable
• Limited API
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
NewSQL Databases
NoSQL scalability + RDBMS ACID
E.g., Megastore and Spanner
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54
Megastore
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 5 / 54
Megastore
Started in 2006 for app development at Google.
Google’s wide-area replicated data store.
Adds (limited) transactions to wide-area replicated data stores.
GMail, Google+, Android Market, Google App Engine, ...
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54
Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
Megastore
Megastore layered on:
• GFS (Distributed file system)
• Bigtable (NoSQL scalable data store per datacenter)
BigTable is cluster-level structured storage, while Megastore is geo-
scale structured database.
[http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html]
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
Entity Group (1/2)
The data is partitioned into a collection of entity groups (EG).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54
Entity Group (2/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54
Entity Group Replication (1/2)
Each entity group independently and synchronously replicated over
a wide area.
Megastore’s replication system provides a single consistent view of
the data stored in its underlying replicas.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Replication (2/2)
Synchronous replication: a low-latency implementation of paxos.
Basic paxos not used: poor match for high-latency links.
• Writes require at least two inter-replica round-trips to achieve
consensus: prepare round, accept round
• Reads require one inter-replica round-trip: prepare round
Megastore uses a modified version of paxos: fast read, fast write
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (1/3)
Within each EG: full ACID semantics
Transaction management using Write Ahead Logging (WAL).
BigTable feature: ability to store multiple data for same row/column
with different timestamps.
Multiversion Concurrency Control (MVCC) in EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
Entity Group Transaction (2/3)
Read consistency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (2/3)
Read consistency
• Current: waits for uncommitted writes, then reads the last
committed value.
• Snapshot: doesn’t wait, and reads the last committed values.
• Inconsistent reads: ignores the state of log and reads the last values
directly (data may be stale).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Entity Group Transaction (3/3)
Write consistency
• Determine the next available log position.
• Assigns mutations of WAL a timestamp higher than any previous
one.
• Employs paxos to settle the resource contention.
• Based on optimistic concurrency: in case of multiple writers to the
same log position, only one will win, and the rest will notice the
victorious write, abort, and retry their operations.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
Across Entity Group Transaction (1/3)
Across entity groups: limited consistency guarantees
Two methods:
• Asynchronous messaging (queue)
• Two-Phase-Commit (2PC)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54
Across Entity Group Transaction (2/3)
Queues
Provide transactional messaging between EGs.
Each message either is:
• Synchronous: has a single sending and receiving entity group.
• Asynchronous: has different sending and receiving entity group.
Useful to perform operations that affect many EGs.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54
Across Entity Group Transaction (3/3)
Two-Phase Commit
Atomicity is satisfied.
High latency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54
Spanner
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 18 / 54
Limitations of Existing Systems
BigTable
• Scalability
• High throughput
• High performance
• Transactional scope limited to single row
• Eventually-consistent replication support across data-centers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54
Limitations of Existing Systems
Megastore
• Replicated ACID transactions
• Schematized semi-relational tables
• Synchronous replication support across data-centers
• Performance (poor write throughput)
• Lack of query language
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54
Spanner
Bridging the gap between Megastore and Bigtable.
SQL transactions + high throughput
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54
Spanner
Global scale database with strict transactional guarantees.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner
Global scale database with strict transactional guarantees.
Global scale
• Across datacenters
• Scale up to millions of nodes, hundreds of datacenters, trillions of
database rows
Strict transactional guarantees
• General transactions (even inter-row)
• Reliable even during wide-area natural disasters
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
Spanner Implementation
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 23 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (1/2)
Universe: Spanner deployment
Zones: analogues to deployment of BigTable servers (unit of physical
isolation)
• One zonemaster: assigns data to spanservers
• The proxies: used by clients to locate the spanservers assigned to
serve their data
• Thousands of spanservers: serve data to clients
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
Spanner Organization (2/2)
The universe master: a console that displays status information
about all the zones.
The placement driver: handles automated movement of data across
zones.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54
Spanserver Software Stack (1/4)
Each spanserver is responsible for 100-1000 data structure instances,
called tablet (similar to BigTable tablet).
Tablet mapping: (key: string, timestamp:int64) → string
Data and logs stored on Colossus (successor of GFS).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (2/4)
A single paxos state machine on top of each tablet: consistent repli-
cation
Paxos group: all machines involved in an instance of paxos.
Paxos implementation supports long-lived leaders with time-based
leader leases.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
Spanserver Software Stack (3/4)
Writes must initiate the paxos protocol at the leader.
Reads access state directly from the underlying tablet at any replica
that is sufficiently up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
Spanserver Software Stack (4/4)
Transaction manager: to support distributed transactions
• At every replica that is a leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Only One Paxos Group
This is the case for most transactions.
A long lived paxos leader.
• The transaction manager: participant leader
• The other replicas in the group: participant slaves
A lock table for concurrency control.
• Multiple concurrent transactions.
• Maintained by paxos leader.
• Maps ranges of keys to lock states.
• Two-phase locking.
• Wound-wait for dead lock avoidance: young transaction dies if an
older transaction needs a resource held by the young transaction.
It can bypass the transaction manager.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Transactions Involving Multiple Paxos Groups
One of the participant groups is chosen as the coordinator.
• The participant leader of that group will be referred to as the
coordinator leader.
• The slaves of that group as coordinator slaves.
Group’s leaders coordinate to perform two phase commit.
The state of each transaction manager is stored in the underlying
paxos group (and therefore is replicated).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
Data Model and Directories
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54
Data Model
An application creates one or more databases in a universe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Data Model
An application creates one or more databases in a universe.
Each database can contain an unlimited number of schematized
tables.
Table
• Rows and columns
• Must have an ordered set one or more primary key columns
• Primary key uniquely identifies each row
Hierarchies of tables
• Tables must be partitioned by client into one or more hierarchies of
tables
• Table in the top: directory table
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (1/2)
Set of contiguous keys that share a common prefix.
All data in a directory has the same replication configuration.
The smallest unit whose geographic replication properties can be
specified by an application.
A Paxos group may contain multiple directories.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
Directory (2/2)
Spanner might move a directory:
• To shed load from a paxos group.
• To put directories that are frequently accessed together into the
same group.
• To move a directory into a group that is closer to its accessors.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
Example
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
True Time
and
Consistency
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 37 / 54
Key Innovation
Spanner knows what time it is.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (1/2)
Is synchronizing time at the global scale possible?
Synchronizing time within and between datacenters is extremely
hard and uncertain.
Serialization of requests is impossible at global scale.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
Time Synchronization (2/2)
Idea: accept uncertainty, keep it small and quantify (using GPS and
Atomic Clocks).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54
True Time API
TTinterval: is guaranteed to contain the absolute time during
which TT.now() was invoked.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54
How TrueTime Is Implemented? (1/2)
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54
How TrueTime Is Implemented? (2/2)
Daemon polls variety of masters:
• Chosen from nearby datacenters
• From further datacenters
• Armageddon masters
Daemon polls variety of masters
and reaches a consensus about
correct timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54
External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
External Consistency (1/2)
Jerry unfriends Tom to write a controversial comment.
If serial order is as above, Jerry will be in trouble!
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
External Consistency (2/2)
External Consistency: Formally, If commit of T1 preceded the ini-
tiation of a new transaction T2 in wall-clock (physical) time, then
commit of T1 should precede commit of T2 in the serial ordering
also.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54
Snapshot Reads
Read in past without locking.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Snapshot Reads
Read in past without locking.
Client can specify timestamp for read or an upper bound of times-
tamp.
Each replica tracks a value called safe time tsafe, which is the max-
imum timestamp at which a replica is up-to-date.
Replica can satisfy read at any t ≤ tsafe.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
Read-only Transactions
Assign timestamp sread and do snapshot read at sread.
sread = TT.now().latest()
It guarantees external consistency.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (1/3)
Leader must only assign timestamps within the interval of its leader
lease.
Timestamps must be assigned in monotonically increasing order.
If transaction T1 commits before T2 starts, T2’s commit timestamp
must be greater than T1’s commit timestamp.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
Read-Write Transactions (2/3)
Clients buffer writes.
Client chooses a coordinate group that initiates two-phase commit.
A non-coordinator-participant leader chooses a prepare timestamp
and logs a prepare record through paxos and notifies the coordinator.
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54
Read-Write Transactions (3/3)
The coordinator assigns a commit timestamp si no less than all
prepare timestamps and TT.now().latest().
The coordinator ensures that clients cannot see any data committed
by Ti until TT.after(si) is true. This is done by commit wait (wait
until absolute time passes si to commit).
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54
Summary
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 51 / 54
Summary
Megastore
Entity Groups (EG)
Within EG: using paxos - ACID
Across EGs: using queue and two-phase commit
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54
Summary
Spanner
Replica consistency: using paxos protocol
Concurrency control: using two phase locking
Transaction coordination: using two-phase commit
Timestamps for transactions and data items
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54
Questions?
Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 54 / 54

MegaStore and Spanner

  • 1.
    Megastore and Spanner AmirH. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 1 / 54
  • 2.
    Motivation Storage requirements oftoday’s interactive online applications. • Scalability (a billion internet users) • Rapid development • Responsiveness (low latency) • Durability and consistency (never lose data) • Fault tolerant (no unplanned/planned downtime) • Easy operations (minimize confusion, support is expensive) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
  • 3.
    Motivation Storage requirements oftoday’s interactive online applications. • Scalability (a billion internet users) • Rapid development • Responsiveness (low latency) • Durability and consistency (never lose data) • Fault tolerant (no unplanned/planned downtime) • Easy operations (minimize confusion, support is expensive) These requirements are in conflict. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 2 / 54
  • 4.
    Motivation Relational DBMS, e.g.,MySQL, MS SQL, Oracle RDB • Rich set of features • Difficult to scale to the massive amount of reads and writes. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
  • 5.
    Motivation Relational DBMS, e.g.,MySQL, MS SQL, Oracle RDB • Rich set of features • Difficult to scale to the massive amount of reads and writes. NoSQL, e.g., BigTable, Dynamo, Cassandra • Highly Scalable • Limited API Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 3 / 54
  • 6.
    NewSQL Databases NoSQL scalability+ RDBMS ACID E.g., Megastore and Spanner Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 4 / 54
  • 7.
    Megastore Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 5 / 54
  • 8.
    Megastore Started in 2006for app development at Google. Google’s wide-area replicated data store. Adds (limited) transactions to wide-area replicated data stores. GMail, Google+, Android Market, Google App Engine, ... Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 6 / 54
  • 9.
    Megastore Megastore layered on: •GFS (Distributed file system) • Bigtable (NoSQL scalable data store per datacenter) [http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html] Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
  • 10.
    Megastore Megastore layered on: •GFS (Distributed file system) • Bigtable (NoSQL scalable data store per datacenter) BigTable is cluster-level structured storage, while Megastore is geo- scale structured database. [http://cse708.blogspot.jp/2011/03/megastore-providing-scalable-highly.html] Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 7 / 54
  • 11.
    Entity Group (1/2) Thedata is partitioned into a collection of entity groups (EG). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 8 / 54
  • 12.
    Entity Group (2/2) AmirH. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 9 / 54
  • 13.
    Entity Group Replication(1/2) Each entity group independently and synchronously replicated over a wide area. Megastore’s replication system provides a single consistent view of the data stored in its underlying replicas. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 10 / 54
  • 14.
    Entity Group Replication(2/2) Synchronous replication: a low-latency implementation of paxos. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 15.
    Entity Group Replication(2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 16.
    Entity Group Replication(2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. • Writes require at least two inter-replica round-trips to achieve consensus: prepare round, accept round • Reads require one inter-replica round-trip: prepare round Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 17.
    Entity Group Replication(2/2) Synchronous replication: a low-latency implementation of paxos. Basic paxos not used: poor match for high-latency links. • Writes require at least two inter-replica round-trips to achieve consensus: prepare round, accept round • Reads require one inter-replica round-trip: prepare round Megastore uses a modified version of paxos: fast read, fast write Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 11 / 54
  • 18.
    Entity Group Transaction(1/3) Within each EG: full ACID semantics Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 19.
    Entity Group Transaction(1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 20.
    Entity Group Transaction(1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). BigTable feature: ability to store multiple data for same row/column with different timestamps. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 21.
    Entity Group Transaction(1/3) Within each EG: full ACID semantics Transaction management using Write Ahead Logging (WAL). BigTable feature: ability to store multiple data for same row/column with different timestamps. Multiversion Concurrency Control (MVCC) in EGs. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 12 / 54
  • 22.
    Entity Group Transaction(2/3) Read consistency Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 23.
    Entity Group Transaction(2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 24.
    Entity Group Transaction(2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. • Snapshot: doesn’t wait, and reads the last committed values. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 25.
    Entity Group Transaction(2/3) Read consistency • Current: waits for uncommitted writes, then reads the last committed value. • Snapshot: doesn’t wait, and reads the last committed values. • Inconsistent reads: ignores the state of log and reads the last values directly (data may be stale). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 13 / 54
  • 26.
    Entity Group Transaction(3/3) Write consistency • Determine the next available log position. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 27.
    Entity Group Transaction(3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 28.
    Entity Group Transaction(3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. • Employs paxos to settle the resource contention. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 29.
    Entity Group Transaction(3/3) Write consistency • Determine the next available log position. • Assigns mutations of WAL a timestamp higher than any previous one. • Employs paxos to settle the resource contention. • Based on optimistic concurrency: in case of multiple writers to the same log position, only one will win, and the rest will notice the victorious write, abort, and retry their operations. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 14 / 54
  • 30.
    Across Entity GroupTransaction (1/3) Across entity groups: limited consistency guarantees Two methods: • Asynchronous messaging (queue) • Two-Phase-Commit (2PC) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 15 / 54
  • 31.
    Across Entity GroupTransaction (2/3) Queues Provide transactional messaging between EGs. Each message either is: • Synchronous: has a single sending and receiving entity group. • Asynchronous: has different sending and receiving entity group. Useful to perform operations that affect many EGs. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 16 / 54
  • 32.
    Across Entity GroupTransaction (3/3) Two-Phase Commit Atomicity is satisfied. High latency Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 17 / 54
  • 33.
    Spanner Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 18 / 54
  • 34.
    Limitations of ExistingSystems BigTable • Scalability • High throughput • High performance • Transactional scope limited to single row • Eventually-consistent replication support across data-centers Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 19 / 54
  • 35.
    Limitations of ExistingSystems Megastore • Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance (poor write throughput) • Lack of query language Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 20 / 54
  • 36.
    Spanner Bridging the gapbetween Megastore and Bigtable. SQL transactions + high throughput Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 21 / 54
  • 37.
    Spanner Global scale databasewith strict transactional guarantees. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 38.
    Spanner Global scale databasewith strict transactional guarantees. Global scale • Across datacenters • Scale up to millions of nodes, hundreds of datacenters, trillions of database rows Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 39.
    Spanner Global scale databasewith strict transactional guarantees. Global scale • Across datacenters • Scale up to millions of nodes, hundreds of datacenters, trillions of database rows Strict transactional guarantees • General transactions (even inter-row) • Reliable even during wide-area natural disasters Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 22 / 54
  • 40.
    Spanner Implementation Amir H.Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 23 / 54
  • 41.
    Spanner Organization (1/2) Universe:Spanner deployment Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 42.
    Spanner Organization (1/2) Universe:Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 43.
    Spanner Organization (1/2) Universe:Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 44.
    Spanner Organization (1/2) Universe:Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers • The proxies: used by clients to locate the spanservers assigned to serve their data Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 45.
    Spanner Organization (1/2) Universe:Spanner deployment Zones: analogues to deployment of BigTable servers (unit of physical isolation) • One zonemaster: assigns data to spanservers • The proxies: used by clients to locate the spanservers assigned to serve their data • Thousands of spanservers: serve data to clients Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 24 / 54
  • 46.
    Spanner Organization (2/2) Theuniverse master: a console that displays status information about all the zones. The placement driver: handles automated movement of data across zones. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 25 / 54
  • 47.
    Spanserver Software Stack(1/4) Each spanserver is responsible for 100-1000 data structure instances, called tablet (similar to BigTable tablet). Tablet mapping: (key: string, timestamp:int64) → string Data and logs stored on Colossus (successor of GFS). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 26 / 54
  • 48.
    Spanserver Software Stack(2/4) A single paxos state machine on top of each tablet: consistent repli- cation Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 49.
    Spanserver Software Stack(2/4) A single paxos state machine on top of each tablet: consistent repli- cation Paxos group: all machines involved in an instance of paxos. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 50.
    Spanserver Software Stack(2/4) A single paxos state machine on top of each tablet: consistent repli- cation Paxos group: all machines involved in an instance of paxos. Paxos implementation supports long-lived leaders with time-based leader leases. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 27 / 54
  • 51.
    Spanserver Software Stack(3/4) Writes must initiate the paxos protocol at the leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
  • 52.
    Spanserver Software Stack(3/4) Writes must initiate the paxos protocol at the leader. Reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 28 / 54
  • 53.
    Spanserver Software Stack(4/4) Transaction manager: to support distributed transactions • At every replica that is a leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 29 / 54
  • 54.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 55.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 56.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 57.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 58.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 59.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. • Wound-wait for dead lock avoidance: young transaction dies if an older transaction needs a resource held by the young transaction. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 60.
    Transactions Involving OnlyOne Paxos Group This is the case for most transactions. A long lived paxos leader. • The transaction manager: participant leader • The other replicas in the group: participant slaves A lock table for concurrency control. • Multiple concurrent transactions. • Maintained by paxos leader. • Maps ranges of keys to lock states. • Two-phase locking. • Wound-wait for dead lock avoidance: young transaction dies if an older transaction needs a resource held by the young transaction. It can bypass the transaction manager. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 30 / 54
  • 61.
    Transactions Involving MultiplePaxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 62.
    Transactions Involving MultiplePaxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Group’s leaders coordinate to perform two phase commit. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 63.
    Transactions Involving MultiplePaxos Groups One of the participant groups is chosen as the coordinator. • The participant leader of that group will be referred to as the coordinator leader. • The slaves of that group as coordinator slaves. Group’s leaders coordinate to perform two phase commit. The state of each transaction manager is stored in the underlying paxos group (and therefore is replicated). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 31 / 54
  • 64.
    Data Model andDirectories Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 32 / 54
  • 65.
    Data Model An applicationcreates one or more databases in a universe. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 66.
    Data Model An applicationcreates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 67.
    Data Model An applicationcreates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Table • Rows and columns • Must have an ordered set one or more primary key columns • Primary key uniquely identifies each row Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 68.
    Data Model An applicationcreates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Table • Rows and columns • Must have an ordered set one or more primary key columns • Primary key uniquely identifies each row Hierarchies of tables • Tables must be partitioned by client into one or more hierarchies of tables • Table in the top: directory table Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 33 / 54
  • 69.
    Directory (1/2) Set ofcontiguous keys that share a common prefix. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 70.
    Directory (1/2) Set ofcontiguous keys that share a common prefix. All data in a directory has the same replication configuration. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 71.
    Directory (1/2) Set ofcontiguous keys that share a common prefix. All data in a directory has the same replication configuration. The smallest unit whose geographic replication properties can be specified by an application. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 72.
    Directory (1/2) Set ofcontiguous keys that share a common prefix. All data in a directory has the same replication configuration. The smallest unit whose geographic replication properties can be specified by an application. A Paxos group may contain multiple directories. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 34 / 54
  • 73.
    Directory (2/2) Spanner mightmove a directory: • To shed load from a paxos group. • To put directories that are frequently accessed together into the same group. • To move a directory into a group that is closer to its accessors. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 35 / 54
  • 74.
    Example Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 75.
    Example Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 76.
    Example Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 77.
    Example Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 36 / 54
  • 78.
    True Time and Consistency Amir H.Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 37 / 54
  • 79.
    Key Innovation Spanner knowswhat time it is. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 38 / 54
  • 80.
    Time Synchronization (1/2) Issynchronizing time at the global scale possible? Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 81.
    Time Synchronization (1/2) Issynchronizing time at the global scale possible? Synchronizing time within and between datacenters is extremely hard and uncertain. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 82.
    Time Synchronization (1/2) Issynchronizing time at the global scale possible? Synchronizing time within and between datacenters is extremely hard and uncertain. Serialization of requests is impossible at global scale. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 39 / 54
  • 83.
    Time Synchronization (2/2) Idea:accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 40 / 54
  • 84.
    True Time API TTinterval:is guaranteed to contain the absolute time during which TT.now() was invoked. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 41 / 54
  • 85.
    How TrueTime IsImplemented? (1/2) Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 42 / 54
  • 86.
    How TrueTime IsImplemented? (2/2) Daemon polls variety of masters: • Chosen from nearby datacenters • From further datacenters • Armageddon masters Daemon polls variety of masters and reaches a consensus about correct timestamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 43 / 54
  • 87.
    External Consistency (1/2) Jerryunfriends Tom to write a controversial comment. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
  • 88.
    External Consistency (1/2) Jerryunfriends Tom to write a controversial comment. If serial order is as above, Jerry will be in trouble! Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 44 / 54
  • 89.
    External Consistency (2/2) ExternalConsistency: Formally, If commit of T1 preceded the ini- tiation of a new transaction T2 in wall-clock (physical) time, then commit of T1 should precede commit of T2 in the serial ordering also. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 45 / 54
  • 90.
    Snapshot Reads Read inpast without locking. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 91.
    Snapshot Reads Read inpast without locking. Client can specify timestamp for read or an upper bound of times- tamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 92.
    Snapshot Reads Read inpast without locking. Client can specify timestamp for read or an upper bound of times- tamp. Each replica tracks a value called safe time tsafe, which is the max- imum timestamp at which a replica is up-to-date. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 93.
    Snapshot Reads Read inpast without locking. Client can specify timestamp for read or an upper bound of times- tamp. Each replica tracks a value called safe time tsafe, which is the max- imum timestamp at which a replica is up-to-date. Replica can satisfy read at any t ≤ tsafe. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 46 / 54
  • 94.
    Read-only Transactions Assign timestampsread and do snapshot read at sread. sread = TT.now().latest() It guarantees external consistency. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 47 / 54
  • 95.
    Read-Write Transactions (1/3) Leadermust only assign timestamps within the interval of its leader lease. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 96.
    Read-Write Transactions (1/3) Leadermust only assign timestamps within the interval of its leader lease. Timestamps must be assigned in monotonically increasing order. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 97.
    Read-Write Transactions (1/3) Leadermust only assign timestamps within the interval of its leader lease. Timestamps must be assigned in monotonically increasing order. If transaction T1 commits before T2 starts, T2’s commit timestamp must be greater than T1’s commit timestamp. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 48 / 54
  • 98.
    Read-Write Transactions (2/3) Clientsbuffer writes. Client chooses a coordinate group that initiates two-phase commit. A non-coordinator-participant leader chooses a prepare timestamp and logs a prepare record through paxos and notifies the coordinator. Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 49 / 54
  • 99.
    Read-Write Transactions (3/3) Thecoordinator assigns a commit timestamp si no less than all prepare timestamps and TT.now().latest(). The coordinator ensures that clients cannot see any data committed by Ti until TT.after(si) is true. This is done by commit wait (wait until absolute time passes si to commit). Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 50 / 54
  • 100.
    Summary Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 51 / 54
  • 101.
    Summary Megastore Entity Groups (EG) WithinEG: using paxos - ACID Across EGs: using queue and two-phase commit Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 52 / 54
  • 102.
    Summary Spanner Replica consistency: usingpaxos protocol Concurrency control: using two phase locking Transaction coordination: using two-phase commit Timestamps for transactions and data items Amir H. Payberah (Tehran Polytechnic) Megastore and Spanner 1393/7/28 53 / 54
  • 103.
    Questions? Amir H. Payberah(Tehran Polytechnic) Megastore and Spanner 1393/7/28 54 / 54