Storage for Google’s ad data – Designed to replace a sharded MySQL database – F1, a rewrite of Google’s advertising backendDesigned to scale up to millions of machines, hundreds of data centersWrite transactions use strict 2PL, each assigned a timestamp, each version automatically timestamped with commit time – timestamps reflect serialisation orderData replicated across continentsAutomatically reshards & migrates data across machines to balance load / respond to failures
As simple databaseBucketing structures called directories – unit of data placement – movement of data between Paxos groupsNoSQL is out, NewSQL is in, inspired by Dremel, an interactive analysis tool. SQL with extensions to support protocol buffered value fieldsAssociations depicted using table hierarchies – Directory Table (Users), Interleaved Table (Albums)Not purely relational – Each table must have primary key – Implemented using key-value stores, database partitioned by clients into one or more hierarchies of tables
Lock-freeExternally consistent reads and writes, and globally-consistent reads across the database at a timestamp - Globally-meaningful commit timestamps to transactionsCorrectness and performanceEnabling technologyApplications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance). Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across datacenters.
Universe – A spanner deployment (Test/Playground, Development/Production, Production-only)Zone – Rough analog of deployment of BigTable servers – unit of physical isolationA zone has one zonemaster, 100 to several thousand spanservers – zone server serve data to spanserver, which in turn serve to clientsLocation proxy – used by clients to locate spanserversUniverse master – console for active debugging, and placement driver- Automated movement of data across zones (are singletons)Software Stack:Each spanserver responsible for 100-1000 data structures called tabletsB-tree files, write-ahead log, distributed filesystem called Colossus
Google’s cluster-management software provides an implementation of the TrueTime API – An interval with bounded time uncertaintyTrueTime uses two forms of time reference because they have different failure modes. GPS reference-source vulnerabilities include antenna and receiver failures, local radio interference, correlated failures (e.g., design faults such as incorrect leap second handling and spoofing), and GPS system outages. Atomic clocks can fail in ways uncorrelated to GPS and each other, and over long periods of time can drift significantly due to frequency error.
All masters’ time references are regularly compared against each other. Each master also cross-checks the rate at which its reference advances time against its own local clock, and evicts itself if there is substantial divergence
Implement features such as externally consistent transactions, lock free read-only transactions, and non-blocking reads in the pastThese features enable, for example, the guarantee that a whole-database audit read at a timestamp t will see exactly the effects of every transaction that has committed as of t.The Spanner implementation supports readwrite transactions, read-only transactions (predeclared snapshot-isolation transactions), and snapshot reads.A snapshot read is a read in the past that executes without locking
Strict two-phase locking for write transactionsAssign timestamp while locks are held
Strict two-phase locking for write transactionsAssign timestamp while locks are heldSpanner also enforces the following external consistency invariant: if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1
“Global wall-clock time” with bounded uncertainty
Between synchronizations, a daemon advertises a slowly increasing time uncertainty. e is derived from conservatively applied worst-case local clock drift - Also depends on time-master uncertainty and communication delay to the time masters.S is the time of invocation of event
“Global wall-clock time” with bounded uncertainty
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no deterministic fault-tolerant consensus protocol can guarantee progress in an asynchronous network (a result proved in a paper by Fischer, Lynch and Paterson), Paxos guarantees safety (freedom from inconsistency), and the conditions that could prevent it from making progress are difficult to provoke.Paxos is normally used in situations requiring durability (for example, to replicate a file or a database), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded number of replicas are unresponsive. However, a reconfiguration mechanism is available, and can be used to drop a permanently failed replica, or to add new replicas to the group.Client The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server. Acceptor (Voters) The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum. Proposer A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur. Learner Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added. Leader Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved on that case.
Spanner is a creation so large, some have trouble wrapping their heads around it. But the end result is easily explained: With Spanner, Google can offer a web service to a worldwide audience, but still ensure that something happening on the service in one part of the world doesn’t contradict what’s happening in another
INTRODUCTION• Built and Deployed at Google• Scalable• Multi-version• Globally distributed• Synchronously-replicated
OVERVIEW• General Purpose Transactions (ACID)• Directory Placement• SQL query language• Schematized tables, Semi-relational data model
SPECIAL FEATURES• Lock-free distributed read transactions• External consistency of distributed transactions• Integration of concurrency control, replication, and 2PC• Interval-based global time – TrueTime – GPS and atomic clock powered• More control to applications
GLOBAL CONSISTENCY ‘As a distributed-systems developer, you’re taughtfrom — I want to say childhood — not to trust time. What we did is find a way that we could trust time — and understand what it meant to trust time.’ ‘We wanted something that we were confident in. It’s a time reference that’s owned by Google.’ — Andrew Fikes
IMPLEMENTATION• Set of time master machines per data center• A time slave daemon per machine• Most masters have GPS, Armageddon masters have atomic clocks
GLOBAL CONSISTENCY• Global wall-clock time == External Consistency• Commit order respects global wall-time order• Timestamp order respects global wall-time order• Given that timestamp order == commit order
PAXOS PROTOCOL• Used in situations requiring durability (replicating a file or database)• Makes progress even during periods of partial unresponsiveness• Roles : Client, Acceptor (Voters), Proposer, Learner, Leader