“Google is living a few
years in the future and
sending us messages.”
Spanner 3
Terms
ACID
Time Synchronization
Global consistency
Paxos
Atomicity
Consistency
Isolation
Durability
Spanner 4
Example: Social Network
Spanner 5
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
US
Brazil
Russia
Spain
San Francisco
Seattle
Arizona
Sao Paulo
Santiago
Buenos Aires
Moscow
Berlin
Krakow
London
Paris
Berlin
Madrid
Lisbon
User posts
Friend lists
x1000
x1000
x1000
x1000
Why Consistency matters?
Generate a page of friends’ recent posts
– Consistent view of friend list and their
posts
Spanner 6
User posts
Friend lists
User posts
Friend lists
Single Machine
Spanner 7
Friend2 post
Generate my page
Friend1 post
Friend1000 post
Friend999 post
Block writes
…
User posts
Friend lists
User posts
Friend lists
Multiple Machines
Spanner 8
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
User posts
Friend lists
Block writes
…
User posts
Friend lists
User posts
Friend lists
User posts
Friend lists
Multiple Datacenters
Spanner 9
User posts
Friend lists
Generate my page
Friend2 post
Friend1 post
Friend1000 post
Friend999 post
…
US
Spain
Russia
Brazil
x1000
x1000
x1000
x1000
Google’s Invention
Spanner 10
SQL vs NoSQL
Spanner 11
Big Table
Spanner 12
Bigtable Data Model
Spanner 13
Why not Bigtable
Complex schema not allowed
No Strong consistency
Spanner 14
MegaStore
Semi-relational Data model
Synchronous replication
Spanner 15
Why not Megastore
Poor write throughput
Spanner 16
Applications
Spanner 17
Gmail
App
market
Picasa MegaStore
Google
Finance
Google
Earth
Web
index
BigTable
Why Spanner
Spanner
"Spanner is impressive work on one of the hardest distributed systems
problems" — Andy Gross, Basho
Spanner 19
database tech that can span the planet
database tech that can span the planet
Why Spanner?
Globally Distributed
Externally consistent
Multi-version Database
Semi-relational data model
ACID
Scalable
Spanner 21
Features
Sql Query language
Non-blocking read
Atomic schema change
Snapshot read
Customized replication config
Spanner 22
Data Model
Logical data layout
Album
user_id album_id Name
1 1 Picnic
1 2 Birthday
2 1 Rag
3 1 Eid
Photo
user_id album_id photo_id name
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 1 pic1
Spanner 24
Physical data layout
1 1 Picnic
1 1 1 pic1
1 1 2 pic2
1 1 3 pic3
1 2 Birthday
1 2 1 pic1
1 2 2 pic2
1 2 3 pic3
1 2 4 pic4
1 2 5 pic5
Spanner 25
Spanner Server organization
Spanner 26
Tablet
Spanner 27
Tablet 1
Tablet 2
Tablet 3
Spanserver
Spanner 28
Tablet
Paxos
Replica a
Tablet
Paxos
Replica c
Tablet
Paxos
Replica b
Leader
Lock Table
Transaction
Manager
Paxos Group
Spanner Organization
Spanner 29
Paxos Leader Lease
Spanner 30
Leader Lease
Lease default 10 seconds
Sends request for timed lease votes
Quorum of lease vote ensures leadership
May request for extension
Spanner 31
Directory
Directory
Spanner 33
Dir 1
Dir 2
Dir 3
Concurrency Control
Version Management
Transactions that write use strict 2PL
– Each transaction T is assigned a timestamp s
– Data written by T is timestamped with s
Spanner 35
Time 8<8
[X]
[me]
15
[P]
My friends
My posts
X’s friends
[]
[]
True Time
TrueTime
“Global wall-clock time” with bounded
uncertainty
Spanner 37
time
earliest latest
TT.now()
2*ε
TrueTime Architecture
Spanner 38
Datacenter 1 Datacenter n…Datacenter 2
GPS
timemaster
GPS timemaster
GPS
timemaster
Atomic-clock
timemaster
GPS
timemaster
Client
GPS
timemaster
Compute reference [earliest, latest] = now ± ε
TrueTime implementation
now = reference now + local-clock offset
ε = reference ε + worst-case local-clock drift
Spanner 39
time
ε
0sec 30sec 60sec 90sec
+6ms
reference
uncertainty
200 μs/sec
What If a Clock Goes
Rogue?
Timestamp assignment would violate
external consistency
Empirically unlikely based on 1 year of
data
– Bad CPUs 6 times more likely than bad
clocks
Spanner 40
Transaction details
Snapshot Read
Snapshot-read
Reads in past without locking
Occurs in sufficiently up-to-date replica
Commit is inevitable
Avoid buffering
Spanner 42
Transaction details
Read only Transaction
R/O transaction
Executes a snapshot read
Only READ !!!!
Spanner 44
Transaction details 1
R/W transaction
Timestamps, Global Clock
Strict two-phase locking for write transactions
Assign timestamp while locks are held
Spanner 46
T
Pick s = now()
Acquired locks Release locks
Timestamps and TrueTime
Spanner 47
T
Pick s = TT.now().latest
Acquired locks Release locks
Wait until TT.now().earliest > ss
average ε
Commit wait
average ε
Commit Wait and 2-Phase
Commit
Spanner 48
TC
Acquired locks Release locks
TP1
Acquired locks Release locks
TP2
Acquired locks Release locks
Notify participants of s
Commit wait doneCompute s for each
Start loggingDone logging
Prepared
Compute overall s
Committed
Send s
Example
Spanner 49
TP
Remove X
from my
friend list
Remove myself
from X’s friend
list
sC=6
sP=8
s=8 s=15
Risky post P
s=8
Time <8
[X]
[me]
15
TC T2
[P]
My friends
My posts
X’s friends
8
[]
[]
Assign timestamp
(R/O)
Simple:
Can read if within
Spanner 50
latestnowTTsread ()..
),min( TM
safe
Paxos
safesafe ttt 
1)(min ,  prepare
gii
TM
safe st
Assign timestamp (paxos)
One paxos group ->
– lastTS()
More paxos group->
– TT.now().latest
• which may wait for safe time to advance
Spanner 51
Transaction Detail
Atomic Schema Change
Spanner 52
Atomic Schema change
Assign timestamp t in future
Do the schema change in background
(prepare phase)
Block request if it is after t.
Spanner 53
Refinements
Reduce wait time for read
Fine-grained mapping from key ranges to
Fine-grained mapping from key ranges to
LastTS()
Spanner 55
TM
safet
Future Work
Improving TrueTime
– Lower ε < 1 ms
Building out database features
– Finish implementing basic features
– Efficiently support rich query patterns
Spanner 56
Q/A
Q/A
Zone master seems to be a single point of
failure
Difference between BigTable tablet and
spanner tablet
Spanner 58
Q/A
Why TimeSlave daemon polls some near
and some far GPS masters for time
synchronization?
What if a server fails in midst of processing
a read-only request?
Spanner 59
What’s in the Literature
External consistency/linearizability
Distributed databases
Concurrency control
Replication
Time (NTP, Marzullo)
Spanner 60
Conclusions
Reify clock uncertainty in time APIs
– Known unknowns are better than unknown
unknowns
– Rethink algorithms to make use of
uncertainty
Stronger semantics are achievable
– Greater scale != weaker semantics
Spanner 61
Thanks
To Spanner Developer Team
To Sebestian Kanthak, Wilson Hsieh &
others
To you for listening!
Spanner 62

Spanner

Editor's Notes

  • #39 Bad hosts are evicted Timemasters check themselves against other timemasters Clients check themselves against timemasters