Spanner: Google’s 
Globally-Distributed Database 
Wilson Hsieh 
representing a host of authors 
OSDI 2012
What is Spanner? 
• Distributed multiversion database 
• General-purpose transactions (ACID) 
• SQL query language 
• Schematized tables 
• Semi-relational data model 
• Running in production 
• Storage for Google’s ad data 
• Replaced a sharded MySQL database 
OSDI 2012 2
Example: Social Network 
OSDI 2012 
User posts 
Friend lists 
US 
Brazil 
Russia 
San Francisco 
Seattle 
Arizona 
Spain 
Sao Paulo 
Santiago 
Buenos Aires 
Moscow 
Berlin 
Krakow 
London 
Paris 
Berlin 
Madrid 
Lisbon 
3 
x1000 
x1000 
x1000 
x1000
Overview 
• Feature: Lock-free distributed read transactions 
• Property: External consistency of distributed 
transactions 
– First system at global scale 
• Implementation: Integration of concurrency 
control, replication, and 2PC 
– Correctness and performance 
• Enabling technology: TrueTime 
– Interval-based global time 
OSDI 2012 4
Read Transactions 
• Generate a page of friends’ recent posts 
– Consistent view of friend list and their posts 
OSDI 2012 
Why consistency matters 
1. Remove untrustworthy person X as friend 
2. Post P: “My government is repressive…” 
5
Single Machine 
User posts 
Friend lists 
Friend2 post 
Generate my page 
Friend1 post 
Friend999 post 
Friend1000 post 
Block writes 
OSDI 2012 
… 
6
Multiple Machines 
User posts 
Friend lists 
User posts 
Friend lists 
Generate my page 
Block writes 
Friend1 post 
Friend2 post 
… 
Friend999 post 
Friend1000 post 
OSDI 2012 
7
Multiple Datacenters 
User posts 
Friend lists 
User posts 
x1000 
Friend lists 
User posts 
Friend lists 
User posts 
Friend lists 
Generate my page 
Friend1 post 
US 
Friend2 post 
Spain 
Friend999 post 
Brazil 
Friend1000 post 
OSDI 2012 
… 
Russia 
8 
x1000 
x1000 
x1000
Version Management 
• Transactions that write use strict 2PL 
– Each transaction T is assigned a timestamp s 
– Data written by T is timestamped with s 
Time <8 8 
[X] 
[me] 
15 
[P] 
My friends 
My posts 
X’s friends 
[] 
[] 
OSDI 2012 9
Synchronizing Snapshots 
Global wall-clock time 
== 
External Consistency: 
Commit order respects global wall-time order 
== 
Timestamp order respects global wall-time order 
given 
timestamp order == commit order 
OSDI 2012 10
Timestamps, Global Clock 
• Strict two-phase locking for write transactions 
• Assign timestamp while locks are held 
T 
Acquired locks Release locks 
Pick s = now() 
OSDI 2012 11
Timestamp Invariants 
• Timestamp order == commit order 
T2 
T1 
• Timestamp order respects global wall-time order 
T3 
T4 
OSDI 2012 12
TrueTime 
• “Global wall-clock time” with bounded 
uncertainty 
time 
TT.now() 
earliest latest 
2*ε 
OSDI 2012 13
Timestamps and TrueTime 
T 
Acquired locks Release locks 
Pick s = TT.now().latest 
s Wait until TT.now().earliest > s 
OSDI 2012 
Commit wait 
average ε 
average ε 
14
Commit Wait and Replication 
OSDI 2012 
T 
Start consensus Notify slaves 
Acquired locks Release locks 
Pick s Commit wait done 
15 
Achieve consensus
Commit Wait and 2-Phase Commit 
TC 
OSDI 2012 
Acquired locks Release locks 
TP1 
Notify participants of s 
Acquired locks Release locks 
TP2 
Acquired locks Release locks 
Compute s for each Commit wait done 
16 
Start logging Done logging 
Prepared 
Compute overall s 
Committed 
Send s
Example 
TC T2 
TP 
Remove X from 
my friend list 
Risky post P 
s=8 s=15 
Remove myself 
from X’s friend list 
sC=6 
sP=8 
s=8 
Time <8 
[X] 
[me] 
15 
[P] 
My friends 
My posts 
X’s friends 
8 
[] 
[] 
OSDI 2012 17
What Have We Covered? 
• Lock-free read transactions across datacenters 
• External consistency 
• Timestamp assignment 
• TrueTime 
– Uncertainty in time can be waited out 
OSDI 2012 18
What Haven’t We Covered? 
• How to read at the present time 
• Atomic schema changes 
– Mostly non-blocking 
– Commit in the future 
• Non-blocking reads in the past 
– At any sufficiently up-to-date replica 
OSDI 2012 19
TrueTime Architecture 
GPS 
timemaster 
GPS 
timemaster 
GPS 
timemaster 
Atomic-clock 
timemaster 
GPS 
timemaster 
GPS 
timemaster 
Client 
Datacenter 1 Datacenter 2 … Datacenter n 
Compute reference [earliest, latest] = now ± ε 
OSDI 2012 20
TrueTime implementation 
now = reference now + local-clock offset 
ε = reference ε + worst-case local-clock drift 
200 μs/sec 
time 
ε 
0sec 30sec 60sec 90sec 
+6ms 
reference 
uncertainty 
OSDI 2012 21
What If a Clock Goes Rogue? 
• Timestamp assignment would violate external 
consistency 
• Empirically unlikely based on 1 year of data 
– Bad CPUs 6 times more likely than bad clocks 
OSDI 2012 22
Network-Induced Uncertainty 
10 
8 
6 
4 
Mar 29 Mar 30 Mar 31 Apr 1 
OSDI 2012 
Date 
2 
Epsilon (ms) 
99.9 
99 
90 
6 
5 
4 
3 
2 
6AM 8AM 10AM 12PM 
Date (April 13) 
1 
23
What’s in the Literature 
• External consistency/linearizability 
• Distributed databases 
• Concurrency control 
• Replication 
• Time (NTP, Marzullo) 
OSDI 2012 24
Future Work 
• Improving TrueTime 
– Lower ε < 1 ms 
• Building out database features 
– Finish implementing basic features 
– Efficiently support rich query patterns 
OSDI 2012 25
Conclusions 
• Reify clock uncertainty in time APIs 
– Known unknowns are better than unknown 
unknowns 
– Rethink algorithms to make use of uncertainty 
• Stronger semantics are achievable 
– Greater scale != weaker semantics 
OSDI 2012 26
Thanks 
• To the Spanner team and customers 
• To our shepherd and reviewers 
• To lots of Googlers for feedback 
• To you for listening! 
• Questions? 
OSDI 2012 27

Spanner osdi2012

  • 1.
    Spanner: Google’s Globally-DistributedDatabase Wilson Hsieh representing a host of authors OSDI 2012
  • 2.
    What is Spanner? • Distributed multiversion database • General-purpose transactions (ACID) • SQL query language • Schematized tables • Semi-relational data model • Running in production • Storage for Google’s ad data • Replaced a sharded MySQL database OSDI 2012 2
  • 3.
    Example: Social Network OSDI 2012 User posts Friend lists US Brazil Russia San Francisco Seattle Arizona Spain Sao Paulo Santiago Buenos Aires Moscow Berlin Krakow London Paris Berlin Madrid Lisbon 3 x1000 x1000 x1000 x1000
  • 4.
    Overview • Feature:Lock-free distributed read transactions • Property: External consistency of distributed transactions – First system at global scale • Implementation: Integration of concurrency control, replication, and 2PC – Correctness and performance • Enabling technology: TrueTime – Interval-based global time OSDI 2012 4
  • 5.
    Read Transactions •Generate a page of friends’ recent posts – Consistent view of friend list and their posts OSDI 2012 Why consistency matters 1. Remove untrustworthy person X as friend 2. Post P: “My government is repressive…” 5
  • 6.
    Single Machine Userposts Friend lists Friend2 post Generate my page Friend1 post Friend999 post Friend1000 post Block writes OSDI 2012 … 6
  • 7.
    Multiple Machines Userposts Friend lists User posts Friend lists Generate my page Block writes Friend1 post Friend2 post … Friend999 post Friend1000 post OSDI 2012 7
  • 8.
    Multiple Datacenters Userposts Friend lists User posts x1000 Friend lists User posts Friend lists User posts Friend lists Generate my page Friend1 post US Friend2 post Spain Friend999 post Brazil Friend1000 post OSDI 2012 … Russia 8 x1000 x1000 x1000
  • 9.
    Version Management •Transactions that write use strict 2PL – Each transaction T is assigned a timestamp s – Data written by T is timestamped with s Time <8 8 [X] [me] 15 [P] My friends My posts X’s friends [] [] OSDI 2012 9
  • 10.
    Synchronizing Snapshots Globalwall-clock time == External Consistency: Commit order respects global wall-time order == Timestamp order respects global wall-time order given timestamp order == commit order OSDI 2012 10
  • 11.
    Timestamps, Global Clock • Strict two-phase locking for write transactions • Assign timestamp while locks are held T Acquired locks Release locks Pick s = now() OSDI 2012 11
  • 12.
    Timestamp Invariants •Timestamp order == commit order T2 T1 • Timestamp order respects global wall-time order T3 T4 OSDI 2012 12
  • 13.
    TrueTime • “Globalwall-clock time” with bounded uncertainty time TT.now() earliest latest 2*ε OSDI 2012 13
  • 14.
    Timestamps and TrueTime T Acquired locks Release locks Pick s = TT.now().latest s Wait until TT.now().earliest > s OSDI 2012 Commit wait average ε average ε 14
  • 15.
    Commit Wait andReplication OSDI 2012 T Start consensus Notify slaves Acquired locks Release locks Pick s Commit wait done 15 Achieve consensus
  • 16.
    Commit Wait and2-Phase Commit TC OSDI 2012 Acquired locks Release locks TP1 Notify participants of s Acquired locks Release locks TP2 Acquired locks Release locks Compute s for each Commit wait done 16 Start logging Done logging Prepared Compute overall s Committed Send s
  • 17.
    Example TC T2 TP Remove X from my friend list Risky post P s=8 s=15 Remove myself from X’s friend list sC=6 sP=8 s=8 Time <8 [X] [me] 15 [P] My friends My posts X’s friends 8 [] [] OSDI 2012 17
  • 18.
    What Have WeCovered? • Lock-free read transactions across datacenters • External consistency • Timestamp assignment • TrueTime – Uncertainty in time can be waited out OSDI 2012 18
  • 19.
    What Haven’t WeCovered? • How to read at the present time • Atomic schema changes – Mostly non-blocking – Commit in the future • Non-blocking reads in the past – At any sufficiently up-to-date replica OSDI 2012 19
  • 20.
    TrueTime Architecture GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster GPS timemaster Client Datacenter 1 Datacenter 2 … Datacenter n Compute reference [earliest, latest] = now ± ε OSDI 2012 20
  • 21.
    TrueTime implementation now= reference now + local-clock offset ε = reference ε + worst-case local-clock drift 200 μs/sec time ε 0sec 30sec 60sec 90sec +6ms reference uncertainty OSDI 2012 21
  • 22.
    What If aClock Goes Rogue? • Timestamp assignment would violate external consistency • Empirically unlikely based on 1 year of data – Bad CPUs 6 times more likely than bad clocks OSDI 2012 22
  • 23.
    Network-Induced Uncertainty 10 8 6 4 Mar 29 Mar 30 Mar 31 Apr 1 OSDI 2012 Date 2 Epsilon (ms) 99.9 99 90 6 5 4 3 2 6AM 8AM 10AM 12PM Date (April 13) 1 23
  • 24.
    What’s in theLiterature • External consistency/linearizability • Distributed databases • Concurrency control • Replication • Time (NTP, Marzullo) OSDI 2012 24
  • 25.
    Future Work •Improving TrueTime – Lower ε < 1 ms • Building out database features – Finish implementing basic features – Efficiently support rich query patterns OSDI 2012 25
  • 26.
    Conclusions • Reifyclock uncertainty in time APIs – Known unknowns are better than unknown unknowns – Rethink algorithms to make use of uncertainty • Stronger semantics are achievable – Greater scale != weaker semantics OSDI 2012 26
  • 27.
    Thanks • Tothe Spanner team and customers • To our shepherd and reviewers • To lots of Googlers for feedback • To you for listening! • Questions? OSDI 2012 27