SlideShare a Scribd company logo
Spanner: Google’s 
Globally-Distributed Database 
Wilson Hsieh 
representing a host of authors 
OSDI 2012
What is Spanner? 
• Distributed multiversion database 
• General-purpose transactions (ACID) 
• SQL query language 
• Schematized tables 
• Semi-relational data model 
• Running in production 
• Storage for Google’s ad data 
• Replaced a sharded MySQL database 
OSDI 2012 2
Example: Social Network 
OSDI 2012 
User posts 
Friend lists 
US 
Brazil 
Russia 
San Francisco 
Seattle 
Arizona 
Spain 
Sao Paulo 
Santiago 
Buenos Aires 
Moscow 
Berlin 
Krakow 
London 
Paris 
Berlin 
Madrid 
Lisbon 
3 
x1000 
x1000 
x1000 
x1000
Overview 
• Feature: Lock-free distributed read transactions 
• Property: External consistency of distributed 
transactions 
– First system at global scale 
• Implementation: Integration of concurrency 
control, replication, and 2PC 
– Correctness and performance 
• Enabling technology: TrueTime 
– Interval-based global time 
OSDI 2012 4
Read Transactions 
• Generate a page of friends’ recent posts 
– Consistent view of friend list and their posts 
OSDI 2012 
Why consistency matters 
1. Remove untrustworthy person X as friend 
2. Post P: “My government is repressive…” 
5
Single Machine 
User posts 
Friend lists 
Friend2 post 
Generate my page 
Friend1 post 
Friend999 post 
Friend1000 post 
Block writes 
OSDI 2012 
… 
6
Multiple Machines 
User posts 
Friend lists 
User posts 
Friend lists 
Generate my page 
Block writes 
Friend1 post 
Friend2 post 
… 
Friend999 post 
Friend1000 post 
OSDI 2012 
7
Multiple Datacenters 
User posts 
Friend lists 
User posts 
x1000 
Friend lists 
User posts 
Friend lists 
User posts 
Friend lists 
Generate my page 
Friend1 post 
US 
Friend2 post 
Spain 
Friend999 post 
Brazil 
Friend1000 post 
OSDI 2012 
… 
Russia 
8 
x1000 
x1000 
x1000
Version Management 
• Transactions that write use strict 2PL 
– Each transaction T is assigned a timestamp s 
– Data written by T is timestamped with s 
Time <8 8 
[X] 
[me] 
15 
[P] 
My friends 
My posts 
X’s friends 
[] 
[] 
OSDI 2012 9
Synchronizing Snapshots 
Global wall-clock time 
== 
External Consistency: 
Commit order respects global wall-time order 
== 
Timestamp order respects global wall-time order 
given 
timestamp order == commit order 
OSDI 2012 10
Timestamps, Global Clock 
• Strict two-phase locking for write transactions 
• Assign timestamp while locks are held 
T 
Acquired locks Release locks 
Pick s = now() 
OSDI 2012 11
Timestamp Invariants 
• Timestamp order == commit order 
T2 
T1 
• Timestamp order respects global wall-time order 
T3 
T4 
OSDI 2012 12
TrueTime 
• “Global wall-clock time” with bounded 
uncertainty 
time 
TT.now() 
earliest latest 
2*ε 
OSDI 2012 13
Timestamps and TrueTime 
T 
Acquired locks Release locks 
Pick s = TT.now().latest 
s Wait until TT.now().earliest > s 
OSDI 2012 
Commit wait 
average ε 
average ε 
14
Commit Wait and Replication 
OSDI 2012 
T 
Start consensus Notify slaves 
Acquired locks Release locks 
Pick s Commit wait done 
15 
Achieve consensus
Commit Wait and 2-Phase Commit 
TC 
OSDI 2012 
Acquired locks Release locks 
TP1 
Notify participants of s 
Acquired locks Release locks 
TP2 
Acquired locks Release locks 
Compute s for each Commit wait done 
16 
Start logging Done logging 
Prepared 
Compute overall s 
Committed 
Send s
Example 
TC T2 
TP 
Remove X from 
my friend list 
Risky post P 
s=8 s=15 
Remove myself 
from X’s friend list 
sC=6 
sP=8 
s=8 
Time <8 
[X] 
[me] 
15 
[P] 
My friends 
My posts 
X’s friends 
8 
[] 
[] 
OSDI 2012 17
What Have We Covered? 
• Lock-free read transactions across datacenters 
• External consistency 
• Timestamp assignment 
• TrueTime 
– Uncertainty in time can be waited out 
OSDI 2012 18
What Haven’t We Covered? 
• How to read at the present time 
• Atomic schema changes 
– Mostly non-blocking 
– Commit in the future 
• Non-blocking reads in the past 
– At any sufficiently up-to-date replica 
OSDI 2012 19
TrueTime Architecture 
GPS 
timemaster 
GPS 
timemaster 
GPS 
timemaster 
Atomic-clock 
timemaster 
GPS 
timemaster 
GPS 
timemaster 
Client 
Datacenter 1 Datacenter 2 … Datacenter n 
Compute reference [earliest, latest] = now ± ε 
OSDI 2012 20
TrueTime implementation 
now = reference now + local-clock offset 
ε = reference ε + worst-case local-clock drift 
200 μs/sec 
time 
ε 
0sec 30sec 60sec 90sec 
+6ms 
reference 
uncertainty 
OSDI 2012 21
What If a Clock Goes Rogue? 
• Timestamp assignment would violate external 
consistency 
• Empirically unlikely based on 1 year of data 
– Bad CPUs 6 times more likely than bad clocks 
OSDI 2012 22
Network-Induced Uncertainty 
10 
8 
6 
4 
Mar 29 Mar 30 Mar 31 Apr 1 
OSDI 2012 
Date 
2 
Epsilon (ms) 
99.9 
99 
90 
6 
5 
4 
3 
2 
6AM 8AM 10AM 12PM 
Date (April 13) 
1 
23
What’s in the Literature 
• External consistency/linearizability 
• Distributed databases 
• Concurrency control 
• Replication 
• Time (NTP, Marzullo) 
OSDI 2012 24
Future Work 
• Improving TrueTime 
– Lower ε < 1 ms 
• Building out database features 
– Finish implementing basic features 
– Efficiently support rich query patterns 
OSDI 2012 25
Conclusions 
• Reify clock uncertainty in time APIs 
– Known unknowns are better than unknown 
unknowns 
– Rethink algorithms to make use of uncertainty 
• Stronger semantics are achievable 
– Greater scale != weaker semantics 
OSDI 2012 26
Thanks 
• To the Spanner team and customers 
• To our shepherd and reviewers 
• To lots of Googlers for feedback 
• To you for listening! 
• Questions? 
OSDI 2012 27

More Related Content

What's hot

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Tathagata Das
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web Servers
David Evans
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
DataStax Academy
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
Wu Liang
 
The Google file system
The Google file systemThe Google file system
The Google file system
Sergio Shevchenko
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
Benjamin Black
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
Karthik Ramasamy
 
Pgxc scalability pg_open2012
Pgxc scalability pg_open2012Pgxc scalability pg_open2012
Pgxc scalability pg_open2012
Ashutosh Bapat
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigSelena Deckelmann
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
Oleksii Diagiliev
 
Paxos building-reliable-system
Paxos building-reliable-systemPaxos building-reliable-system
Paxos building-reliable-system
Yanpo Zhang
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Gokhan Atil
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
Jason Brown
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed Postgres
Stas Kelvich
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Venkateswaran Kandasamy
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
Martin Zapletal
 

What's hot (20)

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web Servers
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and ConsistencyIntroduction to Cassandra: Replication and Consistency
Introduction to Cassandra: Replication and Consistency
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
 
Pgxc scalability pg_open2012
Pgxc scalability pg_open2012Pgxc scalability pg_open2012
Pgxc scalability pg_open2012
 
Managing terabytes: When Postgres gets big
Managing terabytes: When Postgres gets bigManaging terabytes: When Postgres gets big
Managing terabytes: When Postgres gets big
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Paxos building-reliable-system
Paxos building-reliable-systemPaxos building-reliable-system
Paxos building-reliable-system
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed Postgres
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 

Similar to Spanner osdi2012

A Brief History of Stream Processing
A Brief History of Stream ProcessingA Brief History of Stream Processing
A Brief History of Stream Processing
Aleksandr Kuboskin, CFA
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
StampedeCon
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
Duyhai Doan
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho
 
SFMap (TMA 2015)
SFMap (TMA 2015)SFMap (TMA 2015)
SFMap (TMA 2015)
mori_tatsuya
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
Olivier Doucet
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
TEST Huddle
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
Matthew Kalan
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 
MongoDB.pdf
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
KuldeepKumar778733
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
DataStax Academy
 

Similar to Spanner osdi2012 (20)

A Brief History of Stream Processing
A Brief History of Stream ProcessingA Brief History of Stream Processing
A Brief History of Stream Processing
 
Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015Lifting the hood on spark streaming - StampedeCon 2015
Lifting the hood on spark streaming - StampedeCon 2015
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Cassandra introduction mars jug
Cassandra introduction mars jugCassandra introduction mars jug
Cassandra introduction mars jug
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
SFMap (TMA 2015)
SFMap (TMA 2015)SFMap (TMA 2015)
SFMap (TMA 2015)
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Open Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design PatternsOpen Source North - MongoDB Advanced Schema Design Patterns
Open Source North - MongoDB Advanced Schema Design Patterns
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
MongoDB.pdf
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
 

Recently uploaded

guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
harveenkaur52
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
nhiyenphan2005
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 

Recently uploaded (20)

guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027Italy Agriculture Equipment Market Outlook to 2027
Italy Agriculture Equipment Market Outlook to 2027
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 

Spanner osdi2012

  • 1. Spanner: Google’s Globally-Distributed Database Wilson Hsieh representing a host of authors OSDI 2012
  • 2. What is Spanner? • Distributed multiversion database • General-purpose transactions (ACID) • SQL query language • Schematized tables • Semi-relational data model • Running in production • Storage for Google’s ad data • Replaced a sharded MySQL database OSDI 2012 2
  • 3. Example: Social Network OSDI 2012 User posts Friend lists US Brazil Russia San Francisco Seattle Arizona Spain Sao Paulo Santiago Buenos Aires Moscow Berlin Krakow London Paris Berlin Madrid Lisbon 3 x1000 x1000 x1000 x1000
  • 4. Overview • Feature: Lock-free distributed read transactions • Property: External consistency of distributed transactions – First system at global scale • Implementation: Integration of concurrency control, replication, and 2PC – Correctness and performance • Enabling technology: TrueTime – Interval-based global time OSDI 2012 4
  • 5. Read Transactions • Generate a page of friends’ recent posts – Consistent view of friend list and their posts OSDI 2012 Why consistency matters 1. Remove untrustworthy person X as friend 2. Post P: “My government is repressive…” 5
  • 6. Single Machine User posts Friend lists Friend2 post Generate my page Friend1 post Friend999 post Friend1000 post Block writes OSDI 2012 … 6
  • 7. Multiple Machines User posts Friend lists User posts Friend lists Generate my page Block writes Friend1 post Friend2 post … Friend999 post Friend1000 post OSDI 2012 7
  • 8. Multiple Datacenters User posts Friend lists User posts x1000 Friend lists User posts Friend lists User posts Friend lists Generate my page Friend1 post US Friend2 post Spain Friend999 post Brazil Friend1000 post OSDI 2012 … Russia 8 x1000 x1000 x1000
  • 9. Version Management • Transactions that write use strict 2PL – Each transaction T is assigned a timestamp s – Data written by T is timestamped with s Time <8 8 [X] [me] 15 [P] My friends My posts X’s friends [] [] OSDI 2012 9
  • 10. Synchronizing Snapshots Global wall-clock time == External Consistency: Commit order respects global wall-time order == Timestamp order respects global wall-time order given timestamp order == commit order OSDI 2012 10
  • 11. Timestamps, Global Clock • Strict two-phase locking for write transactions • Assign timestamp while locks are held T Acquired locks Release locks Pick s = now() OSDI 2012 11
  • 12. Timestamp Invariants • Timestamp order == commit order T2 T1 • Timestamp order respects global wall-time order T3 T4 OSDI 2012 12
  • 13. TrueTime • “Global wall-clock time” with bounded uncertainty time TT.now() earliest latest 2*ε OSDI 2012 13
  • 14. Timestamps and TrueTime T Acquired locks Release locks Pick s = TT.now().latest s Wait until TT.now().earliest > s OSDI 2012 Commit wait average ε average ε 14
  • 15. Commit Wait and Replication OSDI 2012 T Start consensus Notify slaves Acquired locks Release locks Pick s Commit wait done 15 Achieve consensus
  • 16. Commit Wait and 2-Phase Commit TC OSDI 2012 Acquired locks Release locks TP1 Notify participants of s Acquired locks Release locks TP2 Acquired locks Release locks Compute s for each Commit wait done 16 Start logging Done logging Prepared Compute overall s Committed Send s
  • 17. Example TC T2 TP Remove X from my friend list Risky post P s=8 s=15 Remove myself from X’s friend list sC=6 sP=8 s=8 Time <8 [X] [me] 15 [P] My friends My posts X’s friends 8 [] [] OSDI 2012 17
  • 18. What Have We Covered? • Lock-free read transactions across datacenters • External consistency • Timestamp assignment • TrueTime – Uncertainty in time can be waited out OSDI 2012 18
  • 19. What Haven’t We Covered? • How to read at the present time • Atomic schema changes – Mostly non-blocking – Commit in the future • Non-blocking reads in the past – At any sufficiently up-to-date replica OSDI 2012 19
  • 20. TrueTime Architecture GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster GPS timemaster Client Datacenter 1 Datacenter 2 … Datacenter n Compute reference [earliest, latest] = now ± ε OSDI 2012 20
  • 21. TrueTime implementation now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift 200 μs/sec time ε 0sec 30sec 60sec 90sec +6ms reference uncertainty OSDI 2012 21
  • 22. What If a Clock Goes Rogue? • Timestamp assignment would violate external consistency • Empirically unlikely based on 1 year of data – Bad CPUs 6 times more likely than bad clocks OSDI 2012 22
  • 23. Network-Induced Uncertainty 10 8 6 4 Mar 29 Mar 30 Mar 31 Apr 1 OSDI 2012 Date 2 Epsilon (ms) 99.9 99 90 6 5 4 3 2 6AM 8AM 10AM 12PM Date (April 13) 1 23
  • 24. What’s in the Literature • External consistency/linearizability • Distributed databases • Concurrency control • Replication • Time (NTP, Marzullo) OSDI 2012 24
  • 25. Future Work • Improving TrueTime – Lower ε < 1 ms • Building out database features – Finish implementing basic features – Efficiently support rich query patterns OSDI 2012 25
  • 26. Conclusions • Reify clock uncertainty in time APIs – Known unknowns are better than unknown unknowns – Rethink algorithms to make use of uncertainty • Stronger semantics are achievable – Greater scale != weaker semantics OSDI 2012 26
  • 27. Thanks • To the Spanner team and customers • To our shepherd and reviewers • To lots of Googlers for feedback • To you for listening! • Questions? OSDI 2012 27