Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Google Spanner : our understanding of concepts and implications


Published on

Our understanding of Google Spanner and its implications. Spanner is a global-scale distributed database from Google.

Published in: Technology
  • good!
    Are you sure you want to  Yes  No
    Your message goes here

Google Spanner : our understanding of concepts and implications

  1. 1. Google Spanner: our understanding of concepts and implications Harisankar H DOS lab weekly seminar 8/Dec/2012 "Google Spanner: our understanding of concepts and implications" by Harisankar H is licensed under a Creative Commons Attribution 3.0 Unported License.
  2. 2. Outline• Spanner – User perspective • User = application programmer/administrator – System architecture – Implications
  3. 3. Spanner: user perspective• Global scale database with strict transactional guarantees – Global scale • designed to work across datacenters in different continents • Claim: “designed to scale up to millions of nodes, hundreds of datacenters, trillions of database rows” – Strict transactional guarantees • Supports general transactions(even inter-row) • Stronger properties than serializability* – replaced MySQL cluster storing their critical ad-related data • Reliable even during wide-area natural disasters – Supports hierarchical schema of tables • Semi-relational – Supports SQL-like query and definition language – User-defined locality and availability * means: explained in later slides
  4. 4. Need for Spanner• Limitations of existing systems – BigTable, (could apply to NoSQL systems in general) • Needed complex, evolving schemas • Only eventual consistency across data centers – Needed wide-area replication with strong consistency • Transactional scope limited to single row – Needed general cross-row transactions – Megastore, (relational db-like system) • Low performance – Layered on top of BigTable » High communication costs – Less efficient replica consistency algorithms* • Better transactional guarantees in Spanner*
  5. 5. Spanner: transactional guarantee• External consistency – Stricter than serializability – E.g., T3 T1 T2 physical time Serial ordering T1 T3 T2 T2 after T1 T1 T2 T3 T2 T3 T1 T2 T1 T3
  6. 6. External consistency: motivation • Facebook-like example from OSDI talk by Tom T3: view Jerry’s profile T1: unfriend Tom by Jerry T2: post comment physical time Jerry unfriends Tom to write a controversial commentT2: Jerry posts comment T3: Tom views Jerry’s profile T1: Jerry unfriends Tom If serial order is as above, Jerry will be in trouble! Formally, “If commit of T1 preceded the initiation of a new transaction T2 in wall-clock(physical) time, then commit of T1 should precede commit of T2 in the serial ordering also. ”
  7. 7. Spanner: transactional guarantee• Additional (weaker)transaction modes for performance – Read-only transaction supporting snapshot isolation • Snapshot isolation – Transactions read a consistent snapshot of the database – Values written should not have conflicting updates after the snapshot was read – E.g., R1(X)R1(Y) R2(X)R2(Y) W2(Y) W1(X) is allowed – Weaker than serializability, but more efficient(lock-free) – Spanner do not allow writes for these transactions » Probably, that is how they preserve isolation – Snapshot read • Read of a consistent state of the database in the past
  8. 8. Hierarchical data model – Universes(Spanner deployment) • Databases(collection of tables) – Tables with schemas » Ordered Rows, columns » One or more primary-key columns • Rows named during primary keys – Hierarchies of tables » Directory tables(top of table hierarchy) • Directories • Each row in directory table(with key K) along with the rows in descendant tables that start with K form a directory Figures (a),(b) from Spanner, OSDI 2012 paper Fig: a
  9. 9. User perspective: database configuration• Database placement and reliability – Administrator: • Create options which specify number of replicas and placement – E.g., option (a): North America: 5 replicas, Europe: 3 replicas option (b): Latin America: 3 replicas … – Application • Directory is the smallest unit for which these properties can be specified • Tag each directory or database with these options – E.g., TomDir1: option (b) JerryDir3: option (a) …. Next: System architecture
  10. 10. Spanner architecture: basics• Replica consistency – Using Paxos protocol • Different Paxos groups for different sets of directories – Can be across data centers• Concurrency control – Using two phase locking • Chose over optimistic methods because of long-lived transactions(order of minutes)• Transaction coordination – 2 phase commit • 2 phase commit on top of Paxos ensures availability• Timestamps for transactions and data items – To support snapshot isolation and snapshot reads – Multiple timestamped versions of data items maintained
  11. 11. Spanner components Universe master(status + Placement driver(move data interactive debugging) across zones automatically) Network Zone 1(physical location) *True Time Zone 2(physical location) Zone master(assign data) Service Zone master(assign data)Location proxy(locate data)Location proxies(locate data) Location proxy(locate data) Location proxies(locate data) … … Span servers(data) Span servers(data) ……
  12. 12. Zones, directories and Paxos groups Fig: (b) Figures (a),(b) from Spanner, OSDI 2012 paper
  13. 13. Replication-related components• Tablet: unit of storage – Bag of directories – Abstraction on top of underlying DFS Colossus• Single Paxos state machine(replica) per tablet• Replicas of each tablet form a Paxos group• Leader elected among a Paxos group Paxos group Paxos leader Tablet replica: DC1,n2 …. Tablet replica: DC2,n8 …. …. dirs
  14. 14. Transaction-related components Paxos group(Participant) Participant leaderTransaction T5: Paxos leader Participant slave Tablet replica: …. Tablet replica: …. …. ….. Paxos group(Coordinator) Coordinator leader(2PC +2PL) Coordinator slave Paxos leader Tablet replica: DC1,n2 …. Tablet replica: DC2,n8 …. ….
  15. 15. Next:• Serializability ensured by the already explained components• External consistency implemented with help of TrueTime service – True time service also used for leader election using timed leases
  16. 16. TrueTime + transaction implementation [by Aditya]
  17. 17. Implications of Spanner [REMOVED]
  18. 18. Thank you• Image credits – Figures (a),(b) from Spanner, OSDI 2012 paper