Google Spanner

1,684 views

Published on

Presentation for Advanced topics in Distributed Computing class @ KTH.

Published in: Technology, Business
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
1,684
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
98
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Google Spanner

  1. 1. Spanner: Google’s Globally-Distributed DatabaseVaidas BrundzaEMDC
  2. 2. What is Spanner ?2o Globally distributed multi-version database General-purpose transactions (ACID) SQL-like query language Schematized semi-relational tableso Currently running in production Storage for Google’s F1 adv. backend data Replaced a sharded MySQL database
  3. 3. Overview3o Lock-free distributed read transactionso Global external consistency of distributedtransactionso Used technologies: concurrency control, replication,2PC and 2PLo The key technology:TrueTime service
  4. 4. Overview3o Lock-free distributed read transactionso Global external consistency of distributedtransactions Same as linearizability: if a transaction T1 commits beforeanother transaction T2 starts,then T1’s commit timestamp issmaller than T2’s.o Used technologies: concurrency control, replication,2PC and 2PLo The key technology:TrueTime service
  5. 5. Spanner server organization4o A Spanner deployment is called anuniverseo It have two singletons: the universemaster and the placement driver
  6. 6. Spanner server organization4o A Spanner deployment is called anuniverseo It have two singletons: the universemaster and the placement drivero Can have up to several thousandsspanservers
  7. 7. Spanner server organization4o A Spanner deployment is called anuniverseo It have two singletons: the universemaster and the placement drivero Can have up to several thousandsspanserverso Organized as a set of zones
  8. 8. Datacenterin USDatacenterin SpainDatacenterin SwedenDatacenterin RussiaServing data from multiple datacenters5Data fromUSData fromSpainData fromSwedenData fromRussiaGet the completedata setBlockwrites
  9. 9. Datacenterin USDatacenterin SpainDatacenterin SwedenDatacenterin RussiaServing data from multiple datacenters5Data fromUSData fromSpainData fromSwedenData fromRussiaGet the completedata set
  10. 10. Serving data from multiple datacenters5Data fromUSData fromSpainData fromSwedenData fromRussiaGet the completedata set
  11. 11. Serving data from multiple datacenters5Data fromUSData fromSpainData fromSwedenData fromRussiaGet the completedata set
  12. 12. Transaction example6TcTP1TP2
  13. 13. Transaction example6TcTP1TP2Acquired locksAcquired locksAcquired locks
  14. 14. Transaction example6TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreach
  15. 15. timeearliest latest2*ɛTT.now()True Time API7o Provides an absolute time denotedas „Global wall-clock time“.o Has bounded uncertainty ɛ, whichvaries between 1 to 7 ms overeach poll intervalo Values derived from the worstcase local-clock drift scenarioMethod ReturnsTT.now() TTinterval:[earliest, latest]TT.after(t) true if t has definitely passedTT.before(t) true if t has definitely not arrived
  16. 16. timeearliest latest2*ɛTT.now()True Time API7o Provides an absolute time denotedas „Global wall-clock time“.o Has bounded uncertainty ɛ, whichvaries between 1 to 7 ms overeach poll intervalo Values derived from the worstcase local-clock drift scenarioMethod ReturnsTT.now() TTinterval:[earliest, latest]TT.after(t) true if t has definitely passedTT.before(t) true if t has definitely not arrived
  17. 17. timeearliest latest2*ɛTT.now()True Time API7o Provides an absolute time denotedas „Global wall-clock time“.o Has bounded uncertainty ɛ, whichvaries between 1 to 7 ms overeach poll intervalo Values derived from the worstcase local-clock drift scenarioo Magic number:200 μs/sMethod ReturnsTT.now() TTinterval:[earliest, latest]TT.after(t) true if t has definitely passedTT.before(t) true if t has definitely not arrived
  18. 18. True Time Architecture8GPStimemasterGPStimemasterGPStimemasterAtomic-clocktimemasterAtomic-clocktimemasterGPStimemasterClientDatacenter 1 Datacenter 2 Datacenter n…now = reference now + local-clock offsetɛ = reference ɛ + worst-case local-clock drift
  19. 19. True Time Architecture8GPStimemasterGPStimemasterGPStimemasterAtomic-clocktimemasterAtomic-clocktimemasterGPStimemasterClientDatacenter 1 Datacenter 2 Datacenter n…now = reference now + local-clock offsetɛ = reference ɛ + worst-case local-clock drift
  20. 20. Transaction example9TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreach
  21. 21. Transaction example9TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging
  22. 22. Spanserver software stack10o Tablet implements mappings:(key, timestamp) -> string
  23. 23. Spanserver software stack10o Tablet implements mappings:(key, timestamp) -> stringo The Paxos state machine forreplication supporto Writes initiate the Paxosprotocol at the leadero Focuses on long-livedtransactions
  24. 24. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging
  25. 25. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done logging
  26. 26. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ ts
  27. 27. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ tsCompute overall ts
  28. 28. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ tsCompute overall tsCommit wait done
  29. 29. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ tsCompute overall tsCommit wait doneRelease locks
  30. 30. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ tsCompute overall tsCommit wait doneRelease locksCommittedSend overall ts
  31. 31. Transaction example11TcTP1TP2Acquired locksAcquired locksAcquired locksCompute ts foreachStart logging Done loggingPrepared+ tsCompute overall tsCommit wait doneRelease locksRelease locksRelease locksCommittedSend overall ts
  32. 32. Additional uncovered bits12 Supports atomic schema changes Non-blocking snapshot reads in the past How to read at the present time Paxos protocol restriction Does not support in-Paxos configuration changes
  33. 33. Evaluation: TrueTime uncertainty13Distribution of TrueTime ɛ values, sampled right after time-slave daemonpolls the time masters
  34. 34. Evaluation: F1 study case14# fragments # directories1 >100M2-4 3415-9 533610-14 23215-99 34100-500 7operationlatency (ms)countmean std devall reads 8,7 376,4 21,5Bsingle-site commit 72,3 112,8 31,2Mmulti-site commit 103,0 52,2 32,1M
  35. 35. Evaluation: F1 study case14# fragments # directories1 >100M2-4 3415-9 533610-14 23215-99 34100-500 7operationlatency (ms)countmean std devall reads 8,7 376,4 21,5Bsingle-site commit 72,3 112,8 31,2Mmulti-site commit 103,0 52,2 32,1MDistribution of directory-fragment counts
  36. 36. Evaluation: F1 study case14# fragments # directories1 >100M2-4 3415-9 533610-14 23215-99 34100-500 7operationlatency (ms)countmean std devall reads 8,7 376,4 21,5Bsingle-site commit 72,3 112,8 31,2Mmulti-site commit 103,0 52,2 32,1MDistribution of directory-fragment countsPerceived operation latencies(over 24 hour course)
  37. 37. Evaluation: Microbenchmarks15replicaslatency (ms) throughput (Kops/sec)write read-onlytransactionssnapshot read write read-onlytransactionssnapshot read1D 9,4±0,6 - - 4,0±0,3 - -1 14,4±1,0 1,4±0,1 1,3±0,1 4,1±0,05 10,9±0,4 13,5±0,13 13,9±0,6 1,3±0,1 1,2±0,1 2,2±0,5 13,8±3,2 38,5±0,35 14,4±0,4 1,4±0,05 1,3±0,04 2,8±0,3 25,3±5,2 50,0±1,1
  38. 38. Evaluation: Microbenchmarks15replicaslatency (ms) throughput (Kops/sec)write read-onlytransactionssnapshot read write read-onlytransactionssnapshot read1D 9,4±0,6 - - 4,0±0,3 - -1 14,4±1,0 1,4±0,1 1,3±0,1 4,1±0,05 10,9±0,4 13,5±0,13 13,9±0,6 1,3±0,1 1,2±0,1 2,2±0,5 13,8±3,2 38,5±0,35 14,4±0,4 1,4±0,05 1,3±0,04 2,8±0,3 25,3±5,2 50,0±1,1participantslatency (ms)mean 99th percentile1 17,0±1,4 75,0±34,92 24,5±2,5 87,6±35,95 31,5±6,2 104,5±52,210 30,0±3,7 95,6±25,425 35,5±5,6 100,4±42,750 42,7±4,1 93,7±22,9100 71,4±7,6 131,2±17,6200 150,5±11,0 320,3±35,1
  39. 39. Conclusion16 The first service to provide global externally consistentmulti-version database Relies on novel timeAPI (TrueTime) Improvements introduced over previous services

×