New SQL system
introduction
JiaLin Dai
New SQL
• Scale like NoSQL
• Provide SQL functions
– Complete transaction support
– Full SQL query support
Spanner
• Global distributed DB in google
• Anti datacenter disaster
• Consistency
– External consistent read / write
– Global consistent read at timestamp
Spanner cluster
• zonemaster: assign data to spanserver
• spanserver: serve data to client
• location proxy: locate spanservers
• universe master: console to display status information
• placement driver: handle automated movement of data across zones
Spanner server
• Colossus: successor of GFS
• Tablet: store key ranged data
• Paxos: consensus protocol to keep replicas in sync
Data model
Data model
• SQL like schema
– Row must have PK
• Each row is multi versioned
– Version by timestamp
– Old versions are garbage collected
• Protocol buffer support
Interleaved tables
• Rows with same key prefix are grouped into one directory
• Data in same directory is co-located
Tablet
Log structured merge
• Minor compact
• Convert memtable into one SSTable file
• Merge compact
• Merge all SSTable files into one
Bloom filter
• Read need to access multiple SSTable files
• In memory bloom filter avoid unnecessary reads
Partition attributes across
• Keep same row inside one page
• IO friendly
• Keep column values close to each other
• CPU cache friendly
Consensus protocol
Replicated state machines
• Replicate change logs
• Change committed when majority replicated
Transaction
Types of transaction
• Read write transaction
– Read locks hold on replica leader
– Client buffer all writes
– By end of transaction, 2PL commit
• Snapshot transaction
– Read data at previous timestamp
– Any up to date enough replica can serv
– Lock free
• Read only transaction
– Spanner choose one transaction
– Remaining is same for Snapshot transaction
2 phase lock
Timestamp
True time API
• Explicitly express time uncertainty
• Time masters implemented using GPS and atomic clock
• Time daemon on every machine
Timestamp for RW transaction
• Paxos write
– Monotonical timestamp associated with each write
• Participant: prepare timestamp
– Bigger than all previous transactions
• Coordinator: commit message time
– TT.now().latest
• Coordinator: commit timestamp
– Bigger than all prepare timestamp
– Bigger then all previous transactions
– Bigger than commit message time
• Coordinator: wait time
– TT.after(commit message time)
Update to date?
• Safe time of replica: min of
– Safe time of Paxos
– Safe time of transaction manager
• Safe time of Paxos
– Max timestamp for Paxos write
• Safe time of transaction manager
– Min prepare time stamp of prepared transactions
Schema change
• Plan schema change at a future time
• All shards perform schema change at that
time
• Read / write transactions coordinate based
timestamp
Distributed query
Query compile
• Build relational operator tree
• Optimize tree using equivalent rewrite
– Push down operators into shards
Example
Distributed join
• Use sharding key filter to extract sharding key ranges from input
• Merge sharding key ranges
• Compute affected shards
• Construct minimal batches for these shards
Run query
• Single consumer API
• Parallel consumer API
• Query auto restart
– Any machine can fail
– Restart token accompany all query result
– Capture distributed state of query plan
Other systems
Other NewSQL systems
• TiDB
• CockroachDB
Thank You!

New SQL System introudction

  • 1.
  • 2.
    New SQL • Scalelike NoSQL • Provide SQL functions – Complete transaction support – Full SQL query support
  • 3.
    Spanner • Global distributedDB in google • Anti datacenter disaster • Consistency – External consistent read / write – Global consistent read at timestamp
  • 4.
    Spanner cluster • zonemaster:assign data to spanserver • spanserver: serve data to client • location proxy: locate spanservers • universe master: console to display status information • placement driver: handle automated movement of data across zones
  • 5.
    Spanner server • Colossus:successor of GFS • Tablet: store key ranged data • Paxos: consensus protocol to keep replicas in sync
  • 6.
  • 7.
    Data model • SQLlike schema – Row must have PK • Each row is multi versioned – Version by timestamp – Old versions are garbage collected • Protocol buffer support
  • 8.
    Interleaved tables • Rowswith same key prefix are grouped into one directory • Data in same directory is co-located
  • 9.
  • 10.
    Log structured merge •Minor compact • Convert memtable into one SSTable file • Merge compact • Merge all SSTable files into one
  • 11.
    Bloom filter • Readneed to access multiple SSTable files • In memory bloom filter avoid unnecessary reads
  • 12.
    Partition attributes across •Keep same row inside one page • IO friendly • Keep column values close to each other • CPU cache friendly
  • 13.
  • 14.
    Replicated state machines •Replicate change logs • Change committed when majority replicated
  • 15.
  • 16.
    Types of transaction •Read write transaction – Read locks hold on replica leader – Client buffer all writes – By end of transaction, 2PL commit • Snapshot transaction – Read data at previous timestamp – Any up to date enough replica can serv – Lock free • Read only transaction – Spanner choose one transaction – Remaining is same for Snapshot transaction
  • 17.
  • 18.
  • 19.
    True time API •Explicitly express time uncertainty • Time masters implemented using GPS and atomic clock • Time daemon on every machine
  • 20.
    Timestamp for RWtransaction • Paxos write – Monotonical timestamp associated with each write • Participant: prepare timestamp – Bigger than all previous transactions • Coordinator: commit message time – TT.now().latest • Coordinator: commit timestamp – Bigger than all prepare timestamp – Bigger then all previous transactions – Bigger than commit message time • Coordinator: wait time – TT.after(commit message time)
  • 21.
    Update to date? •Safe time of replica: min of – Safe time of Paxos – Safe time of transaction manager • Safe time of Paxos – Max timestamp for Paxos write • Safe time of transaction manager – Min prepare time stamp of prepared transactions
  • 22.
    Schema change • Planschema change at a future time • All shards perform schema change at that time • Read / write transactions coordinate based timestamp
  • 23.
  • 24.
    Query compile • Buildrelational operator tree • Optimize tree using equivalent rewrite – Push down operators into shards
  • 25.
  • 26.
    Distributed join • Usesharding key filter to extract sharding key ranges from input • Merge sharding key ranges • Compute affected shards • Construct minimal batches for these shards
  • 27.
    Run query • Singleconsumer API • Parallel consumer API • Query auto restart – Any machine can fail – Restart token accompany all query result – Capture distributed state of query plan
  • 28.
  • 29.
    Other NewSQL systems •TiDB • CockroachDB
  • 30.