How to build TiDB
PingCAP
About me
● Infrastructure engineer / CEO of PingCAP
● Working on open source projects: TiDB/TiKV
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Email: liuqi@pingcap.com
Let’s say we want to build a NewSQL Database
● From the beginning
● What’s wrong with the existing DBs?
○ Relational databases
○ NoSQL
We have a key-value store (RocksDB)
● Good start, RocksDB is fast and stable.
○ Atomic batch write
○ Snapshot
● However… It’s a local embedded kv store.
○ Can’t tolerate machine failures
○ Scalability depends on the capacity of the disk
Let’s fix Fault Tolerance
● Use Raft to replicate data
○ Key features of Raft
■ Strong leader: leader does most of the work, issue all log updates
■ Leader election
■ Membership changes
● Implementation:
○ Ported from etcd
Let’s fix Fault Tolerance
Machine 1 Machine 2 Machine 3
RocksDB RocksDB RocksDB
Raft Raft
That’s cool
● Basically we have a lite version of etcd or zookeeper.
○ Does not support watch command, and some other features
● Let’s make it better.
How about Scalability?
● What if we SPLIT data into many regions?
○ We got many Raft groups.
○ Region = Contiguous Keys
● Hash partitioning or Range partitioning
○ Redis: Hash partitioning
○ HBase: Range partitioning
That’s Cool, but...
● But what if we want to scan data?
○ How to support API: scan(startKey, endKey, limit)
● So, we need a globally ordered map
○ Can’t use hash partitioning
○ Use range partitioning
■ Region 1 -> [a - d]
■ Region 2 -> [e - h]
■ …
■ Region n -> [w - z]
How to scale? (1/2)
● That’s simple
● Just Split && Move Region 1
Region 1 Region 2
How to scale? (2/2)
● Raft comes for rescue again
○ Using Raft Membership changes, 2 steps:
■ Add a new replica
■ Destroy old region replica
Region 1
Region 3
Region 1
Region 2
Scale-out(initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to
Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica on Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
Now we have a distributed key-value store
● We want to keep replicas in different datacenters
○ For HA: any node might crash, even the whole Data center
○ And to balance the workload
● So, we need Placement Driver (PD) to act as cluster manager, for:
○ Replication constraint
○ Data movement
Placement Driver
● Concept comes from Spanner
● Provide the God’s view of the whole cluster
● Store the metadata
○ Clients have cache of placement information.
● Maintain the replication constraint
○ 3 replicas, by default
● Data movement
○ For balancing the workload
● It’s a cluster too, of course.
○ Thanks to Raft.
Placement
Driver
Placement
Driver
Placement
Driver
Raft
Raft
Raft
Placement Driver
● Rebalance without moving data.
○ Raft: Leadership transfer extension
● Moving data is a slow operation.
● We need fast rebalance.
Store4
Raft groups
RPCRPC
Client
Store1
TiKV Node1
Region 1
Region 3
...
Store2
TiKV Node2
Region 1
Region 2
Region 3
...
Store3
TiKV Node3
Region 1Region 2
...
TiKV Node4
Region 2Region 3
...
RPCRPC
TiKV: The whole picture
Placement
Driver
That’s Cool, but hold on...
● It could be cooler if we have:
○ MVCC
○ ACID Transaction
■ Transaction mode: Google Percolator (2PC)
MVCC (Multi-Version Concurrency Control)
● Each transaction sees a snapshot of database at the beginning time of this
transaction, any changes made by this transaction will not seen by other
transactions until the transaction is committed.
● Data is tagged with versions
○ Key_version: value
● Lock-free snapshot reads
Transaction API style (go code)
txn := store.Begin() // start a transaction
txn.Set([]byte("key1"), []byte("value1"))
txn.Set([]byte("key2"), []byte("value2"))
err = txn.Commit() // commit transaction
if err != nil {
txn.Rollback()
}
I want to write
code like this.
Transaction Model
● Inspired by Google Percolator
● 3 column families
○ cf:lock: An uncommitted transaction is writing this cell; contains the
location/pointer of primary lock
○ cf: write: it stores the commit timestamp of the data
○ cf: data: Stores the data itself
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
6:
5: $10
6:
5:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Bob wants to transfer 7$ to Joe
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
7: $9
6:
5: $2
7:Primary@Bob.bal
6:
5:
7:
6: data @ 5
5:
Transaction Model (commit point)
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $3
6:
5: $10
8:
7: I am Primary
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $9
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
TiKV: Architecture overview (Logical)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
● Highly layered
● Using Raft for consistency and
scalability
● No distributed file system
○ For better performance and lower
latency
TiKV: Highly layered (API angle)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
get(key)
raft_get(key)
MVCC_get(key, ver)
txn_get(key, txn_start_ts)
That’s really really Cool
● We have A Distributed Key-Value Database with
○ Geo-Replication / Auto Rebalance
○ ACID Transaction support
○ Horizontal Scalability
What if we support SQL?
● SQL is simple and very productive
● We want to write code like this:
SELECT COUNT(*) FROM user
WHERE age > 20 and age < 30;
And this...
BEGIN;
INSERT INTO person VALUES(‘tom’, 25);
INSERT INTO person VALUES(‘jerry’, 30);
COMMIT;
First of all, map table data to key value store
● What happens behind:
CREATE TABLE user (
id INT PRIMARY KEY,
name TEXT,
email TEXT
);
Mapping table data to kv store
Key Value
user/1 dongxu | huang@pingcap.com
user/2 tom | tom@pingcap.com
... ...
INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”);
INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
Secondary index is necessary
● Global index
○ All indexes in TiDB are transactional and fully consistent
○ Stored as separate key-value pairs in TiKV
● Keyed by a concatenation of the index prefix and primary key in TiKV
○ For example: table := {id, name} , id is primary key. If we want to build an index on the name
column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like:
■ name_index/tom_1 => nil
■ name_index/tom_2 => nil
○ For unique index
■ id_index/tom => 1,
Index is just not enough...
● Can we push down filters?
○ select count(*) from person
where age > 20 and age < 30
● It should be much faster, maybe 100x
○ Less RPC round trip
○ Less transferring data
Predicate pushdown
TiKV Node1 TiKV Node2 TiKV Node3
TiDB Server
Region 2Region 1
Region 5
age > 20 and age < 30 age > 20 and age < 30
age > 20 and age < 30
TiDB knows that
Region 1 / 2 / 5
stores the data of
person table.
But TiKV doesn’t know the schema
● Key-value database doesn’t have any information about table and row
● Coprocessor comes for help:
○ Concept comes from HBase
○ Inject your own logic to data nodes
What about drivers for every language?
● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go…
● It needs lots of time and code.
○ Trust me, you don’t want to do that.
OR...
● We just build a protocol layer that is compatible with MySQL. Then we have
all the MySQL drivers.
○ All the tools
○ All the ORMs
○ All the applications
● That’s what TiDB does.
Schema change in distributed RDBMS?
● A must-have feature!
● But you don’t want to lock the whole table while changing schema.
○ Usually distributed database stores tons of data spanning multiple machines
● We need a non-blocking schema change algorithm
● Thanks F1 again
○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
Architecture (The whole picture)
MySQL Clients (e.g. JDBC)
TiDB
TiKV
RPC
MySQL Protocol
F1
Spanner
Applications
Testing
● Testing in distributed system is really hard
Embedded testing to your design
● Design for testing
● Get tests from community
○ Lots of tests in MySQL drivers/connectors
○ Lots of ORMs
○ Lots of applications (Record---replay)
And more
● Fault injection
○ Hardware
■ disk error
■ network card
■ cpu
■ clock
○ Software
■ file system
■ network & protocol
And more
● Simulate everything
○ Network example :
https://github.com/pingcap/tikv/pull/916/commits/3cf0f7
248b32c3c523927eed5ebf82aabea481ec
Distribute testing
● Jepsen
● Namazu
○ ZooKeeper:
■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article)
○ Etcd:
■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a
hint of #3611
■ Reproduced flaky tests {#4006, #4039}
○ YARN:
○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
More to come
Distributed query plan - WIP
Change history (binlog) - WIP
Run TiDB on top of Kubernetes
Thanks
Q&A
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv

How to build TiDB

  • 1.
    How to buildTiDB PingCAP
  • 2.
    About me ● Infrastructureengineer / CEO of PingCAP ● Working on open source projects: TiDB/TiKV https://github.com/pingcap/tidb https://github.com/pingcap/tikv Email: liuqi@pingcap.com
  • 3.
    Let’s say wewant to build a NewSQL Database ● From the beginning ● What’s wrong with the existing DBs? ○ Relational databases ○ NoSQL
  • 4.
    We have akey-value store (RocksDB) ● Good start, RocksDB is fast and stable. ○ Atomic batch write ○ Snapshot ● However… It’s a local embedded kv store. ○ Can’t tolerate machine failures ○ Scalability depends on the capacity of the disk
  • 5.
    Let’s fix FaultTolerance ● Use Raft to replicate data ○ Key features of Raft ■ Strong leader: leader does most of the work, issue all log updates ■ Leader election ■ Membership changes ● Implementation: ○ Ported from etcd
  • 6.
    Let’s fix FaultTolerance Machine 1 Machine 2 Machine 3 RocksDB RocksDB RocksDB Raft Raft
  • 7.
    That’s cool ● Basicallywe have a lite version of etcd or zookeeper. ○ Does not support watch command, and some other features ● Let’s make it better.
  • 8.
    How about Scalability? ●What if we SPLIT data into many regions? ○ We got many Raft groups. ○ Region = Contiguous Keys ● Hash partitioning or Range partitioning ○ Redis: Hash partitioning ○ HBase: Range partitioning
  • 9.
    That’s Cool, but... ●But what if we want to scan data? ○ How to support API: scan(startKey, endKey, limit) ● So, we need a globally ordered map ○ Can’t use hash partitioning ○ Use range partitioning ■ Region 1 -> [a - d] ■ Region 2 -> [e - h] ■ … ■ Region n -> [w - z]
  • 10.
    How to scale?(1/2) ● That’s simple ● Just Split && Move Region 1 Region 1 Region 2
  • 11.
    How to scale?(2/2) ● Raft comes for rescue again ○ Using Raft Membership changes, 2 steps: ■ Add a new replica ■ Destroy old region replica
  • 12.
    Region 1 Region 3 Region1 Region 2 Scale-out(initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 13.
    Region 1 Region 3 Region1^ Region 2 Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 14.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica on Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 15.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 16.
    Now we havea distributed key-value store ● We want to keep replicas in different datacenters ○ For HA: any node might crash, even the whole Data center ○ And to balance the workload ● So, we need Placement Driver (PD) to act as cluster manager, for: ○ Replication constraint ○ Data movement
  • 17.
    Placement Driver ● Conceptcomes from Spanner ● Provide the God’s view of the whole cluster ● Store the metadata ○ Clients have cache of placement information. ● Maintain the replication constraint ○ 3 replicas, by default ● Data movement ○ For balancing the workload ● It’s a cluster too, of course. ○ Thanks to Raft. Placement Driver Placement Driver Placement Driver Raft Raft Raft
  • 18.
    Placement Driver ● Rebalancewithout moving data. ○ Raft: Leadership transfer extension ● Moving data is a slow operation. ● We need fast rebalance.
  • 19.
    Store4 Raft groups RPCRPC Client Store1 TiKV Node1 Region1 Region 3 ... Store2 TiKV Node2 Region 1 Region 2 Region 3 ... Store3 TiKV Node3 Region 1Region 2 ... TiKV Node4 Region 2Region 3 ... RPCRPC TiKV: The whole picture Placement Driver
  • 20.
    That’s Cool, buthold on... ● It could be cooler if we have: ○ MVCC ○ ACID Transaction ■ Transaction mode: Google Percolator (2PC)
  • 21.
    MVCC (Multi-Version ConcurrencyControl) ● Each transaction sees a snapshot of database at the beginning time of this transaction, any changes made by this transaction will not seen by other transactions until the transaction is committed. ● Data is tagged with versions ○ Key_version: value ● Lock-free snapshot reads
  • 22.
    Transaction API style(go code) txn := store.Begin() // start a transaction txn.Set([]byte("key1"), []byte("value1")) txn.Set([]byte("key2"), []byte("value2")) err = txn.Commit() // commit transaction if err != nil { txn.Rollback() } I want to write code like this.
  • 23.
    Transaction Model ● Inspiredby Google Percolator ● 3 column families ○ cf:lock: An uncommitted transaction is writing this cell; contains the location/pointer of primary lock ○ cf: write: it stores the commit timestamp of the data ○ cf: data: Stores the data itself
  • 24.
    Transaction Model Key Bal:Data Bal: Lock Bal: Write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: Bob wants to transfer 7$ to Joe
  • 25.
    Transaction Model Key Bal:Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5:
  • 26.
    Transaction Model Key Bal:Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 7: $9 6: 5: $2 7:Primary@Bob.bal 6: 5: 7: 6: data @ 5 5:
  • 27.
    Transaction Model (commitpoint) Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $3 6: 5: $10 8: 7: I am Primary 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $9 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 28.
    Transaction Model Key Bal:Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 29.
    Transaction Model Key Bal:Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 30.
    TiKV: Architecture overview(Logical) Transaction MVCC RaftKV Local KV Storage (RocksDB) ● Highly layered ● Using Raft for consistency and scalability ● No distributed file system ○ For better performance and lower latency
  • 31.
    TiKV: Highly layered(API angle) Transaction MVCC RaftKV Local KV Storage (RocksDB) get(key) raft_get(key) MVCC_get(key, ver) txn_get(key, txn_start_ts)
  • 32.
    That’s really reallyCool ● We have A Distributed Key-Value Database with ○ Geo-Replication / Auto Rebalance ○ ACID Transaction support ○ Horizontal Scalability
  • 33.
    What if wesupport SQL? ● SQL is simple and very productive ● We want to write code like this: SELECT COUNT(*) FROM user WHERE age > 20 and age < 30;
  • 34.
    And this... BEGIN; INSERT INTOperson VALUES(‘tom’, 25); INSERT INTO person VALUES(‘jerry’, 30); COMMIT;
  • 35.
    First of all,map table data to key value store ● What happens behind: CREATE TABLE user ( id INT PRIMARY KEY, name TEXT, email TEXT );
  • 36.
    Mapping table datato kv store Key Value user/1 dongxu | huang@pingcap.com user/2 tom | tom@pingcap.com ... ... INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”); INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
  • 37.
    Secondary index isnecessary ● Global index ○ All indexes in TiDB are transactional and fully consistent ○ Stored as separate key-value pairs in TiKV ● Keyed by a concatenation of the index prefix and primary key in TiKV ○ For example: table := {id, name} , id is primary key. If we want to build an index on the name column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like: ■ name_index/tom_1 => nil ■ name_index/tom_2 => nil ○ For unique index ■ id_index/tom => 1,
  • 38.
    Index is justnot enough... ● Can we push down filters? ○ select count(*) from person where age > 20 and age < 30 ● It should be much faster, maybe 100x ○ Less RPC round trip ○ Less transferring data
  • 39.
    Predicate pushdown TiKV Node1TiKV Node2 TiKV Node3 TiDB Server Region 2Region 1 Region 5 age > 20 and age < 30 age > 20 and age < 30 age > 20 and age < 30 TiDB knows that Region 1 / 2 / 5 stores the data of person table.
  • 40.
    But TiKV doesn’tknow the schema ● Key-value database doesn’t have any information about table and row ● Coprocessor comes for help: ○ Concept comes from HBase ○ Inject your own logic to data nodes
  • 41.
    What about driversfor every language? ● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go… ● It needs lots of time and code. ○ Trust me, you don’t want to do that.
  • 42.
    OR... ● We justbuild a protocol layer that is compatible with MySQL. Then we have all the MySQL drivers. ○ All the tools ○ All the ORMs ○ All the applications ● That’s what TiDB does.
  • 43.
    Schema change indistributed RDBMS? ● A must-have feature! ● But you don’t want to lock the whole table while changing schema. ○ Usually distributed database stores tons of data spanning multiple machines ● We need a non-blocking schema change algorithm ● Thanks F1 again ○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
  • 44.
    Architecture (The wholepicture) MySQL Clients (e.g. JDBC) TiDB TiKV RPC MySQL Protocol F1 Spanner Applications
  • 45.
    Testing ● Testing indistributed system is really hard
  • 46.
    Embedded testing toyour design ● Design for testing ● Get tests from community ○ Lots of tests in MySQL drivers/connectors ○ Lots of ORMs ○ Lots of applications (Record---replay)
  • 47.
    And more ● Faultinjection ○ Hardware ■ disk error ■ network card ■ cpu ■ clock ○ Software ■ file system ■ network & protocol
  • 48.
    And more ● Simulateeverything ○ Network example : https://github.com/pingcap/tikv/pull/916/commits/3cf0f7 248b32c3c523927eed5ebf82aabea481ec
  • 49.
    Distribute testing ● Jepsen ●Namazu ○ ZooKeeper: ■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article) ○ Etcd: ■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a hint of #3611 ■ Reproduced flaky tests {#4006, #4039} ○ YARN: ○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
  • 50.
    More to come Distributedquery plan - WIP Change history (binlog) - WIP Run TiDB on top of Kubernetes
  • 51.