2. About me and PingCAP
● Shen Li, VP of Engineering @ PingCAP
● A startup based in Beijing, China
● Round B with $15 million
● TiDB, 400+ PoC, 30+ adoptions
● We are setting up an office in the Bay Area. So we are hiring :)
3. Agenda
● Motivations
● The goals of TiDB
● The core components of TiDB
● The tools around TiDB
● Spark on TiKV
● Future plans
4. Why we build a new relational database
● RDBMS is becoming the performance bottleneck of your backend service
● The amount of data stored in the database is overwhelming
● You want to do some complex queries on a sharding cluster
○ e.g. simple JOIN or GROUP BY
● Your application needs ACID transaction on a sharding cluster
5. TiDB Project - Goal
● SQL is necessary
● Transparent sharding and data movement/balance
● 100% OLTP + 80% OLAP
○ Transaction + Complex query
● 24/7 availability, even in case of datacenter outages
○ Thanks to the Raft consensus algorithm
● Compatible with MySQL, in most cases
● Open source, of course.
7. Storage stack 1/3
● TiKV is the underlying storage layer
● Physically, data is stored in RocksDB
● We build a Raft layer on top of RocksDB
○ What is Raft?
● Written in Rust!
TiKV
API (gRPC)
Transaction
MVCC
Raft (gRPC)
RocksDB
Raw KV API
(https://github.com/pingc
ap/tidb/blob/master/cmd
/benchraw/main.go)
Transactional KV API
(https://github.com/pingcap
/tidb/blob/master/cmd/ben
chkv/main.go)
8. Storage Stack 2/3
Logical view of TiKV
● Key-Value storage
● Giant sorted (in byte-order) Key-Value map
● Split into regions
● Metadata: [start_key, end_key)
TiKV Key Space
[ start_key,
end_key)
(-∞, +∞)
Sorted Map
256MB
9. RocksDB
Instance
Region 1:[a-e]
Region 3:[k-o]
Region 5:[u-z]
...
Region 4:[p-t]
RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 4:[p-t]
...
Region 3:[k-o]
RocksDB
Instance
Region 2:[f-j]
Region 5:[u-z]
Region 3:[k-o]
... RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 5:[u-z]
...
Region 4:[p-t]
Raft group
Storage stack 3/3
● Data is organized by Regions
● Region: a set of continuous Key-Value pairs
RPC (gRPC)
Transaction
MVCC
Raft
RocksDB
···
10. Dynamic Multi-Raft
● What’s DynamicMulti-Raft?
○ Dynamic split / merge
● Safe split / merge
Region 1:[a-e]
split Region 1.1:[a-c]
Region 1.2:[d-e]split
11. Safe Split: 1/4
TiKV1
Region 1:[a-e]
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader Follower Follower
Raft group
12. Safe Split: 2/4
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader
Follower Follower
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
13. Safe Split: 3/4
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Leader
Follower Follower
Split log (replicated by Raft)
Split log
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
14. Safe Split: 4/4
TiKV1
Region 1.1:[a-c]
Leader
Region 1.2:[d-e]
TiKV2
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
TiKV3
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
raft
raft
raft
raft
15. Region 1
Region 3
Region 1
Region 2
Scale-out (initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
16. Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3
Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to Node B
Node C
Node D
Scale-out (add new node)
17. Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica to Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
18. Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
19. ACID Transaction
● Based on Google Percolator
● ‘Almost’ decentralized 2-phase commit
○ Timestamp Allocator
● Optimistic transaction model
● Default isolation level: Snapshot Isolation
● We also support RC Isolation
20. Something we haven't mentioned
Now, we have a distributed, transactional, auto-scalable
key-value storage.
● Timestamp allocator
● Metadata storage
● Balance decision
Here comes the Placement Driver (PD for short)
21. Placement Driver
The brain of the TiKV cluster
●Timestamp allocator
●Metadata storage
●Replica scheduling PD PDPD
Raft Raft
etcd
Embedded
22. Scheduling Strategy
Region A
Region B
Node 1
Node 2
PD
Scheduling
Strategy
Cluster
Info
Admin
HeartBeat
Scheduling
Command
Region C
Config
Movement
23. The SQL Layer
● Mapping relational model to Key-Value model
● Full-featured SQL layer
● Cost-based optimizer (CBO)
● Distributed execution engine
26. What happens behind a query
CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1));
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
27. Query Plan
Partial Aggregate
COUNT(c1)
Filter
c2 = “seattle”
Read Index
idx1: (10, +∞)
Physical Plan on TiKV (index scan)
Read Row Data
by RowID
RowID
Row
Row
Final Aggregate
SUM(COUNT(c1))
DistSQL Scan
Physical Plan on TiDB
COUNT(c1)
COUNT(c1)
TiKV
TiKV
TiKV
COUNT(c1)
COUNT(c1)
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
28. What happens behind a query
CREATE TABLE t1(id INT, email TEXT,KEY idx_id(id));
CREATE TABLE t2(id INT, email TEXT, KEY idx_id(id));
SELECT * FROM t1 join t2 WHERE t1.id = t2.id;
33. Syncer
● Synchronize data from MySQL in real-time
● Hook up as a MySQL replica
MySQL
(master)
Syncer
Save Point
(disk)
Rule Filter
MySQL
TiDB Cluster
TiDB Cluster
TiDB Cluster
Syncer
Syncerbinlog
Fake slave
Syncer
or
34. TiDB-Binlog
TiDB Server
TiDB Server Sorter
Pumper
Pumper
TiDB Server
Pumper
Protobuf
MySQL Binlog
MySQL
3rd party applicationsCistern
● Subscribe the incremental data from TiDB
● Output Protobuf formatted data or MySQL Binlog format(WIP)
Another TiDB-Cluster
35. MyDumper / Loader
● Backup/restore in parallel
● Works for TiDB too
● Actually, we don’t have our own data migration tool for now
36. Spark on TiKV
● TiSpark = Spark SQL on TiKV
o Spark SQL directly on top of a distributed Database Storage engine
o Two extension points for Spark SQL Internal: Extra Optimizer Rules
and Extra Strategies
o Hijack Spark SQL logical plan and inject our own physical executor
● Hybrid Transactional/Analytical Processing(HTAP) rocks
o Provide strong OLAP capacity together with TiDB
37. Spark on TiKV
TiDB
TiDB
Worker
Spark
Driver
TiKV Cluster (Storage)
Metadata
TiKV TiKV
TiKV
Application
Syncer
Data location
Job
TiSpark
DistSQL API
TiKV
TiDB
TSO/Data location
Worker
Worker
Spark Cluster
TiDB Cluster
TiDB
... ...
...
DistSQL API
P
D
P
D
P
D
PD Cluster
TiKV TiKV
TiDB
38. Spark on TiKV
● The TiKV Connector is better than the JDBC connector
● Index support
● Complex Calculation Pushdown
● CBO
o Pick up right Access Path
o Join Reorder
● Priority & Isolation Level
39. Future plans
● Shift from Pre-GA to GA
● Better optimizer (Statistic && CBO)
● Smarter scheduling mechanism
● Document store for TiDB
○ MySQL 5.7.12+ X-Plugin
● Integrate TiDB with Kubernetes