Scale the Relational Database with
NewSQL
Shen Li @ PingCAP
About me and PingCAP
● Shen Li, VP of Engineering @ PingCAP
● A startup based in Beijing, China
● Round B with $15 million
● TiDB, 400+ PoC, 30+ adoptions
● We are setting up an office in the Bay Area. So we are hiring :)
Agenda
● Motivations
● The goals of TiDB
● The core components of TiDB
● The tools around TiDB
● Spark on TiKV
● Future plans
Why we build a new relational database
● RDBMS is becoming the performance bottleneck of your backend service
● The amount of data stored in the database is overwhelming
● You want to do some complex queries on a sharding cluster
○ e.g. simple JOIN or GROUP BY
● Your application needs ACID transaction on a sharding cluster
TiDB Project - Goal
● SQL is necessary
● Transparent sharding and data movement/balance
● 100% OLTP + 80% OLAP
○ Transaction + Complex query
● 24/7 availability, even in case of datacenter outages
○ Thanks to the Raft consensus algorithm
● Compatible with MySQL, in most cases
● Open source, of course.
Architecture
TiKV TiKV TiKV TiKV
Raft Raft Raft
TiDB TiDB TiDB
... ......
... ...
Placement
Driver (PD)
Control flow:
Balance / Failover
Metadata / Timestamp request
Stateless SQL Layer
Distributed Storage Layer
gRPC
gRPC
gRPCgRPC
Storage stack 1/3
● TiKV is the underlying storage layer
● Physically, data is stored in RocksDB
● We build a Raft layer on top of RocksDB
○ What is Raft?
● Written in Rust!
TiKV
API (gRPC)
Transaction
MVCC
Raft (gRPC)
RocksDB
Raw KV API
(https://github.com/pingc
ap/tidb/blob/master/cmd
/benchraw/main.go)
Transactional KV API
(https://github.com/pingcap
/tidb/blob/master/cmd/ben
chkv/main.go)
Storage Stack 2/3
Logical view of TiKV
● Key-Value storage
● Giant sorted (in byte-order) Key-Value map
● Split into regions
● Metadata: [start_key, end_key)
TiKV Key Space
[ start_key,
end_key)
(-∞, +∞)
Sorted Map
256MB
RocksDB
Instance
Region 1:[a-e]
Region 3:[k-o]
Region 5:[u-z]
...
Region 4:[p-t]
RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 4:[p-t]
...
Region 3:[k-o]
RocksDB
Instance
Region 2:[f-j]
Region 5:[u-z]
Region 3:[k-o]
... RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 5:[u-z]
...
Region 4:[p-t]
Raft group
Storage stack 3/3
● Data is organized by Regions
● Region: a set of continuous Key-Value pairs
RPC (gRPC)
Transaction
MVCC
Raft
RocksDB
···
Dynamic Multi-Raft
● What’s DynamicMulti-Raft?
○ Dynamic split / merge
● Safe split / merge
Region 1:[a-e]
split Region 1.1:[a-c]
Region 1.2:[d-e]split
Safe Split: 1/4
TiKV1
Region 1:[a-e]
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader Follower Follower
Raft group
Safe Split: 2/4
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader
Follower Follower
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Safe Split: 3/4
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Leader
Follower Follower
Split log (replicated by Raft)
Split log
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
Safe Split: 4/4
TiKV1
Region 1.1:[a-c]
Leader
Region 1.2:[d-e]
TiKV2
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
TiKV3
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
raft
raft
raft
raft
Region 1
Region 3
Region 1
Region 2
Scale-out (initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3
Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica to Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
ACID Transaction
● Based on Google Percolator
● ‘Almost’ decentralized 2-phase commit
○ Timestamp Allocator
● Optimistic transaction model
● Default isolation level: Snapshot Isolation
● We also support RC Isolation
Something we haven't mentioned
Now, we have a distributed, transactional, auto-scalable
key-value storage.
● Timestamp allocator
● Metadata storage
● Balance decision
Here comes the Placement Driver (PD for short)
Placement Driver
The brain of the TiKV cluster
●Timestamp allocator
●Metadata storage
●Replica scheduling PD PDPD
Raft Raft
etcd
Embedded
Scheduling Strategy
Region A
Region B
Node 1
Node 2
PD
Scheduling
Strategy
Cluster
Info
Admin
HeartBeat
Scheduling
Command
Region C
Config
Movement
The SQL Layer
● Mapping relational model to Key-Value model
● Full-featured SQL layer
● Cost-based optimizer (CBO)
● Distributed execution engine
SQL to Key-Value
● Row
Key: TableID + RowID
Value: Row Value
●Index
Key: TableID + IndexID + Index-Column-Values
Value: RowID
CREATE TABLE `t` (`id` int, `age` int, key
`age_idx` (`age`));
INSERT INTO `t` VALUES (100, 35);
K1
K2
100, 35
K1
TiKV
Encoded Keys:
K1: tid + rowid
K2: tid + idxid + 35
SQL Layer Overview
What happens behind a query
CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1));
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
Query Plan
Partial Aggregate
COUNT(c1)
Filter
c2 = “seattle”
Read Index
idx1: (10, +∞)
Physical Plan on TiKV (index scan)
Read Row Data
by RowID
RowID
Row
Row
Final Aggregate
SUM(COUNT(c1))
DistSQL Scan
Physical Plan on TiDB
COUNT(c1)
COUNT(c1)
TiKV
TiKV
TiKV
COUNT(c1)
COUNT(c1)
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
What happens behind a query
CREATE TABLE t1(id INT, email TEXT,KEY idx_id(id));
CREATE TABLE t2(id INT, email TEXT, KEY idx_id(id));
SELECT * FROM t1 join t2 WHERE t1.id = t2.id;
Hash Join Operator
Supported Join Operators
● Hash Join
● Sort merge Join
● Index-lookup Join
Cost-Based Optimizer
● Predicate Pushdown
● Column Pruning
● Eager Aggregate
● Convert Subquery to Join
● Statistics framework
● CBO Framework
○ Index Selection
○ Join Operator Selection
○ Stream Operators VS Hash Operators
Tools matter
● Syncer
● TiDB-Binlog
● Mydumper/MyLoader(loader)
Syncer
● Synchronize data from MySQL in real-time
● Hook up as a MySQL replica
MySQL
(master)
Syncer
Save Point
(disk)
Rule Filter
MySQL
TiDB Cluster
TiDB Cluster
TiDB Cluster
Syncer
Syncerbinlog
Fake slave
Syncer
or
TiDB-Binlog
TiDB Server
TiDB Server Sorter
Pumper
Pumper
TiDB Server
Pumper
Protobuf
MySQL Binlog
MySQL
3rd party applicationsCistern
● Subscribe the incremental data from TiDB
● Output Protobuf formatted data or MySQL Binlog format(WIP)
Another TiDB-Cluster
MyDumper / Loader
● Backup/restore in parallel
● Works for TiDB too
● Actually, we don’t have our own data migration tool for now
Spark on TiKV
● TiSpark = Spark SQL on TiKV
o Spark SQL directly on top of a distributed Database Storage engine
o Two extension points for Spark SQL Internal: Extra Optimizer Rules
and Extra Strategies
o Hijack Spark SQL logical plan and inject our own physical executor
● Hybrid Transactional/Analytical Processing(HTAP) rocks
o Provide strong OLAP capacity together with TiDB
Spark on TiKV
TiDB
TiDB
Worker
Spark
Driver
TiKV Cluster (Storage)
Metadata
TiKV TiKV
TiKV
Application
Syncer
Data location
Job
TiSpark
DistSQL API
TiKV
TiDB
TSO/Data location
Worker
Worker
Spark Cluster
TiDB Cluster
TiDB
... ...
...
DistSQL API
P
D
P
D
P
D
PD Cluster
TiKV TiKV
TiDB
Spark on TiKV
● The TiKV Connector is better than the JDBC connector
● Index support
● Complex Calculation Pushdown
● CBO
o Pick up right Access Path
o Join Reorder
● Priority & Isolation Level
Future plans
● Shift from Pre-GA to GA
● Better optimizer (Statistic && CBO)
● Smarter scheduling mechanism
● Document store for TiDB
○ MySQL 5.7.12+ X-Plugin
● Integrate TiDB with Kubernetes
Thanks
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Contact me:
shenli@pingcap.com

Scale Relational Database with NewSQL

  • 1.
    Scale the RelationalDatabase with NewSQL Shen Li @ PingCAP
  • 2.
    About me andPingCAP ● Shen Li, VP of Engineering @ PingCAP ● A startup based in Beijing, China ● Round B with $15 million ● TiDB, 400+ PoC, 30+ adoptions ● We are setting up an office in the Bay Area. So we are hiring :)
  • 3.
    Agenda ● Motivations ● Thegoals of TiDB ● The core components of TiDB ● The tools around TiDB ● Spark on TiKV ● Future plans
  • 4.
    Why we builda new relational database ● RDBMS is becoming the performance bottleneck of your backend service ● The amount of data stored in the database is overwhelming ● You want to do some complex queries on a sharding cluster ○ e.g. simple JOIN or GROUP BY ● Your application needs ACID transaction on a sharding cluster
  • 5.
    TiDB Project -Goal ● SQL is necessary ● Transparent sharding and data movement/balance ● 100% OLTP + 80% OLAP ○ Transaction + Complex query ● 24/7 availability, even in case of datacenter outages ○ Thanks to the Raft consensus algorithm ● Compatible with MySQL, in most cases ● Open source, of course.
  • 6.
    Architecture TiKV TiKV TiKVTiKV Raft Raft Raft TiDB TiDB TiDB ... ...... ... ... Placement Driver (PD) Control flow: Balance / Failover Metadata / Timestamp request Stateless SQL Layer Distributed Storage Layer gRPC gRPC gRPCgRPC
  • 7.
    Storage stack 1/3 ●TiKV is the underlying storage layer ● Physically, data is stored in RocksDB ● We build a Raft layer on top of RocksDB ○ What is Raft? ● Written in Rust! TiKV API (gRPC) Transaction MVCC Raft (gRPC) RocksDB Raw KV API (https://github.com/pingc ap/tidb/blob/master/cmd /benchraw/main.go) Transactional KV API (https://github.com/pingcap /tidb/blob/master/cmd/ben chkv/main.go)
  • 8.
    Storage Stack 2/3 Logicalview of TiKV ● Key-Value storage ● Giant sorted (in byte-order) Key-Value map ● Split into regions ● Metadata: [start_key, end_key) TiKV Key Space [ start_key, end_key) (-∞, +∞) Sorted Map 256MB
  • 9.
    RocksDB Instance Region 1:[a-e] Region 3:[k-o] Region5:[u-z] ... Region 4:[p-t] RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 4:[p-t] ... Region 3:[k-o] RocksDB Instance Region 2:[f-j] Region 5:[u-z] Region 3:[k-o] ... RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 5:[u-z] ... Region 4:[p-t] Raft group Storage stack 3/3 ● Data is organized by Regions ● Region: a set of continuous Key-Value pairs RPC (gRPC) Transaction MVCC Raft RocksDB ···
  • 10.
    Dynamic Multi-Raft ● What’sDynamicMulti-Raft? ○ Dynamic split / merge ● Safe split / merge Region 1:[a-e] split Region 1.1:[a-c] Region 1.2:[d-e]split
  • 11.
    Safe Split: 1/4 TiKV1 Region1:[a-e] TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower Raft group
  • 12.
    Safe Split: 2/4 TiKV2 Region1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower TiKV1 Region 1.1:[a-c] Region 1.2:[d-e]
  • 13.
    Safe Split: 3/4 TiKV1 Region1.1:[a-c] Region 1.2:[d-e] Leader Follower Follower Split log (replicated by Raft) Split log TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e]
  • 14.
    Safe Split: 4/4 TiKV1 Region1.1:[a-c] Leader Region 1.2:[d-e] TiKV2 Region 1.1:[a-c] Follower Region 1.2:[d-e] TiKV3 Region 1.1:[a-c] Follower Region 1.2:[d-e] raft raft raft raft
  • 15.
    Region 1 Region 3 Region1 Region 2 Scale-out (initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 16.
    Region 1 Region 3 Region1^ Region 2 Region 1* Region 2 Region 2 Region 3 Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 17.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica to Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 18.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 19.
    ACID Transaction ● Basedon Google Percolator ● ‘Almost’ decentralized 2-phase commit ○ Timestamp Allocator ● Optimistic transaction model ● Default isolation level: Snapshot Isolation ● We also support RC Isolation
  • 20.
    Something we haven'tmentioned Now, we have a distributed, transactional, auto-scalable key-value storage. ● Timestamp allocator ● Metadata storage ● Balance decision Here comes the Placement Driver (PD for short)
  • 21.
    Placement Driver The brainof the TiKV cluster ●Timestamp allocator ●Metadata storage ●Replica scheduling PD PDPD Raft Raft etcd Embedded
  • 22.
    Scheduling Strategy Region A RegionB Node 1 Node 2 PD Scheduling Strategy Cluster Info Admin HeartBeat Scheduling Command Region C Config Movement
  • 23.
    The SQL Layer ●Mapping relational model to Key-Value model ● Full-featured SQL layer ● Cost-based optimizer (CBO) ● Distributed execution engine
  • 24.
    SQL to Key-Value ●Row Key: TableID + RowID Value: Row Value ●Index Key: TableID + IndexID + Index-Column-Values Value: RowID CREATE TABLE `t` (`id` int, `age` int, key `age_idx` (`age`)); INSERT INTO `t` VALUES (100, 35); K1 K2 100, 35 K1 TiKV Encoded Keys: K1: tid + rowid K2: tid + idxid + 35
  • 25.
  • 26.
    What happens behinda query CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1)); SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
  • 27.
    Query Plan Partial Aggregate COUNT(c1) Filter c2= “seattle” Read Index idx1: (10, +∞) Physical Plan on TiKV (index scan) Read Row Data by RowID RowID Row Row Final Aggregate SUM(COUNT(c1)) DistSQL Scan Physical Plan on TiDB COUNT(c1) COUNT(c1) TiKV TiKV TiKV COUNT(c1) COUNT(c1) SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘seattle’;
  • 28.
    What happens behinda query CREATE TABLE t1(id INT, email TEXT,KEY idx_id(id)); CREATE TABLE t2(id INT, email TEXT, KEY idx_id(id)); SELECT * FROM t1 join t2 WHERE t1.id = t2.id;
  • 29.
  • 30.
    Supported Join Operators ●Hash Join ● Sort merge Join ● Index-lookup Join
  • 31.
    Cost-Based Optimizer ● PredicatePushdown ● Column Pruning ● Eager Aggregate ● Convert Subquery to Join ● Statistics framework ● CBO Framework ○ Index Selection ○ Join Operator Selection ○ Stream Operators VS Hash Operators
  • 32.
    Tools matter ● Syncer ●TiDB-Binlog ● Mydumper/MyLoader(loader)
  • 33.
    Syncer ● Synchronize datafrom MySQL in real-time ● Hook up as a MySQL replica MySQL (master) Syncer Save Point (disk) Rule Filter MySQL TiDB Cluster TiDB Cluster TiDB Cluster Syncer Syncerbinlog Fake slave Syncer or
  • 34.
    TiDB-Binlog TiDB Server TiDB ServerSorter Pumper Pumper TiDB Server Pumper Protobuf MySQL Binlog MySQL 3rd party applicationsCistern ● Subscribe the incremental data from TiDB ● Output Protobuf formatted data or MySQL Binlog format(WIP) Another TiDB-Cluster
  • 35.
    MyDumper / Loader ●Backup/restore in parallel ● Works for TiDB too ● Actually, we don’t have our own data migration tool for now
  • 36.
    Spark on TiKV ●TiSpark = Spark SQL on TiKV o Spark SQL directly on top of a distributed Database Storage engine o Two extension points for Spark SQL Internal: Extra Optimizer Rules and Extra Strategies o Hijack Spark SQL logical plan and inject our own physical executor ● Hybrid Transactional/Analytical Processing(HTAP) rocks o Provide strong OLAP capacity together with TiDB
  • 37.
    Spark on TiKV TiDB TiDB Worker Spark Driver TiKVCluster (Storage) Metadata TiKV TiKV TiKV Application Syncer Data location Job TiSpark DistSQL API TiKV TiDB TSO/Data location Worker Worker Spark Cluster TiDB Cluster TiDB ... ... ... DistSQL API P D P D P D PD Cluster TiKV TiKV TiDB
  • 38.
    Spark on TiKV ●The TiKV Connector is better than the JDBC connector ● Index support ● Complex Calculation Pushdown ● CBO o Pick up right Access Path o Join Reorder ● Priority & Isolation Level
  • 39.
    Future plans ● Shiftfrom Pre-GA to GA ● Better optimizer (Statistic && CBO) ● Smarter scheduling mechanism ● Document store for TiDB ○ MySQL 5.7.12+ X-Plugin ● Integrate TiDB with Kubernetes
  • 40.