TiDB for Big Data
shenli@PingCAP
About me
● Shen Li (申砾)
● Tech Lead of TiDB, VP of Engineering
● Netease / 360 / PingCAP
● Infrastructure software engineer
What is Big Data?
Big Data is a term for data sets that are so large or complex that traditional
data processing application software is inadequate to deal with them.
---- From Wikipedia
Big Data Landscape
OLTP and OLAP
What is TiDB
● SQL is necessary
● Scale is easy
● Compatible with MySQL, at most cases
● OLTP + OLAP = HTAP (Hybrid Transactional/Analytical Processing)
○ Transaction + Complex query
● 24/7 availability, even in case of datacenter outages
○ Thanks to Raft consensus algorithm
● Open source, of course.
Architecture
TiKV TiKV TiKV TiKV
Raft Raft Raft
TiDB TiDB TiDB
... ......
... ...
Placement
Driver (PD)
Control flow:
Balance / Failover
Metadata / Timestamp request
Stateless SQL Layer
Distributed Storage Layer
gRPC
gRPC
gRPC
Storage stack 1/2
● TiKV is the underlying storage layer
● Physically, data is stored in RocksDB
● We build a Raft layer on top of RocksDB
○ What is Raft?
● Written in Rust!
TiKV
API (gRPC)
Transaction
MVCC
Raft (gRPC)
RocksDB
Raw KV API
(https://github.com/pingc
ap/tidb/blob/master/cmd
/benchraw/main.go)
Transactional KV API
(https://github.com/pingcap
/tidb/blob/master/cmd/ben
chkv/main.go)
RocksDB
Instance
Region 1:[a-e]
Region 3:[k-o]
Region 5:[u-z]
...
Region 4:[p-t]
RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 4:[p-t]
...
Region 3:[k-o]
RocksDB
Instance
Region 2:[f-j]
Region 5:[u-z]
Region 3:[k-o]
... RocksDB
Instance
Region 1:[a-e]
Region 2:[f-j]
Region 5:[u-z]
...
Region 4:[p-t]
Raft group
Storage stack 2/2
● Data is organized by Regions
● Region: a set of continuous key-value pairs
RPC (gRPC)
Transaction
MVCC
Raft
RocksDB
···
Dynamic Multi-Raft
● What’s Dynamic Multi-Raft?
○ Dynamic split / merge
● Safe split / merge
Region 1:[a-e]
split Region 1.1:[a-c]
Region 1.2:[d-e]split
Safe Split: 1/4
TiKV1
Region 1:[a-e]
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader Follower Follower
Raft group
Safe Split: 2/4
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
raft raft
Leader
Follower Follower
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Safe Split: 3/4
TiKV1
Region 1.1:[a-c]
Region 1.2:[d-e]
Leader
Follower Follower
Split log (replicated by Raft)
Split log
TiKV2
Region 1:[a-e]
TiKV3
Region 1:[a-e]
Safe Split: 4/4
TiKV1
Region 1.1:[a-c]
Leader
Region 1.2:[d-e]
TiKV2
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
TiKV3
Region 1.1:[a-c]
Follower
Region 1.2:[d-e]
raft
raft
raft
raft
Region 1
Region 3
Region 1
Region 2
Scale-out (initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3
Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica on Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
ACID Transaction
● Based on Google Percolator
● ‘Almost’ decentralized 2-phase commit
○ Timestamp Allocator
● Optimistic transaction model
● Default isolation level: Repeatable Read
● External consistency: Snapshot Isolation + Lock
■ SELECT … FOR UPDATE
Distributed SQL
● Full-featured SQL layer
● Predicate pushdown
● Distributed join
● Distributed cost-based optimizer (Distributed CBO)
TiDB SQL Layer overview
What happens behind a query
CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1));
SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘golang’;
Query Plan
Partial Aggregate
COUNT(c1)
Filter
c2 = “golang”
Read Index
idx1: (10, +∞)
Physical Plan on TiKV (index scan)
Read Row Data
by RowID
RowID
Row
Row
Final Aggregate
SUM(COUNT(c1))
DistSQL Scan
Physical Plan on TiDB
COUNT(c1)
COUNT(c1)
TiKV
TiKV
TiKV
COUNT(c1)
COUNT(c1)
What happens behind a query
CREATE TABLE left (id INT, email TEXT,KEY idx_id(id));
CREATE TABLE right (id INT, email TEXT, KEY idx_id(id));
SELECT * FROM left join right WHERE left.id = right.id;
Distributed Join (HashJoin)
Supported Distributed Join Type
● Hash Join
● Sort merge Join
● Index-lookup Join
Hybrid Transactional/Analytical Processing
TiDB with the Big Data Ecosystem
Syncer
● Synchronize data from MySQL in real-time
● Hook up as a MySQL replica
MySQL
(master)
Syncer
Save Point
(disk)
Rule Filter
MySQL
TiDB Cluster
TiDB Cluster
TiDB Cluster
Syncer
Syncerbinlog
Fake slave
Syncer
or
TiDB-Binlog
TiDB Server
TiDB Server Sorter
Pumper
Pumper
TiDB Server
Pumper
Protobuf
MySQL Binlog
MySQL
3rd party applicationsCistern
● Subscribe the incremental data from TiDB
● Output Protobuf formatted data or MySQL Binlog format(WIP)
Another TiDB-Cluster
TiSpark
TiKV TiKV TiKV TiKV TiKV
TiDB TiDB
TiDB
TiDB + SparkSQL = TiSpark
Spark Master
TiKV Connector
Data Storage & Coprocessor
PD
Spark Exec
TiKV Connector
Spark Exec
TiKV Connector
Spark Exec
TiSpark
● TiKV Connector is better than JDBC connector
● Index support
● Complex Calculation Pushdown
● CBO
○ Pick up right Access Path
○ Join Reorder
● Priority & Isolation Level
Too Abstract? Let’s get concrete.
TiKV
CoprocessorSpark
SQL Plan PushDown Plan
SQL: Select sum(score) from t1 group by class
where school = “engineering”;
Pushdown Plan: Sum(score), Group by class, Table:t1
Filter: School = “engineering”
Use Case
Use Case MySQL Spark TiDB TiSpark
Large-Aggregat
es
Slow or
impossible if
beyond scale
Well supported Supported Well supported
Large-joins Slow or
impossible if
beyond scale
Well supported Supported Well supported
Point Query Fast Very slow on
HDFS
Fast Fast
Modification Supported Not possible on
HDFS
Supported Supported
Benefit
● Analytical / Transactional support all on one platform
○ No need for ETL
○ Real-time query with Spark
○ Possibility for get rid of Hadoop
● Embrace Spark echo-system
○ Support of complex transformation and analytics with Scala /
Python and R
○ Machine Learning Libraries
○ Spark Streaming
Current Status
● Phase 1: (will be released with GA)
○ Aggregates pushdown
○ Type System
○ Filter Pushdown and Access Path selection
● Phase 2: (EOY)
○ Join Reorder
○ Write
Future work of TiDB
Roadmap
● TiSpark: Integrate TiKV with SparkSQL
● Better optimizer (Statistic && CBO)
● Json type and document store for TiDB
○ MySQL 5.7.12+ X-Plugin
● Integrate with Kubernetes
○ Operator by CoreOS
Thanks
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Contact me:
shenli@pingcap.com

TiDB for Big Data

  • 1.
    TiDB for BigData shenli@PingCAP
  • 2.
    About me ● ShenLi (申砾) ● Tech Lead of TiDB, VP of Engineering ● Netease / 360 / PingCAP ● Infrastructure software engineer
  • 3.
    What is BigData? Big Data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. ---- From Wikipedia
  • 4.
  • 5.
  • 6.
    What is TiDB ●SQL is necessary ● Scale is easy ● Compatible with MySQL, at most cases ● OLTP + OLAP = HTAP (Hybrid Transactional/Analytical Processing) ○ Transaction + Complex query ● 24/7 availability, even in case of datacenter outages ○ Thanks to Raft consensus algorithm ● Open source, of course.
  • 7.
    Architecture TiKV TiKV TiKVTiKV Raft Raft Raft TiDB TiDB TiDB ... ...... ... ... Placement Driver (PD) Control flow: Balance / Failover Metadata / Timestamp request Stateless SQL Layer Distributed Storage Layer gRPC gRPC gRPC
  • 8.
    Storage stack 1/2 ●TiKV is the underlying storage layer ● Physically, data is stored in RocksDB ● We build a Raft layer on top of RocksDB ○ What is Raft? ● Written in Rust! TiKV API (gRPC) Transaction MVCC Raft (gRPC) RocksDB Raw KV API (https://github.com/pingc ap/tidb/blob/master/cmd /benchraw/main.go) Transactional KV API (https://github.com/pingcap /tidb/blob/master/cmd/ben chkv/main.go)
  • 9.
    RocksDB Instance Region 1:[a-e] Region 3:[k-o] Region5:[u-z] ... Region 4:[p-t] RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 4:[p-t] ... Region 3:[k-o] RocksDB Instance Region 2:[f-j] Region 5:[u-z] Region 3:[k-o] ... RocksDB Instance Region 1:[a-e] Region 2:[f-j] Region 5:[u-z] ... Region 4:[p-t] Raft group Storage stack 2/2 ● Data is organized by Regions ● Region: a set of continuous key-value pairs RPC (gRPC) Transaction MVCC Raft RocksDB ···
  • 10.
    Dynamic Multi-Raft ● What’sDynamic Multi-Raft? ○ Dynamic split / merge ● Safe split / merge Region 1:[a-e] split Region 1.1:[a-c] Region 1.2:[d-e]split
  • 11.
    Safe Split: 1/4 TiKV1 Region1:[a-e] TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower Raft group
  • 12.
    Safe Split: 2/4 TiKV2 Region1:[a-e] TiKV3 Region 1:[a-e] raft raft Leader Follower Follower TiKV1 Region 1.1:[a-c] Region 1.2:[d-e]
  • 13.
    Safe Split: 3/4 TiKV1 Region1.1:[a-c] Region 1.2:[d-e] Leader Follower Follower Split log (replicated by Raft) Split log TiKV2 Region 1:[a-e] TiKV3 Region 1:[a-e]
  • 14.
    Safe Split: 4/4 TiKV1 Region1.1:[a-c] Leader Region 1.2:[d-e] TiKV2 Region 1.1:[a-c] Follower Region 1.2:[d-e] TiKV3 Region 1.1:[a-c] Follower Region 1.2:[d-e] raft raft raft raft
  • 15.
    Region 1 Region 3 Region1 Region 2 Scale-out (initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 16.
    Region 1 Region 3 Region1^ Region 2 Region 1* Region 2 Region 2 Region 3 Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 17.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica on Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 18.
    Region 1 Region 3 Region1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 19.
    ACID Transaction ● Basedon Google Percolator ● ‘Almost’ decentralized 2-phase commit ○ Timestamp Allocator ● Optimistic transaction model ● Default isolation level: Repeatable Read ● External consistency: Snapshot Isolation + Lock ■ SELECT … FOR UPDATE
  • 20.
    Distributed SQL ● Full-featuredSQL layer ● Predicate pushdown ● Distributed join ● Distributed cost-based optimizer (Distributed CBO)
  • 21.
  • 22.
    What happens behinda query CREATE TABLE t (c1 INT, c2 TEXT, KEY idx_c1(c1)); SELECT COUNT(c1) FROM t WHERE c1 > 10 AND c2 = ‘golang’;
  • 23.
    Query Plan Partial Aggregate COUNT(c1) Filter c2= “golang” Read Index idx1: (10, +∞) Physical Plan on TiKV (index scan) Read Row Data by RowID RowID Row Row Final Aggregate SUM(COUNT(c1)) DistSQL Scan Physical Plan on TiDB COUNT(c1) COUNT(c1) TiKV TiKV TiKV COUNT(c1) COUNT(c1)
  • 24.
    What happens behinda query CREATE TABLE left (id INT, email TEXT,KEY idx_id(id)); CREATE TABLE right (id INT, email TEXT, KEY idx_id(id)); SELECT * FROM left join right WHERE left.id = right.id;
  • 25.
  • 26.
    Supported Distributed JoinType ● Hash Join ● Sort merge Join ● Index-lookup Join
  • 27.
  • 28.
    TiDB with theBig Data Ecosystem
  • 29.
    Syncer ● Synchronize datafrom MySQL in real-time ● Hook up as a MySQL replica MySQL (master) Syncer Save Point (disk) Rule Filter MySQL TiDB Cluster TiDB Cluster TiDB Cluster Syncer Syncerbinlog Fake slave Syncer or
  • 30.
    TiDB-Binlog TiDB Server TiDB ServerSorter Pumper Pumper TiDB Server Pumper Protobuf MySQL Binlog MySQL 3rd party applicationsCistern ● Subscribe the incremental data from TiDB ● Output Protobuf formatted data or MySQL Binlog format(WIP) Another TiDB-Cluster
  • 31.
    TiSpark TiKV TiKV TiKVTiKV TiKV TiDB TiDB TiDB TiDB + SparkSQL = TiSpark Spark Master TiKV Connector Data Storage & Coprocessor PD Spark Exec TiKV Connector Spark Exec TiKV Connector Spark Exec
  • 32.
    TiSpark ● TiKV Connectoris better than JDBC connector ● Index support ● Complex Calculation Pushdown ● CBO ○ Pick up right Access Path ○ Join Reorder ● Priority & Isolation Level
  • 33.
    Too Abstract? Let’sget concrete. TiKV CoprocessorSpark SQL Plan PushDown Plan SQL: Select sum(score) from t1 group by class where school = “engineering”; Pushdown Plan: Sum(score), Group by class, Table:t1 Filter: School = “engineering”
  • 34.
    Use Case Use CaseMySQL Spark TiDB TiSpark Large-Aggregat es Slow or impossible if beyond scale Well supported Supported Well supported Large-joins Slow or impossible if beyond scale Well supported Supported Well supported Point Query Fast Very slow on HDFS Fast Fast Modification Supported Not possible on HDFS Supported Supported
  • 35.
    Benefit ● Analytical /Transactional support all on one platform ○ No need for ETL ○ Real-time query with Spark ○ Possibility for get rid of Hadoop ● Embrace Spark echo-system ○ Support of complex transformation and analytics with Scala / Python and R ○ Machine Learning Libraries ○ Spark Streaming
  • 36.
    Current Status ● Phase1: (will be released with GA) ○ Aggregates pushdown ○ Type System ○ Filter Pushdown and Access Path selection ● Phase 2: (EOY) ○ Join Reorder ○ Write
  • 37.
  • 38.
    Roadmap ● TiSpark: IntegrateTiKV with SparkSQL ● Better optimizer (Statistic && CBO) ● Json type and document store for TiDB ○ MySQL 5.7.12+ X-Plugin ● Integrate with Kubernetes ○ Operator by CoreOS
  • 39.