SlideShare a Scribd company logo
1 of 37
Download to read offline
© 2023 All Rights Reserved
YugabyteDB
Advanced level unlocked
1
Gwenn Etourneau
Principal, Solution Architect
© 2023 All Rights Reserved
● Quick reminder
● Under the hood
○ Tablet Splitting
■ Manual splitting
■ Pre-splitting
■ Automatic splitting
○ Replication
■ Raft
■ Read - Write path
■ Transaction Read-Write path
2
Agenda
© 2023 All Rights Reserved 3
About Me
https://github.com/shinji62
https://twitter.com/the_shinji62
Woven by Toyota
Pivotal (ac. By VMware)
Rakuten
IBM …
Etourneau Gwenn
Principal Solution Architect
© 2023 All Rights Reserved
Quick reminder
4
© 2023 All Rights Reserved
Component
5
© 2023 All Rights Reserved
Layered Architecture
DocDB Storage Layer
Distributed, transactional document store
with sync and async replication support
YSQL
A fully PostgreSQL
compatible relational API
YCQL
Cassandra compatible
semi-relational API
Extensible Query Layer
Extensible query layer to support multiple API’s
Microservice requiring
relational integrity
Microservice requiring
massive scale
Microservice requiring
geo-distribution of data
Extensible query layer
○ YSQL: PostgreSQL-based
○ YCQL: Cassandra-based
Transactional storage layer
○ Transactional
○ Resilient and scalable
○ Document storage
6
© 2023 All Rights Reserved
Extend to Distributed SQL
7
© 2023 All Rights Reserved
Under the hood
Table sharding
8
© 2023 All Rights Reserved
● YugabyteDB splits user tables into multiple shards, called tablets, using either a hash- or
range-based strategy.
○ Primary Key for each row in the table uniquely identifies the location of the tablet in the row
○ By default, 8 tablets per node, distributed evenly across the nodes
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
Every Tables Data is Automatically Sharded
tablet 1’
… … …
… … …
… … …
… … …
… … …
SHARDING = AUTOMATIC DISTRIBUTION OF TABLES
https://docs.yugabyte.com/preview/explore/linear-scalability/sharding-data/
https://www.yugabyte.com/blog/distributed-sql-tips-tricks-tablet-splitting-high-availability-sharding/
© 2023 All Rights Reserved
● YugabyteDB allows data resharding by splitting tablets using the following 3 mechanisms:
● Presplitting tablets
○ All tables created in DocDB can be split into the desired number of tablets at creation time.
● Manual tablet splitting
○ The tablets in a running cluster can be split manually at runtime by you.
● Automatic tablet splitting
○ The tablets in a running cluster are automatically split according to some policy by the
database.
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
1. Presplitting tablets
● At creation time, presplit a table into the desired number of tablets
○ YSQL tables - Supports both range-sharded and hash-sharded
○ YCQL tables - Support hash-sharded YCQL tables
● Hash-sharded tables
● Max 65536(64k) tablets/shard
● 2-byte range from 0x0000 to 0xFFFF
CREATE TABLE customers (
customer_id bpchar NOT NULL,
cname character varying(40),
contact_name character varying(30),
contact_title character varying(30),
PRIMARY KEY (customer_id HASH)
) SPLIT INTO 16 TABLETS;
● e.g. table with 16 tablets the overall hash space [0x0000
to 0xFFFF) is divided into 16 subranges, one for each
tablet: [0x0000, 0x1000), [0x1000, 0x2000), … , [0xF000,
0xFFFF]
● Read/write operations are processed by converting the
primary key into an internal key and its hash value, and
determining to which tablet the operation should be routed
© 2023 All Rights Reserved
1. Presplitting tablets
© 2023 All Rights Reserved
1. Presplitting tablets
● Range shard splitting, you can predefined the splitting point.
CREATE TABLE customers (
customer_id bpchar NOT NULL,
company_name character varying(40)
PRIMARY KEY (customer_id ASC))
SPLIT AT VALUES ((1000), (2000), (3000), ... );
© 2023 All Rights Reserved
1. Presplitting tablets - Maximum number of tablets
● Maximum of tablets is based on the number of Tserver and max_create_tablets_per_ts
(default 50) setting.
○ For example with 4 nodes only 200 tablets by table can be created.
○ If you try to create more than the maximum number of tablets an error will be returned
message="Invalid Table Definition. Error creating table YOUR-TABLE on the master: The
requested number of tablets (XXXX) is over the permitted maximum (200)
© 2023 All Rights Reserved
2. Manual tablet splitting
CREATE TABLE t (k VARCHAR, v TEXT, PRIMARY KEY (k)) SPLIT INTO 1 TABLETS;
INSERT INTO t(k, v) SELECT i::text, left(md5(random()::text), 4) FROM generate_series(1, 100000)
s(i);
SELECT count(*) FROM t;
● Recommended v2.14.x
● By using the config `SPLIT INTO X TABLETS` when creating table you can specify the numbers of
tablets for the table.
Example below will create only 1 tablets for the table
yb-admin --master_addresses 127.0.0.{1..4}:7100 split_tablet cdcc15981d29480498e5bacd4fc6b277
● You can also use the yb-admin command split_tablet to change the numbers of tablets.
© 2023 All Rights Reserved
3. Automatic tablet splitting
● Resharding of data automatically while online, transparently when a specified size threshold has been
reached
● To enable automatic tablet splitting,
○ yb-master --enable_automatic_tablet_splitting flag and specify the
associated flags to configure when tablets should split
○ Newly-created tables have 1 shard per by default
© 2023 All Rights Reserved
3. Automatic tablet splitting - 3 Phases
● Low phase
○ Each node has fewer than tablet_split_low_phase_shard_count_per_node
shards (8 by default).
○ Splits tablets larger than tablet_split_low_phase_size_threshold_bytes (512
MB by default).
● High phase
○ Each node has fewer than tablet_split_high_phase_shard_count_per_node
shards (24 by default).
○ Splits tablets larger than tablet_split_high_phase_size_threshold_bytes (10
GB by default).
● Final phase
○ Exceeds the high phase count (determined by
tablet_split_high_phase_shard_count_per_node , 24 by default),
○ Splits tablets larger than tablet_force_split_threshold_bytes (100 GB by
default).
● Recommended v2.14.9 +
© 2023 All Rights Reserved
3. Automatic tablet splitting - Others.
● Post-split compactions
○ When a tablet is split, the two tablets need to have a full compaction to remove
unnecessary data and free disk space.
○ This may increase CPU overhead, but you can control this behavior with some gflags
© 2023 All Rights Reserved
Hash vs Range
Pro
● Recommended for most of the workload
● Best for massive workload
● Best for data distribution across node
Cons
● Range queries are inefficiency, for example where
k>v1 and k<v2
Pro
● Efficient for range query, for example where k>v1
and k<v2
Cons
● Warming issue, as starting everything on a single
node / tablets (need presplitting)
● May lead to hotspot, many PK within the same
tablets
Hash Range
© 2023 All Rights Reserved
Under the hood
Replication
21
© 2023 All Rights Reserved
Replication factor 3
Node#1 Node#2 Node#3
Tablet #1
Tablet #2
Tablet #3
Tablet #1 Tablet #1
Tablet #2 Tablet #2
Tablet #3
Tablet #3
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
Replication done at Tablets (shard) level
tablet 1’
Tablet Peer 1 on Node X
Tablet #1
Tablet Peer 2 on Node Y
Tablet Peer 3 on Node Z
Replication Factor = 3
© 2023 All Rights Reserved
Replication uses a Consensus algorithm
tablet 1’
Raft Leader
Uses Raft Algorithm
First elect Tablet Leader
24
© 2023 All Rights Reserved
Reads in Raft Consensus
tablet 1’
Raft Leader
Reads handled by leader**
Read
25
** Read can be done from the follower if the gflag yb_read_from_followers is true
© 2023 All Rights Reserved
Writes in Raft Consensus
tablet 1’
Raft Leader
Writes processed by leader:
Send writes to all peers
Wait for majority to ack
Write
26
© 2023 All Rights Reserved
Leader Lease
tablet 1’
27
To avoid inconsistencies during network partition, to be sure to read the latest Data, the
leader will have lease, `I want to be the leader for 3sec’, that at most one leader is
serving data.
The old leader have his lease expire as the
new leader hold it, so it will not be able to
responds to the client.
© 2023 All Rights Reserved
Under the hood
IO Path
28
© 2023 All Rights Reserved
Read path
29
© 2023 All Rights Reserved
Standard Read Request
Tablet1-Follower
Tablet2-Follower
Tablet3-Leader
YB-tserver 3
Tablet1-Leader
Tablet2-Follower
Tablet3-Follower
YB-tserver 1
Read request for tablet 3
1
Tablet1-Follower
Tablet2-Leader
Tablet3-Follower
YB-tserver 2
Get Tablet Leader Locations
2
Redirect to current
table 3 leader
3 Respond to
client
4
Master-Follower
YB-master 1
Master-Leader
YB-master 3
Master-Follower
YB-master 2
© 2023 All Rights Reserved
Write path
31
© 2023 All Rights Reserved
Standard Write Request
Tablet1-Follower
Tablet2-Follower
Tablet3-Leader
YB-tserver 3
Tablet1-Leader
Tablet2-Follower
Tablet3-Follower
YB-tserver 1
Update request for tablet 3
1
Tablet1-Follower
Tablet2-Leader
Tablet3-Follower
YB-tserver 2
Get Tablet Leader Locations
2
Redirect to current
table 3 leader
3
Wait for one replica commit to
his own raft log then Ack client
5
Master-Follower
YB-master 1
Master-Leader
YB-master 3
Master-Follower
YB-master 2
4
4
Sync update to follower replicas using Raft
© 2023 All Rights Reserved
Distributed Transactions
33
© 2023 All Rights Reserved
Distributed Transactions
node1 node2 node3 node4 … Scale to as many nodes as needed
Raft group leader (serves writes & strong reads)
Raft group follower (serves timeline-consistent reads & ready for leader election)
syscatalog
yb-master1
YB-Master Service
Manage shard metadata &
coordinate config changes
syscatalog
yb-master2
syscatalog
yb-master3
Cluster Administration
Admin clients
…
yb-tserver1
tablet3
tablet2
tablet1
YB-TServer Service
Store & serve app data
in/from tablets (aka shards)
yb-tserver2 yb-tserver3 yb-tserver4
…
tablet4
tablet2
tablet1
…
tablet4
tablet3
tablet1
…
tablet4
tablet3
tablet2
App clients
Distributed SQL API
Distributed
Txn Mgr
Distributed
Txn Mgr
Distributed
Txn Mgr
Distributed
Txn Mgr
34
© 2023 All Rights Reserved
Transaction Write path
YB Tablet Server 1 YB Tablet Server 2
YB Tablet Server 4
YB Tablet Server 3
Txn Status
Tablet
(leader)
Tablet containing k1
(leader)
Tablet containing k2
(leader)
Provisional record:
k1=v1 (txn=txn_id)
Provisional record:
k2=v2 (txn=txn_id)
Txn Status
Tablet
(follower)
Txn Status
Tablet
(follower)
Tablet
follower
Tablet
follower
Tablet
follower
Tablet
follower
Transaction
Manager
1
Client’s request set k1=v1,k2=v2
5 Ack Client
2 Create status record
3 Write provisional
records
3
4 Commit txn
6
6
Async. Apply
Provisional records
(convert to permanent)
© 2023 All Rights Reserved
Transaction read path
YB Tablet Server 1 YB Tablet Server 2
YB Tablet Server 4
YB Tablet Server 3
Tx status tablet (leader)
txn_id: committed @ t=100
Tablet containing k1
(leader)
Tablet containing k2
(leader)
Provisional record:
k1=v1 (txn=txn_id)
Provisional record:
k2=v2 (txn=txn_id)
Txn Status
Tablet
(follower)
Txn Status
Tablet
(follower)
Tablet
follower
Tablet
follower
Tablet
follower
Tablet
follower
Transaction
Manager
1
Client’s request read k1,k2
5 Respond to client
4
4
Return k1=v1
Return k2=v2
2
2
Read k1 at hybrid
Time ht_read
Read k2 at hybrid
time ht_read
3
3
Request status
of txn txn_id
Request status
of txn txn_id
© 2023 All Rights Reserved 37
Thank You
Join us on Slack:
www.yugabyte.com/slack
Star us on GitHub:
github.com/yugabyte/yugabyte-db
37

More Related Content

Similar to YugabyteDB Tablet Splitting and Replication

Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentationvanjakom
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 
3.5. managing cluster parameters
3.5. managing cluster parameters3.5. managing cluster parameters
3.5. managing cluster parameterstsuras
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreMariaDB plc
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
M|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStoreM|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStoreMariaDB plc
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBhawani N Prasad
 
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...The Linux Foundation
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database managementLeyi (Kamus) Zhang
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLMasahiko Sawada
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentationvanjakom
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster 5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster Aaron Joue
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla ClusterScyllaDB
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudMariaDB plc
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to NetezzaVijaya Chandrika
 

Similar to YugabyteDB Tablet Splitting and Replication (20)

Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
3.5. managing cluster parameters
3.5. managing cluster parameters3.5. managing cluster parameters
3.5. managing cluster parameters
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
M|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStoreM|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStore
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
 
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQL
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster 5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung Cloud
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 

More from Gwenn Etourneau

Meetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdfMeetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdfGwenn Etourneau
 
Concourse for devops @quoine
Concourse for devops @quoineConcourse for devops @quoine
Concourse for devops @quoineGwenn Etourneau
 
Cloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stackCloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stackGwenn Etourneau
 
Demo Pivotal Circle Of Code
Demo Pivotal Circle Of CodeDemo Pivotal Circle Of Code
Demo Pivotal Circle Of CodeGwenn Etourneau
 
Monitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with PrometheusMonitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with PrometheusGwenn Etourneau
 
Route service-pcf-techmeetup
Route service-pcf-techmeetupRoute service-pcf-techmeetup
Route service-pcf-techmeetupGwenn Etourneau
 
Cloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route serviceCloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route serviceGwenn Etourneau
 

More from Gwenn Etourneau (15)

Meetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdfMeetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdf
 
Concourse for devops @quoine
Concourse for devops @quoineConcourse for devops @quoine
Concourse for devops @quoine
 
Cloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stackCloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stack
 
Concourse webhook
Concourse webhookConcourse webhook
Concourse webhook
 
Concourse and Database
Concourse and DatabaseConcourse and Database
Concourse and Database
 
ConcourseCI love Minio
ConcourseCI love MinioConcourseCI love Minio
ConcourseCI love Minio
 
Demo Pivotal Circle Of Code
Demo Pivotal Circle Of CodeDemo Pivotal Circle Of Code
Demo Pivotal Circle Of Code
 
Monitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with PrometheusMonitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with Prometheus
 
Concourse updates
Concourse updatesConcourse updates
Concourse updates
 
Route service-pcf-techmeetup
Route service-pcf-techmeetupRoute service-pcf-techmeetup
Route service-pcf-techmeetup
 
Bosh 2-0-reloaded
Bosh 2-0-reloadedBosh 2-0-reloaded
Bosh 2-0-reloaded
 
ConcourseCi Dockerimage
ConcourseCi DockerimageConcourseCi Dockerimage
ConcourseCi Dockerimage
 
ConcourseCi overview
ConcourseCi  overviewConcourseCi  overview
ConcourseCi overview
 
Cloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route serviceCloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route service
 
Lattice yapc-slideshare
Lattice yapc-slideshareLattice yapc-slideshare
Lattice yapc-slideshare
 

Recently uploaded

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...akbard9823
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Deliverybabeytanya
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Denver Web Design brochure for public viewing
Denver Web Design brochure for public viewingDenver Web Design brochure for public viewing
Denver Web Design brochure for public viewingbigorange77
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一3sw2qly1
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...aditipandeya
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 

Recently uploaded (20)

Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
Sushant Golf City / best call girls in Lucknow | Service-oriented sexy call g...
 
Call Girls Service Dwarka @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
Call Girls Service Dwarka @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICECall Girls Service Dwarka @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SERVICE
Call Girls Service Dwarka @9999965857 Delhi 🫦 No Advance VVIP 🍎 SERVICE
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls In Mumbai Central Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in  Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Jamuna Vihar Delhi reach out to us at 🔝9953056974🔝
 
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In South Ex 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Denver Web Design brochure for public viewing
Denver Web Design brochure for public viewingDenver Web Design brochure for public viewing
Denver Web Design brochure for public viewing
 
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
定制(CC毕业证书)美国美国社区大学毕业证成绩单原版一比一
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 

YugabyteDB Tablet Splitting and Replication

  • 1. © 2023 All Rights Reserved YugabyteDB Advanced level unlocked 1 Gwenn Etourneau Principal, Solution Architect
  • 2. © 2023 All Rights Reserved ● Quick reminder ● Under the hood ○ Tablet Splitting ■ Manual splitting ■ Pre-splitting ■ Automatic splitting ○ Replication ■ Raft ■ Read - Write path ■ Transaction Read-Write path 2 Agenda
  • 3. © 2023 All Rights Reserved 3 About Me https://github.com/shinji62 https://twitter.com/the_shinji62 Woven by Toyota Pivotal (ac. By VMware) Rakuten IBM … Etourneau Gwenn Principal Solution Architect
  • 4. © 2023 All Rights Reserved Quick reminder 4
  • 5. © 2023 All Rights Reserved Component 5
  • 6. © 2023 All Rights Reserved Layered Architecture DocDB Storage Layer Distributed, transactional document store with sync and async replication support YSQL A fully PostgreSQL compatible relational API YCQL Cassandra compatible semi-relational API Extensible Query Layer Extensible query layer to support multiple API’s Microservice requiring relational integrity Microservice requiring massive scale Microservice requiring geo-distribution of data Extensible query layer ○ YSQL: PostgreSQL-based ○ YCQL: Cassandra-based Transactional storage layer ○ Transactional ○ Resilient and scalable ○ Document storage 6
  • 7. © 2023 All Rights Reserved Extend to Distributed SQL 7
  • 8. © 2023 All Rights Reserved Under the hood Table sharding 8
  • 9. © 2023 All Rights Reserved ● YugabyteDB splits user tables into multiple shards, called tablets, using either a hash- or range-based strategy. ○ Primary Key for each row in the table uniquely identifies the location of the tablet in the row ○ By default, 8 tablets per node, distributed evenly across the nodes Every Tables Data is Automatically Sharded
  • 10. © 2023 All Rights Reserved Every Tables Data is Automatically Sharded tablet 1’ … … … … … … … … … … … … … … … SHARDING = AUTOMATIC DISTRIBUTION OF TABLES https://docs.yugabyte.com/preview/explore/linear-scalability/sharding-data/ https://www.yugabyte.com/blog/distributed-sql-tips-tricks-tablet-splitting-high-availability-sharding/
  • 11. © 2023 All Rights Reserved ● YugabyteDB allows data resharding by splitting tablets using the following 3 mechanisms: ● Presplitting tablets ○ All tables created in DocDB can be split into the desired number of tablets at creation time. ● Manual tablet splitting ○ The tablets in a running cluster can be split manually at runtime by you. ● Automatic tablet splitting ○ The tablets in a running cluster are automatically split according to some policy by the database. Every Tables Data is Automatically Sharded
  • 12. © 2023 All Rights Reserved 1. Presplitting tablets ● At creation time, presplit a table into the desired number of tablets ○ YSQL tables - Supports both range-sharded and hash-sharded ○ YCQL tables - Support hash-sharded YCQL tables ● Hash-sharded tables ● Max 65536(64k) tablets/shard ● 2-byte range from 0x0000 to 0xFFFF CREATE TABLE customers ( customer_id bpchar NOT NULL, cname character varying(40), contact_name character varying(30), contact_title character varying(30), PRIMARY KEY (customer_id HASH) ) SPLIT INTO 16 TABLETS; ● e.g. table with 16 tablets the overall hash space [0x0000 to 0xFFFF) is divided into 16 subranges, one for each tablet: [0x0000, 0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF] ● Read/write operations are processed by converting the primary key into an internal key and its hash value, and determining to which tablet the operation should be routed
  • 13. © 2023 All Rights Reserved 1. Presplitting tablets
  • 14. © 2023 All Rights Reserved 1. Presplitting tablets ● Range shard splitting, you can predefined the splitting point. CREATE TABLE customers ( customer_id bpchar NOT NULL, company_name character varying(40) PRIMARY KEY (customer_id ASC)) SPLIT AT VALUES ((1000), (2000), (3000), ... );
  • 15. © 2023 All Rights Reserved 1. Presplitting tablets - Maximum number of tablets ● Maximum of tablets is based on the number of Tserver and max_create_tablets_per_ts (default 50) setting. ○ For example with 4 nodes only 200 tablets by table can be created. ○ If you try to create more than the maximum number of tablets an error will be returned message="Invalid Table Definition. Error creating table YOUR-TABLE on the master: The requested number of tablets (XXXX) is over the permitted maximum (200)
  • 16. © 2023 All Rights Reserved 2. Manual tablet splitting CREATE TABLE t (k VARCHAR, v TEXT, PRIMARY KEY (k)) SPLIT INTO 1 TABLETS; INSERT INTO t(k, v) SELECT i::text, left(md5(random()::text), 4) FROM generate_series(1, 100000) s(i); SELECT count(*) FROM t; ● Recommended v2.14.x ● By using the config `SPLIT INTO X TABLETS` when creating table you can specify the numbers of tablets for the table. Example below will create only 1 tablets for the table yb-admin --master_addresses 127.0.0.{1..4}:7100 split_tablet cdcc15981d29480498e5bacd4fc6b277 ● You can also use the yb-admin command split_tablet to change the numbers of tablets.
  • 17. © 2023 All Rights Reserved 3. Automatic tablet splitting ● Resharding of data automatically while online, transparently when a specified size threshold has been reached ● To enable automatic tablet splitting, ○ yb-master --enable_automatic_tablet_splitting flag and specify the associated flags to configure when tablets should split ○ Newly-created tables have 1 shard per by default
  • 18. © 2023 All Rights Reserved 3. Automatic tablet splitting - 3 Phases ● Low phase ○ Each node has fewer than tablet_split_low_phase_shard_count_per_node shards (8 by default). ○ Splits tablets larger than tablet_split_low_phase_size_threshold_bytes (512 MB by default). ● High phase ○ Each node has fewer than tablet_split_high_phase_shard_count_per_node shards (24 by default). ○ Splits tablets larger than tablet_split_high_phase_size_threshold_bytes (10 GB by default). ● Final phase ○ Exceeds the high phase count (determined by tablet_split_high_phase_shard_count_per_node , 24 by default), ○ Splits tablets larger than tablet_force_split_threshold_bytes (100 GB by default). ● Recommended v2.14.9 +
  • 19. © 2023 All Rights Reserved 3. Automatic tablet splitting - Others. ● Post-split compactions ○ When a tablet is split, the two tablets need to have a full compaction to remove unnecessary data and free disk space. ○ This may increase CPU overhead, but you can control this behavior with some gflags
  • 20. © 2023 All Rights Reserved Hash vs Range Pro ● Recommended for most of the workload ● Best for massive workload ● Best for data distribution across node Cons ● Range queries are inefficiency, for example where k>v1 and k<v2 Pro ● Efficient for range query, for example where k>v1 and k<v2 Cons ● Warming issue, as starting everything on a single node / tablets (need presplitting) ● May lead to hotspot, many PK within the same tablets Hash Range
  • 21. © 2023 All Rights Reserved Under the hood Replication 21
  • 22. © 2023 All Rights Reserved Replication factor 3 Node#1 Node#2 Node#3 Tablet #1 Tablet #2 Tablet #3 Tablet #1 Tablet #1 Tablet #2 Tablet #2 Tablet #3 Tablet #3 Every Tables Data is Automatically Sharded
  • 23. © 2023 All Rights Reserved Replication done at Tablets (shard) level tablet 1’ Tablet Peer 1 on Node X Tablet #1 Tablet Peer 2 on Node Y Tablet Peer 3 on Node Z Replication Factor = 3
  • 24. © 2023 All Rights Reserved Replication uses a Consensus algorithm tablet 1’ Raft Leader Uses Raft Algorithm First elect Tablet Leader 24
  • 25. © 2023 All Rights Reserved Reads in Raft Consensus tablet 1’ Raft Leader Reads handled by leader** Read 25 ** Read can be done from the follower if the gflag yb_read_from_followers is true
  • 26. © 2023 All Rights Reserved Writes in Raft Consensus tablet 1’ Raft Leader Writes processed by leader: Send writes to all peers Wait for majority to ack Write 26
  • 27. © 2023 All Rights Reserved Leader Lease tablet 1’ 27 To avoid inconsistencies during network partition, to be sure to read the latest Data, the leader will have lease, `I want to be the leader for 3sec’, that at most one leader is serving data. The old leader have his lease expire as the new leader hold it, so it will not be able to responds to the client.
  • 28. © 2023 All Rights Reserved Under the hood IO Path 28
  • 29. © 2023 All Rights Reserved Read path 29
  • 30. © 2023 All Rights Reserved Standard Read Request Tablet1-Follower Tablet2-Follower Tablet3-Leader YB-tserver 3 Tablet1-Leader Tablet2-Follower Tablet3-Follower YB-tserver 1 Read request for tablet 3 1 Tablet1-Follower Tablet2-Leader Tablet3-Follower YB-tserver 2 Get Tablet Leader Locations 2 Redirect to current table 3 leader 3 Respond to client 4 Master-Follower YB-master 1 Master-Leader YB-master 3 Master-Follower YB-master 2
  • 31. © 2023 All Rights Reserved Write path 31
  • 32. © 2023 All Rights Reserved Standard Write Request Tablet1-Follower Tablet2-Follower Tablet3-Leader YB-tserver 3 Tablet1-Leader Tablet2-Follower Tablet3-Follower YB-tserver 1 Update request for tablet 3 1 Tablet1-Follower Tablet2-Leader Tablet3-Follower YB-tserver 2 Get Tablet Leader Locations 2 Redirect to current table 3 leader 3 Wait for one replica commit to his own raft log then Ack client 5 Master-Follower YB-master 1 Master-Leader YB-master 3 Master-Follower YB-master 2 4 4 Sync update to follower replicas using Raft
  • 33. © 2023 All Rights Reserved Distributed Transactions 33
  • 34. © 2023 All Rights Reserved Distributed Transactions node1 node2 node3 node4 … Scale to as many nodes as needed Raft group leader (serves writes & strong reads) Raft group follower (serves timeline-consistent reads & ready for leader election) syscatalog yb-master1 YB-Master Service Manage shard metadata & coordinate config changes syscatalog yb-master2 syscatalog yb-master3 Cluster Administration Admin clients … yb-tserver1 tablet3 tablet2 tablet1 YB-TServer Service Store & serve app data in/from tablets (aka shards) yb-tserver2 yb-tserver3 yb-tserver4 … tablet4 tablet2 tablet1 … tablet4 tablet3 tablet1 … tablet4 tablet3 tablet2 App clients Distributed SQL API Distributed Txn Mgr Distributed Txn Mgr Distributed Txn Mgr Distributed Txn Mgr 34
  • 35. © 2023 All Rights Reserved Transaction Write path YB Tablet Server 1 YB Tablet Server 2 YB Tablet Server 4 YB Tablet Server 3 Txn Status Tablet (leader) Tablet containing k1 (leader) Tablet containing k2 (leader) Provisional record: k1=v1 (txn=txn_id) Provisional record: k2=v2 (txn=txn_id) Txn Status Tablet (follower) Txn Status Tablet (follower) Tablet follower Tablet follower Tablet follower Tablet follower Transaction Manager 1 Client’s request set k1=v1,k2=v2 5 Ack Client 2 Create status record 3 Write provisional records 3 4 Commit txn 6 6 Async. Apply Provisional records (convert to permanent)
  • 36. © 2023 All Rights Reserved Transaction read path YB Tablet Server 1 YB Tablet Server 2 YB Tablet Server 4 YB Tablet Server 3 Tx status tablet (leader) txn_id: committed @ t=100 Tablet containing k1 (leader) Tablet containing k2 (leader) Provisional record: k1=v1 (txn=txn_id) Provisional record: k2=v2 (txn=txn_id) Txn Status Tablet (follower) Txn Status Tablet (follower) Tablet follower Tablet follower Tablet follower Tablet follower Transaction Manager 1 Client’s request read k1,k2 5 Respond to client 4 4 Return k1=v1 Return k2=v2 2 2 Read k1 at hybrid Time ht_read Read k2 at hybrid time ht_read 3 3 Request status of txn txn_id Request status of txn txn_id
  • 37. © 2023 All Rights Reserved 37 Thank You Join us on Slack: www.yugabyte.com/slack Star us on GitHub: github.com/yugabyte/yugabyte-db 37