SlideShare a Scribd company logo
© 2023 All Rights Reserved
YugabyteDB
Advanced level unlocked
1
Gwenn Etourneau
Principal, Solution Architect
© 2023 All Rights Reserved
● Quick reminder
● Under the hood
○ Tablet Splitting
■ Manual splitting
■ Pre-splitting
■ Automatic splitting
○ Replication
■ Raft
■ Read - Write path
■ Transaction Read-Write path
2
Agenda
© 2023 All Rights Reserved 3
About Me
https://github.com/shinji62
https://twitter.com/the_shinji62
Woven by Toyota
Pivotal (ac. By VMware)
Rakuten
IBM …
Etourneau Gwenn
Principal Solution Architect
© 2023 All Rights Reserved
Quick reminder
4
© 2023 All Rights Reserved
Component
5
© 2023 All Rights Reserved
Layered Architecture
DocDB Storage Layer
Distributed, transactional document store
with sync and async replication support
YSQL
A fully PostgreSQL
compatible relational API
YCQL
Cassandra compatible
semi-relational API
Extensible Query Layer
Extensible query layer to support multiple API’s
Microservice requiring
relational integrity
Microservice requiring
massive scale
Microservice requiring
geo-distribution of data
Extensible query layer
○ YSQL: PostgreSQL-based
○ YCQL: Cassandra-based
Transactional storage layer
○ Transactional
○ Resilient and scalable
○ Document storage
6
© 2023 All Rights Reserved
Extend to Distributed SQL
7
© 2023 All Rights Reserved
Under the hood
Table sharding
8
© 2023 All Rights Reserved
● YugabyteDB splits user tables into multiple shards, called tablets, using either a hash- or
range-based strategy.
○ Primary Key for each row in the table uniquely identifies the location of the tablet in the row
○ By default, 8 tablets per node, distributed evenly across the nodes
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
Every Tables Data is Automatically Sharded
tablet 1’
… … …
… … …
… … …
… … …
… … …
SHARDING = AUTOMATIC DISTRIBUTION OF TABLES
https://docs.yugabyte.com/preview/explore/linear-scalability/sharding-data/
https://www.yugabyte.com/blog/distributed-sql-tips-tricks-tablet-splitting-high-availability-sharding/
© 2023 All Rights Reserved
● YugabyteDB allows data resharding by splitting tablets using the following 3 mechanisms:
● Presplitting tablets
○ All tables created in DocDB can be split into the desired number of tablets at creation time.
● Manual tablet splitting
○ The tablets in a running cluster can be split manually at runtime by you.
● Automatic tablet splitting
○ The tablets in a running cluster are automatically split according to some policy by the
database.
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
1. Presplitting tablets
● At creation time, presplit a table into the desired number of tablets
○ YSQL tables - Supports both range-sharded and hash-sharded
○ YCQL tables - Support hash-sharded YCQL tables
● Hash-sharded tables
● Max 65536(64k) tablets/shard
● 2-byte range from 0x0000 to 0xFFFF
CREATE TABLE customers (
customer_id bpchar NOT NULL,
cname character varying(40),
contact_name character varying(30),
contact_title character varying(30),
PRIMARY KEY (customer_id HASH)
) SPLIT INTO 16 TABLETS;
● e.g. table with 16 tablets the overall hash space [0x0000
to 0xFFFF) is divided into 16 subranges, one for each
tablet: [0x0000, 0x1000), [0x1000, 0x2000), … , [0xF000,
0xFFFF]
● Read/write operations are processed by converting the
primary key into an internal key and its hash value, and
determining to which tablet the operation should be routed
© 2023 All Rights Reserved
1. Presplitting tablets
© 2023 All Rights Reserved
1. Presplitting tablets
● Range shard splitting, you can predefined the splitting point.
CREATE TABLE customers (
customer_id bpchar NOT NULL,
company_name character varying(40)
PRIMARY KEY (customer_id ASC))
SPLIT AT VALUES ((1000), (2000), (3000), ... );
© 2023 All Rights Reserved
1. Presplitting tablets - Maximum number of tablets
● Maximum of tablets is based on the number of Tserver and max_create_tablets_per_ts
(default 50) setting.
○ For example with 4 nodes only 200 tablets by table can be created.
○ If you try to create more than the maximum number of tablets an error will be returned
message="Invalid Table Definition. Error creating table YOUR-TABLE on the master: The
requested number of tablets (XXXX) is over the permitted maximum (200)
© 2023 All Rights Reserved
2. Manual tablet splitting
CREATE TABLE t (k VARCHAR, v TEXT, PRIMARY KEY (k)) SPLIT INTO 1 TABLETS;
INSERT INTO t(k, v) SELECT i::text, left(md5(random()::text), 4) FROM generate_series(1, 100000)
s(i);
SELECT count(*) FROM t;
● Recommended v2.14.x
● By using the config `SPLIT INTO X TABLETS` when creating table you can specify the numbers of
tablets for the table.
Example below will create only 1 tablets for the table
yb-admin --master_addresses 127.0.0.{1..4}:7100 split_tablet cdcc15981d29480498e5bacd4fc6b277
● You can also use the yb-admin command split_tablet to change the numbers of tablets.
© 2023 All Rights Reserved
3. Automatic tablet splitting
● Resharding of data automatically while online, transparently when a specified size threshold has been
reached
● To enable automatic tablet splitting,
○ yb-master --enable_automatic_tablet_splitting flag and specify the
associated flags to configure when tablets should split
○ Newly-created tables have 1 shard per by default
© 2023 All Rights Reserved
3. Automatic tablet splitting - 3 Phases
● Low phase
○ Each node has fewer than tablet_split_low_phase_shard_count_per_node
shards (8 by default).
○ Splits tablets larger than tablet_split_low_phase_size_threshold_bytes (512
MB by default).
● High phase
○ Each node has fewer than tablet_split_high_phase_shard_count_per_node
shards (24 by default).
○ Splits tablets larger than tablet_split_high_phase_size_threshold_bytes (10
GB by default).
● Final phase
○ Exceeds the high phase count (determined by
tablet_split_high_phase_shard_count_per_node , 24 by default),
○ Splits tablets larger than tablet_force_split_threshold_bytes (100 GB by
default).
● Recommended v2.14.9 +
© 2023 All Rights Reserved
3. Automatic tablet splitting - Others.
● Post-split compactions
○ When a tablet is split, the two tablets need to have a full compaction to remove
unnecessary data and free disk space.
○ This may increase CPU overhead, but you can control this behavior with some gflags
© 2023 All Rights Reserved
Hash vs Range
Pro
● Recommended for most of the workload
● Best for massive workload
● Best for data distribution across node
Cons
● Range queries are inefficiency, for example where
k>v1 and k<v2
Pro
● Efficient for range query, for example where k>v1
and k<v2
Cons
● Warming issue, as starting everything on a single
node / tablets (need presplitting)
● May lead to hotspot, many PK within the same
tablets
Hash Range
© 2023 All Rights Reserved
Under the hood
Replication
21
© 2023 All Rights Reserved
Replication factor 3
Node#1 Node#2 Node#3
Tablet #1
Tablet #2
Tablet #3
Tablet #1 Tablet #1
Tablet #2 Tablet #2
Tablet #3
Tablet #3
Every Tables Data is Automatically Sharded
© 2023 All Rights Reserved
Replication done at Tablets (shard) level
tablet 1’
Tablet Peer 1 on Node X
Tablet #1
Tablet Peer 2 on Node Y
Tablet Peer 3 on Node Z
Replication Factor = 3
© 2023 All Rights Reserved
Replication uses a Consensus algorithm
tablet 1’
Raft Leader
Uses Raft Algorithm
First elect Tablet Leader
24
© 2023 All Rights Reserved
Reads in Raft Consensus
tablet 1’
Raft Leader
Reads handled by leader**
Read
25
** Read can be done from the follower if the gflag yb_read_from_followers is true
© 2023 All Rights Reserved
Writes in Raft Consensus
tablet 1’
Raft Leader
Writes processed by leader:
Send writes to all peers
Wait for majority to ack
Write
26
© 2023 All Rights Reserved
Leader Lease
tablet 1’
27
To avoid inconsistencies during network partition, to be sure to read the latest Data, the
leader will have lease, `I want to be the leader for 3sec’, that at most one leader is
serving data.
The old leader have his lease expire as the
new leader hold it, so it will not be able to
responds to the client.
© 2023 All Rights Reserved
Under the hood
IO Path
28
© 2023 All Rights Reserved
Read path
29
© 2023 All Rights Reserved
Standard Read Request
Tablet1-Follower
Tablet2-Follower
Tablet3-Leader
YB-tserver 3
Tablet1-Leader
Tablet2-Follower
Tablet3-Follower
YB-tserver 1
Read request for tablet 3
1
Tablet1-Follower
Tablet2-Leader
Tablet3-Follower
YB-tserver 2
Get Tablet Leader Locations
2
Redirect to current
table 3 leader
3 Respond to
client
4
Master-Follower
YB-master 1
Master-Leader
YB-master 3
Master-Follower
YB-master 2
© 2023 All Rights Reserved
Write path
31
© 2023 All Rights Reserved
Standard Write Request
Tablet1-Follower
Tablet2-Follower
Tablet3-Leader
YB-tserver 3
Tablet1-Leader
Tablet2-Follower
Tablet3-Follower
YB-tserver 1
Update request for tablet 3
1
Tablet1-Follower
Tablet2-Leader
Tablet3-Follower
YB-tserver 2
Get Tablet Leader Locations
2
Redirect to current
table 3 leader
3
Wait for one replica commit to
his own raft log then Ack client
5
Master-Follower
YB-master 1
Master-Leader
YB-master 3
Master-Follower
YB-master 2
4
4
Sync update to follower replicas using Raft
© 2023 All Rights Reserved
Distributed Transactions
33
© 2023 All Rights Reserved
Distributed Transactions
node1 node2 node3 node4 … Scale to as many nodes as needed
Raft group leader (serves writes & strong reads)
Raft group follower (serves timeline-consistent reads & ready for leader election)
syscatalog
yb-master1
YB-Master Service
Manage shard metadata &
coordinate config changes
syscatalog
yb-master2
syscatalog
yb-master3
Cluster Administration
Admin clients
…
yb-tserver1
tablet3
tablet2
tablet1
YB-TServer Service
Store & serve app data
in/from tablets (aka shards)
yb-tserver2 yb-tserver3 yb-tserver4
…
tablet4
tablet2
tablet1
…
tablet4
tablet3
tablet1
…
tablet4
tablet3
tablet2
App clients
Distributed SQL API
Distributed
Txn Mgr
Distributed
Txn Mgr
Distributed
Txn Mgr
Distributed
Txn Mgr
34
© 2023 All Rights Reserved
Transaction Write path
YB Tablet Server 1 YB Tablet Server 2
YB Tablet Server 4
YB Tablet Server 3
Txn Status
Tablet
(leader)
Tablet containing k1
(leader)
Tablet containing k2
(leader)
Provisional record:
k1=v1 (txn=txn_id)
Provisional record:
k2=v2 (txn=txn_id)
Txn Status
Tablet
(follower)
Txn Status
Tablet
(follower)
Tablet
follower
Tablet
follower
Tablet
follower
Tablet
follower
Transaction
Manager
1
Client’s request set k1=v1,k2=v2
5 Ack Client
2 Create status record
3 Write provisional
records
3
4 Commit txn
6
6
Async. Apply
Provisional records
(convert to permanent)
© 2023 All Rights Reserved
Transaction read path
YB Tablet Server 1 YB Tablet Server 2
YB Tablet Server 4
YB Tablet Server 3
Tx status tablet (leader)
txn_id: committed @ t=100
Tablet containing k1
(leader)
Tablet containing k2
(leader)
Provisional record:
k1=v1 (txn=txn_id)
Provisional record:
k2=v2 (txn=txn_id)
Txn Status
Tablet
(follower)
Txn Status
Tablet
(follower)
Tablet
follower
Tablet
follower
Tablet
follower
Tablet
follower
Transaction
Manager
1
Client’s request read k1,k2
5 Respond to client
4
4
Return k1=v1
Return k2=v2
2
2
Read k1 at hybrid
Time ht_read
Read k2 at hybrid
time ht_read
3
3
Request status
of txn txn_id
Request status
of txn txn_id
© 2023 All Rights Reserved 37
Thank You
Join us on Slack:
www.yugabyte.com/slack
Star us on GitHub:
github.com/yugabyte/yugabyte-db
37

More Related Content

Similar to Gwenn - Advanced level unlocked_.pdf

Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
vanjakom
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
3.5. managing cluster parameters
3.5. managing cluster parameters3.5. managing cluster parameters
3.5. managing cluster parameters
tsuras
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
MariaDB plc
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
M|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStoreM|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStore
MariaDB plc
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bhawani N Prasad
 
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
The Linux Foundation
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
Leyi (Kamus) Zhang
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQL
Masahiko Sawada
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
vanjakom
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster 5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
Aaron Joue
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
trihug
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
ScyllaDB
 
The Google file system
The Google file systemThe Google file system
The Google file system
Sergio Shevchenko
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
Saeid Zebardast
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Anne Nicolas
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung Cloud
MariaDB plc
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
Vadim Tkachenko
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
Vijaya Chandrika
 

Similar to Gwenn - Advanced level unlocked_.pdf (20)

Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
3.5. managing cluster parameters
3.5. managing cluster parameters3.5. managing cluster parameters
3.5. managing cluster parameters
 
Understanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStoreUnderstanding the architecture of MariaDB ColumnStore
Understanding the architecture of MariaDB ColumnStore
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
M|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStoreM|18 Understanding the Architecture of MariaDB ColumnStore
M|18 Understanding the Architecture of MariaDB ColumnStore
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasadBigdata netezza-ppt-apr2013-bhawani nandan prasad
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
 
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME  JP BLAKE, ASSURED INFORMATION SE...
XPDS13: VIRTUAL DISK INTEGRITY IN REAL TIME JP BLAKE, ASSURED INFORMATION SE...
 
Oracle 12.2 sharded database management
Oracle 12.2 sharded database managementOracle 12.2 sharded database management
Oracle 12.2 sharded database management
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQL
 
Google Bigtable Paper Presentation
Google Bigtable Paper PresentationGoogle Bigtable Paper Presentation
Google Bigtable Paper Presentation
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster 5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
The Google file system
The Google file systemThe Google file system
The Google file system
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
ClustrixDB at Samsung Cloud
ClustrixDB at Samsung CloudClustrixDB at Samsung Cloud
ClustrixDB at Samsung Cloud
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
An Introduction to Netezza
An Introduction to NetezzaAn Introduction to Netezza
An Introduction to Netezza
 

More from Gwenn Etourneau

Meetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdfMeetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdf
Gwenn Etourneau
 
Concourse for devops @quoine
Concourse for devops @quoineConcourse for devops @quoine
Concourse for devops @quoine
Gwenn Etourneau
 
Cloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stackCloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stack
Gwenn Etourneau
 
Concourse webhook
Concourse webhookConcourse webhook
Concourse webhook
Gwenn Etourneau
 
Concourse and Database
Concourse and DatabaseConcourse and Database
Concourse and Database
Gwenn Etourneau
 
ConcourseCI love Minio
ConcourseCI love MinioConcourseCI love Minio
ConcourseCI love Minio
Gwenn Etourneau
 
Demo Pivotal Circle Of Code
Demo Pivotal Circle Of CodeDemo Pivotal Circle Of Code
Demo Pivotal Circle Of Code
Gwenn Etourneau
 
Monitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with PrometheusMonitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with Prometheus
Gwenn Etourneau
 
Concourse updates
Concourse updatesConcourse updates
Concourse updates
Gwenn Etourneau
 
Route service-pcf-techmeetup
Route service-pcf-techmeetupRoute service-pcf-techmeetup
Route service-pcf-techmeetup
Gwenn Etourneau
 
Bosh 2-0-reloaded
Bosh 2-0-reloadedBosh 2-0-reloaded
Bosh 2-0-reloaded
Gwenn Etourneau
 
ConcourseCi Dockerimage
ConcourseCi DockerimageConcourseCi Dockerimage
ConcourseCi Dockerimage
Gwenn Etourneau
 
ConcourseCi overview
ConcourseCi  overviewConcourseCi  overview
ConcourseCi overview
Gwenn Etourneau
 
Cloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route serviceCloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route service
Gwenn Etourneau
 
Lattice yapc-slideshare
Lattice yapc-slideshareLattice yapc-slideshare
Lattice yapc-slideshare
Gwenn Etourneau
 

More from Gwenn Etourneau (15)

Meetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdfMeetup-#1-Getting-Started.pdf
Meetup-#1-Getting-Started.pdf
 
Concourse for devops @quoine
Concourse for devops @quoineConcourse for devops @quoine
Concourse for devops @quoine
 
Cloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stackCloud Foundry CF LOGS stack
Cloud Foundry CF LOGS stack
 
Concourse webhook
Concourse webhookConcourse webhook
Concourse webhook
 
Concourse and Database
Concourse and DatabaseConcourse and Database
Concourse and Database
 
ConcourseCI love Minio
ConcourseCI love MinioConcourseCI love Minio
ConcourseCI love Minio
 
Demo Pivotal Circle Of Code
Demo Pivotal Circle Of CodeDemo Pivotal Circle Of Code
Demo Pivotal Circle Of Code
 
Monitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with PrometheusMonitor Cloud Foundry and Bosh with Prometheus
Monitor Cloud Foundry and Bosh with Prometheus
 
Concourse updates
Concourse updatesConcourse updates
Concourse updates
 
Route service-pcf-techmeetup
Route service-pcf-techmeetupRoute service-pcf-techmeetup
Route service-pcf-techmeetup
 
Bosh 2-0-reloaded
Bosh 2-0-reloadedBosh 2-0-reloaded
Bosh 2-0-reloaded
 
ConcourseCi Dockerimage
ConcourseCi DockerimageConcourseCi Dockerimage
ConcourseCi Dockerimage
 
ConcourseCi overview
ConcourseCi  overviewConcourseCi  overview
ConcourseCi overview
 
Cloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route serviceCloud Foundry Meetup Tokyo #1 Route service
Cloud Foundry Meetup Tokyo #1 Route service
 
Lattice yapc-slideshare
Lattice yapc-slideshareLattice yapc-slideshare
Lattice yapc-slideshare
 

Recently uploaded

Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
APNIC
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
GNAMBIKARAO
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
thezot
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
Infosec train
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
APNIC
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
Donato Onofri
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
rtunex8r
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
3a0sd7z3
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
3a0sd7z3
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
Tarandeep Singh
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
dtagbe
 

Recently uploaded (11)

Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...Securing BGP: Operational Strategies and Best Practices for Network Defenders...
Securing BGP: Operational Strategies and Best Practices for Network Defenders...
 
cyber crime.pptx..........................
cyber crime.pptx..........................cyber crime.pptx..........................
cyber crime.pptx..........................
 
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
一比一原版新西兰林肯大学毕业证(Lincoln毕业证书)学历如何办理
 
How to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdfHow to make a complaint to the police for Social Media Fraud.pdf
How to make a complaint to the police for Social Media Fraud.pdf
 
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
 
Bengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal BrandingBengaluru Dreamin' 24 - Personal Branding
Bengaluru Dreamin' 24 - Personal Branding
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
 

Gwenn - Advanced level unlocked_.pdf

  • 1. © 2023 All Rights Reserved YugabyteDB Advanced level unlocked 1 Gwenn Etourneau Principal, Solution Architect
  • 2. © 2023 All Rights Reserved ● Quick reminder ● Under the hood ○ Tablet Splitting ■ Manual splitting ■ Pre-splitting ■ Automatic splitting ○ Replication ■ Raft ■ Read - Write path ■ Transaction Read-Write path 2 Agenda
  • 3. © 2023 All Rights Reserved 3 About Me https://github.com/shinji62 https://twitter.com/the_shinji62 Woven by Toyota Pivotal (ac. By VMware) Rakuten IBM … Etourneau Gwenn Principal Solution Architect
  • 4. © 2023 All Rights Reserved Quick reminder 4
  • 5. © 2023 All Rights Reserved Component 5
  • 6. © 2023 All Rights Reserved Layered Architecture DocDB Storage Layer Distributed, transactional document store with sync and async replication support YSQL A fully PostgreSQL compatible relational API YCQL Cassandra compatible semi-relational API Extensible Query Layer Extensible query layer to support multiple API’s Microservice requiring relational integrity Microservice requiring massive scale Microservice requiring geo-distribution of data Extensible query layer ○ YSQL: PostgreSQL-based ○ YCQL: Cassandra-based Transactional storage layer ○ Transactional ○ Resilient and scalable ○ Document storage 6
  • 7. © 2023 All Rights Reserved Extend to Distributed SQL 7
  • 8. © 2023 All Rights Reserved Under the hood Table sharding 8
  • 9. © 2023 All Rights Reserved ● YugabyteDB splits user tables into multiple shards, called tablets, using either a hash- or range-based strategy. ○ Primary Key for each row in the table uniquely identifies the location of the tablet in the row ○ By default, 8 tablets per node, distributed evenly across the nodes Every Tables Data is Automatically Sharded
  • 10. © 2023 All Rights Reserved Every Tables Data is Automatically Sharded tablet 1’ … … … … … … … … … … … … … … … SHARDING = AUTOMATIC DISTRIBUTION OF TABLES https://docs.yugabyte.com/preview/explore/linear-scalability/sharding-data/ https://www.yugabyte.com/blog/distributed-sql-tips-tricks-tablet-splitting-high-availability-sharding/
  • 11. © 2023 All Rights Reserved ● YugabyteDB allows data resharding by splitting tablets using the following 3 mechanisms: ● Presplitting tablets ○ All tables created in DocDB can be split into the desired number of tablets at creation time. ● Manual tablet splitting ○ The tablets in a running cluster can be split manually at runtime by you. ● Automatic tablet splitting ○ The tablets in a running cluster are automatically split according to some policy by the database. Every Tables Data is Automatically Sharded
  • 12. © 2023 All Rights Reserved 1. Presplitting tablets ● At creation time, presplit a table into the desired number of tablets ○ YSQL tables - Supports both range-sharded and hash-sharded ○ YCQL tables - Support hash-sharded YCQL tables ● Hash-sharded tables ● Max 65536(64k) tablets/shard ● 2-byte range from 0x0000 to 0xFFFF CREATE TABLE customers ( customer_id bpchar NOT NULL, cname character varying(40), contact_name character varying(30), contact_title character varying(30), PRIMARY KEY (customer_id HASH) ) SPLIT INTO 16 TABLETS; ● e.g. table with 16 tablets the overall hash space [0x0000 to 0xFFFF) is divided into 16 subranges, one for each tablet: [0x0000, 0x1000), [0x1000, 0x2000), … , [0xF000, 0xFFFF] ● Read/write operations are processed by converting the primary key into an internal key and its hash value, and determining to which tablet the operation should be routed
  • 13. © 2023 All Rights Reserved 1. Presplitting tablets
  • 14. © 2023 All Rights Reserved 1. Presplitting tablets ● Range shard splitting, you can predefined the splitting point. CREATE TABLE customers ( customer_id bpchar NOT NULL, company_name character varying(40) PRIMARY KEY (customer_id ASC)) SPLIT AT VALUES ((1000), (2000), (3000), ... );
  • 15. © 2023 All Rights Reserved 1. Presplitting tablets - Maximum number of tablets ● Maximum of tablets is based on the number of Tserver and max_create_tablets_per_ts (default 50) setting. ○ For example with 4 nodes only 200 tablets by table can be created. ○ If you try to create more than the maximum number of tablets an error will be returned message="Invalid Table Definition. Error creating table YOUR-TABLE on the master: The requested number of tablets (XXXX) is over the permitted maximum (200)
  • 16. © 2023 All Rights Reserved 2. Manual tablet splitting CREATE TABLE t (k VARCHAR, v TEXT, PRIMARY KEY (k)) SPLIT INTO 1 TABLETS; INSERT INTO t(k, v) SELECT i::text, left(md5(random()::text), 4) FROM generate_series(1, 100000) s(i); SELECT count(*) FROM t; ● Recommended v2.14.x ● By using the config `SPLIT INTO X TABLETS` when creating table you can specify the numbers of tablets for the table. Example below will create only 1 tablets for the table yb-admin --master_addresses 127.0.0.{1..4}:7100 split_tablet cdcc15981d29480498e5bacd4fc6b277 ● You can also use the yb-admin command split_tablet to change the numbers of tablets.
  • 17. © 2023 All Rights Reserved 3. Automatic tablet splitting ● Resharding of data automatically while online, transparently when a specified size threshold has been reached ● To enable automatic tablet splitting, ○ yb-master --enable_automatic_tablet_splitting flag and specify the associated flags to configure when tablets should split ○ Newly-created tables have 1 shard per by default
  • 18. © 2023 All Rights Reserved 3. Automatic tablet splitting - 3 Phases ● Low phase ○ Each node has fewer than tablet_split_low_phase_shard_count_per_node shards (8 by default). ○ Splits tablets larger than tablet_split_low_phase_size_threshold_bytes (512 MB by default). ● High phase ○ Each node has fewer than tablet_split_high_phase_shard_count_per_node shards (24 by default). ○ Splits tablets larger than tablet_split_high_phase_size_threshold_bytes (10 GB by default). ● Final phase ○ Exceeds the high phase count (determined by tablet_split_high_phase_shard_count_per_node , 24 by default), ○ Splits tablets larger than tablet_force_split_threshold_bytes (100 GB by default). ● Recommended v2.14.9 +
  • 19. © 2023 All Rights Reserved 3. Automatic tablet splitting - Others. ● Post-split compactions ○ When a tablet is split, the two tablets need to have a full compaction to remove unnecessary data and free disk space. ○ This may increase CPU overhead, but you can control this behavior with some gflags
  • 20. © 2023 All Rights Reserved Hash vs Range Pro ● Recommended for most of the workload ● Best for massive workload ● Best for data distribution across node Cons ● Range queries are inefficiency, for example where k>v1 and k<v2 Pro ● Efficient for range query, for example where k>v1 and k<v2 Cons ● Warming issue, as starting everything on a single node / tablets (need presplitting) ● May lead to hotspot, many PK within the same tablets Hash Range
  • 21. © 2023 All Rights Reserved Under the hood Replication 21
  • 22. © 2023 All Rights Reserved Replication factor 3 Node#1 Node#2 Node#3 Tablet #1 Tablet #2 Tablet #3 Tablet #1 Tablet #1 Tablet #2 Tablet #2 Tablet #3 Tablet #3 Every Tables Data is Automatically Sharded
  • 23. © 2023 All Rights Reserved Replication done at Tablets (shard) level tablet 1’ Tablet Peer 1 on Node X Tablet #1 Tablet Peer 2 on Node Y Tablet Peer 3 on Node Z Replication Factor = 3
  • 24. © 2023 All Rights Reserved Replication uses a Consensus algorithm tablet 1’ Raft Leader Uses Raft Algorithm First elect Tablet Leader 24
  • 25. © 2023 All Rights Reserved Reads in Raft Consensus tablet 1’ Raft Leader Reads handled by leader** Read 25 ** Read can be done from the follower if the gflag yb_read_from_followers is true
  • 26. © 2023 All Rights Reserved Writes in Raft Consensus tablet 1’ Raft Leader Writes processed by leader: Send writes to all peers Wait for majority to ack Write 26
  • 27. © 2023 All Rights Reserved Leader Lease tablet 1’ 27 To avoid inconsistencies during network partition, to be sure to read the latest Data, the leader will have lease, `I want to be the leader for 3sec’, that at most one leader is serving data. The old leader have his lease expire as the new leader hold it, so it will not be able to responds to the client.
  • 28. © 2023 All Rights Reserved Under the hood IO Path 28
  • 29. © 2023 All Rights Reserved Read path 29
  • 30. © 2023 All Rights Reserved Standard Read Request Tablet1-Follower Tablet2-Follower Tablet3-Leader YB-tserver 3 Tablet1-Leader Tablet2-Follower Tablet3-Follower YB-tserver 1 Read request for tablet 3 1 Tablet1-Follower Tablet2-Leader Tablet3-Follower YB-tserver 2 Get Tablet Leader Locations 2 Redirect to current table 3 leader 3 Respond to client 4 Master-Follower YB-master 1 Master-Leader YB-master 3 Master-Follower YB-master 2
  • 31. © 2023 All Rights Reserved Write path 31
  • 32. © 2023 All Rights Reserved Standard Write Request Tablet1-Follower Tablet2-Follower Tablet3-Leader YB-tserver 3 Tablet1-Leader Tablet2-Follower Tablet3-Follower YB-tserver 1 Update request for tablet 3 1 Tablet1-Follower Tablet2-Leader Tablet3-Follower YB-tserver 2 Get Tablet Leader Locations 2 Redirect to current table 3 leader 3 Wait for one replica commit to his own raft log then Ack client 5 Master-Follower YB-master 1 Master-Leader YB-master 3 Master-Follower YB-master 2 4 4 Sync update to follower replicas using Raft
  • 33. © 2023 All Rights Reserved Distributed Transactions 33
  • 34. © 2023 All Rights Reserved Distributed Transactions node1 node2 node3 node4 … Scale to as many nodes as needed Raft group leader (serves writes & strong reads) Raft group follower (serves timeline-consistent reads & ready for leader election) syscatalog yb-master1 YB-Master Service Manage shard metadata & coordinate config changes syscatalog yb-master2 syscatalog yb-master3 Cluster Administration Admin clients … yb-tserver1 tablet3 tablet2 tablet1 YB-TServer Service Store & serve app data in/from tablets (aka shards) yb-tserver2 yb-tserver3 yb-tserver4 … tablet4 tablet2 tablet1 … tablet4 tablet3 tablet1 … tablet4 tablet3 tablet2 App clients Distributed SQL API Distributed Txn Mgr Distributed Txn Mgr Distributed Txn Mgr Distributed Txn Mgr 34
  • 35. © 2023 All Rights Reserved Transaction Write path YB Tablet Server 1 YB Tablet Server 2 YB Tablet Server 4 YB Tablet Server 3 Txn Status Tablet (leader) Tablet containing k1 (leader) Tablet containing k2 (leader) Provisional record: k1=v1 (txn=txn_id) Provisional record: k2=v2 (txn=txn_id) Txn Status Tablet (follower) Txn Status Tablet (follower) Tablet follower Tablet follower Tablet follower Tablet follower Transaction Manager 1 Client’s request set k1=v1,k2=v2 5 Ack Client 2 Create status record 3 Write provisional records 3 4 Commit txn 6 6 Async. Apply Provisional records (convert to permanent)
  • 36. © 2023 All Rights Reserved Transaction read path YB Tablet Server 1 YB Tablet Server 2 YB Tablet Server 4 YB Tablet Server 3 Tx status tablet (leader) txn_id: committed @ t=100 Tablet containing k1 (leader) Tablet containing k2 (leader) Provisional record: k1=v1 (txn=txn_id) Provisional record: k2=v2 (txn=txn_id) Txn Status Tablet (follower) Txn Status Tablet (follower) Tablet follower Tablet follower Tablet follower Tablet follower Transaction Manager 1 Client’s request read k1,k2 5 Respond to client 4 4 Return k1=v1 Return k2=v2 2 2 Read k1 at hybrid Time ht_read Read k2 at hybrid time ht_read 3 3 Request status of txn txn_id Request status of txn txn_id
  • 37. © 2023 All Rights Reserved 37 Thank You Join us on Slack: www.yugabyte.com/slack Star us on GitHub: github.com/yugabyte/yugabyte-db 37