SlideShare a Scribd company logo
1© 2018 All rights reserved.
Distributed PostgreSQL
with YugaByte DB
Karthik Ranganathan
PostgresConf Silicon Valley
Oct 16, 2018
2© 2018 All rights reserved.
CHECKOUT THIS REPO:
github.com/YugaByte/yb-sql-workshop
3© 2018 All rights reserved.
About Us
Kannan Muthukkaruppan, CEO
Nutanix ♦ Facebook ♦ Oracle
IIT-Madras, University of California-Berkeley
Karthik Ranganathan, CTO
Nutanix ♦ Facebook ♦ Microsoft
IIT-Madras, University of Texas-Austin
Mikhail Bautin, Software Architect
ClearStory Data ♦ Facebook ♦ D.E.Shaw
Nizhny Novgorod State University, Stony Brook
 Founded Feb 2016
 Apache HBase committers and early engineers on Apache Cassandra
 Built Facebook’s NoSQL platform powered by Apache HBase
 Scaled the platform to serve many mission-critical use cases
• Facebook Messages (Messenger)
• Operational Data Store (Time series Data)
 Reassembled the same Facebook team at YugaByte along with
engineers from Oracle, Google, Nutanix and LinkedIn
Founders
4© 2018 All rights reserved.
WORKSHOP AGENDA
• What is YugaByte DB? Why Another DB?
• Exercise 1: BI Tools on YugaByte PostgreSQL
• Exercise 2: Distributed PostgreSQL Architecture
• Exercise 3: Sharding and Scale Out in Action
• Exercise 4: Fault Tolerance in Action
5© 2018 All rights reserved.
WHAT IS
YUGABYTE DB?
6© 2018 All rights reserved.
A transactional, planet-scale database
for building high-performance cloud services.
7© 2018 All rights reserved.
NoSQL + SQL Cloud Native
8© 2018 All rights reserved.
WHY ANOTHER DB?
9© 2018 All rights reserved.
Typical Stack Today
Fragile infra with several moving parts
Datacenter 1
SQL Master SQL Slave
Application Tier (Stateless Microservices)
Datacenter 2
SQL for OLTP data
Manual sharding
Cost: dev team
Manual replication
Manual failover
Cost: ops team
NoSQL for other data
App aware of data silo
Cost: dev team
Cache for low latency
App does caching
Cost: dev team
Data inconsistency/loss
Fragile infra
Hours of debugging
Cost: dev + ops team
10© 2018 All rights reserved.
Does AWS change this?
Datacenter 1
SQL Master SQL Slave
Datacenter 2
Elasticache
Aurora
DynamoDB
Still Complex
it’s the same architecture
Application Tier (Stateless Microservices)
11© 2018 All rights reserved.
Not Portable
Not Portable
Open Source
Not Portable
Open Source
Open Source
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
System-of-Record DBs for Global Apps
12© 2018 All rights reserved.
TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE
Single Shard & Distributed ACID Txns
Document-Based, Strongly
Consistent Storage
Low Latency, Tunable Reads
High Throughput
OPEN SOURCE
Apache 2.0
Popular APIs Extended
Apache Cassandra, Redis and PostgreSQL (BETA)
Auto Sharding & Rebalancing
Global Data Distribution
Design Principles
CLOUD NATIVE
Built For The Container Era
Self-Healing, Fault-Tolerant
13© 2018 All rights reserved.
EXERCISE #1
BUSINESS INTELLIGENCE
14© 2018 All rights reserved.
EXERCISE #2
DISTRIBUTED POSTGRES:
ARCHITECTURE
15© 2018 All rights reserved.
ARCHITECTURE
Overview
16© 2018 All rights reserved.
YugaByte DB Process Overview
• Universe = cluster of nodes
• Two sets of processes: YB-Master & YB-TServer
• Example universe
4 nodes
rf=3
17© 2018 All rights reserved.
Sharding data
• User table split into tablets
18© 2018 All rights reserved.
One tablet for every key
19© 2018 All rights reserved.
Tablets and replication
• Tablet = set of tablet-peers in a RAFT group
• Num tablet-peers in tablet = replication factor (RF)
Tolerate 1 failure : RF=3
Tolerate 2 failures: RF=5
20© 2018 All rights reserved.
YB-TServer
• Process that does IO
• Hosts tablet for tables
• Hosts transaction manager
• Auto memory sizing
Block cache
Memstores
21© 2018 All rights reserved.
YB-Master
• Not in critical path
• System metadata store
Keyspaces, tables, tablets
Users/roles, permissions
• Admin operations
Create/alter/drop of tables
Backups
Load balancing (leader and data balancing)
Enforces data placement policy
22© 2018 All rights reserved.
HANDLING DDL STATEMENTS
23© 2018 All rights reserved.
DDL Statements in PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
DISK
Create Table Data File
Update System Tables
24© 2018 All rights reserved.
DDL Statements in YugaByte DB PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
Create sharded, replicated table as data source
Store Table Metadata in YB-Master (in works)
YugaByte
master3
…
YugaByte
master2
YugaByte
master1
25© 2018 All rights reserved.
YugaByte Query Layer (YQL)
• Stateless, runs in each YB-TServer process
GA Goal:
Distributed
Stateless
PostgreSQL Layer
Current Beta uses
a single Stateless
PostgreSQL Layer
26© 2018 All rights reserved.
HANDLING DML QUERIES
27© 2018 All rights reserved.
DDL Queries in PostgreSQL
QUERY Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
WAL Writer BG Writer…
DISK
FDW
Local Table Code Path
EXTERNAL
DATABASE
28© 2018 All rights reserved.
DML Queries in YugaByte DB PostgreSQL
DML Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
FDW
YugaByte DB Code Path
YB Gateway
EXTERNAL
DATABASE
YugaByte
node3
YugaByte
node4
…
YugaByte
node2
YugaByte
node1
Using FDW as a
Table Storage API
29© 2018 All rights reserved.
ARCHITECTURE
Data Persistence
30© 2018 All rights reserved.
Data Persistence in DocDB
• DocDB is YugaByte DB’s LSM storage engine
• Persistent key to document store
• Extends and enhances RocksDB
• Designed to support high data-densities per node
31© 2018 All rights reserved.
DocDB: Key-to-Document Store
• Document key
CQL/SQL/Redis primary key
• Document value
a CQL or SQL row
Redis data structure
• Fine-grained reads and writes
32© 2018 All rights reserved.
DocDB Data Format
Example Insert
Encoding
33© 2018 All rights reserved.
Some of the RocksDB enhancements
• WAL and MVCC enhancements
o Removed RocksDB WAL, re-uses Raft log
o MVCC at a higher layer
o Coordinate RocksDB memstore flushing and Raft log garbage collection
• File format changes
o Sharded (multi-level) indexes and Bloom filters
• Splitting data blocks & metadata into separate files for tiering support
• Separate queues for large and small compactions
34© 2018 All rights reserved.
More Enhancements to RocksDB
• Data model aware Bloom filters
• Per-SSTable key range metadata to optimize range queries
• Server-global block caches & memstore limits
• Scan-resistant block cache (single-touch and multi-touch)
35© 2018 All rights reserved.
ARCHITECTURE
Data Replication
36© 2018 All rights reserved.
Raft Replication for Consistency
37© 2018 All rights reserved.
How Raft Replication Works
38© 2018 All rights reserved.
How Raft Replication Works
39© 2018 All rights reserved.
How Raft Replication Works
40© 2018 All rights reserved.
How Raft Replication Works
41© 2018 All rights reserved.
Raft Related Enhancements
• Leader Leases
• Multiple Raft groups (1 per tablet)
• Leader Balancing
• Group Commits
• Observer Nodes / Read Replicas
42© 2018 All rights reserved.
ARCHITECTURE
Transactions
43© 2018 All rights reserved.
Single Shard Transactions
Raft Consensus Protocol
. . .
INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager
(in memory, on leader only)
Acquire a lock on x
DocDB / RocksDB
Read current value of x
Submit a Raft operation for replication:
Insert (k1, v1) at hybrid_time 100
Raft log
Tablet
follower
Tablet
follower
Replicate to
majority of
tablet peers
Apply to RocksDB and
release lock
k1,v1
@ht=100
1
2
5
3
4
44© 2018 All rights reserved.
MVCC for Lockless Reads
• Achieved through HybridTime (HT)
Monotonically increasing timestamp
• Allows reads at a particular HT without locking
• Multiple versions may exist temporarily
Reclaim older values during compactions
45© 2018 All rights reserved.
Single Shard Transactions
• Each tablet maintains a “safe time” for reads
o Highest timestamp such that the view as of that timestamp is fixed
o In the common case it is just before the hybrid time of the next
uncommitted record in the tablet
46© 2018 All rights reserved.
Distributed Transactions
• Fully decentralized architecture
• Every tablet server can act as a Transaction Manager
• A distributed Transaction Status table
Tracks state of active transactions
• Transactions can have 3 states:
pending, committed, aborted
47© 2018 All rights reserved.
Distributed Transactions – Write Path
48© 2018 All rights reserved.
Distributed Transactions – Write Path Step 1: Client request
49© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
50© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
51© 2018 All rights reserved.
Distributed Transactions – Write Path Step 3: Write provisional records
52© 2018 All rights reserved.
Distributed Transactions – Write Path Step 4: Atomic commit
53© 2018 All rights reserved.
Distributed Transactions – Write Path Step 5: Respond to client
54© 2018 All rights reserved.
Distributed Transactions – Write Path Step 6: Apply provisional records
55© 2018 All rights reserved.
Isolation Levels
• Currently Snapshot Isolation is supported
o Write-write conflicts detected when writing provisional records
• Serializable isolation (roadmap)
o Reads in RW txns also need provisional records
• Read-only transactions are always lock-free
56© 2018 All rights reserved.
Clock Skew and Read Restarts
• Need to ensure the read timestamp is high enough
o Committed records the client might have seen must be visible
• Optimistically use current Hybrid Time, re-read if necessary
o Reads are restarted if a record with a higher timestamp that the client
could have seen is encountered
o Read restart happens at most once per tablet
o Relying on bounded clock skew (NTP, AWS Time Sync)
• Only affects multi-row reads of frequently updated records
57© 2018 All rights reserved.
Distributed Transactions – Read Path
58© 2018 All rights reserved.
Distributed Transactions – Read Path Step 1: Client request; pick ht_read
59© 2018 All rights reserved.
Distributed Transactions – Read Path Step 2: Read from tablet servers
60© 2018 All rights reserved.
Distributed Transactions – Read Path Step 3: Resolve txn status
61© 2018 All rights reserved.
Distributed Transactions – Read Path Step 4: Respond to YQL Engine
62© 2018 All rights reserved.
Distributed Transactions – Read Path Step 5: Respond to client
63© 2018 All rights reserved.
Distributed Transactions – Conflicts & Retries
• Every transaction is assigned a random priority
• In a conflict, the higher-priority transaction wins
o The restarted transaction gets a new random priority
o Probability of success quickly increases with retries
• Restarting a transaction is the same as starting a new one
• A read-write transaction can be subject to read-restart
64© 2018 All rights reserved.
EXERCISE #3 and #4
SHARDING AND SCALE OUT
FAULT TOLERANCE
65© 2018 All rights reserved.
Questions?
Try it at
docs.yugabyte.com/latest/quick-start

More Related Content

What's hot

Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
HostedbyConfluent
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
NTT DATA Technology & Innovation
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
Kenny Gryp
 
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
NTT DATA Technology & Innovation
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
Mohamed Farouk
 
MySQL Innovation from 5.7 to 8.0
MySQL Innovation from 5.7 to 8.0MySQL Innovation from 5.7 to 8.0
MySQL Innovation from 5.7 to 8.0
Frederic Descamps
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPR
Yugabyte
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
Kenny Gryp
 
Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例
オラクルエンジニア通信
 
Upgrade from MySQL 5.7 to MySQL 8.0
Upgrade from MySQL 5.7 to MySQL 8.0Upgrade from MySQL 5.7 to MySQL 8.0
Upgrade from MySQL 5.7 to MySQL 8.0
Olivier DASINI
 
Oracle GoldenGate導入Tips
Oracle GoldenGate導入TipsOracle GoldenGate導入Tips
Oracle GoldenGate導入Tips
オラクルエンジニア通信
 
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
Google Cloud Platform - Japan
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Scalar DL Technical Overview
Scalar DL Technical OverviewScalar DL Technical Overview
Scalar DL Technical Overview
Scalar, Inc.
 
Oracle Integration Cloud 概要(20200507版)
Oracle Integration Cloud 概要(20200507版)Oracle Integration Cloud 概要(20200507版)
Oracle Integration Cloud 概要(20200507版)
オラクルエンジニア通信
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
オラクルエンジニア通信
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
Masahiko Sawada
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
Maria Colgan
 

What's hot (20)

Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQLのfull_page_writesについて(第24回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
MySQL Group Replication
MySQL Group ReplicationMySQL Group Replication
MySQL Group Replication
 
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
オンライン物理バックアップの排他モードと非排他モードについて ~PostgreSQLバージョン15対応版~(第34回PostgreSQLアンカンファレンス...
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
MySQL Innovation from 5.7 to 8.0
MySQL Innovation from 5.7 to 8.0MySQL Innovation from 5.7 to 8.0
MySQL Innovation from 5.7 to 8.0
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPR
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
 
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
MySQL Database Architectures - MySQL InnoDB ClusterSet 2021-11
 
Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例Zero Data Loss Recovery Appliance 設定手順例
Zero Data Loss Recovery Appliance 設定手順例
 
Upgrade from MySQL 5.7 to MySQL 8.0
Upgrade from MySQL 5.7 to MySQL 8.0Upgrade from MySQL 5.7 to MySQL 8.0
Upgrade from MySQL 5.7 to MySQL 8.0
 
Oracle GoldenGate導入Tips
Oracle GoldenGate導入TipsOracle GoldenGate導入Tips
Oracle GoldenGate導入Tips
 
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Scalar DL Technical Overview
Scalar DL Technical OverviewScalar DL Technical Overview
Scalar DL Technical Overview
 
Oracle Integration Cloud 概要(20200507版)
Oracle Integration Cloud 概要(20200507版)Oracle Integration Cloud 概要(20200507版)
Oracle Integration Cloud 概要(20200507版)
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#3
 
アーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーションアーキテクチャから理解するPostgreSQLのレプリケーション
アーキテクチャから理解するPostgreSQLのレプリケーション
 
What to Expect From Oracle database 19c
What to Expect From Oracle database 19cWhat to Expect From Oracle database 19c
What to Expect From Oracle database 19c
 

Similar to How YugaByte DB Implements Distributed PostgreSQL

YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
AlanCaldera
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low Latency
Yugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
Carlos Andrés García
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
VMware Tanzu
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
Yugabyte
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
Oracle Korea
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
Bobby Curtis
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
SrirakshaSrinivasan2
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
Byung Ho Lee
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
Stefan Richter
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 Webinar
EDB
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
Novinky v Oracle Database 18c
Novinky v Oracle Database 18cNovinky v Oracle Database 18c
Novinky v Oracle Database 18c
MarketingArrowECS_CZ
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
EDB
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
EDB
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
confluent
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Leveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDbLeveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDb
radicalbit
 
YugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An IntroductionYugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An Introduction
Yugabyte
 

Similar to How YugaByte DB Implements Distributed PostgreSQL (20)

YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low Latency
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 Webinar
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Novinky v Oracle Database 18c
Novinky v Oracle Database 18cNovinky v Oracle Database 18c
Novinky v Oracle Database 18c
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Leveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDbLeveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDb
 
YugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An IntroductionYugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An Introduction
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 

How YugaByte DB Implements Distributed PostgreSQL

  • 1. 1© 2018 All rights reserved. Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018
  • 2. 2© 2018 All rights reserved. CHECKOUT THIS REPO: github.com/YugaByte/yb-sql-workshop
  • 3. 3© 2018 All rights reserved. About Us Kannan Muthukkaruppan, CEO Nutanix ♦ Facebook ♦ Oracle IIT-Madras, University of California-Berkeley Karthik Ranganathan, CTO Nutanix ♦ Facebook ♦ Microsoft IIT-Madras, University of Texas-Austin Mikhail Bautin, Software Architect ClearStory Data ♦ Facebook ♦ D.E.Shaw Nizhny Novgorod State University, Stony Brook  Founded Feb 2016  Apache HBase committers and early engineers on Apache Cassandra  Built Facebook’s NoSQL platform powered by Apache HBase  Scaled the platform to serve many mission-critical use cases • Facebook Messages (Messenger) • Operational Data Store (Time series Data)  Reassembled the same Facebook team at YugaByte along with engineers from Oracle, Google, Nutanix and LinkedIn Founders
  • 4. 4© 2018 All rights reserved. WORKSHOP AGENDA • What is YugaByte DB? Why Another DB? • Exercise 1: BI Tools on YugaByte PostgreSQL • Exercise 2: Distributed PostgreSQL Architecture • Exercise 3: Sharding and Scale Out in Action • Exercise 4: Fault Tolerance in Action
  • 5. 5© 2018 All rights reserved. WHAT IS YUGABYTE DB?
  • 6. 6© 2018 All rights reserved. A transactional, planet-scale database for building high-performance cloud services.
  • 7. 7© 2018 All rights reserved. NoSQL + SQL Cloud Native
  • 8. 8© 2018 All rights reserved. WHY ANOTHER DB?
  • 9. 9© 2018 All rights reserved. Typical Stack Today Fragile infra with several moving parts Datacenter 1 SQL Master SQL Slave Application Tier (Stateless Microservices) Datacenter 2 SQL for OLTP data Manual sharding Cost: dev team Manual replication Manual failover Cost: ops team NoSQL for other data App aware of data silo Cost: dev team Cache for low latency App does caching Cost: dev team Data inconsistency/loss Fragile infra Hours of debugging Cost: dev + ops team
  • 10. 10© 2018 All rights reserved. Does AWS change this? Datacenter 1 SQL Master SQL Slave Datacenter 2 Elasticache Aurora DynamoDB Still Complex it’s the same architecture Application Tier (Stateless Microservices)
  • 11. 11© 2018 All rights reserved. Not Portable Not Portable Open Source Not Portable Open Source Open Source High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale System-of-Record DBs for Global Apps
  • 12. 12© 2018 All rights reserved. TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE Single Shard & Distributed ACID Txns Document-Based, Strongly Consistent Storage Low Latency, Tunable Reads High Throughput OPEN SOURCE Apache 2.0 Popular APIs Extended Apache Cassandra, Redis and PostgreSQL (BETA) Auto Sharding & Rebalancing Global Data Distribution Design Principles CLOUD NATIVE Built For The Container Era Self-Healing, Fault-Tolerant
  • 13. 13© 2018 All rights reserved. EXERCISE #1 BUSINESS INTELLIGENCE
  • 14. 14© 2018 All rights reserved. EXERCISE #2 DISTRIBUTED POSTGRES: ARCHITECTURE
  • 15. 15© 2018 All rights reserved. ARCHITECTURE Overview
  • 16. 16© 2018 All rights reserved. YugaByte DB Process Overview • Universe = cluster of nodes • Two sets of processes: YB-Master & YB-TServer • Example universe 4 nodes rf=3
  • 17. 17© 2018 All rights reserved. Sharding data • User table split into tablets
  • 18. 18© 2018 All rights reserved. One tablet for every key
  • 19. 19© 2018 All rights reserved. Tablets and replication • Tablet = set of tablet-peers in a RAFT group • Num tablet-peers in tablet = replication factor (RF) Tolerate 1 failure : RF=3 Tolerate 2 failures: RF=5
  • 20. 20© 2018 All rights reserved. YB-TServer • Process that does IO • Hosts tablet for tables • Hosts transaction manager • Auto memory sizing Block cache Memstores
  • 21. 21© 2018 All rights reserved. YB-Master • Not in critical path • System metadata store Keyspaces, tables, tablets Users/roles, permissions • Admin operations Create/alter/drop of tables Backups Load balancing (leader and data balancing) Enforces data placement policy
  • 22. 22© 2018 All rights reserved. HANDLING DDL STATEMENTS
  • 23. 23© 2018 All rights reserved. DDL Statements in PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor DISK Create Table Data File Update System Tables
  • 24. 24© 2018 All rights reserved. DDL Statements in YugaByte DB PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor Create sharded, replicated table as data source Store Table Metadata in YB-Master (in works) YugaByte master3 … YugaByte master2 YugaByte master1
  • 25. 25© 2018 All rights reserved. YugaByte Query Layer (YQL) • Stateless, runs in each YB-TServer process GA Goal: Distributed Stateless PostgreSQL Layer Current Beta uses a single Stateless PostgreSQL Layer
  • 26. 26© 2018 All rights reserved. HANDLING DML QUERIES
  • 27. 27© 2018 All rights reserved. DDL Queries in PostgreSQL QUERY Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor WAL Writer BG Writer… DISK FDW Local Table Code Path EXTERNAL DATABASE
  • 28. 28© 2018 All rights reserved. DML Queries in YugaByte DB PostgreSQL DML Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor FDW YugaByte DB Code Path YB Gateway EXTERNAL DATABASE YugaByte node3 YugaByte node4 … YugaByte node2 YugaByte node1 Using FDW as a Table Storage API
  • 29. 29© 2018 All rights reserved. ARCHITECTURE Data Persistence
  • 30. 30© 2018 All rights reserved. Data Persistence in DocDB • DocDB is YugaByte DB’s LSM storage engine • Persistent key to document store • Extends and enhances RocksDB • Designed to support high data-densities per node
  • 31. 31© 2018 All rights reserved. DocDB: Key-to-Document Store • Document key CQL/SQL/Redis primary key • Document value a CQL or SQL row Redis data structure • Fine-grained reads and writes
  • 32. 32© 2018 All rights reserved. DocDB Data Format Example Insert Encoding
  • 33. 33© 2018 All rights reserved. Some of the RocksDB enhancements • WAL and MVCC enhancements o Removed RocksDB WAL, re-uses Raft log o MVCC at a higher layer o Coordinate RocksDB memstore flushing and Raft log garbage collection • File format changes o Sharded (multi-level) indexes and Bloom filters • Splitting data blocks & metadata into separate files for tiering support • Separate queues for large and small compactions
  • 34. 34© 2018 All rights reserved. More Enhancements to RocksDB • Data model aware Bloom filters • Per-SSTable key range metadata to optimize range queries • Server-global block caches & memstore limits • Scan-resistant block cache (single-touch and multi-touch)
  • 35. 35© 2018 All rights reserved. ARCHITECTURE Data Replication
  • 36. 36© 2018 All rights reserved. Raft Replication for Consistency
  • 37. 37© 2018 All rights reserved. How Raft Replication Works
  • 38. 38© 2018 All rights reserved. How Raft Replication Works
  • 39. 39© 2018 All rights reserved. How Raft Replication Works
  • 40. 40© 2018 All rights reserved. How Raft Replication Works
  • 41. 41© 2018 All rights reserved. Raft Related Enhancements • Leader Leases • Multiple Raft groups (1 per tablet) • Leader Balancing • Group Commits • Observer Nodes / Read Replicas
  • 42. 42© 2018 All rights reserved. ARCHITECTURE Transactions
  • 43. 43© 2018 All rights reserved. Single Shard Transactions Raft Consensus Protocol . . . INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager (in memory, on leader only) Acquire a lock on x DocDB / RocksDB Read current value of x Submit a Raft operation for replication: Insert (k1, v1) at hybrid_time 100 Raft log Tablet follower Tablet follower Replicate to majority of tablet peers Apply to RocksDB and release lock k1,v1 @ht=100 1 2 5 3 4
  • 44. 44© 2018 All rights reserved. MVCC for Lockless Reads • Achieved through HybridTime (HT) Monotonically increasing timestamp • Allows reads at a particular HT without locking • Multiple versions may exist temporarily Reclaim older values during compactions
  • 45. 45© 2018 All rights reserved. Single Shard Transactions • Each tablet maintains a “safe time” for reads o Highest timestamp such that the view as of that timestamp is fixed o In the common case it is just before the hybrid time of the next uncommitted record in the tablet
  • 46. 46© 2018 All rights reserved. Distributed Transactions • Fully decentralized architecture • Every tablet server can act as a Transaction Manager • A distributed Transaction Status table Tracks state of active transactions • Transactions can have 3 states: pending, committed, aborted
  • 47. 47© 2018 All rights reserved. Distributed Transactions – Write Path
  • 48. 48© 2018 All rights reserved. Distributed Transactions – Write Path Step 1: Client request
  • 49. 49© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 50. 50© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 51. 51© 2018 All rights reserved. Distributed Transactions – Write Path Step 3: Write provisional records
  • 52. 52© 2018 All rights reserved. Distributed Transactions – Write Path Step 4: Atomic commit
  • 53. 53© 2018 All rights reserved. Distributed Transactions – Write Path Step 5: Respond to client
  • 54. 54© 2018 All rights reserved. Distributed Transactions – Write Path Step 6: Apply provisional records
  • 55. 55© 2018 All rights reserved. Isolation Levels • Currently Snapshot Isolation is supported o Write-write conflicts detected when writing provisional records • Serializable isolation (roadmap) o Reads in RW txns also need provisional records • Read-only transactions are always lock-free
  • 56. 56© 2018 All rights reserved. Clock Skew and Read Restarts • Need to ensure the read timestamp is high enough o Committed records the client might have seen must be visible • Optimistically use current Hybrid Time, re-read if necessary o Reads are restarted if a record with a higher timestamp that the client could have seen is encountered o Read restart happens at most once per tablet o Relying on bounded clock skew (NTP, AWS Time Sync) • Only affects multi-row reads of frequently updated records
  • 57. 57© 2018 All rights reserved. Distributed Transactions – Read Path
  • 58. 58© 2018 All rights reserved. Distributed Transactions – Read Path Step 1: Client request; pick ht_read
  • 59. 59© 2018 All rights reserved. Distributed Transactions – Read Path Step 2: Read from tablet servers
  • 60. 60© 2018 All rights reserved. Distributed Transactions – Read Path Step 3: Resolve txn status
  • 61. 61© 2018 All rights reserved. Distributed Transactions – Read Path Step 4: Respond to YQL Engine
  • 62. 62© 2018 All rights reserved. Distributed Transactions – Read Path Step 5: Respond to client
  • 63. 63© 2018 All rights reserved. Distributed Transactions – Conflicts & Retries • Every transaction is assigned a random priority • In a conflict, the higher-priority transaction wins o The restarted transaction gets a new random priority o Probability of success quickly increases with retries • Restarting a transaction is the same as starting a new one • A read-write transaction can be subject to read-restart
  • 64. 64© 2018 All rights reserved. EXERCISE #3 and #4 SHARDING AND SCALE OUT FAULT TOLERANCE
  • 65. 65© 2018 All rights reserved. Questions? Try it at docs.yugabyte.com/latest/quick-start