1© 2018 All rights reserved.
Distributed PostgreSQL
with YugaByte DB
Karthik Ranganathan
PostgresConf Silicon Valley
Oct 16, 2018
2© 2018 All rights reserved.
CHECKOUT THIS REPO:
github.com/YugaByte/yb-sql-workshop
3© 2018 All rights reserved.
About Us
Kannan Muthukkaruppan, CEO
Nutanix ♦ Facebook ♦ Oracle
IIT-Madras, University of California-Berkeley
Karthik Ranganathan, CTO
Nutanix ♦ Facebook ♦ Microsoft
IIT-Madras, University of Texas-Austin
Mikhail Bautin, Software Architect
ClearStory Data ♦ Facebook ♦ D.E.Shaw
Nizhny Novgorod State University, Stony Brook
 Founded Feb 2016
 Apache HBase committers and early engineers on Apache Cassandra
 Built Facebook’s NoSQL platform powered by Apache HBase
 Scaled the platform to serve many mission-critical use cases
• Facebook Messages (Messenger)
• Operational Data Store (Time series Data)
 Reassembled the same Facebook team at YugaByte along with
engineers from Oracle, Google, Nutanix and LinkedIn
Founders
4© 2018 All rights reserved.
WORKSHOP AGENDA
• What is YugaByte DB? Why Another DB?
• Exercise 1: BI Tools on YugaByte PostgreSQL
• Exercise 2: Distributed PostgreSQL Architecture
• Exercise 3: Sharding and Scale Out in Action
• Exercise 4: Fault Tolerance in Action
5© 2018 All rights reserved.
WHAT IS
YUGABYTE DB?
6© 2018 All rights reserved.
A transactional, planet-scale database
for building high-performance cloud services.
7© 2018 All rights reserved.
NoSQL + SQL Cloud Native
8© 2018 All rights reserved.
WHY ANOTHER DB?
9© 2018 All rights reserved.
Typical Stack Today
Fragile infra with several moving parts
Datacenter 1
SQL Master SQL Slave
Application Tier (Stateless Microservices)
Datacenter 2
SQL for OLTP data
Manual sharding
Cost: dev team
Manual replication
Manual failover
Cost: ops team
NoSQL for other data
App aware of data silo
Cost: dev team
Cache for low latency
App does caching
Cost: dev team
Data inconsistency/loss
Fragile infra
Hours of debugging
Cost: dev + ops team
10© 2018 All rights reserved.
Does AWS change this?
Datacenter 1
SQL Master SQL Slave
Datacenter 2
Elasticache
Aurora
DynamoDB
Still Complex
it’s the same architecture
Application Tier (Stateless Microservices)
11© 2018 All rights reserved.
Not Portable
Not Portable
Open Source
Not Portable
Open Source
Open Source
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
System-of-Record DBs for Global Apps
12© 2018 All rights reserved.
TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE
Single Shard & Distributed ACID Txns
Document-Based, Strongly
Consistent Storage
Low Latency, Tunable Reads
High Throughput
OPEN SOURCE
Apache 2.0
Popular APIs Extended
Apache Cassandra, Redis and PostgreSQL (BETA)
Auto Sharding & Rebalancing
Global Data Distribution
Design Principles
CLOUD NATIVE
Built For The Container Era
Self-Healing, Fault-Tolerant
13© 2018 All rights reserved.
EXERCISE #1
BUSINESS INTELLIGENCE
14© 2018 All rights reserved.
EXERCISE #2
DISTRIBUTED POSTGRES:
ARCHITECTURE
15© 2018 All rights reserved.
ARCHITECTURE
Overview
16© 2018 All rights reserved.
YugaByte DB Process Overview
• Universe = cluster of nodes
• Two sets of processes: YB-Master & YB-TServer
• Example universe
4 nodes
rf=3
17© 2018 All rights reserved.
Sharding data
• User table split into tablets
18© 2018 All rights reserved.
One tablet for every key
19© 2018 All rights reserved.
Tablets and replication
• Tablet = set of tablet-peers in a RAFT group
• Num tablet-peers in tablet = replication factor (RF)
Tolerate 1 failure : RF=3
Tolerate 2 failures: RF=5
20© 2018 All rights reserved.
YB-TServer
• Process that does IO
• Hosts tablet for tables
• Hosts transaction manager
• Auto memory sizing
Block cache
Memstores
21© 2018 All rights reserved.
YB-Master
• Not in critical path
• System metadata store
Keyspaces, tables, tablets
Users/roles, permissions
• Admin operations
Create/alter/drop of tables
Backups
Load balancing (leader and data balancing)
Enforces data placement policy
22© 2018 All rights reserved.
HANDLING DDL STATEMENTS
23© 2018 All rights reserved.
DDL Statements in PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
DISK
Create Table Data File
Update System Tables
24© 2018 All rights reserved.
DDL Statements in YugaByte DB PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
Create sharded, replicated table as data source
Store Table Metadata in YB-Master (in works)
YugaByte
master3
…
YugaByte
master2
YugaByte
master1
25© 2018 All rights reserved.
YugaByte Query Layer (YQL)
• Stateless, runs in each YB-TServer process
GA Goal:
Distributed
Stateless
PostgreSQL Layer
Current Beta uses
a single Stateless
PostgreSQL Layer
26© 2018 All rights reserved.
HANDLING DML QUERIES
27© 2018 All rights reserved.
DDL Queries in PostgreSQL
QUERY Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
WAL Writer BG Writer…
DISK
FDW
Local Table Code Path
EXTERNAL
DATABASE
28© 2018 All rights reserved.
DML Queries in YugaByte DB PostgreSQL
DML Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
FDW
YugaByte DB Code Path
YB Gateway
EXTERNAL
DATABASE
YugaByte
node3
YugaByte
node4
…
YugaByte
node2
YugaByte
node1
Using FDW as a
Table Storage API
29© 2018 All rights reserved.
ARCHITECTURE
Data Persistence
30© 2018 All rights reserved.
Data Persistence in DocDB
• DocDB is YugaByte DB’s LSM storage engine
• Persistent key to document store
• Extends and enhances RocksDB
• Designed to support high data-densities per node
31© 2018 All rights reserved.
DocDB: Key-to-Document Store
• Document key
CQL/SQL/Redis primary key
• Document value
a CQL or SQL row
Redis data structure
• Fine-grained reads and writes
32© 2018 All rights reserved.
DocDB Data Format
Example Insert
Encoding
33© 2018 All rights reserved.
Some of the RocksDB enhancements
• WAL and MVCC enhancements
o Removed RocksDB WAL, re-uses Raft log
o MVCC at a higher layer
o Coordinate RocksDB memstore flushing and Raft log garbage collection
• File format changes
o Sharded (multi-level) indexes and Bloom filters
• Splitting data blocks & metadata into separate files for tiering support
• Separate queues for large and small compactions
34© 2018 All rights reserved.
More Enhancements to RocksDB
• Data model aware Bloom filters
• Per-SSTable key range metadata to optimize range queries
• Server-global block caches & memstore limits
• Scan-resistant block cache (single-touch and multi-touch)
35© 2018 All rights reserved.
ARCHITECTURE
Data Replication
36© 2018 All rights reserved.
Raft Replication for Consistency
37© 2018 All rights reserved.
How Raft Replication Works
38© 2018 All rights reserved.
How Raft Replication Works
39© 2018 All rights reserved.
How Raft Replication Works
40© 2018 All rights reserved.
How Raft Replication Works
41© 2018 All rights reserved.
Raft Related Enhancements
• Leader Leases
• Multiple Raft groups (1 per tablet)
• Leader Balancing
• Group Commits
• Observer Nodes / Read Replicas
42© 2018 All rights reserved.
ARCHITECTURE
Transactions
43© 2018 All rights reserved.
Single Shard Transactions
Raft Consensus Protocol
. . .
INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager
(in memory, on leader only)
Acquire a lock on x
DocDB / RocksDB
Read current value of x
Submit a Raft operation for replication:
Insert (k1, v1) at hybrid_time 100
Raft log
Tablet
follower
Tablet
follower
Replicate to
majority of
tablet peers
Apply to RocksDB and
release lock
k1,v1
@ht=100
1
2
5
3
4
44© 2018 All rights reserved.
MVCC for Lockless Reads
• Achieved through HybridTime (HT)
Monotonically increasing timestamp
• Allows reads at a particular HT without locking
• Multiple versions may exist temporarily
Reclaim older values during compactions
45© 2018 All rights reserved.
Single Shard Transactions
• Each tablet maintains a “safe time” for reads
o Highest timestamp such that the view as of that timestamp is fixed
o In the common case it is just before the hybrid time of the next
uncommitted record in the tablet
46© 2018 All rights reserved.
Distributed Transactions
• Fully decentralized architecture
• Every tablet server can act as a Transaction Manager
• A distributed Transaction Status table
Tracks state of active transactions
• Transactions can have 3 states:
pending, committed, aborted
47© 2018 All rights reserved.
Distributed Transactions – Write Path
48© 2018 All rights reserved.
Distributed Transactions – Write Path Step 1: Client request
49© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
50© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
51© 2018 All rights reserved.
Distributed Transactions – Write Path Step 3: Write provisional records
52© 2018 All rights reserved.
Distributed Transactions – Write Path Step 4: Atomic commit
53© 2018 All rights reserved.
Distributed Transactions – Write Path Step 5: Respond to client
54© 2018 All rights reserved.
Distributed Transactions – Write Path Step 6: Apply provisional records
55© 2018 All rights reserved.
Isolation Levels
• Currently Snapshot Isolation is supported
o Write-write conflicts detected when writing provisional records
• Serializable isolation (roadmap)
o Reads in RW txns also need provisional records
• Read-only transactions are always lock-free
56© 2018 All rights reserved.
Clock Skew and Read Restarts
• Need to ensure the read timestamp is high enough
o Committed records the client might have seen must be visible
• Optimistically use current Hybrid Time, re-read if necessary
o Reads are restarted if a record with a higher timestamp that the client
could have seen is encountered
o Read restart happens at most once per tablet
o Relying on bounded clock skew (NTP, AWS Time Sync)
• Only affects multi-row reads of frequently updated records
57© 2018 All rights reserved.
Distributed Transactions – Read Path
58© 2018 All rights reserved.
Distributed Transactions – Read Path Step 1: Client request; pick ht_read
59© 2018 All rights reserved.
Distributed Transactions – Read Path Step 2: Read from tablet servers
60© 2018 All rights reserved.
Distributed Transactions – Read Path Step 3: Resolve txn status
61© 2018 All rights reserved.
Distributed Transactions – Read Path Step 4: Respond to YQL Engine
62© 2018 All rights reserved.
Distributed Transactions – Read Path Step 5: Respond to client
63© 2018 All rights reserved.
Distributed Transactions – Conflicts & Retries
• Every transaction is assigned a random priority
• In a conflict, the higher-priority transaction wins
o The restarted transaction gets a new random priority
o Probability of success quickly increases with retries
• Restarting a transaction is the same as starting a new one
• A read-write transaction can be subject to read-restart
64© 2018 All rights reserved.
EXERCISE #3 and #4
SHARDING AND SCALE OUT
FAULT TOLERANCE
65© 2018 All rights reserved.
Questions?
Try it at
docs.yugabyte.com/latest/quick-start

How YugaByte DB Implements Distributed PostgreSQL

  • 1.
    1© 2018 Allrights reserved. Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018
  • 2.
    2© 2018 Allrights reserved. CHECKOUT THIS REPO: github.com/YugaByte/yb-sql-workshop
  • 3.
    3© 2018 Allrights reserved. About Us Kannan Muthukkaruppan, CEO Nutanix ♦ Facebook ♦ Oracle IIT-Madras, University of California-Berkeley Karthik Ranganathan, CTO Nutanix ♦ Facebook ♦ Microsoft IIT-Madras, University of Texas-Austin Mikhail Bautin, Software Architect ClearStory Data ♦ Facebook ♦ D.E.Shaw Nizhny Novgorod State University, Stony Brook  Founded Feb 2016  Apache HBase committers and early engineers on Apache Cassandra  Built Facebook’s NoSQL platform powered by Apache HBase  Scaled the platform to serve many mission-critical use cases • Facebook Messages (Messenger) • Operational Data Store (Time series Data)  Reassembled the same Facebook team at YugaByte along with engineers from Oracle, Google, Nutanix and LinkedIn Founders
  • 4.
    4© 2018 Allrights reserved. WORKSHOP AGENDA • What is YugaByte DB? Why Another DB? • Exercise 1: BI Tools on YugaByte PostgreSQL • Exercise 2: Distributed PostgreSQL Architecture • Exercise 3: Sharding and Scale Out in Action • Exercise 4: Fault Tolerance in Action
  • 5.
    5© 2018 Allrights reserved. WHAT IS YUGABYTE DB?
  • 6.
    6© 2018 Allrights reserved. A transactional, planet-scale database for building high-performance cloud services.
  • 7.
    7© 2018 Allrights reserved. NoSQL + SQL Cloud Native
  • 8.
    8© 2018 Allrights reserved. WHY ANOTHER DB?
  • 9.
    9© 2018 Allrights reserved. Typical Stack Today Fragile infra with several moving parts Datacenter 1 SQL Master SQL Slave Application Tier (Stateless Microservices) Datacenter 2 SQL for OLTP data Manual sharding Cost: dev team Manual replication Manual failover Cost: ops team NoSQL for other data App aware of data silo Cost: dev team Cache for low latency App does caching Cost: dev team Data inconsistency/loss Fragile infra Hours of debugging Cost: dev + ops team
  • 10.
    10© 2018 Allrights reserved. Does AWS change this? Datacenter 1 SQL Master SQL Slave Datacenter 2 Elasticache Aurora DynamoDB Still Complex it’s the same architecture Application Tier (Stateless Microservices)
  • 11.
    11© 2018 Allrights reserved. Not Portable Not Portable Open Source Not Portable Open Source Open Source High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale System-of-Record DBs for Global Apps
  • 12.
    12© 2018 Allrights reserved. TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE Single Shard & Distributed ACID Txns Document-Based, Strongly Consistent Storage Low Latency, Tunable Reads High Throughput OPEN SOURCE Apache 2.0 Popular APIs Extended Apache Cassandra, Redis and PostgreSQL (BETA) Auto Sharding & Rebalancing Global Data Distribution Design Principles CLOUD NATIVE Built For The Container Era Self-Healing, Fault-Tolerant
  • 13.
    13© 2018 Allrights reserved. EXERCISE #1 BUSINESS INTELLIGENCE
  • 14.
    14© 2018 Allrights reserved. EXERCISE #2 DISTRIBUTED POSTGRES: ARCHITECTURE
  • 15.
    15© 2018 Allrights reserved. ARCHITECTURE Overview
  • 16.
    16© 2018 Allrights reserved. YugaByte DB Process Overview • Universe = cluster of nodes • Two sets of processes: YB-Master & YB-TServer • Example universe 4 nodes rf=3
  • 17.
    17© 2018 Allrights reserved. Sharding data • User table split into tablets
  • 18.
    18© 2018 Allrights reserved. One tablet for every key
  • 19.
    19© 2018 Allrights reserved. Tablets and replication • Tablet = set of tablet-peers in a RAFT group • Num tablet-peers in tablet = replication factor (RF) Tolerate 1 failure : RF=3 Tolerate 2 failures: RF=5
  • 20.
    20© 2018 Allrights reserved. YB-TServer • Process that does IO • Hosts tablet for tables • Hosts transaction manager • Auto memory sizing Block cache Memstores
  • 21.
    21© 2018 Allrights reserved. YB-Master • Not in critical path • System metadata store Keyspaces, tables, tablets Users/roles, permissions • Admin operations Create/alter/drop of tables Backups Load balancing (leader and data balancing) Enforces data placement policy
  • 22.
    22© 2018 Allrights reserved. HANDLING DDL STATEMENTS
  • 23.
    23© 2018 Allrights reserved. DDL Statements in PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor DISK Create Table Data File Update System Tables
  • 24.
    24© 2018 Allrights reserved. DDL Statements in YugaByte DB PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor Create sharded, replicated table as data source Store Table Metadata in YB-Master (in works) YugaByte master3 … YugaByte master2 YugaByte master1
  • 25.
    25© 2018 Allrights reserved. YugaByte Query Layer (YQL) • Stateless, runs in each YB-TServer process GA Goal: Distributed Stateless PostgreSQL Layer Current Beta uses a single Stateless PostgreSQL Layer
  • 26.
    26© 2018 Allrights reserved. HANDLING DML QUERIES
  • 27.
    27© 2018 Allrights reserved. DDL Queries in PostgreSQL QUERY Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor WAL Writer BG Writer… DISK FDW Local Table Code Path EXTERNAL DATABASE
  • 28.
    28© 2018 Allrights reserved. DML Queries in YugaByte DB PostgreSQL DML Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor FDW YugaByte DB Code Path YB Gateway EXTERNAL DATABASE YugaByte node3 YugaByte node4 … YugaByte node2 YugaByte node1 Using FDW as a Table Storage API
  • 29.
    29© 2018 Allrights reserved. ARCHITECTURE Data Persistence
  • 30.
    30© 2018 Allrights reserved. Data Persistence in DocDB • DocDB is YugaByte DB’s LSM storage engine • Persistent key to document store • Extends and enhances RocksDB • Designed to support high data-densities per node
  • 31.
    31© 2018 Allrights reserved. DocDB: Key-to-Document Store • Document key CQL/SQL/Redis primary key • Document value a CQL or SQL row Redis data structure • Fine-grained reads and writes
  • 32.
    32© 2018 Allrights reserved. DocDB Data Format Example Insert Encoding
  • 33.
    33© 2018 Allrights reserved. Some of the RocksDB enhancements • WAL and MVCC enhancements o Removed RocksDB WAL, re-uses Raft log o MVCC at a higher layer o Coordinate RocksDB memstore flushing and Raft log garbage collection • File format changes o Sharded (multi-level) indexes and Bloom filters • Splitting data blocks & metadata into separate files for tiering support • Separate queues for large and small compactions
  • 34.
    34© 2018 Allrights reserved. More Enhancements to RocksDB • Data model aware Bloom filters • Per-SSTable key range metadata to optimize range queries • Server-global block caches & memstore limits • Scan-resistant block cache (single-touch and multi-touch)
  • 35.
    35© 2018 Allrights reserved. ARCHITECTURE Data Replication
  • 36.
    36© 2018 Allrights reserved. Raft Replication for Consistency
  • 37.
    37© 2018 Allrights reserved. How Raft Replication Works
  • 38.
    38© 2018 Allrights reserved. How Raft Replication Works
  • 39.
    39© 2018 Allrights reserved. How Raft Replication Works
  • 40.
    40© 2018 Allrights reserved. How Raft Replication Works
  • 41.
    41© 2018 Allrights reserved. Raft Related Enhancements • Leader Leases • Multiple Raft groups (1 per tablet) • Leader Balancing • Group Commits • Observer Nodes / Read Replicas
  • 42.
    42© 2018 Allrights reserved. ARCHITECTURE Transactions
  • 43.
    43© 2018 Allrights reserved. Single Shard Transactions Raft Consensus Protocol . . . INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager (in memory, on leader only) Acquire a lock on x DocDB / RocksDB Read current value of x Submit a Raft operation for replication: Insert (k1, v1) at hybrid_time 100 Raft log Tablet follower Tablet follower Replicate to majority of tablet peers Apply to RocksDB and release lock k1,v1 @ht=100 1 2 5 3 4
  • 44.
    44© 2018 Allrights reserved. MVCC for Lockless Reads • Achieved through HybridTime (HT) Monotonically increasing timestamp • Allows reads at a particular HT without locking • Multiple versions may exist temporarily Reclaim older values during compactions
  • 45.
    45© 2018 Allrights reserved. Single Shard Transactions • Each tablet maintains a “safe time” for reads o Highest timestamp such that the view as of that timestamp is fixed o In the common case it is just before the hybrid time of the next uncommitted record in the tablet
  • 46.
    46© 2018 Allrights reserved. Distributed Transactions • Fully decentralized architecture • Every tablet server can act as a Transaction Manager • A distributed Transaction Status table Tracks state of active transactions • Transactions can have 3 states: pending, committed, aborted
  • 47.
    47© 2018 Allrights reserved. Distributed Transactions – Write Path
  • 48.
    48© 2018 Allrights reserved. Distributed Transactions – Write Path Step 1: Client request
  • 49.
    49© 2018 Allrights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 50.
    50© 2018 Allrights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 51.
    51© 2018 Allrights reserved. Distributed Transactions – Write Path Step 3: Write provisional records
  • 52.
    52© 2018 Allrights reserved. Distributed Transactions – Write Path Step 4: Atomic commit
  • 53.
    53© 2018 Allrights reserved. Distributed Transactions – Write Path Step 5: Respond to client
  • 54.
    54© 2018 Allrights reserved. Distributed Transactions – Write Path Step 6: Apply provisional records
  • 55.
    55© 2018 Allrights reserved. Isolation Levels • Currently Snapshot Isolation is supported o Write-write conflicts detected when writing provisional records • Serializable isolation (roadmap) o Reads in RW txns also need provisional records • Read-only transactions are always lock-free
  • 56.
    56© 2018 Allrights reserved. Clock Skew and Read Restarts • Need to ensure the read timestamp is high enough o Committed records the client might have seen must be visible • Optimistically use current Hybrid Time, re-read if necessary o Reads are restarted if a record with a higher timestamp that the client could have seen is encountered o Read restart happens at most once per tablet o Relying on bounded clock skew (NTP, AWS Time Sync) • Only affects multi-row reads of frequently updated records
  • 57.
    57© 2018 Allrights reserved. Distributed Transactions – Read Path
  • 58.
    58© 2018 Allrights reserved. Distributed Transactions – Read Path Step 1: Client request; pick ht_read
  • 59.
    59© 2018 Allrights reserved. Distributed Transactions – Read Path Step 2: Read from tablet servers
  • 60.
    60© 2018 Allrights reserved. Distributed Transactions – Read Path Step 3: Resolve txn status
  • 61.
    61© 2018 Allrights reserved. Distributed Transactions – Read Path Step 4: Respond to YQL Engine
  • 62.
    62© 2018 Allrights reserved. Distributed Transactions – Read Path Step 5: Respond to client
  • 63.
    63© 2018 Allrights reserved. Distributed Transactions – Conflicts & Retries • Every transaction is assigned a random priority • In a conflict, the higher-priority transaction wins o The restarted transaction gets a new random priority o Probability of success quickly increases with retries • Restarting a transaction is the same as starting a new one • A read-write transaction can be subject to read-restart
  • 64.
    64© 2018 Allrights reserved. EXERCISE #3 and #4 SHARDING AND SCALE OUT FAULT TOLERANCE
  • 65.
    65© 2018 Allrights reserved. Questions? Try it at docs.yugabyte.com/latest/quick-start