SlideShare a Scribd company logo
Multi-master replication for
Postgres
K. Knizhnik, C. Pan, S. Kelvich
Design objectives
Implementation/internals
Tests
Configuration
Roadmap
Contents
2
Design objectives
3
We want:
Fault-tolerance in easy way
OLTP-style load
Compatibility with standalone postgres
Possibility to reuse as metadata storage for sharded cluster
Design objectives
4
Replication:
Identical replicated data on all nodes
Possibility to have local tables
Writes allowed to any node
=> Easy to use
=> We need to take care about proper isolation
Design objectives
5
Transaction manager. We want:
Avoid single point of failure.
+: Spanner, Cockroach, Clock-SI
—: Pg-XL, ...
Avoid network communication for Read-Only transactions
+: HANA, Spanner, Cockroach, Clock-SI
—: Pg-XL, ...
Design objectives
6
Fault tolerance.
Paxos. Distributed consensus, low level.
Raft. Complete state-machine replication solution with failure
detector on timeouts and autorecovery. But all writes are
proxied to one node.
2PC. Blocks in case of node and coordinator failure. Postgres
already support 2pc.
3PC-like. Extra message between "P"and "C". 3PC, Paxos
commit, E3PC.
Design objectives
7
Summary.
No performance penalty for reads.
Tx can be issued to any node.
No special actions required in case of failure.
Design objectives
8
github.com/postgrespro/postgres_cluster
Patched version of Postgres 9.6
Transaction Manager API + Deadlock detection API.
Logical decoding of 2PC transactions.
Mmts extension.
Transaction Manager implementation (Clock-SI)
Logical replication protocol/client
Hooks on transaction commit and transforms it into 2PC.
Bunch of bgworkers.
Implementation
9
Mmts uses logical replication/decoding.
In-core support and extension by 2ndQuadrant.
Very flexible:
Can skip tables
Replication between different versions
Logical messages
Implementation
10
BE – backend, WS – Walsender, Arb – Arbiter, WR – Walreceiver
Implementation
11
Transaction Manager.
Clock-SI algorithm (MS research)
Make use of CSN instead of running lists. (we track xid-csn
correspondence in extension, but there is ongoing work to have
CSN in-core by Heikki and Alexander)
Implementation
12
DDL replication.
Statement-based.
Happily, postgres support 2PC for almost all DDL (alter enum
already fixed in -master)
CREATE TABLE AS, CREATE MATVIEW, etc – tricky, mixes
DDL and DML.
Temp tables are tricky – shouldn’t be replicted.
Depends on environment (search_path, auth, etc.)
Implementation
13
Postgres compatibility.
almost FULLY compatible with pg.
162 of 166 regressions tests pass as is.
1 test is using prepared statement inside CREATE TABLE AS
(CTA).
3 tests are using CTA(CTA(TEMP TABLE)).
Some obvious way to abuse statement based replication, e.g.
write function that create table with name based on current
timestamp.
Also sequences can add pain.
Implementation
14
Automatic recovery: normal work
Implementation
15
Automatic recovery: network split
Implementation
16
Automatic recovery: recovery process
Implementation
17
Automatic recovery: normal work again
Implementation
18
Not that hard:
Install mmts extension
Postgres:
max_prepared_transactions
wal_level = logical
max_worker_processes, max_replication_slots,
max_wal_senders
shared_preload_libraries = ’multimaster’
Multimaster extension:
multimaster.node_id = ...
multimaster.conn_strings = ’...’
Configuration
19
We want:
Test cluster liveness against network problems, restarts,
timeshifts, etc.
Sound like Jepsen. But unfortunately it uses ssh on precreated
vm’s/servers. That’s okay for single test, but painful for CI.
No sane way of testing network split with processes, i.e.
postgres TAP test framework is not helpful with that.
Testing
20
So we are using python unittest with docker.
3-5 containers is _way_ faster to start than vm’s.
takes 10 seconds to compile mmts extension, init and start
cluster.
failure injection via docker.exec (iptables, shift time, etc).
compatible with Travis-CI.
Testing
21
Testing itself: attach clients to each node of cluster and start
abusing nodes.
Client: bank-like test case. Transfer money between accounts
with concurrent total balance calculation.
Testing
22
Failures injected:
Node stop-start
Node kill-start
Node in network partition
Edge network split (a.k.a. majority rings)
Shift time
Change clock speed on nodes with libfaketime *
* – not yet implemented.
Testing
23
Performance.
Read-only tx speed is the same as in standalone postgres.
Commit takes more time (two net roundtrips).
Logical decoding slows down big transactions – but that
should be fixed, patch on commitfest.
Testing
24
Release a public beta
Try to commit twophase decoding patch to pg
Try to commit transation manager patch to pg
Raise discussion about replication/decoding of catalog content
Roadmap
25

More Related Content

What's hot

Training Slides: Basics 104: Simple Tungsten Clustering Deployments
Training Slides: Basics 104: Simple Tungsten Clustering DeploymentsTraining Slides: Basics 104: Simple Tungsten Clustering Deployments
Training Slides: Basics 104: Simple Tungsten Clustering Deployments
Continuent
 
Dw tpain - Gordon Klok
Dw tpain - Gordon KlokDw tpain - Gordon Klok
Dw tpain - Gordon Klok
Devopsdays
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Streaming huge databases using logical decoding
Streaming huge databases using logical decodingStreaming huge databases using logical decoding
Streaming huge databases using logical decoding
Alexander Shulgin
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
confluent
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
ScyllaDB
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
Severalnines
 
Percona XtraDB 集群安装与配置
Percona XtraDB 集群安装与配置Percona XtraDB 集群安装与配置
Percona XtraDB 集群安装与配置
YUCHENG HU
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
Srihari Sriraman
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live sc
Emanuel Calvo
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
InMobi Technology
 
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in RustWhoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
ScyllaDB
 
Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3
Sameer Kumar
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
ScyllaDB
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
ScyllaDB
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Altinity Ltd
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
HBaseCon
 
Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
All Things Open
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
Tier1 App
 
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Red Hat Developers
 

What's hot (20)

Training Slides: Basics 104: Simple Tungsten Clustering Deployments
Training Slides: Basics 104: Simple Tungsten Clustering DeploymentsTraining Slides: Basics 104: Simple Tungsten Clustering Deployments
Training Slides: Basics 104: Simple Tungsten Clustering Deployments
 
Dw tpain - Gordon Klok
Dw tpain - Gordon KlokDw tpain - Gordon Klok
Dw tpain - Gordon Klok
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Streaming huge databases using logical decoding
Streaming huge databases using logical decodingStreaming huge databases using logical decoding
Streaming huge databases using logical decoding
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Object Compaction in Cloud for High Yield
Object Compaction in Cloud for High YieldObject Compaction in Cloud for High Yield
Object Compaction in Cloud for High Yield
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
 
Percona XtraDB 集群安装与配置
Percona XtraDB 集群安装与配置Percona XtraDB 集群安装与配置
Percona XtraDB 集群安装与配置
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Demystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live scDemystifying postgres logical replication percona live sc
Demystifying postgres logical replication percona live sc
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
Whoops! I Rewrote It in Rust
Whoops! I Rewrote It in RustWhoops! I Rewrote It in Rust
Whoops! I Rewrote It in Rust
 
Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3
 
Continuous Go Profiling & Observability
Continuous Go Profiling & ObservabilityContinuous Go Profiling & Observability
Continuous Go Profiling & Observability
 
Data Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at ScaleData Structures for High Resolution, Real-time Telemetry at Scale
Data Structures for High Resolution, Real-time Telemetry at Scale
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
Tips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache KafkaTips and Tricks for Operating Apache Kafka
Tips and Tricks for Operating Apache Kafka
 
Am I reading GC logs Correctly?
Am I reading GC logs Correctly?Am I reading GC logs Correctly?
Am I reading GC logs Correctly?
 
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
Shenandoah GC: Java Without The Garbage Collection Hiccups (Christine Flood)
 

Viewers also liked

Managing thousands of databases
Managing thousands of databasesManaging thousands of databases
Managing thousands of databases
Emre Hasegeli
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Masao Fujii
 
Enterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional DatabasesEnterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional Databases
Ashnikbiz
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky
 
Teaching PostgreSQL to new people
Teaching PostgreSQL to new peopleTeaching PostgreSQL to new people
Teaching PostgreSQL to new people
Tomek Borek
 
Flexible Indexing with Postgres
Flexible Indexing with PostgresFlexible Indexing with Postgres
Flexible Indexing with Postgres
EDB
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
EDB
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDBPostgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
Mason Sharp
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
EDB
 
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
inovex GmbH
 
Postgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL ClusterPostgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL Cluster
Pavan Deolasee
 
Developing PostgreSQL Performance, Simon Riggs
Developing PostgreSQL Performance, Simon RiggsDeveloping PostgreSQL Performance, Simon Riggs
Developing PostgreSQL Performance, Simon Riggs
Fuenteovejuna
 
X-DB Replication Server and MMR
X-DB Replication Server and MMRX-DB Replication Server and MMR
X-DB Replication Server and MMR
Ashnikbiz
 
Gbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfsGbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfs
Giuseppe Broccolo
 
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
PostgresOpen 2013 A Comparison of PostgreSQL Encryption OptionsPostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
Faisal Akber
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
PivotalOpenSourceHub
 
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Ontico
 
PostgreSQL replication from setup to advanced features.
 PostgreSQL replication from setup to advanced features. PostgreSQL replication from setup to advanced features.
PostgreSQL replication from setup to advanced features.
Pivorak MeetUp
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
Umair Shahid
 

Viewers also liked (20)

Managing thousands of databases
Managing thousands of databasesManaging thousands of databases
Managing thousands of databases
 
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
 
Enterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional DatabasesEnterprise PostgreSQL - EDB's answer to conventional Databases
Enterprise PostgreSQL - EDB's answer to conventional Databases
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
Teaching PostgreSQL to new people
Teaching PostgreSQL to new peopleTeaching PostgreSQL to new people
Teaching PostgreSQL to new people
 
Flexible Indexing with Postgres
Flexible Indexing with PostgresFlexible Indexing with Postgres
Flexible Indexing with Postgres
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDBPostgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
1
11
1
 
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
Keepalived & HA-Proxy as an alternative to commercial loadbalancer - August 2014
 
Postgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL ClusterPostgres-XC: Symmetric PostgreSQL Cluster
Postgres-XC: Symmetric PostgreSQL Cluster
 
Developing PostgreSQL Performance, Simon Riggs
Developing PostgreSQL Performance, Simon RiggsDeveloping PostgreSQL Performance, Simon Riggs
Developing PostgreSQL Performance, Simon Riggs
 
X-DB Replication Server and MMR
X-DB Replication Server and MMRX-DB Replication Server and MMR
X-DB Replication Server and MMR
 
Gbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfsGbroccolo pgconfeu2016 pgnfs
Gbroccolo pgconfeu2016 pgnfs
 
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
PostgresOpen 2013 A Comparison of PostgreSQL Encryption OptionsPostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
Как сделать высоконагруженный сервис, не зная количество нагрузки / Олег Обле...
 
PostgreSQL replication from setup to advanced features.
 PostgreSQL replication from setup to advanced features. PostgreSQL replication from setup to advanced features.
PostgreSQL replication from setup to advanced features.
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
 

Similar to Multimaster

Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
Codership Oy - Creators of Galera Cluster
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
timcrack
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
Nopparat Nopkuat
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
Julien SIMON
 
Modern processors
Modern processorsModern processors
Modern processors
gowrivageesan87
 
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdfKubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Eugenio Marzo
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
Hajime Tazaki
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Phil Estes
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
Amazon Web Services
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
prathap kumar
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
DynamicInfraDays
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
Docker, Inc.
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Spark Summit
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
Andriy Berestovskyy
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
Ajith Narayanan
 

Similar to Multimaster (20)

Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Modern processors
Modern processorsModern processors
Modern processors
 
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdfKubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
Kubeinvaders & Chaos Engineering practices for Kubernetes-1.pdf
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container DayQuantifying Container Runtime Performance: OSCON 2017 Open Container Day
Quantifying Container Runtime Performance: OSCON 2017 Open Container Day
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
weblogic perfomence tuning
weblogic perfomence tuningweblogic perfomence tuning
weblogic perfomence tuning
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
ContainerDays Boston 2015: "CoreOS: Building the Layers of the Scalable Clust...
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 

Recently uploaded

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Multimaster

  • 1. Multi-master replication for Postgres K. Knizhnik, C. Pan, S. Kelvich
  • 4. We want: Fault-tolerance in easy way OLTP-style load Compatibility with standalone postgres Possibility to reuse as metadata storage for sharded cluster Design objectives 4
  • 5. Replication: Identical replicated data on all nodes Possibility to have local tables Writes allowed to any node => Easy to use => We need to take care about proper isolation Design objectives 5
  • 6. Transaction manager. We want: Avoid single point of failure. +: Spanner, Cockroach, Clock-SI —: Pg-XL, ... Avoid network communication for Read-Only transactions +: HANA, Spanner, Cockroach, Clock-SI —: Pg-XL, ... Design objectives 6
  • 7. Fault tolerance. Paxos. Distributed consensus, low level. Raft. Complete state-machine replication solution with failure detector on timeouts and autorecovery. But all writes are proxied to one node. 2PC. Blocks in case of node and coordinator failure. Postgres already support 2pc. 3PC-like. Extra message between "P"and "C". 3PC, Paxos commit, E3PC. Design objectives 7
  • 8. Summary. No performance penalty for reads. Tx can be issued to any node. No special actions required in case of failure. Design objectives 8
  • 9. github.com/postgrespro/postgres_cluster Patched version of Postgres 9.6 Transaction Manager API + Deadlock detection API. Logical decoding of 2PC transactions. Mmts extension. Transaction Manager implementation (Clock-SI) Logical replication protocol/client Hooks on transaction commit and transforms it into 2PC. Bunch of bgworkers. Implementation 9
  • 10. Mmts uses logical replication/decoding. In-core support and extension by 2ndQuadrant. Very flexible: Can skip tables Replication between different versions Logical messages Implementation 10
  • 11. BE – backend, WS – Walsender, Arb – Arbiter, WR – Walreceiver Implementation 11
  • 12. Transaction Manager. Clock-SI algorithm (MS research) Make use of CSN instead of running lists. (we track xid-csn correspondence in extension, but there is ongoing work to have CSN in-core by Heikki and Alexander) Implementation 12
  • 13. DDL replication. Statement-based. Happily, postgres support 2PC for almost all DDL (alter enum already fixed in -master) CREATE TABLE AS, CREATE MATVIEW, etc – tricky, mixes DDL and DML. Temp tables are tricky – shouldn’t be replicted. Depends on environment (search_path, auth, etc.) Implementation 13
  • 14. Postgres compatibility. almost FULLY compatible with pg. 162 of 166 regressions tests pass as is. 1 test is using prepared statement inside CREATE TABLE AS (CTA). 3 tests are using CTA(CTA(TEMP TABLE)). Some obvious way to abuse statement based replication, e.g. write function that create table with name based on current timestamp. Also sequences can add pain. Implementation 14
  • 15. Automatic recovery: normal work Implementation 15
  • 16. Automatic recovery: network split Implementation 16
  • 17. Automatic recovery: recovery process Implementation 17
  • 18. Automatic recovery: normal work again Implementation 18
  • 19. Not that hard: Install mmts extension Postgres: max_prepared_transactions wal_level = logical max_worker_processes, max_replication_slots, max_wal_senders shared_preload_libraries = ’multimaster’ Multimaster extension: multimaster.node_id = ... multimaster.conn_strings = ’...’ Configuration 19
  • 20. We want: Test cluster liveness against network problems, restarts, timeshifts, etc. Sound like Jepsen. But unfortunately it uses ssh on precreated vm’s/servers. That’s okay for single test, but painful for CI. No sane way of testing network split with processes, i.e. postgres TAP test framework is not helpful with that. Testing 20
  • 21. So we are using python unittest with docker. 3-5 containers is _way_ faster to start than vm’s. takes 10 seconds to compile mmts extension, init and start cluster. failure injection via docker.exec (iptables, shift time, etc). compatible with Travis-CI. Testing 21
  • 22. Testing itself: attach clients to each node of cluster and start abusing nodes. Client: bank-like test case. Transfer money between accounts with concurrent total balance calculation. Testing 22
  • 23. Failures injected: Node stop-start Node kill-start Node in network partition Edge network split (a.k.a. majority rings) Shift time Change clock speed on nodes with libfaketime * * – not yet implemented. Testing 23
  • 24. Performance. Read-only tx speed is the same as in standalone postgres. Commit takes more time (two net roundtrips). Logical decoding slows down big transactions – but that should be fixed, patch on commitfest. Testing 24
  • 25. Release a public beta Try to commit twophase decoding patch to pg Try to commit transation manager patch to pg Raise discussion about replication/decoding of catalog content Roadmap 25