SlideShare a Scribd company logo
1 of 48
Download to read offline
HA PostgreSQL with Patroni
Oleksii Kliukin, Zalando SE
@alexeyklyukin
FOSDEM PGDay 2016
January 29th, 2016, Brussels
What happens if the master is down?
● Built-in streaming replication is great!
● Only one writable node (primary, master)
● Multiple read-only standbys (replicas)
● Manual failover
pg_ctl promote -D /home/postgres/data
Re-joining the former master
Before 9.3:
rm -rf /home/postgres/data && pg_basebackup …
Before 9.5
git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github.
com/vmware/pg_rewind.git  && cd pg_rewind && apt-get source
postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name
"postgresql*" -type d) install;
pg_rewind in 9.5 and above
● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5)
● wal_log_hints = ‘on’ or enable data checksums
● rewind your former master to be able to follow the current one:
pg_rewind -D /home/postgres/data --source-server=’
host=localhost port=5433 sslmode=prefer’
● requires superuser access
No fixed address
● Pgbouncer
● Pgpool
● HAProxy
● Floating IP/DNS
MASTER REPLICA
FORMER
MASTER
WAL storage
connection
router
CLIENTS
Streaming replication
pg_rewind
archive
com
m
and
restore
com
m
and
How much downtime can you tolerate?
Automatic failover
master
replica
master
replica
promote
replica
master
Network issues
master
replica
master
replica
promote
master
master
?
What about an arbiter?
replica
master
arbiter
ping
ping
master
master
arbiter
vote
master
replica
Do we need a distributed consensus?
Master election
The consensus problem requires agreement among a number of processes
(or agents) for a single data value.
● leader (master) value defines the current master
● no leader - which node takes the master key
● leader is present - should be the same for all nodes
● leader has disappeared - should be the same for all nodes
● etcd from CoreOS
● distributed key-value storage
● directory-tree like
● implements RAFT
● talks REST
● key expiration with TTL and test and set operations
3-rd party to enforce a consensus
RAFT
● Distributed consensus algorithm (like Paxos)
● Achieves consensus by directing all changes to the leader
● Only commit the change if it’s acknowledged by the majority of nodes
● 2 stages
○ leader election
○ log replication
● Implemented in etcd, consul.
http://thesecretlivesofdata.com/raft/
Patroni
● Manages a single PostgreSQL node
● Commonly runs on the same host as PostgreSQL
● Talks to etcd
● Promotes/demotes the managed node depending on the leader key
PostgreSQL master election
set leader lock
set leader lock set leader lock
● every node tries to set the leader lock (key)
● the leader lock can only be set when it’s not present
● once the leader lock is set - no one else can obtain it
PostgreSQL master election
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0"
ttl=30
HTTP/1.1 201 Created
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2045
X-Raft-Index: 13006
X-Raft-Term: 2
{
"action": "create",
"node": {
"createdIndex": 2045,
"expiration": "2016-01-28T13:38:19.717822356Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2045,
"ttl": 30,
"value": "postgresql0"
}
}
ELECTED
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1"
ttl=30
HTTP/1.1 412 Precondition Failed
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2047
{
"cause": "/service/fosdem/leader",
"errorCode": 105,
"index": 2047,
"message": "Key already exists"
}
Only one leader at a time
PostgreSQL master election
I’m the member
I’m
the leader with the lock
I’m
the member
Streaming replication
How do you know the leader is alive?
● leader updates its key periodically (by default every 10 seconds)
● only the leader is allowed to update the key (via compare and swap)
● if the key is not updated in 30 seconds - it expires (via TTL)
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar"
HTTP/1.1 412 Precondition Failed
Content-Length: 89
Content-Type: application/json
Date: Thu, 28 Jan 2016 13:45:27 GMT
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2090
{
"cause": "[bar != postgresql0]",
"errorCode": 101,
"index": 2090,
"message": "Compare failed"
}
Only the leader can update the lock
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value="
postgresql0" ttl=30
{
"action": "compareAndSwap",
"node": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.38531821Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2119,
"ttl": 30,
"value": "postgresql0"
},
"prevNode": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.226784451Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2116,
"ttl": 22,
"value": "postgresql0"
}
}
How do you know where to connect?
$ etcdctl ls --recursive /service/fosdem
/service/fosdem/members
/service/fosdem/members/postgresql0
/service/fosdem/members/postgresql1
/service/fosdem/initialize
/service/fosdem/leader
/service/fosdem/optime
/service/fosdem/optime/leader
$ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0
HTTP/1.1 200 OK
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 3114
X-Raft-Index: 20102
X-Raft-Term: 2
{
"action": "get",
"node": {
"createdIndex": 3111,
"expiration": "2016-01-28T14:28:25.221011955Z",
"key": "/service/fosdem/members/postgresql0",
"modifiedIndex": 3111,
"ttl": 22,
"value": "{"conn_url":"postgres://replicator:rep-pass@127.0.0.1:5432/postgres","
api_url":"http://127.0.0.1:8008/patroni","tags":{"nofailover":false,"noloadbalance":false,
"clonefrom":false},"state":"running","role":"master","xlog_location":234881568}"
}
}
Avoiding the split brain
Worst case scenario
Streaming replication in 140 characters
Patroni configuration parameters
● YAML file with sections
● general parameters
○ ttl: time to leave for the leader and member keys
○ loop_wait: minimum time one iteration of the eventloop takes
○ scope: name of the cluster to run
○ auth: ‘username:password’ string for the REST API
● postgresql section
○ name - name of the postgresql member (should be unique)
○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432)
○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432)
○ data_dir: PGDATA (can be initially not empty)
○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind
○ use_slots: whether to use replication slots (9.4 and above)
postgresql subsections
● initdb: section to specify initdb options (i.e. encoding, default auth mode)
● pg_rewind: section with username/password for the user used by pg_rewind
● pg_hba: entries to be added to pg_hba.conf
● replication: replication user, password, and network (for pg_hba.conf)
● superuser: username/password for the superuser account (to be created)
● admin: username/password for the user with createdb/createrole permissions
● create_replica_methods: list of methods to image replicas from the master:
● recovery.conf: parameters put into the recovery.conf (primary_conninfo is
written automatically)
● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)
tags (patroni configuration)
tags modify behavior of the node they are applied to
● nofailover: the node should not participate in elections or ever become the
master
● noloadbalance: the node should be excluded from the load balancer (TODO)
● clonefrom: this node should be bootstrapped from (TODO)
● replicatefrom: this node should do streaming replication from (pull request)
REST API
● command and control interface
● GET /master and /replica endpoints for the load balancer
● GET /patroni in order to get system information
● POST /restart in order to restart the node
● POST /reinitialize in order to remove the data directory and reinitialize from
the master
● POST /failover with leader and optional member names in order to do a
controlled failover
● patronictl to do it in a more user-friendly way
REST API (master)
$ http http://127.0.0.1:8008/master
HTTP/1.0 200 OK
...
Server: BaseHTTP/0.3 Python/2.7.10
{
"postmaster_start_time": "2016-01-27 23:23:21.873 CET",
"role": "master",
"state": "running",
"tags": {
"clonefrom": false,
"nofailover": false,
"noloadbalance": false
},
"xlog": {
"location": 301990984
}
}
REST API (replica)
http http://127.0.0.1:8009/master
HTTP/1.0 503 Service Unavailable
...
Server: BaseHTTP/0.3 Python/2.7.10
{
"postmaster_start_time": "2016-01-27 23:23:24.367 CET",
"role": "replica",
"state": "running",
"tags": {
"clonefrom": false,
"nofailover": false,
"noloadbalance": false
},
"xlog": {
"paused": false,
"received_location": 301990984,
"replayed_location": 301990984
}
Configuring HA Proxy for Patroni
global
maxconn 100
defaults
log global
mode tcp
retries 2
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s
frontend ft_postgresql
bind *:5000
default_backend bk_db
backend bk_db
option httpchk
server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008
server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009
Implementation details
Separate nodes for etcd and patroni
Multi-threading to avoid blocking the
event loop
Use synchronous_standby_names=’*’ for
synchronous replication
Use etcd/Zookeeper watches to speed
up the failover
Callbacks
Call monitoring code or do some application-specific actions (i.e. change
pgbouncer configuration)
User-defined scripts set in the configuration file.
● on start
● on stop
● on restart
● on change role
pg_rewind support
● remove recovery.conf if present
● run a checkpoint on a promoted master (due to the fast promote)
● remove archive status to avoid losing archived segments to be removed
● start in a single-user mode with archive_command set to false
● stop to produce a clean shutdown
● only if checksums or enabled or wal_log_hints are set (via pg_controldata)
● Many installations already have Zookeeper running
● No TTL
● Session-specific (ephemeral) keys
● No dynamic nodes (use Exhibitor)
Zookeeper support
Spilo: Patroni on AWS
Up next
● scheduled failovers
● full support for cascading replication
● consul joins etcd and zookeeper
● manage BDR nodes
Thank you!
Feedback: @alexeyklyukin
alexk@hintbits.com
Links
github.com/zalando/patroni
spilo.readthedocs.org
coreos.com/etcd/docs/latest/getting-started-with-etcd.html
raft.github.io

More Related Content

What's hot

What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...ScaleGrid.io
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB plc
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PGConf APAC
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQLMydbops
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 
MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바NeoClova
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기NeoClova
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기NHN FORWARD
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBMydbops
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centersMariaDB plc
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep InternalEXEM
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlMydbops
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Mydbops
 

What's hot (20)

What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Using galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wanUsing galera replication to create geo distributed clusters on the wan
Using galera replication to create geo distributed clusters on the wan
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
MariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & OptimizationMariaDB Server Performance Tuning & Optimization
MariaDB Server Performance Tuning & Optimization
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Backup and-recovery2
Backup and-recovery2Backup and-recovery2
Backup and-recovery2
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바MySQL Advanced Administrator 2021 - 네오클로바
MySQL Advanced Administrator 2021 - 네오클로바
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
 
[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기[2018] MySQL 이중화 진화기
[2018] MySQL 이중화 진화기
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Parallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDBParallel Replication in MySQL and MariaDB
Parallel Replication in MySQL and MariaDB
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centers
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
Tuning Autovacuum in Postgresql
Tuning Autovacuum in PostgresqlTuning Autovacuum in Postgresql
Tuning Autovacuum in Postgresql
 
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera ) Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
 
PostgreSQL replication
PostgreSQL replicationPostgreSQL replication
PostgreSQL replication
 

Similar to High Availability PostgreSQL with Zalando Patroni

PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practiceAlexey Lesovsky
 
OpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpen Gurukul
 
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside OutFerenc Kovács
 
The Essential postgresql.conf
The Essential postgresql.confThe Essential postgresql.conf
The Essential postgresql.confRobert Treat
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...Puppet
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a gamejackpot201
 
We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....Sadia Textile
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Command Prompt., Inc
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangaspgdayrussia
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2PgTraining
 
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenBruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenPostgresOpen
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache MesosJoe Stein
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Skydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSkydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSylvain Afchain
 

Similar to High Availability PostgreSQL with Zalando Patroni (20)

PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
 
OpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQLOpenGurukul : Database : PostgreSQL
OpenGurukul : Database : PostgreSQL
 
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
 
The Accidental DBA
The Accidental DBAThe Accidental DBA
The Accidental DBA
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
 
The Essential postgresql.conf
The Essential postgresql.confThe Essential postgresql.conf
The Essential postgresql.conf
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
 
0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri0507 057 01 98 * Adana Klima Servisleri
0507 057 01 98 * Adana Klima Servisleri
 
Shall we play a game
Shall we play a gameShall we play a game
Shall we play a game
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
We shall play a game....
We shall play a game....We shall play a game....
We shall play a game....
 
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
Building tungsten-clusters-with-postgre sql-hot-standby-and-streaming-replica...
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
 
Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2Oracle to Postgres Migration - part 2
Oracle to Postgres Migration - part 2
 
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres OpenBruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
Bruce Momjian - Inside PostgreSQL Shared Memory @ Postgres Open
 
Introduction to Apache Mesos
Introduction to Apache MesosIntroduction to Apache Mesos
Introduction to Apache Mesos
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Skydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSkydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integration
 

More from Zalando Technology

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityZalando Technology
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker Zalando Technology
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices Zalando Technology
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnZalando Technology
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Technology
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickZalando Technology
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesZalando Technology
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayZalando Technology
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZalando Technology
 
Mobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando TechMobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando TechZalando Technology
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamZalando Technology
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudZalando Technology
 

More from Zalando Technology (14)

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering Platform
 
Mobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando TechMobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando Tech
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

High Availability PostgreSQL with Zalando Patroni

  • 1. HA PostgreSQL with Patroni Oleksii Kliukin, Zalando SE @alexeyklyukin FOSDEM PGDay 2016 January 29th, 2016, Brussels
  • 2. What happens if the master is down? ● Built-in streaming replication is great! ● Only one writable node (primary, master) ● Multiple read-only standbys (replicas) ● Manual failover pg_ctl promote -D /home/postgres/data
  • 3. Re-joining the former master Before 9.3: rm -rf /home/postgres/data && pg_basebackup … Before 9.5 git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github. com/vmware/pg_rewind.git && cd pg_rewind && apt-get source postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name "postgresql*" -type d) install;
  • 4. pg_rewind in 9.5 and above ● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5) ● wal_log_hints = ‘on’ or enable data checksums ● rewind your former master to be able to follow the current one: pg_rewind -D /home/postgres/data --source-server=’ host=localhost port=5433 sslmode=prefer’ ● requires superuser access
  • 5. No fixed address ● Pgbouncer ● Pgpool ● HAProxy ● Floating IP/DNS
  • 6. MASTER REPLICA FORMER MASTER WAL storage connection router CLIENTS Streaming replication pg_rewind archive com m and restore com m and
  • 7. How much downtime can you tolerate?
  • 10. What about an arbiter? replica master arbiter ping ping master master arbiter vote master replica
  • 11. Do we need a distributed consensus? Master election
  • 12. The consensus problem requires agreement among a number of processes (or agents) for a single data value. ● leader (master) value defines the current master ● no leader - which node takes the master key ● leader is present - should be the same for all nodes ● leader has disappeared - should be the same for all nodes
  • 13. ● etcd from CoreOS ● distributed key-value storage ● directory-tree like ● implements RAFT ● talks REST ● key expiration with TTL and test and set operations 3-rd party to enforce a consensus
  • 14. RAFT ● Distributed consensus algorithm (like Paxos) ● Achieves consensus by directing all changes to the leader ● Only commit the change if it’s acknowledged by the majority of nodes ● 2 stages ○ leader election ○ log replication ● Implemented in etcd, consul. http://thesecretlivesofdata.com/raft/
  • 15. Patroni ● Manages a single PostgreSQL node ● Commonly runs on the same host as PostgreSQL ● Talks to etcd ● Promotes/demotes the managed node depending on the leader key
  • 16. PostgreSQL master election set leader lock set leader lock set leader lock
  • 17. ● every node tries to set the leader lock (key) ● the leader lock can only be set when it’s not present ● once the leader lock is set - no one else can obtain it PostgreSQL master election
  • 18. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0" ttl=30 HTTP/1.1 201 Created ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2045 X-Raft-Index: 13006 X-Raft-Term: 2 { "action": "create", "node": { "createdIndex": 2045, "expiration": "2016-01-28T13:38:19.717822356Z", "key": "/service/fosdem/leader", "modifiedIndex": 2045, "ttl": 30, "value": "postgresql0" } } ELECTED
  • 19. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1" ttl=30 HTTP/1.1 412 Precondition Failed ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2047 { "cause": "/service/fosdem/leader", "errorCode": 105, "index": 2047, "message": "Key already exists" } Only one leader at a time
  • 20. PostgreSQL master election I’m the member I’m the leader with the lock I’m the member Streaming replication
  • 21. How do you know the leader is alive? ● leader updates its key periodically (by default every 10 seconds) ● only the leader is allowed to update the key (via compare and swap) ● if the key is not updated in 30 seconds - it expires (via TTL)
  • 22. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar" HTTP/1.1 412 Precondition Failed Content-Length: 89 Content-Type: application/json Date: Thu, 28 Jan 2016 13:45:27 GMT X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 2090 { "cause": "[bar != postgresql0]", "errorCode": 101, "index": 2090, "message": "Compare failed" } Only the leader can update the lock
  • 23. http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value=" postgresql0" ttl=30 { "action": "compareAndSwap", "node": { "createdIndex": 2052, "expiration": "2016-01-28T13:47:05.38531821Z", "key": "/service/fosdem/leader", "modifiedIndex": 2119, "ttl": 30, "value": "postgresql0" }, "prevNode": { "createdIndex": 2052, "expiration": "2016-01-28T13:47:05.226784451Z", "key": "/service/fosdem/leader", "modifiedIndex": 2116, "ttl": 22, "value": "postgresql0" } }
  • 24. How do you know where to connect? $ etcdctl ls --recursive /service/fosdem /service/fosdem/members /service/fosdem/members/postgresql0 /service/fosdem/members/postgresql1 /service/fosdem/initialize /service/fosdem/leader /service/fosdem/optime /service/fosdem/optime/leader
  • 25. $ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0 HTTP/1.1 200 OK ... X-Etcd-Cluster-Id: 7e27652122e8b2ae X-Etcd-Index: 3114 X-Raft-Index: 20102 X-Raft-Term: 2 { "action": "get", "node": { "createdIndex": 3111, "expiration": "2016-01-28T14:28:25.221011955Z", "key": "/service/fosdem/members/postgresql0", "modifiedIndex": 3111, "ttl": 22, "value": "{"conn_url":"postgres://replicator:rep-pass@127.0.0.1:5432/postgres"," api_url":"http://127.0.0.1:8008/patroni","tags":{"nofailover":false,"noloadbalance":false, "clonefrom":false},"state":"running","role":"master","xlog_location":234881568}" } }
  • 28. Streaming replication in 140 characters
  • 29. Patroni configuration parameters ● YAML file with sections ● general parameters ○ ttl: time to leave for the leader and member keys ○ loop_wait: minimum time one iteration of the eventloop takes ○ scope: name of the cluster to run ○ auth: ‘username:password’ string for the REST API ● postgresql section ○ name - name of the postgresql member (should be unique) ○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432) ○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432) ○ data_dir: PGDATA (can be initially not empty) ○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind ○ use_slots: whether to use replication slots (9.4 and above)
  • 30. postgresql subsections ● initdb: section to specify initdb options (i.e. encoding, default auth mode) ● pg_rewind: section with username/password for the user used by pg_rewind ● pg_hba: entries to be added to pg_hba.conf ● replication: replication user, password, and network (for pg_hba.conf) ● superuser: username/password for the superuser account (to be created) ● admin: username/password for the user with createdb/createrole permissions ● create_replica_methods: list of methods to image replicas from the master: ● recovery.conf: parameters put into the recovery.conf (primary_conninfo is written automatically) ● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)
  • 31. tags (patroni configuration) tags modify behavior of the node they are applied to ● nofailover: the node should not participate in elections or ever become the master ● noloadbalance: the node should be excluded from the load balancer (TODO) ● clonefrom: this node should be bootstrapped from (TODO) ● replicatefrom: this node should do streaming replication from (pull request)
  • 32. REST API ● command and control interface ● GET /master and /replica endpoints for the load balancer ● GET /patroni in order to get system information ● POST /restart in order to restart the node ● POST /reinitialize in order to remove the data directory and reinitialize from the master ● POST /failover with leader and optional member names in order to do a controlled failover ● patronictl to do it in a more user-friendly way
  • 33. REST API (master) $ http http://127.0.0.1:8008/master HTTP/1.0 200 OK ... Server: BaseHTTP/0.3 Python/2.7.10 { "postmaster_start_time": "2016-01-27 23:23:21.873 CET", "role": "master", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "location": 301990984 } }
  • 34. REST API (replica) http http://127.0.0.1:8009/master HTTP/1.0 503 Service Unavailable ... Server: BaseHTTP/0.3 Python/2.7.10 { "postmaster_start_time": "2016-01-27 23:23:24.367 CET", "role": "replica", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "paused": false, "received_location": 301990984, "replayed_location": 301990984 }
  • 35. Configuring HA Proxy for Patroni global maxconn 100 defaults log global mode tcp retries 2 timeout client 30m timeout connect 4s timeout server 30m timeout check 5s frontend ft_postgresql bind *:5000 default_backend bk_db backend bk_db option httpchk server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008 server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009
  • 37. Separate nodes for etcd and patroni
  • 38. Multi-threading to avoid blocking the event loop
  • 40. Use etcd/Zookeeper watches to speed up the failover
  • 41. Callbacks Call monitoring code or do some application-specific actions (i.e. change pgbouncer configuration) User-defined scripts set in the configuration file. ● on start ● on stop ● on restart ● on change role
  • 42. pg_rewind support ● remove recovery.conf if present ● run a checkpoint on a promoted master (due to the fast promote) ● remove archive status to avoid losing archived segments to be removed ● start in a single-user mode with archive_command set to false ● stop to produce a clean shutdown ● only if checksums or enabled or wal_log_hints are set (via pg_controldata)
  • 43. ● Many installations already have Zookeeper running ● No TTL ● Session-specific (ephemeral) keys ● No dynamic nodes (use Exhibitor) Zookeeper support
  • 45.
  • 46. Up next ● scheduled failovers ● full support for cascading replication ● consul joins etcd and zookeeper ● manage BDR nodes