PostgreSQL Lifecycle Automation
This Talk
• Transform the way databases are being operated.
• Looking at PostgreSQL and see what automation its lifecycle means.
• Select a tooling to work with.
• Draft an architectural outline.
• Show use case examples for handling the PostgreSQL lifecycle in a fully
automated way.
Introduction
–The anynines mission statement
“Automate the entire
lifecycle at production
grade of a growing number
of data services across
platforms, infrastructures
at scale.”
Terms
• Infrastructure
• Platform
• (Data) Service
• (Data) Service Instance
• Service Binding
• Cluster = Streaming Replication PostreSQL Cluster
Change
“Database Administration
is Subject to Change.”
• Application development platforms. CF, k8s, …
• Microservices
• More Apps
• More Data Service Types
• More Data Service Instances
Change
Striving for…
Scalability
Instant On-demand Self-
Service
Production-Readiness
The mantra is…
Automate!
Automate!
Automate!
PostgreSQL
Vision
Automate the entire lifecycle of
PostgreSQL to easily operate
thousands of production grade
DBs across …
publicand on-premise clouds
and integrate well with
multiple platforms
Tackle the
Challenge
PostgreSQL
Lifecycle
• Provision database VM
• Install database software
• Configure database software
• Setup replication & cluster mangement
• Configure Monitoring
• Configure availability monitoring
• Configure alerting
• Configure collection of logs
• Configure collection of metrics
• Create database
• Create database users
• Adapt configuration of individual
database servers
• e.g. enable extension
• Adapt configuration of individual
databases
• Setup backup procedure
• Perform backups regularly
• Perform on-demand backups
• Perform schema migrations
• Query data
• Determine performance bottleneck
• Perform scale-out of database server
• Recover from master db failure
• Recover from standby failure
• Recover from AZ failure
• Recover from network partitioning (split
brain)
• Recover from PostgreSQL process failure
• Perform upgrade of operating system
• Restore data from backup
• Apply patch-level upgrade
• Apply major version upgrade
• Perform data migration
• Destroy data
• Destroy database server
PostgreSQL Lifecycle
PostgreSQL is
not easy to
automate.
PostgreSQL
Lifecycle
Automation
Finding the
Automation
Strategy
Is there a single strategy
that works at production
grade across data services
at scale?
On-Demand Provisioning
of
Dedicated Database
Servers
Easy Deployment
$> cf create-service postgresql
single-small single-postgres-1
single-postgres-1
Postgresql
VM#1
Service Instance #1
Service Broker
PostgreSQL Automation
Easy Deployment
$> cf create-service postgresql
cluster-small postgres-cluster-2
single-postgres-1
Postgresql
VM#1
Service Instance #1
Service Broker
PostgreSQL Automation
postgres-cluster-2
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
Service Instance #2
Data ServicesApplication Runtime
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
App App App App App App App App App App
Data Service Automation
Lifecycle
vs.
Data Service Instance Lifecycle
Data Service
Automation
Lifecycle
Test
a9s
Release
Build
Upstream
Release
Platform
Environment #1
Platform
Environment #2
Platform
Environment #n
Ship Automation Releases
into Platform Environments
Update Data Service Instances
a9s PostgreSQL
Release Repository
Open Source
PostgreSQL
Building new PostgreSQL automation releases
PostgreSQL CI/CD-Pipeline
Data Service
Instance
Lifecycle
Create
Modify
Update
Backup
Recovery
Terminate
PostgreSQL Lifecycle - 1st Iteration
Principles - Full Lifecycle Management
Iteratively increase the
automation depth.
Create
Modify
Update
Backup
Recove
ry
Termina
te
Enabling/disabling data
service plugins
Scale-out Scale down
Add/remove a log/metric endpoint
Major version update Misc. config changes
Greenfield / Clone
Single / clustered
pre-provisioned / on-demand
Trigger manual backup
Scheduled backup
Restore backup
Hard reset of service
instance
Destroy service
instance
Minor version update
Patchlevel version update
Cluster Failure detection
Cluster failover
Self-healing by
resurrection
PostgreSQL Lifecycle - nth Iteration
Automation
Architecture
a9s Deployer
Templates Deployments
Bosh || k8s
a9s Service Broker
my-3node-postgres-cluster-2
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-single-postgres-1
Postgresql
VM#1
Middleware Adapter
Open Service Broker API
a9s PostgreSQL SPI
Service InstanceService Instance
my-3node-postgres-cluster-3
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
Service Instance
…
Cloud ControllerCF Client
create service
create service
create deployment from template xy with attributes {…}
deploy release abc & deployment manifest xyz
Execute deployments
create
service specific
credentials
create binding
HTTP Verb Action
Service Catalog
GET /v2/catalog
Deliver meta data about the data service.
Create Service Instance
PUT /v2/service_instances/:id
Provision a VM, install and configure a
data service VMs
/ Cluster representing a service instance.
Create Service Binding
PUT /v2/service_instances/:instance_id/service_bindings/:id
Create a data service user and return
credentials representing a service
binding.
Delete Service Binding
DELETE /v2/service_instances/:instance_id/service_bindings/:id
Remove credentials associated with the
service binding.
Delete Service Instance
DELETE /v2/service_instances/:id
Destroy the VMs and data associated with
the service instance.
How to
Automate?
Selecting the
Automation
Technology
Automation
Technology
Requirements
Lifecycle
Awareness
Repeatability
Scalability
Infrastructure
Agnosticism
Automation
Technology
Candidates
BOSH
Kubernetes
VMs
Containers
VMs
Containers
Strong isolation.
Bad
Neighborhood
Protection.
Faster dev
cycles.
Less overhead.
Faster instance
startup.
Weak isolation.
BOSH
Infrastructure
Independent
“„BOSH let’s you
orchestrate the lifecycle of
large-scale deployments of
stateful distributed
systems
to infrastructure.“”
Automate once,
deploy everywhere.
VMware
BOSH
BOSH CLI
$> bosh target http://bosh-on.vmware.com$> bosh deploy
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
OpenStack
BOSH
AWS
BOSH
BOSH CLI
VMware AWS OpenStack
$> bosh target http://bosh-on.aws.com$> bosh deploy
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
Some
Service / App
BOSH Agent
VIRTUAL MACHINE
BOSH BOSH BOSH
Dealing with State
Where to store state?
Store state on a remotely
attached block device =
persistent disk.
🔑
Infrastructure as a Service (IaaS), e.g. OpenStack
VIRTUAL DATACENTER
Router
STORAGE
Storage Node Storage Node Storage Node
HDD HDD
HDD HDD
HDD HDD
HDD HDD
HDD HDD
HDD HDD
HDD HDD
HDD HDD
HDD HDD
Storage Volume
Operating
System
VIRTUAL MACHINE
Infrasstructure API
The data lifecycle has been
decoupled from the VM
lifecycle
⇒ The VM becomes
disposable.
🔑
Ephemeral VM,
persistent disk.
🔑
Scalable
Horizontal
Scaling
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
Some Service
BOSH Agent
VIRTUAL MACHINE
BOSH Deployments are
Predictable
BOSH Deployments are
Repeatable
PostgreSQL
Automation.
High Availability &
Cluster
Management
Replication
• Applies to clustered Data Service Instances
• Asynchronous streaming replication
• Three (3) Nodes:
• One (1) Master Node
• Two (2) Standby Nodes
• Replication slots configured to avoid early recycling of master WAL segments
High Availability & Replication
Repmgr
• Extends PostgreSQL streaming replication
• Manages replication
• Performs failure detection
• Assists during failover by triggering leader election
• Facilitates monitoring of the replication health & performance
Repmgr
Custom
Automation
• Detect network partitioning & split brain situations
• Periodically checks the repmgr database and verifies replication status and
cluster status
• Fires alarm if
• master is not followed by a majority
• standby is following the wrong master
Custom PostgreSQL Automation
Exemplary
Failure Scenarios
Failing Standby
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
Failing Master
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#3
👑
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
👑
Vertical Scale-
Out
single-postgres-1
Postgresql
VM#1
Service Instance #1
ScalabilityVertical-Scale Out
$> cf service-update 
single-postgres-1 -p single-large
Single node PostgreSQL service instance
> turned into a large PostgreSQL server.
single-postgres-
1
Postgresql
VM#1
What has happened
during the vertical scale-
out?
4GB RAM
1 vCPU
10GB persistent disk
BOSH Agent
VIRTUAL MACHINE
4 GB RAM, 1 vCPU
10 GB Persistent Disk
Data
PostgreSQL
BOSH Agent
PostgreSQL
VIRTUAL MACHINE
8 GB RAM, 2 vCPUs
10 GB Persistent Disk
Data
BOSH Agent
VIRTUAL MACHINE
4 GB RAM, 1 vCPU
PostgreSQL
10 GB Persistent Disk
Data
20 GB Persistent Disk
Data
OS Update
PostgreSQL
Upgrade
Add DB User and
Grand Access
ScalabilityDB User & DB Access
$> cf bind-service 
my-app single-postgres-1
Backup &
Restore
Backup & Restore for
Application Developers
Backup & Restore of
the user’s service
instances
Backup & Restore for
Platform Operators
• Disaster Recovery Plan for Data Services
• Backup & Restore for platform operators
• Backup & Restore of
• All service instances
• Individual service instances
Backup & Restore for Platform
Operators
Backup Framework
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-postgres-cluster-2
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-postgres-cluster-3
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-mongo-cluster-4
MongoDB
VM#1
MongoDB
VM#2
MongoDB
VM#3
my-3node-postgres-cluster-6
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-postgres-cluster-5
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-redis-
solo-7
Redis
VM#1
BOSH Agent
VIRTUAL MACHINE
4 GB RAM, 1 vCPU
10 GB Persistent Disk
Data
PostgreSQL
Replication Manager
Log Agent
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
Backup Agent
Data Filter Chain
a9s PostgreSQL
Service Instance #32
Data Stream Reader
Postgresql VM#2
Data Filter
Data Stream Writer
Data Filter
Backup Agent
Object Store,
e.g. AWS S3
Database
Object Store,
e.g. AWS S3
my-3node-postgres-cluster-1
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
my-3node-postgres-cluster-2
Postgresql
VM#1
Postgresql
VM#2
Postgresql
VM#3
Backup requested
a9s Backup Manager
a9s Backup API
⏰
Amazon S3
Backup scheduled
Tell backup agent to
perform backup
Store encrypted backup to storage
Backup🔐
Conclusion
• Dedicated service instances are mandatory.
• On-demand provisioning is essential.
• Choosing the right automation technology is key.
Full PostgreSQL
lifecycle automation is
feasible…
… and it is
already
happening!
Questions?
@anynines
@fischerjulian
Thank You!

Automating the Entire PostgreSQL Lifecycle

Editor's Notes

  • #2 introduction into data service automation focus on postgresql focus on the postgresql lifecycle how the lifecycle of postgresql can be automated.
  • #4 Why? Change > Adoption needed.
  • #8 Why? Change > Adoption needed.
  • #13 More database types and more database instances are to be operated than ever before. They are needed at short hand.
  • #15 BOSH on-demand dedicated on-demand dedicated clustered, bosh self-healing, back up restore
  • #22 Challange: complexity of the lifecycle.
  • #24 VM lifecycle: VM provisioning, VM self-healing, VM scaling, Software management: package management, dependencies, etc. Process management: starting and monitoring of processes PostgreSQL lifecycle: configuration, databases, backup, restore, users, permissions Cluster lifecycle: replication, failure detection, failover, recovery from standard failure scenarios: master failure, standby failure, az failure, network partitioning
  • #25 Things to worry about when automating.
  • #26 replication but no cluster management ugly upgrade paths … assumed presence of human dba in design
  • #35 Solo, clustered Different vertical scales.
  • #39 Then Focus on Data Service Instance Lifecycle for the remainder of the talk!
  • #50 VMs, Software, Processes, Self-Healing
  • #51 Two deployments are absolutely identical. Convergence vs. declarative automation.
  • #52 Automate once, deploy thousands of times. The infrastructure is the limit.
  • #53 Portability
  • #58 But k8s is conceptually similar.
  • #60 large-scale: several hundred VMs distributed systems: such as CF. Dozenz of components. Deployment to infrastructure: prefer the resulting platform for anything that does not need to run directly on infrastructures.
  • #76 Limited time. Exemplary lifecycle aspect.
  • #78 Replication slots: WAL files with data that have not been sent to the standby are kept until the standby nodes have received them
  • #85 Similar to app instances, backing services should be self-healing.
  • #86 In this case a cluster node fails.
  • #87 A clustered service detects such a failure and performs an automatic failover. The service instance remains available hickups are possible and depend on the backing service type. This this failover is much faster than re-building VMs and even Containers. However it is more costly, as additional VMs and additional storage is required to store data replicas.
  • #88 Automatic recovery from degraded mode. Data is held in remote storage. Instance could also be recovered in case the remote storage fails.
  • #90 Similar to app instances, backing services should be self-healing.
  • #91 In this case a cluster node fails.
  • #92 A clustered service detects such a failure and performs an automatic failover. The service instance remains available hickups are possible and depend on the backing service type. This this failover is much faster than re-building VMs and even Containers. However it is more costly, as additional VMs and additional storage is required to store data replicas.
  • #93 Automatic recovery from degraded mode. Data is held in remote storage. Instance could also be recovered in case the remote storage fails.
  • #94 Limited time. Exemplary lifecycle aspect.
  • #97 Things change. Apps and services grow and need to be scaled out.
  • #101 Limited time. Exemplary lifecycle aspect.
  • #102 Limited time. Exemplary lifecycle aspect.
  • #103 Limited time. Exemplary lifecycle aspect.
  • #104 Limited time. Exemplary lifecycle aspect.
  • #106 Limited time. Exemplary lifecycle aspect.
  • #114 back plugins, filters, output plugins