• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
linux.conf.au-HAminiconf-pgsql91-20120116
 

linux.conf.au-HAminiconf-pgsql91-20120116

on

  • 2,697 views

Presentation slide for HA Miniconf at linux.conf.au 2012

Presentation slide for HA Miniconf at linux.conf.au 2012
http://linux.conf.au/wiki/index.php/Miniconfs/HighAvailabilityAndDistributedStorage

Statistics

Views

Total Views
2,697
Views on SlideShare
2,697
Embed Views
0

Actions

Likes
5
Downloads
89
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    linux.conf.au-HAminiconf-pgsql91-20120116 linux.conf.au-HAminiconf-pgsql91-20120116 Presentation Transcript

    • LCA 2012 HA MiniconfBuilding a non-shared storage HA cluster withPacemaker and PostgreSQL 9.1 2012/01/16 Keisuke MORI NTT DATA Intellilink Corporation. Linux-HA Japan Project. http://linux-ha.sourceforge.jp/ Copyright(c) 2012 Linux-HA Japan Project
    • Introduction PostgreSQL database now supports Streaming Replication (SR)  2010.09 Release 9.0 - "Asynchronous" replication supported.  2011.09 Release 9.1 - "Synchronous" replication supported.  NTT contributed the feature to the PostgreSQL community. Integration with a HA cluster software is necessary to accomplish automatic fail-over We have developed an enhancement version of “pgsql” resource agent (RA) for the integration with Pacemaker. + Copyright(c) 2011 Linux-HA Japan Project 2
    • Existing HA Configuration for PostgreSQL te Wri d Rea Answers to query PostgreSQL PostgreSQL PostgreSQL (Active) (Active) (Standby) Not running: started when a failure occurred pgsql RA Manages PostgreSQL pgsql RA pgsql RA start / stop / monitor Pacemaker Pacemaker Pacemaker Pacemaker Node #1 Node #2 The Database on Shared Storage Copyright(c) 2011 Linux-HA Japan Project 3
    • Advanced HA Configuration with PostgreSQL SR te Wri Re d ad Rea PRI sends “WAL” HS is running; also can records to HS answer to read queries. PostgreSQL PostgreSQL PostgreSQL PostgreSQL [Primary(PRI)] [Hot-Standby(HS)] [Hot-Standby(HS)] [Primary(PRI)] Streaming Replication (SR) New Enhancement! pgsql RA pgsql RA pgsql RA pgsql RA Manages PRI/HS [Master] status in PostgreSQL state in PostgreSQL [Slave] [Slave] [Master] start / stop / monitor promote / demote Pacemaker Pacemaker Pacemaker Pacemaker Node #1 No shared storage No shared storage Node #2 Copyright(c) 2011 Linux-HA Japan Project 4
    • Benefits of Streaming Replication Removing Single Point Of Failure (SPOF)  Shared storage could be a SPOF. Reduce the cost  Shared storages are very expensive! Faster Fail-Over / Shorter Downtime  by eliminating crash recovery time of the database.  Crash recovery time is the most dominant factor of the downtime particularly in the large database. Load Balancing for read-only query Copyright(c) 2011 Linux-HA Japan Project 5
    • Comparison of Replication Technology PostgreSQL 9.1 DRBD Shared Storage Slony-I SR (sync) (Protocol C)High Availability N.A.usage (async only) mount/umount mount/umountFail-Over Time crash recovery N.A. crash recovery cluster-aware cluster-awareRead Scalability applications required applications requiredNon-DB usage N.A. N.A.Throughput approx. approx. 100%Performance 90%-99% (*1) 70%-80% (*2) (*1) Varies on the workload (*2) Assumes DB usage, Varies on the workload Copyright(c) 2011 Linux-HA Japan Project 6
    • Key Features of the new pgsql RA Manages Primary/Hot-Standby status in PostgreSQL  works as a MasterSlave resource in Pacemaker. Data protection  Preventing from starting PostgreSQL when the stored data is considered unreliable or inconsistent.  The RA creates the “lock file”and intentionally leaves it over reboots. for indicating the data on the node is likely unreliable.  Lock file: /var/lib/pgsql/PGSQL.lock Display PostgreSQL status on crm_mon  running status, data status  makes operation easier. Determine which node has the latest data  when both nodes has started at the same time.  but not recommended to depend on it to simplify the operation. Copyright(c) 2011 Linux-HA Japan Project 7
    • Sample Resource Configuration VIP master VIP slave IPaddr2 IPaddr2 IPaddr2 IPaddr2 IPaddr2 IPaddr2 IPaddr2 IPaddr2 vip-slave vip-slave vip-slave vip-slave vip-master vip-master vip-master vip-master (Optional) IPaddr2 IPaddr2 IPaddr2 IPaddr2 vip-rep vip-rep vip-rep vip-rep VIP rep pgsql pgsql pgsql pgsql (Master) (Master) Streaming Replication (Slave) (Slave) Pacemaker Pacemaker Pacemaker Pacemaker Node #1 Node #2 Copyright(c) 2011 Linux-HA Japan Project 8
    • Sample CRM configuration for pgsql RA ms msPostgresql postgresql meta master-max="1" rep_mode: master-node-max="1" “sync” enables the SR support clone-max="2" clone-node-max="1" notify="true" master_ip: primitive postgresql ocf:heartbeat:pgsql the virtual IP for the replication params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" stop_on_demote: pgctldata="/usr/pgsql-9.1/bin/pg_controldata" “yes” is recommended. pgdata="/var/lib/pgsql/9.1/data/" See discussion later. start_opt="-p 5432" rep_mode="sync" node_list="devnode1 devnode2" restore_command="cp /var/lib/pgsql/9.1/data/pg_archive/%f %p" master_ip="192.168.122.103" stop_on_demote="yes" op start timeout="60s" interval="0s" on-fail="restart" op monitor timeout="60s" interval="7s" on-fail="restart" op monitor timeout="60s" interval="2s" on-fail="restart" role="Master" op promote timeout="60s" interval="0s" on-fail="restart" op demote timeout="60s" interval="0s" on-fail="block" op stop timeout="60s" interval="0s" on-fail="block" op notify timeout="60s" interval="0s" Copyright(c) 2011 Linux-HA Japan Project 9
    • Sample CRM configuration for VIPs group master-group vip-master vip-rep primitive vip-master ocf:heartbeat:IPaddr2 params ip="192.168.100.101" nic="eth0" cidr_netmask="24" op start timeout="60s" interval="0s" on-fail="restart" op monitor timeout="60s" interval="10s" on-fail="restart" op stop timeout="60s" interval="0s" on-fail="block" primitive vip-rep ocf:heartbeat:IPaddr2 params ip="192.168.122.103" nic="eth3" cidr_netmask="24" (ditto) primitive vip-slave ocf:heartbeat:IPaddr2 params ip="192.168.100.102" nic="eth0" cidr_netmask="24" meta resource-stickiness="1" (ditto) colocation rsc_colocation-2 inf: master-group msPostgresql:Master order rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false location rsc_location-1 vip-slave rule 200: pgsql-status eq "HS:sync" rule 100: pgsql-status eq "PRI" rule -inf: not_defined pgsql-status rule -inf: pgsql-status ne "HS:sync" and pgsql-status ne "PRI" Copyright(c) 2011 Linux-HA Japan Project 10
    • New introduced parameters for pgsql RA Name Description rep_mode R Replication mode: none(default) / async / sync. node_list R All node names. Please separate each node name with a space. restore_command R restore_command for recovery.conf. Masters floating IP address to be connected from hot standby. master_ip R This parameter is used for "primary_conninfo" in recovery.conf. User used to connect to the master server. repuser This parameter is used for "primary_conninfo" in recovery.conf. Default: postgres Whether or not to stop PostgreSQL with instead of restarting it stop_on_demote on demote, to speed up failover (yes/no(default)). primary_conninfo_opt primary_conninfo options of recovery.conf except host, port, user and application_name. tmpdir Path to temporary directory. Default: /var/lib/pgsql pgctldata Path to pg_controldata command. Default: /usr/bin/pg_controldata xlog_check_count Number of checking xlog on monitor before promote. Default: 3 crm_attr_timeout The timeout of crm_attribute forever update command. Default: 5 R: Required when the streaming replication is enabled. Copyright(c) 2011 Linux-HA Japan Project 11
    • Sample PostgreSQL configuration postgresql.conf (excerpt)  Only related part to the streaming replication. See the PostgreSQL manual for details. listen_addresses = * wal_level = hot_standby synchronous_commit = on archive_mode = on archive_command = /bin/cp %p /var/lib/pgsql/9.1/data/pg_archive/%f max_wal_senders=5 wal_keep_segments = 32 hot_standby = on include = /var/lib/pgsql/rep_mode.conf restart_after_crash = off replication_timeout = 5000 # mseconds wal_receiver_status_interval = 2 # seconds rep_mode.conf: pgsql RA will create this file to control the postgresql mode. Copyright(c) 2011 Linux-HA Japan Project 12
    • Sample crm_mon output Online: [ devnode1 devnode2 ] vip-slave (ocf::heartbeat:IPaddr2): Started devnode2 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started devnode1 vip-rep (ocf::heartbeat:IPaddr2): Started devnode1 Master/Slave Set: msPostgresql Masters: [ devnode1 ] Slaves: [ devnode2 ] Clone Set: clnPingCheck Started: [ devnode1 devnode2 ] Node Attributes: * Node devnode1: + default_ping_set : 100 + master-postgresql:0 : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 16:000000002B000EC0 + pgsql-status : PRI * Node devnode2: + default_ping_set : 100 + master-postgresql:1 : 100 + pgsql-data-status : STREAMING|SYNC + pgsql-status : HS:sync Copyright(c) 2011 Linux-HA Japan Project 13
    • Status attributes pgsql-status: running status of PostgreSQL on each nodes Value Description STOP Not running. HS:alone Running as Hot-Standby, not connected to Primary. HS:connected Running as Hot-Standby, connected to Primary, transient state. HS:async Running as Hot-Standby in Asynchronous replication mode. HS:potential Running as Hot-Standby (only appears when 3 or more nodes). HS:sync Running as Hot-Standby in Synchronous replication mode. PRI Running as Primary. pgsql-data-status: data status on each nodes Value Description DISCONNECTED The data is out-of-dated. Must not become Primary. The data is replicating from Primary but may not be up-to-date. STREAMING|ASYNC Must not become Primary. The data is replicating from Primary and it is up-to-date. STREAMING|SYNC Ready to become Primary. LATEST The data is up-to-date and the node is now Primary. Copyright(c) 2011 Linux-HA Japan Project 14
    • Best Practice of the Operation Procedure General Recommendations  Invoke the cluster nodes one by one: PRI first, HS second.  Always copy the database from the PRI node before starting the HS node to make sure the data is consistent. Initial Invocation  (0) Initialize the database on #1  or determine which node has the latest data manually  (1) Invoke Pacemaker and pgsql resource on #1(PRI)  (2) Copy the database from #1(PRI) to #2  pg_basebackup -h $MASTER_IP -U postgres -D $PGDATA --xlog  or any other backup/restore methods should work (e.g. rsync)  (3) Invoke Pacemaker on #2(HS)  (4) Make sure the replication working successfully on #2(HS)  wait until as “pgsql-status : HS:sync” on #2 Copyright(c) 2011 Linux-HA Japan Project 15
    • Best Practice of the Operation Procedure Recovery from a failure  (0) fail-over or switch-over occurred; #2 is now PRI  (1) Stop Pacemaker on failed node #1  (2) Repair broken things if needed  (3) Copy the database from PRI(#2) to #1  (4) Clear the "lock" file created by the RA  rm /var/lib/pgsql/PGSQL.lock  letting the RA know weve made sure that the data is consistent.  (5) Invoke Pacemaker on #1(HS)  (6) Make sure the replication working successfully on #1(HS)  wait until as “pgsql-status : HS:sync” on #1 Copyright(c) 2011 Linux-HA Japan Project 16
    • Implementation Challenges State Transition Models are diferent between Pacemaker and PostgreSQL  PostgreSQL can not “demote”  can not transit from Primary state to Hot-Standby state  Primary state is only allowed to transit to Stop state  Diference of Concepts  Pacemaker: “Master” is an additional state  PostgreSQL: “Slave(Hot-Standby)” is an additional state Current Solution  “stop_on_demote” parameter Copyright(c) 2011 Linux-HA Japan Project 17
    • Status Mapping between Pacemaker and PostgreSQL Stopped Slave Master (not in use) / pg_ctl start promote / pg_ctl promote (state transition inside PostgreSQL) Ready to failover start / pg_ctl start + recovery.conf HS: HS: HS: promote / pg_ctl promote STOP PRI alone (others) sync stop / pg_ctl stop notify post-demote demote stop / / pg_ctl start + recovery.conf STOP / pg_ctl stop (do nothing) (transient) demote_on_stop =no STOP (unmatched demote / pg_ctl stop demote_on_stop state) =yes Copyright(c) 2011 Linux-HA Japan Project 18 18
    • “stop_on_demote” parameter yes: obey the PostgreSQL transition model (recommended)  Always stop the PostgreSQL application when demote.  Simplify the operation.  A monitor may fail if its run between a demote and a stop operation. no: obey the Pacemaker transition model  Stop the PostgreSQL once and invoke it again when demote.  Takes very long time to complete the fail-over.  Requires the archive files being kept up-to-date via scp or an shared storage (to build a Non-shared storage cluster!) .  Rather complicated configuration and operation. Copyright(c) 2011 Linux-HA Japan Project 19
    • Proposal for the future enhancement of Pacemaker Add new state transition paths by either: (1) add the return codes semantics of operations  when “start” returns OCF_RUNNING_MASTER:  Stopped → Master state  when “demote” returns OCF_NOT_RUNNING:  Master → Stopped state  RAs decide the next transition state. (2) change the operations semantics  “promote” may be invoked at Stopped state  Supposed to move to Stopped state after “demote”has succeeded.  Pacemaker decides the semantics via a configuration parameter  This method does NOT work for pgsql RA  can not decide which node should be promoted before its started. Copyright(c) 2011 Linux-HA Japan Project 20
    • Other Issues (1) crm_attribute command rarely hangs  When invoked at the moment of the DC node is absent by a node failure  Workaround: a wrapper function to pop up the timeout  Details will be filed to the bugzilla soon (2) Can not obtain the Master uname in monitor  OCF_RESKEY_CRM_meta_notify_master_uname is not set in monitor  pgsql RA needs to get to know the uname of Master in monitor operation to manage the current status of PostgreSQL.  Workaround: parse crm_mon output  Filed to bugzilla LF#2607  (The Linux Foundation site is still down though...) Copyright(c) 2011 Linux-HA Japan Project 21
    • Conclusion TODO  merging to the upstream of the resource-agents package.  code clean-up and refactoring Development code  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql  Tested version:  postgresql-9.1.1 (or later)  pacemaker-1.0.11 and heartbeat-3.0.5 (should be independent from cluster stack / versions)  The key developer: Takatoshi MATSUO Documents / Sample configuration  https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for- PostgreSQL-9.1-streaming-replication Discussions  pacemaker or linux-ha-dev Mailing Lists Any comments and improvements are welcome! Copyright(c) 2011 Linux-HA Japan Project 22