基于linux-HA 的PG高可用性

基于linux-HA的PG高可用系统

DBA/李思亮

我们的问题
• 目前的mysql-HA 方案是可行的，有效的。
• mysql-HA 检测代码时间过长，目前需要5
分钟左右确认故障。
• 单向不可控，主从网络不通导致切换的可
能性。
• 监控代码与切换代码分离，引入多层架
构，增加中间环节。
• LVS的高可用性保障（脑裂问题）

简介
• linux-ha.org 是希望提供一整套的解决方案库，来实现linux系统下的高可用性解决方案。

• 起始于1999年。
• 最初是heartbeat ，到版本2.1.4 的时候，做了功能上的分拆，分拆为heartbeat，
cluster-glue，resource-agents 3个部分，同时，把CRM (集群管理）分离作为一个单
独的开源系统进行开发，也就是我们说的pacemaker .
• 现在的heartbeat 只剩下一个消息通信层这一个功能了，社区也已经不再添加新的功能
了。这也预示着heartbeat 的生命周期进入夕阳阶段了。

• 另一个项目corosync 功能跟heartbeat 一样:消息管理层。但比heartbeat 集成了更多
的功能。

• 2002年起始于openais.org ，2008年命名为corosync 2009年发布1.0.0版本，
• 目前版本 1.4.1 。

• pacemaker 模块来源于heartbeat 主要功能是资源管理。

我们的方案
COROSYNC + PACEMAKER

COROSYNC
• 提供安全可靠，有序的消息传输。
• 确认集群的membership
• 确认集群的法定投票人数（quorum）
– 通过authkey 确认集群节点间的关系和安全验证
– 单环有序的广播协议（single-ring ordering and
membership protocol）
– 一般通过网络udp协议传输消息，可以通过广播
（broadcast）方式

Pacemaker
• 目前的主要代码贡献者 redhat,ibm,NTT
• 提供分布式集群消息框架
• CIB 是一个基于XML的数据仓库，存储了
资源的配置信息和资源的运行状态
• 集成了策略决策系统（PE) (Policy Engine )
来保证资源之间的依赖关系
• Resource Agents：资源脚本。资源脚本可
以是任何可以执行的代码，一般要求代码
能够响应： start stop monitor 3个动作

安装
• yum install pacemaker corosync
heartbeat
• yum install resource-agnets fence-
agents
• 各个节点之间配置无密码认证。
• 各个节点之间配置时间同步。
• linux-HA redhat6.0 之后的安装变更
– http://bbs.pconline.cn/topic-2329.html
□

配置
• 各个节点的配置中的主机名=`uname -n`
• 配置节点的hosts文件主机名=`uname -n`
• vi /etc/corosync/corosync.conf
• vi /etc/corosync/service.d/pcmk
– service {
– name: pacemaker
– ver: 1
– }
• corosync-keygen
• copy all of them to all nodes.

启动
• /etc/init.d/corosync start
• /etc/init.d/pacemaker start
• crm_mon 监控状态。

资源
• 资源是任何可执行的程序。
• 资源需要支持3个基本操作
– start 启动资源,
– stop 停止资源,
– monitor 监控状态
• http，database，filesystem，disk，
network，ip，drbd。。。。
• 目前社区提供约70个资源脚本

stonith（fence）设备
• Shoot The Other Node In The Head
– 考虑以下环境：
– M-S结构，如何确保vip 不在两个节点上同时启动，或
者确保vip在M节点上正确的卸载，才能在S节点上启
动？
• 确保资源真正的在另一个节点上关闭
• 返回的状态是可信的。

主要的stonith（fence）设备
• crm ra list stonith
• 对我们来说最有价值的。
– fence_vmware 远程操作虚拟机
– fence_ipmilan 主机远程电源管理
• redhat 资助了大部分的pacemaker 的开发
者，把redhat的RHCS的fence系统合并生
成了resource-agents，stonith改称为fence
设备。

资源管理器pacemaker
• 基于xml的配置数据库与状态容器
• crm_mon 监控资源状态，基于事件驱动的。
• crm 资源配置管理工具
– crm configure show
– crm resource show
– crm help 。。。。

• cibadmin 资源配置信息仓库cib的管理工
具

•
•
node node95
我们的一个配置示例
attributes standby="off"
• node node96
• attributes standby="off"
• primitive ClusterIp ocf:heartbeat:IPaddr2
• params ip=" 192.168.11.101" cidr_netmask="32"
• op monitor interval="30s" failure-timeout="60s"
• op start interval="0" timeout="30s"
• op stop interval="0" timeout="30s"
• meta target-role="Started" is-managed="true"
• primitive fence_vm95 stonith:fence_vmware
• params ipaddr=" 192.168.10.197" login="user" passwd="passwd" vmware_datacenter="GZ-Offices" vmware_type="esx" action="reboot" port="dba-test-Cos6.2.64-
11.95" pcmk_reboot_action="reboot" pcmk_host_list="node95 node96"
• op monitor interval="20" timeout="60s" failure-timeout="60s" fail-count="100"
• meta target-role="Started"
• params ipaddr=" 192.168.10.197" login="user" passwd="passwd" vmware_datacenter="GZ-Offices" vmware_type="esx" action="reboot" port="dba-test-Cos6.2.64-
11.96" pcmk_reboot_action="reboot" pcmk_host_list="node95 node96"
• op monitor interval="20" timeout="60s" failure-timeout="60s" fail-count="100"
• meta target-role="Started"
•
•
primitive ping ocf:pacemaker:ping
□
params host_list="192.168.11.95 192.168.11.96 192.168.10.1 192.168.10.254 10.10.200.10 10.10.200.20" multiplier="100"
• op monitor interval="10s" timeout="60s" failure-timeout="30s"
• primitive postgres_res ocf:heartbeat:pgsql
• params pgctl="/usr/local/pgsql/bin/pg_ctl" psql="/usr/local/pgsql/bin/psql" start_opt="" pgdata="/usr/local/pgsql/data" config="/usr/local/pgsql/data/postgresql.conf"
pgdba="postgres" pgdb="postgres"
• op monitor interval="10s" timeout="30s" failure-timeout="60s" migration-threshold="10000"
• meta target-role="Started" is-managed="true"
• clone clone-ping ping
• meta interleave="true" target-role="Started"
• location loc_ClusterIp ClusterIp 50: node96
• location loc_fence_vm95 fence_vm95 -inf: node95
• location loc_fence_vm96 fence_vm96 -inf: node96
• location loc_postgres_res postgres_res 50: node96
• colocation Pg-with-ClusterIp inf: ClusterIp postgres_res
• order Pg-before-ClusterIp inf: postgres_res ClusterIp
• property $id="cib-bootstrap-options"
• dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14"
• cluster-infrastructure="openais"
• expected-quorum-votes="2"
• stonith-enabled="true"
• last-lrm-refresh="1350372421"
• no-quorum-policy="ignore"
• cluster-delay="30s"
• rsc_defaults $id="rsc-options"
• resource-stickiness="100"

• node node95
node node96

• primitive ClusterIp ocf:heartbeat:IPaddr2
params ip="192.168.11.101"
cidr_netmask="32"
op monitor interval="30s"
failure-timeout="60s"
op start interval="0" timeout="30s"
op stop interval="0" timeout="30s"

params ipaddr="x.x.x.x" login="user"
passwd="passwd vmware_datacenter="GZ-
Offices" vmware_type="esx" action="reboot"
port="dba-test-Cos6.2.64-11.95"
pcmk_reboot_action="reboot"
pcmk_host_list="node95 node96"
op monitor interval="20" timeout="60s"
failure-timeout="60s" fail-count="100"
meta target-role="Started"

params ipaddr="x.x.x.x" login="user"
passwd="passwd"
vmware_datacenter="GZ-Offices"
vmware_type="esx" action="reboot" port="dba-test-
Cos6.2.64-11.96" pcmk_reboot_action="reboot"
pcmk_host_list="node95 node96"
op monitor interval="20" timeout="60s"
failure-timeout="60s" fail-count="100"
meta target-role="Started"

• primitive ping ocf:pacemaker:ping
params host_list="192.168.11.95
192.168.11.96 192.168.10.1 192.168.10.254
10.10.200.10 10.10.200.20" multiplier="100"
op monitor interval="10s" timeout="60s"
failure-timeout="30s"

• primitive postgres_res ocf:heartbeat:pgsql
params pgctl="/usr/local/pgsql/bin/pg_ctl"
psql="/usr/local/pgsql/bin/psql" start_opt=""
pgdata="/usr/local/pgsql/data"
config="/usr/local/pgsql/data/postgresql.conf"
pgdba="postgres" pgdb="postgres"
op monitor interval="10s" timeout="30s"
failure-timeout="60s" migration-
threshold="10000"

• clone clone-ping ping
meta interleave="true" target-
role="Started"

• location loc_ClusterIp ClusterIp 50:
node96
• location loc_fence_vm95 fence_vm95 -inf:
node95
• location loc_fence_vm96 fence_vm96 -inf:
node96
• location loc_postgres_res postgres_res 50:
node96

• colocation Pg-with-ClusterIp inf: ClusterIp
postgres_res
• order Pg-before-ClusterIp inf: postgres_res
ClusterIp

• property $id="cib-bootstrap-options"
dc-version="1.1.7-6.el6-
148fccfd5985c5590cc601123c6c16e966b85d14
"
• cluster-infrastructure="openais"
• expected-quorum-votes="2"
• stonith-enabled="true"
• last-lrm-refresh="1350372421"
• no-quorum-policy="ignore"
• cluster-delay="30s"

• rsc_defaults $id="rsc-options"
resource-stickiness="100"

数据库启动脚本的改造

PG数据库切换的两个假设
• 1.主从数据库都是开着的，这个时候，如果
主库发生故障了，数据库可以按照我们的
预想的方式发生主从切换。
然后从库的pg 会切换为主库，原来的主
库就坏掉了，需要人工干预才能恢复主从
复制了
• 我们的数据库只支持一次主从切换。

• 2.从库是冷启动的，就是说在发生故障切换的时候，从
库是没有启动的，这个时候，也会发生切换，但是从库
是只读访问的，不会破坏主从的复制关系，这个时候需
要人工干预把从库切为主库。

从库需要通过集群软件来停库的情况发生的可能性比较小，我们
还是做了相应的处理。

• 针对第2个情况的的考虑：
dba 人工干预了从库的主机，比如我们把
从库停机维护了2天，期间，发生了主从切
换，这个时候，我们是不希望从库转换为
主库的，因为可能丢了很多数据，从库的
日志恢复是没有追赶上主库的。这个时
候，从库只是简单的冷启动，会把未恢复
的日志完成恢复，达到跟主库达到一致，
这个时候，dba 可以人工干预启动主从切换。

VIP的arp表更新问题
• vip在集群节上启动时通过arp欺骗来实现的。
– 在节点主机上执行ifconfig -a 是看不到vip的。

□

□

□

资源的fail-count问题
• 每一个资源，如果失败是会在原来的节点上重启，如果重
启成功则fail-count+=1，migration-threshold -=1，如果重
启不成功，就会触发资源切换。
• 当fail-couont 达到最大值（1000000）就会发生资源
切换。除非清理fail-count否则是资源是不会再回来的，这
个参数是可以修改的。
• 清理fail-count 的命令：
– crm resource cleanup <res> <node>
– crm resource cleanup fence_vm95 node96
• 还有一个fail-timeout 问题
• 详细说明：
– http://bbs.pconline.cn/topic-2367.html

fence设备的作用点
• 如果资源start失败，会直接切备机，fence
设备不会被触发。
• 如果资源stop失败，fence设备就起作用，
一般情况默认是重启主机，可设置。
• 如果不启用fence设备会怎样？
– 如果资源stop失败，就会不停的尝试stop，这
个资源将不会切换到备机，最终会陷入一个死
循环，导致资源无法提供服务。

脑裂问题
• 什么情况下发生脑裂？
– 集群中节点间的心跳连接断开就会发生脑裂
• 产生什么影响？
– 集群中的各个节点各自为政，互相认为对方失
败了，于是资源开始在各个节点上启动，fence
设备不停的去相互重启对方主机，共享类资源
出现故障，比如共享磁盘，发生写锁，vip同时
在多个节点间启动，导致无法访问，数据库类
资源，将导致数据不一致，甚至损坏。

如何预防脑裂
• 生产环境中脑裂的发生几率比较低。
– 3个及其以上节点的环境中，几乎不可能发生。
• 启用fence设备。
• 增加第三方投票节点。
• 增加多个心跳链路。
• 我们的方案：
– 启用fence设备 +多条心跳链路。

两节点集群的问题
• 法定投票人数问题
– 资源管理器的中心节点的选出，要有集群中
>sum(nodes)/2 的节点投票选出。
– 通过设置系统参数 no-quorum-policy="ignore"
解决
• 脑裂问题
– 启用fence设备
– 两节点的集群必须启用fence设备。

基于linux-HA 的PG高可用性

基于linux-HA 的PG高可用性

More Related Content

What's hot

Viewers also liked

Similar to 基于linux-HA 的PG高可用性

基于linux-HA 的PG高可用性