This document summarizes a presentation about SUSE Linux Enterprise High Availability Cluster Multi-Device. It discusses the main features of SUSE HA including policy driven clusters, cluster aware filesystems, and continuous data replication. It then describes the HA storage stack architecture and various options for doing HA storage including DRBD, clustered LVM2, and Cluster-MD. Cluster-MD is presented as a software-based RAID storage that provides redundancy at the device level across multiple nodes. Performance comparisons show Cluster-MD outperforming clustered LVM mirroring. Extensions to Cluster-MD are discussed including expanding the size of a Cluster-MD device.
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
1. SUSE Linux Enterprise High Availability
Cluster Multi-Device
Antoine Giniès
Project Manager / Release Manager
SUSE / aginies@suse.com
Expert Days Paris
Feb 2018
3. 3
SUSE Entreprise Server HA
Main Features
• Policy Driven Cluster
• Cluster Aware FS
• Continuous Data Replication
• Setup and Installation bootstrap
• Simple
7. 7
Doing HA storage
2 main solution
• Cluster nodes have local storage and for each write request to be sent on the
network (minimum of 2 nodes)
OR
• Redundant storage separate from the Cluster nodes
– SAN (fiber/iSCSI etc…)
– SPOF !
8. 8
High Availability – DRBD
• Data Replication Block Device
• master/slave resources is managed by
pacemaker, corosync software stack
• SLE HA stack manages the service
ordering, dependency, and failover
• Mirror of 2 block Devices (RAID1)
• Active-Passive
Host1 Host2
Virtual IP
Apache/ext4
DRBD Master
Kernel
Virtual IP
Apache/ext4
DRBD Slave
Pacemaker + Corosync
Kernel
Failover
9. 9
Data Replication – DRBD
DRBD can be thought as a networked RAID1
DRBD allows you to create a mirror of two block devices that are
located at two different sites across the network
It mirrors data in real-time, so its replication occurs continuously,
and works well for long distance replication
> SLE 12 HA SP2 --> DRBD 9
10. 10
Clustered LVM2 (cLVM2)
• multiple nodes to use LVM2 on the
shared disk
• cLVM2 coordinates LVM2 metadata
• coordinate access to the shared data
• Multiple nodes access data of different
dedicated VG are safe
• Active-Active
Host1 Host2
Pacemaker + Corosync + DLM
LVM2
Metadata
clvmd
Shared LUN 2
Shared LUN 1
LVM2
Metadata
clvmd
11. 11
Data Replication – cLVM/cmirrord
● There are different types of LV in CLVM: Striped, Mirrored etc...
● CLVM has been extended from LVM to support transparent management
of volume groups across the whole cluster
● For CLVM, we can also created mirrored lv to achieve data replication,
and cmirrord is used to tracks mirror log info in a cluster
12. 12
Problem
Solution?
• All nodes to have local storage and for each write request to be sent on the
network (mini of 2 nodes)
• Complexity of different nodes works together with multiple storage server
• CLVM2/cmirrord bad performance
Cluster MD
13. 13
Cluster-MD
Cluster Multi-device
• Software based RAID storage
• Redundancy at the device level
• NOT a cluster FS !
• Ensure data between mirrors is consistent
• Improve performance (VS CLVM mirroring)
• RAID1 (redundancy)
• Replace at Runtime
• On top of 2 SAN storage → no more SPOF
• Possible to have more than 2 SAN
SAN1 SAN2
Pacemaker + Corosync + DLM
Cluster-MD
bitmap
Clvmd /
lvmlockd
Shared LUN 2
Shared LUN 1
Cluster-MD
bitmap
Clvmd /
lvmlockd
Shared LUN 3
…..Shared LUN 4
14. 14
Data Replication – Cluster-MD
Internals:
– Cluster MD keeps write-intent-bitmap for each cluster node
– During "normal" I/O access, we assume the clustered filesystem ensures that only one
node writes to any given block at a time
– With each node have it's own bitmap, there would be no locking
and no need to keep sync array during normal operation
– Cluster MD would only handle the bitmaps when
resync/recovery etc happened
15. 15
DRBD VS Cluster-MD
• DRBD
– SAN storage
– 2 nodes only (+1 backup)
– Possible Regular FS
– Primary / Primary with cluster aware FS
– Raid 0 (stripping) / Raid 1 (Mirroring)
• Cluster-MD
– Raid 1 (Mirroring)
– SAN storage
– > 2 nodes
– Cluster Aware FS
17. 17
Active/Active FS
• All nodes write at same FS at the same time
• Cluster awares FS (OCFS2 / GFS2)
• Each node can write to any block
• Locking service is mandatory (DLM)
• RAID1:
– No needs of extra coordination
– 1 possible issue: 2 nodes write same block at same time
! CLVM and cluster-MD never do locking !
18. 18
Cluster-MD VS CLVM (in details)
• Cluster-MD
– Resyncing device or reading from a single device (wait resync finish)
– Resync technical details:
• Bitmap with possibility-out-of-sync regions stored
• Bit is set before writing and cleared at the end
– Smaller cost updating bitmap rather than resyncing whole array (faster recovery)
– Bitmap stored on all the devices in the array
– But it has Separate bitmap for each cluster node
– Setting/clearing a bit is a single node case, simple write to all disks
• CLVM DM-RAID1
– Resync tech details:
• Dirty region log managed by dm-log-userspace (mark or clear a region)
• This is user space daemon
– Replication around the cluster through a message
– Acknowledgment return to the original node; and then kernel module
19. 19
Cluster-MD VS CLVM
• Cluster-MD:
– Writing to a block of each storage device
– Waiting for confirmation
• DM-RAID1
– Message in user space to all nodes
– Acknowledgment return to the original node
– Pass the info to kernel module
25. 25
Cluster-MD Demo
• 3 Virtual Machine (ha1 ha2 ha3)
• SLE12SP3 HA ready for usage
• Attached disks:
– 1 System
– 1 SBD
– 3 x 1Gb
– 3 x 2Gb
• Cluster-MD Deployment
26. 26
Cluster-MD setup (step by step)
• Install Cluster-md-kmp mdadm on all nodes
• Shared storage: fake shared storage between nodes (vdd-vdi)
• Create a CIB and use it
cib new cluster_md_demo
• CRM: DLM resource
primitive dlm ocf:pacemaker:controld op monitor interval='60' timeout='60'
group base-group dlm
clone base-clone base-group meta interleave=true target-role=Startedup dlm
• Create the RAID1
mdadm --create /dev/md0 --bitmap=clustered --raid-devices=2 --level=mirror --spare-devices=1 /dev/vdd /dev/vde /dev/vdf
• Create /etc/mdadm.conf (using the UUID)
DEVICE /dev/vdd /dev/vde /dev/vdf /dev/vdg /dev/vdh /dev/vdi
ARRAY /dev/md0 metadata=1.2 spares=1 name=SLE12SP3ha3:0 UUID=c846e466:b7e15a4e:9ff54149:96b0dfb1
• Sync /etc/mdadm.conf to all nodes
• CRM: RAIDER primitive
primitive raider Raid1 params raidconf="/etc/mdadm.conf" raiddev="/dev/md0" force_clones=true
op monitor timeout=20s interval=10 op start timeout=20s interval=0 op stop timeout=20s interval=0
• mkfs.ocfs2 --cluster-stack pcmk -L 'VMtesting' --cluster-name hacluster /dev/md0
27. 27
Cluster-MD Extend Demo from 1Gb to 2 Gb
• VDC VDD VDE = 1 Gb | VDE VDF VDG = 2Gb
• Add more backend devices (VDE VDF VDG)
– mdadm –manage /dev/md0 –add DEV
• Declare as fail the spare, remove it
– mdadm --manage /dev/md0 --fail /dev/vdX
– mdadm --manage /dev/md0 --remove /dev/vdX
• Still 2 Active of 1Gb, and 3 Spare of 2Gb
• Declare fail & remove 1 Active → Resync between 1 Active (1Gb) & one of 2Gb
• Once Sync done, declare failed latest Active on 1Gb, 1 spare of 2Gb will replace it
• Remove the latest fail device of 1Gb
• grow the size of /dev/md0
– mdadm --grow /dev/md0 –size=max
• Resize the FS (tunefs.ocfs2 or gfs2_grow)