SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device

SUSE Linux Enterprise High Availability
Cluster Multi-Device
Antoine Giniès
Project Manager / Release Manager
SUSE / aginies@suse.com
Expert Days Paris
Feb 2018

3
SUSE Entreprise Server HA
Main Features
• Policy Driven Cluster
• Cluster Aware FS
• Continuous Data Replication
• Setup and Installation bootstrap
• Simple

4
HA Cluster Stack Architecture

5
RAID 0 | RAID 1 | RAID 10 | RAID Forever !

7
Doing HA storage
2 main solution
• Cluster nodes have local storage and for each write request to be sent on the
network (minimum of 2 nodes)
OR
• Redundant storage separate from the Cluster nodes
– SAN (fiber/iSCSI etc…)
– SPOF !

8
High Availability – DRBD
• Data Replication Block Device
• master/slave resources is managed by
pacemaker, corosync software stack
• SLE HA stack manages the service
ordering, dependency, and failover
• Mirror of 2 block Devices (RAID1)
• Active-Passive
Host1 Host2
Virtual IP
Apache/ext4
DRBD Master
Kernel
Virtual IP
Apache/ext4
DRBD Slave
Pacemaker + Corosync
Kernel
Failover

9
Data Replication – DRBD
DRBD can be thought as a networked RAID1
DRBD allows you to create a mirror of two block devices that are
located at two different sites across the network
It mirrors data in real-time, so its replication occurs continuously,
and works well for long distance replication
> SLE 12 HA SP2 --> DRBD 9

10
Clustered LVM2 (cLVM2)
• multiple nodes to use LVM2 on the
shared disk
• cLVM2 coordinates LVM2 metadata
• coordinate access to the shared data
• Multiple nodes access data of different
dedicated VG are safe
• Active-Active
Host1 Host2
Pacemaker + Corosync + DLM
LVM2
Metadata
clvmd
Shared LUN 2
Shared LUN 1
LVM2
Metadata
clvmd

11
Data Replication – cLVM/cmirrord
● There are different types of LV in CLVM: Striped, Mirrored etc...
● CLVM has been extended from LVM to support transparent management
of volume groups across the whole cluster
● For CLVM, we can also created mirrored lv to achieve data replication,
and cmirrord is used to tracks mirror log info in a cluster

12
Problem
Solution?
• All nodes to have local storage and for each write request to be sent on the
network (mini of 2 nodes)
• Complexity of different nodes works together with multiple storage server
• CLVM2/cmirrord bad performance
Cluster MD

13
Cluster-MD
Cluster Multi-device
• Software based RAID storage
• Redundancy at the device level
• NOT a cluster FS !
• Ensure data between mirrors is consistent
• Improve performance (VS CLVM mirroring)
• RAID1 (redundancy)
• Replace at Runtime
• On top of 2 SAN storage → no more SPOF
• Possible to have more than 2 SAN
SAN1 SAN2
Pacemaker + Corosync + DLM
Cluster-MD
bitmap
Clvmd /
lvmlockd
Shared LUN 2
Shared LUN 1
Cluster-MD
bitmap
Clvmd /
lvmlockd
Shared LUN 3
…..Shared LUN 4

14
Data Replication – Cluster-MD
Internals:
– Cluster MD keeps write-intent-bitmap for each cluster node
– During "normal" I/O access, we assume the clustered filesystem ensures that only one
node writes to any given block at a time
– With each node have it's own bitmap, there would be no locking
and no need to keep sync array during normal operation
– Cluster MD would only handle the bitmaps when
resync/recovery etc happened

15
DRBD VS Cluster-MD
• DRBD
– SAN storage
– 2 nodes only (+1 backup)
– Possible Regular FS
– Primary / Primary with cluster aware FS
– Raid 0 (stripping) / Raid 1 (Mirroring)
• Cluster-MD
– Raid 1 (Mirroring)
– SAN storage
– > 2 nodes
– Cluster Aware FS

17
Active/Active FS
• All nodes write at same FS at the same time
• Cluster awares FS (OCFS2 / GFS2)
• Each node can write to any block
• Locking service is mandatory (DLM)
• RAID1:
– No needs of extra coordination
– 1 possible issue: 2 nodes write same block at same time
! CLVM and cluster-MD never do locking !

18
Cluster-MD VS CLVM (in details)
• Cluster-MD
– Resyncing device or reading from a single device (wait resync finish)
– Resync technical details:
• Bitmap with possibility-out-of-sync regions stored
• Bit is set before writing and cleared at the end
– Smaller cost updating bitmap rather than resyncing whole array (faster recovery)
– Bitmap stored on all the devices in the array
– But it has Separate bitmap for each cluster node
– Setting/clearing a bit is a single node case, simple write to all disks
• CLVM DM-RAID1
– Resync tech details:
• Dirty region log managed by dm-log-userspace (mark or clear a region)
• This is user space daemon
– Replication around the cluster through a message
– Acknowledgment return to the original node; and then kernel module

19
Cluster-MD VS CLVM
• Cluster-MD:
– Writing to a block of each storage device
– Waiting for confirmation
• DM-RAID1
– Message in user space to all nodes
– Acknowledgment return to the original node
– Pass the info to kernel module

21
Comparison table
Nodes
Suitable
for Geo
A/A
A/P
RAID FS Shared Storage
DRBD
Supported
(limited to 2
nodes)
Yes A/P RAID1 Classical
No, storage is
dedicated to each
node
CLVM
limited by
pacemaker
& corosync
No A/A
RAID0
RAID1
Classical
Cluster aware
Yes
Cluster-MD
limited by
pacemaker
& corosync
No
A/A
A/P
RAID1 Cluster Aware Yes

22
Data Replication – Performance Comparison
FIO test with sync engine
Read Write
4k
16k
0 500 1000 1500 2000 2500 3000 3500
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
4k
16k
0 1000 2000 3000 4000 5000 6000 7000 8000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize

23
Data Replication – Performance Comparison
FIO test with libaio engine
Read Write
4k
16k
0 5000 10000 15000 20000 25000 30000 35000 40000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
4k
16k
0 5000 10000 15000 20000 25000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iopsBlocksize

Extension of a Cluster-MD device

25
Cluster-MD Demo
• 3 Virtual Machine (ha1 ha2 ha3)
• SLE12SP3 HA ready for usage
• Attached disks:
– 1 System
– 1 SBD
– 3 x 1Gb
– 3 x 2Gb
• Cluster-MD Deployment

26
Cluster-MD setup (step by step)
• Install Cluster-md-kmp mdadm on all nodes
• Shared storage: fake shared storage between nodes (vdd-vdi)
• Create a CIB and use it
cib new cluster_md_demo
• CRM: DLM resource
primitive dlm ocf:pacemaker:controld op monitor interval='60' timeout='60'
group base-group dlm
clone base-clone base-group meta interleave=true target-role=Startedup dlm
• Create the RAID1
mdadm --create /dev/md0 --bitmap=clustered --raid-devices=2 --level=mirror --spare-devices=1 /dev/vdd /dev/vde /dev/vdf
• Create /etc/mdadm.conf (using the UUID)
DEVICE /dev/vdd /dev/vde /dev/vdf /dev/vdg /dev/vdh /dev/vdi
ARRAY /dev/md0 metadata=1.2 spares=1 name=SLE12SP3ha3:0 UUID=c846e466:b7e15a4e:9ff54149:96b0dfb1
• Sync /etc/mdadm.conf to all nodes
• CRM: RAIDER primitive
primitive raider Raid1 params raidconf="/etc/mdadm.conf" raiddev="/dev/md0" force_clones=true
op monitor timeout=20s interval=10 op start timeout=20s interval=0 op stop timeout=20s interval=0
• mkfs.ocfs2 --cluster-stack pcmk -L 'VMtesting' --cluster-name hacluster /dev/md0

27
Cluster-MD Extend Demo from 1Gb to 2 Gb
• VDC VDD VDE = 1 Gb | VDE VDF VDG = 2Gb
• Add more backend devices (VDE VDF VDG)
– mdadm –manage /dev/md0 –add DEV
• Declare as fail the spare, remove it
– mdadm --manage /dev/md0 --fail /dev/vdX
– mdadm --manage /dev/md0 --remove /dev/vdX
• Still 2 Active of 1Gb, and 3 Spare of 2Gb
• Declare fail & remove 1 Active → Resync between 1 Active (1Gb) & one of 2Gb
• Once Sync done, declare failed latest Active on 1Gb, 1 spare of 2Gb will replace it
• Remove the latest fail device of 1Gb
• grow the size of /dev/md0
– mdadm --grow /dev/md0 –size=max
• Resize the FS (tunefs.ocfs2 or gfs2_grow)

29
Cluster-MD RAID 10 (TP)
• mdadm --create /dev/md0 --bitmap=clustered --metadata=1.2 --raid-devices=2
--level=10 /dev/sda /dev/sdb
• RAID10 supports 3 layouts:
• Cluster-MD only supports near layout (best performance)
NEAR FAR OFFSET
a1 b1 c1 e1
0 0 1 1
2 2 3 3
4 4 5 5
6 6 7 7
8 8 9 9
a1 b1 c1 e1
0 1 2 3
4 5 6 7
. . .
3 0 1 2
7 4 5 6
a1 b1 c1 e1
0 1 2 3
3 0 1 2
4 5 6 7
7 4 5 6
8 9 10 11
11 8 9 10

SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device

SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device

Similar to SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device (20)

More from SUSE

More from SUSE (20)

Recently uploaded

Recently uploaded (20)

SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device