SlideShare a Scribd company logo
High Availability Storage
SUSE's Building Blocks
Roger Zhou
Sr. Eng Manager
zzhou@suse.com
Guoqing Jiang
HA Specialist
gqjiang@suse.com
2
SLE-HA Introduction
The SUSE Linux Enterprise High Availability Extension is powered by Pacemaker & Corosync
to eliminate SPOF for critical data, applications, and services, to implement HA clusters.
TUT88811 - Practical High Availability
Web Site A
Web Site B
Web Server 1
Web Site C
Web Site D
Web Site E
Web Site F
Web Server 2 Web Server 3
Shared StorageFibre Channel Switch
Web Site A Web Site B
Web Server 1
Web Site C
Web Site D
Web Site E
Web Site F
Web Server 2 Web Server 3
Shared StorageFibre Channel Switch
3
Storage
Software Elements
– Block Device: hdX, sdX, vdX => vgX, dmX, mdX
– Filesystem: ext3/4, xfs, btrfs => ocfs2, gfs2
Enterprise Requirements ( must-have )
– High Available (Active-Passive, Active-Active)
– Data Protection (Data Replication)
4
Small Business
5
High Availability Block Device
- active/passive
6
High Availability – DRBD
• DRBD – Data Replication Block Device
• A special master/slave resources is
managed by pacemaker, corosync
software stack
• SLE HA stack manages the service
ordering, dependency, and failover
• Active-Passive approach
Host1 Host2
Virtual IP
Apache/ext4
DRBD Master
Kernel
Virtual IP
Apache/ext4
DRBD Slave
Pacemaker + Corosync
Kernel
Failover
7
High Availability LVM
• HA-LVM resources managed by
pacemaker, corosync software stack
• lvmconf --enable-halvm
• Active-Passive approach
Host1 Host2
Virtual IP
LVM
DRBD Master
Kernel
Virtual IP
LVM
DRBD Slave
Pacemaker + Corosync
Kernel
Failover
8
High Availability iSCSI Server
• iSCSI provides the block devices
over TCP/IP
• Active-Passive approach
Host1 Host2
Virtual IP
ISCSILogicalUnit
iSCSITarget
Kernel
Virtual IP
iSCSILogicalUnit
iSCSITarget
Pacemaker + Corosync
Kernel
Failover
DRBD
Master/LVM
DRBD
Slave/LVM
9
High Availability Filesystem
10
High Availability NFS (HA-NAS)
• NFS server on top of ext3/ext4
• Same applies to:
– xfs
– cifs samba
– etc.
• Active-Passive approach
Host1 Host2
Virtual IP
NFS Exports
Filesystem(ext4)
Kernel
Virtual IP
NFS Exports
Filesystem(ext4)
Pacemaker + Corosync
Kernel
Failover
DRBD
Master/LVM
DRBD
Slave/LVM
11
High Availability NFS (HA-NAS) with Shared Disk
• Active-Passive approach
NOTE: multipathing is “must-have”
for Enterprise
Host1 Host2
Virtual IP
NFS Exports
Kernel
Virtual IP
NFS Exports
Pacemaker + Corosync
Kernel
Failover
Shared
Disk
Filesystem(ext4) Filesystem(ext4)
12
Motivation for Your Storage Growth
Scalability —► extendable storage
Easy Management —► cluster aware components
Cost Effective —► shared resources
Performance —► active – active
13
Concurrent Sharing – you need a lock
14
Clustered LVM2 (cLVM2)
• cLVM2 enables multiple nodes to use
LVM2 on the shared disk
• cLVM2 coordinates LVM2 metadata,
does not coordinate access to the
shared data
• That said, multiple nodes access data
of different dedicated VG are safe
• lvmconf --enable-cluster
• Active-Active approach
Host1 Host2
Pacemaker + Corosync +DLM
LVM2
Metadata
clvmd
Shared LUN 2
Shared LUN 1
LVM2
Metadata
clvmd
15
Clustered FS – OCFS2 on shared disk
• OCFS2 resources is managed by
using pacemaker, corosync, dlm
software stack
• Multiple host can modify the same
file with performance penalty
• DLM lock protects the data access
• Active-Active approach
Host1 Host2
Clustered File System
Kernel
Pacemaker + Corosync + DLM
Kernel
OCFS2 OCFS2
Shared
Disk
16
OCFS2 + cLVM2 (both volumes and data are safe)
• cLVM provides clustered VG. Metadata
is protected
• OCFS2 protect the data access inside
the volume
• Safe to enlarge the filesystem size
• Active-Active approach
Host1 Host2
Clustered File System
Kernel
Pacemaker + Corosync + DLM
Kernel
cLVM2 cLVM2
Shared
Disk
OCFS2 OCFS2
17
Is Data Safe Enough?
18
Data Replication
Data Replication in general
– File Replication (Not the focus of this talk.)
– Block Level Replication
• A great new feature “Clustered MD RAID 1” now is available in in SLE12 HA SP2
– Object Storage Replication
• TUT89016 - SUSE Enterprise Storage: Disaster Recovery with Block Mirroring
Requirements in HA context
– Multiple nodes
– Active-active
19
Data Replication – DRBD
DRBD can be thought as a networked RAID1.
DRBD allows you to create a mirror of two block devices that
are located at two different sites across the network.
It mirrors data in real-time, so its replication occurs
continuously, and works well for long distance replication.
SLE 12 HA SP2 now supports DRBD 9.
20
Data Replication – clvm/cmirrord
● There are different types of LV in CLVM: Snapshot, Striped and
Mirrored etc.
● CLVM has been extended from LVM to support transparent
management of volume groups across the whole cluster.
● For CLVM, we can also created mirrored lv to achieve data
replication, and cmirrord is used to tracks mirror log info
in a cluster.
21
Data Replication – clustered md/raid1
The cluster multi-device (Cluster MD) is a software based RAID
storage solution for a cluster.
The biggest motivation for Cluster MD is that CLVM/cmirrord has
severe performance issue and people are not happy about it.
Cluster MD provides the redundancy of RAID1
mirroring to the cluster. SUSE considers to
support other RAID levels with further evaluation.
22
Data Replication – clustered md/raid1 (cont.)
Internals:
– Cluster MD keeps write-intent-bitmap for each cluster node.
– During "normal" I/O access, we assume the clustered filesystem ensures that only one
node writes to any given block at a time.
– With each node have it's own bitmap, there would be no locking
and no need to keep sync array during normal operation.
– Cluster MD would only handle the bitmaps when
resync/recovery etc happened.
23
Data Replication – clustered md/raid1 (cont.)
It coordinates RAID1 metadata, not coordinate access to the
shared data, and it’s performance is close to native RAID1.
CLVM must send a message to user space, that must be
passed to all nodes. Acknowledgments must return to the
originating node, then to the kernel module.
Some related links:
https://lwn.net/Articles/674085/
http://www.spinics.net/lists/raid/msg47863.html
24
Data Replication – Comparison
Active/Active mode Suitable for Geo Shared Storage
DRBD
Supported
(limited to two nodes)
Yes
No,
storage is dedicated
to each node.
CLVM
(cmirrord)
Supported, the node number
is limited by pacemaker and
corosync
No Yes
Clustered
md/raid1
Supported, the node number
is limited by pacemaker and
corosync
No Yes
25
Data Replication – Performance Comparison
FIO test with sync engine
Read Write
4k
16k
0 500 1000 1500 2000 2500 3000 3500
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
4k
16k
0 1000 2000 3000 4000 5000 6000 7000 8000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
26
Data Replication – Performance Comparison
FIO test with libaio engine
Read Write
4k
16k
0 5000 10000 15000 20000 25000 30000 35000 40000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
4k
16k
0 5000 10000 15000 20000 25000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iopsBlocksize
27
Recap
28
Recap: HA Storage Building Blocks
Block-level
HA DRBD
HA LVM2
HA iSCSI
cLVM2
Clustered MD RAID1
Filesystem-level
HA NFS / CIFS
HA EXT3/EXT4/XFS
OCFS2 (GFS2)
29
Question & Answer
High Availability Storage (susecon2016)

More Related Content

What's hot

Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2
Gang He
 
Five common customer use cases for Virtual SAN - VMworld US / 2015
Five common customer use cases for Virtual SAN - VMworld US / 2015Five common customer use cases for Virtual SAN - VMworld US / 2015
Five common customer use cases for Virtual SAN - VMworld US / 2015
Duncan Epping
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
Kevin Traynor
 
Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
Seung-Hoon Baek
 
ZFS
ZFSZFS
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Yan Vugenfirer
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
Ceph Community
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVM
Marcus L Sorensen
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
ShapeBlue
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
Pradeep Kumar
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitchSim Janghoon
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Schubert Zhang
 
Red hat enterprise linux 7 (rhel 7)
Red hat enterprise linux 7 (rhel 7)Red hat enterprise linux 7 (rhel 7)
Red hat enterprise linux 7 (rhel 7)
Ramola Dhande
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
Michelle Holley
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Nelson Calero
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
Sage Weil
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
xKinAnx
 
Automated CloudStack Deployment
Automated CloudStack DeploymentAutomated CloudStack Deployment
Automated CloudStack Deployment
ShapeBlue
 

What's hot (20)

Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2Comparison between OCFS2 and GFS2
Comparison between OCFS2 and GFS2
 
Five common customer use cases for Virtual SAN - VMworld US / 2015
Five common customer use cases for Virtual SAN - VMworld US / 2015Five common customer use cases for Virtual SAN - VMworld US / 2015
Five common customer use cases for Virtual SAN - VMworld US / 2015
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조Open vSwitch 패킷 처리 구조
Open vSwitch 패킷 처리 구조
 
ZFS
ZFSZFS
ZFS
 
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-netReceive side scaling (RSS) with eBPF in QEMU and virtio-net
Receive side scaling (RSS) with eBPF in QEMU and virtio-net
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Using CloudStack With Clustered LVM
Using CloudStack With Clustered LVMUsing CloudStack With Clustered LVM
Using CloudStack With Clustered LVM
 
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitch
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Red Hat Global File System (GFS)
Red Hat Global File System (GFS)Red Hat Global File System (GFS)
Red Hat Global File System (GFS)
 
Red hat enterprise linux 7 (rhel 7)
Red hat enterprise linux 7 (rhel 7)Red hat enterprise linux 7 (rhel 7)
Red hat enterprise linux 7 (rhel 7)
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
 
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
Ibm spectrum scale fundamentals workshop for americas part 4 spectrum scale_r...
 
Automated CloudStack Deployment
Automated CloudStack DeploymentAutomated CloudStack Deployment
Automated CloudStack Deployment
 

Similar to High Availability Storage (susecon2016)

Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
Roger Zhou 周志强
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE
 
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13Gosuke Miyashita
 
Gpfs introandsetup
Gpfs introandsetupGpfs introandsetup
Gpfs introandsetup
asihan
 
Distributed replicated block device
Distributed replicated block deviceDistributed replicated block device
Distributed replicated block device
Chanaka Lasantha
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
Sage Weil
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
Nikhil Kumar
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
Etsuji Nakai
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Ceph Community
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
Gluster.org
 
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
Novell
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
Neependra Khare
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
ShapeBlue
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebula Project
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
Marc Cortinas Val
 
Lightweight Virtualization in Linux
Lightweight Virtualization in LinuxLightweight Virtualization in Linux
Lightweight Virtualization in Linux
Sadegh Dorri N.
 
ZFS and MySQL on Linux, the Sweet Spots
ZFS and MySQL on Linux, the Sweet SpotsZFS and MySQL on Linux, the Sweet Spots
ZFS and MySQL on Linux, the Sweet Spots
Jervin Real
 

Similar to High Availability Storage (susecon2016) (20)

Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
 
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
 
Gpfs introandsetup
Gpfs introandsetupGpfs introandsetup
Gpfs introandsetup
 
Distributed replicated block device
Distributed replicated block deviceDistributed replicated block device
Distributed replicated block device
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
GlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack IntegrationGlusterFS Update and OpenStack Integration
GlusterFS Update and OpenStack Integration
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
Building High Availability Clusters with SUSE Linux Enterprise High Availabil...
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Lightweight Virtualization in Linux
Lightweight Virtualization in LinuxLightweight Virtualization in Linux
Lightweight Virtualization in Linux
 
ZFS and MySQL on Linux, the Sweet Spots
ZFS and MySQL on Linux, the Sweet SpotsZFS and MySQL on Linux, the Sweet Spots
ZFS and MySQL on Linux, the Sweet Spots
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

High Availability Storage (susecon2016)

  • 1. High Availability Storage SUSE's Building Blocks Roger Zhou Sr. Eng Manager zzhou@suse.com Guoqing Jiang HA Specialist gqjiang@suse.com
  • 2. 2 SLE-HA Introduction The SUSE Linux Enterprise High Availability Extension is powered by Pacemaker & Corosync to eliminate SPOF for critical data, applications, and services, to implement HA clusters. TUT88811 - Practical High Availability Web Site A Web Site B Web Server 1 Web Site C Web Site D Web Site E Web Site F Web Server 2 Web Server 3 Shared StorageFibre Channel Switch Web Site A Web Site B Web Server 1 Web Site C Web Site D Web Site E Web Site F Web Server 2 Web Server 3 Shared StorageFibre Channel Switch
  • 3. 3 Storage Software Elements – Block Device: hdX, sdX, vdX => vgX, dmX, mdX – Filesystem: ext3/4, xfs, btrfs => ocfs2, gfs2 Enterprise Requirements ( must-have ) – High Available (Active-Passive, Active-Active) – Data Protection (Data Replication)
  • 5. 5 High Availability Block Device - active/passive
  • 6. 6 High Availability – DRBD • DRBD – Data Replication Block Device • A special master/slave resources is managed by pacemaker, corosync software stack • SLE HA stack manages the service ordering, dependency, and failover • Active-Passive approach Host1 Host2 Virtual IP Apache/ext4 DRBD Master Kernel Virtual IP Apache/ext4 DRBD Slave Pacemaker + Corosync Kernel Failover
  • 7. 7 High Availability LVM • HA-LVM resources managed by pacemaker, corosync software stack • lvmconf --enable-halvm • Active-Passive approach Host1 Host2 Virtual IP LVM DRBD Master Kernel Virtual IP LVM DRBD Slave Pacemaker + Corosync Kernel Failover
  • 8. 8 High Availability iSCSI Server • iSCSI provides the block devices over TCP/IP • Active-Passive approach Host1 Host2 Virtual IP ISCSILogicalUnit iSCSITarget Kernel Virtual IP iSCSILogicalUnit iSCSITarget Pacemaker + Corosync Kernel Failover DRBD Master/LVM DRBD Slave/LVM
  • 10. 10 High Availability NFS (HA-NAS) • NFS server on top of ext3/ext4 • Same applies to: – xfs – cifs samba – etc. • Active-Passive approach Host1 Host2 Virtual IP NFS Exports Filesystem(ext4) Kernel Virtual IP NFS Exports Filesystem(ext4) Pacemaker + Corosync Kernel Failover DRBD Master/LVM DRBD Slave/LVM
  • 11. 11 High Availability NFS (HA-NAS) with Shared Disk • Active-Passive approach NOTE: multipathing is “must-have” for Enterprise Host1 Host2 Virtual IP NFS Exports Kernel Virtual IP NFS Exports Pacemaker + Corosync Kernel Failover Shared Disk Filesystem(ext4) Filesystem(ext4)
  • 12. 12 Motivation for Your Storage Growth Scalability —► extendable storage Easy Management —► cluster aware components Cost Effective —► shared resources Performance —► active – active
  • 13. 13 Concurrent Sharing – you need a lock
  • 14. 14 Clustered LVM2 (cLVM2) • cLVM2 enables multiple nodes to use LVM2 on the shared disk • cLVM2 coordinates LVM2 metadata, does not coordinate access to the shared data • That said, multiple nodes access data of different dedicated VG are safe • lvmconf --enable-cluster • Active-Active approach Host1 Host2 Pacemaker + Corosync +DLM LVM2 Metadata clvmd Shared LUN 2 Shared LUN 1 LVM2 Metadata clvmd
  • 15. 15 Clustered FS – OCFS2 on shared disk • OCFS2 resources is managed by using pacemaker, corosync, dlm software stack • Multiple host can modify the same file with performance penalty • DLM lock protects the data access • Active-Active approach Host1 Host2 Clustered File System Kernel Pacemaker + Corosync + DLM Kernel OCFS2 OCFS2 Shared Disk
  • 16. 16 OCFS2 + cLVM2 (both volumes and data are safe) • cLVM provides clustered VG. Metadata is protected • OCFS2 protect the data access inside the volume • Safe to enlarge the filesystem size • Active-Active approach Host1 Host2 Clustered File System Kernel Pacemaker + Corosync + DLM Kernel cLVM2 cLVM2 Shared Disk OCFS2 OCFS2
  • 17. 17 Is Data Safe Enough?
  • 18. 18 Data Replication Data Replication in general – File Replication (Not the focus of this talk.) – Block Level Replication • A great new feature “Clustered MD RAID 1” now is available in in SLE12 HA SP2 – Object Storage Replication • TUT89016 - SUSE Enterprise Storage: Disaster Recovery with Block Mirroring Requirements in HA context – Multiple nodes – Active-active
  • 19. 19 Data Replication – DRBD DRBD can be thought as a networked RAID1. DRBD allows you to create a mirror of two block devices that are located at two different sites across the network. It mirrors data in real-time, so its replication occurs continuously, and works well for long distance replication. SLE 12 HA SP2 now supports DRBD 9.
  • 20. 20 Data Replication – clvm/cmirrord ● There are different types of LV in CLVM: Snapshot, Striped and Mirrored etc. ● CLVM has been extended from LVM to support transparent management of volume groups across the whole cluster. ● For CLVM, we can also created mirrored lv to achieve data replication, and cmirrord is used to tracks mirror log info in a cluster.
  • 21. 21 Data Replication – clustered md/raid1 The cluster multi-device (Cluster MD) is a software based RAID storage solution for a cluster. The biggest motivation for Cluster MD is that CLVM/cmirrord has severe performance issue and people are not happy about it. Cluster MD provides the redundancy of RAID1 mirroring to the cluster. SUSE considers to support other RAID levels with further evaluation.
  • 22. 22 Data Replication – clustered md/raid1 (cont.) Internals: – Cluster MD keeps write-intent-bitmap for each cluster node. – During "normal" I/O access, we assume the clustered filesystem ensures that only one node writes to any given block at a time. – With each node have it's own bitmap, there would be no locking and no need to keep sync array during normal operation. – Cluster MD would only handle the bitmaps when resync/recovery etc happened.
  • 23. 23 Data Replication – clustered md/raid1 (cont.) It coordinates RAID1 metadata, not coordinate access to the shared data, and it’s performance is close to native RAID1. CLVM must send a message to user space, that must be passed to all nodes. Acknowledgments must return to the originating node, then to the kernel module. Some related links: https://lwn.net/Articles/674085/ http://www.spinics.net/lists/raid/msg47863.html
  • 24. 24 Data Replication – Comparison Active/Active mode Suitable for Geo Shared Storage DRBD Supported (limited to two nodes) Yes No, storage is dedicated to each node. CLVM (cmirrord) Supported, the node number is limited by pacemaker and corosync No Yes Clustered md/raid1 Supported, the node number is limited by pacemaker and corosync No Yes
  • 25. 25 Data Replication – Performance Comparison FIO test with sync engine Read Write 4k 16k 0 500 1000 1500 2000 2500 3000 3500 RawDisk NativeRaid Clustermd Cmirror Average iops Blocksize 4k 16k 0 1000 2000 3000 4000 5000 6000 7000 8000 RawDisk NativeRaid Clustermd Cmirror Average iops Blocksize
  • 26. 26 Data Replication – Performance Comparison FIO test with libaio engine Read Write 4k 16k 0 5000 10000 15000 20000 25000 30000 35000 40000 RawDisk NativeRaid Clustermd Cmirror Average iops Blocksize 4k 16k 0 5000 10000 15000 20000 25000 RawDisk NativeRaid Clustermd Cmirror Average iopsBlocksize
  • 28. 28 Recap: HA Storage Building Blocks Block-level HA DRBD HA LVM2 HA iSCSI cLVM2 Clustered MD RAID1 Filesystem-level HA NFS / CIFS HA EXT3/EXT4/XFS OCFS2 (GFS2)