SlideShare a Scribd company logo
延伸 Linux 关键业务到双活 NVMe-oF 存储
- 近期 SUSE 研发人员的相关进展
周志强 (Roger)
SUSE 高级研发经理
zzhou@suse.com
2018 OpenInfra Days China
2
High
Availability
<32 nodes
Stretched
Cluster
说说关键字
SDS
Ceph
High
Performance
(Cloud)
Storage
Infrastructure
SAN
Act-Act
DR
NVMe-oF
3
Short about NVMe-oF in Linux
2017
2018
●
Linux storage stack catches up the hardware evolution
➢
Transport: 100M/1G → 10G/40G/100G network
➢
Media: HDD → SSD Flash
➢
S/W Stack: SCSI protocol → NVMe protocol
➢
NVMe-oF Storage Array: Very High iops, Very Low Latency.
●
Linux MD RAID1 new I/O barrier, 70% NVMe speed.
➢
Contributed by Coly Li, Neil Brown, Hannes Reinecke, Guoqing Jiang, etc.
➢
2017, SLE12SP2 Maintenance Update
●
NVMe-oF products.
➢
2017, SLE12SP3 support NVMe-oF with NetApp, Emulex, Mellanox.
➢
2018-05, Broadcom, NetApp and SUSE Announce Production Availability
4
NVMe-oF in Data Centers
5
Data Center
Host3
VM2
VM1
Host4
VM2
VM1
NVMe-oF
Host1
VM2
VM1
Host2
VM2
VM1
●
期望: FTT >= 2
Failures To Tolerate
可容忍 / 可恢复错误的数量
●
期望:接近于 0 的 RTO/RPO.
●
期望:数据保护 / 灾难恢复
6
Successful Stories: Stretched / Host-Based-Mirroring
- Banking & Automakers
Site B
Host3
VM2
VM1
Host4
VM2
VM1
Site A
Host1
VM2
VM1
Host2
VM2
VM1
Active-Passive Shared Storage
●
支持异构存储
●
解锁 供应商特定方案 ,
特定存储
●
解锁 灾难恢复 (DR) 厂
商专有复制工具。
●
解锁 基于存储厂商的
镜像复制工具 .
●
和 Linux 无缝集成
** hundreds of clusters
7
SAN : LUN
NVMe-oF : NS
Failover
Successful Stories: 里面的挑战
- Banking & Automakers
What happen during
“Failover” ?
From node-1
1. To umount filesystem
2. To deactivate lvm
3. To remove RAID1
To node-2
4. To assemble RAID1
5. To activate lvm
6. To mount filesystem
Imaging hundreds of RAID1
devices, RTO can be very long!
m
path2’ mpath1’
/dev/mapper/mpath1
MD RAID1
vg / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
MD RAID1
vg / lv
Applications/FS
Active node-1 Passive node-2
1 2 4
Heartbeating
Lock messageing
5
3
8
Improve the cluster to Active - Active ( This Talk )
Linux
MD RAID
cluster aware
2016
Guoqing Jiang, Neil
Brown
●
Assemble MD RAID1
on both datacenters
●
Activate shared LV on
both datacenters
●
Mount OCFS2 on
both datacenters
m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
No native SAN syncing
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
9
Cluster RAID1 performance is nearly same as native
FIO test with sync engine
Read Write
4k
16k
0 500 1000 1500 2000 2500 3000 3500
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
4k
16k
0 1000 2000 3000 4000 5000 6000 7000 8000
RawDisk
NativeRaid
Clustermd
Cmirror
Average iops
Blocksize
10
Failures in Stretched Cluster
- Ethernet / Cluster Communication
11
Keep stretching – Ethernet perspective
FTT = 1
●
Heartbeating
➢
Network Bonding (L2)
➢
Redundant Rings (L3)
●
Distributed Lock Messaging
➢
SCTP
2018
Gang He, Michal Kubecek
Host1
VM
VM
Host2
VM
VM
router
router
bond
rrp
UDP
Host1
VM
VM
Host2
VM
VM
router
router
ip
SCTP
12
Mature Linux HA stack to deal with SPLIT BRAIN
FTT = 2
●
Pacemaker
●
Corosync
●
STONITH
Host1
VM
VM
Host2
VM
VM
router
router
bond
rrp
UDP
Host1
VM
VM
Host2
VM
VM
router
router
ip
SCTP
13
Failures in Stretched Cluster
- SAN Storage ( eg. NVMe-oF )
14
Failure 1: SAN Storage( NVMe-oF ) lose power
●
Node-2 RAID1 marks
mpath2 as FAULTY device.
●
Node-1 RAID1 marks
mpath2’ as FAULTY device.
●
Both sites working well via
node-1’s SAN storage.
m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
a a
a
15
Keep stretching
●
Storage links failures in between
( 蓝翔挖掘机和光缆的恩怨 )
Original image: baike.baidu
16
Failure: SAN Partitioned
Byzantine
Failures
(Wikipedia)
组件在故障检测系统中
的呈现可能不一致,不
同的观察者有不同的症
状:一个角度看正常工
作,另一个看已经失
败。
m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
17
Failure 2: one storage link failed
a) Assuming, Link② failed.
Node1 RAID1 marks
mpath2’ as FAULTY
b) Cluster RAID1 will
populate FAULTY device
role of mpath2’ in
superblock (*), and Node2
mpath2 becomes as
FAULTY too.
c) That says, Cluster RAID1
will populate FAULTY
disk. In the end. Just like
a whole SAN failure .
(*) That says, MD RAID
superblock plays the role to
**populate FAULTY device
role** in the cluster
m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
a b
c
18
Failure 3: SAN Partitioned : both links failed
a) Assume Link② is the first
failure detected by the
cluster.
• FAULTY is populated, and
• just like a whole SAN failure.
b) Sequentially(*), the cluster
deals with Link③ failure.
• MD RAID1 on node-2 lose all
devices.
• Cluster MD on node-2 is
disabled. dmesg report: “
[ 79.942305] md: md0 stopped”.
• RA RAID will fail.
•
c) Services failover to node-1.
• Only one site keeps running.
(*) the distributed lock play the
game here.
m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
a b a
a
19
Failure 4: SAN switch broken
same as
Failure 3:
SAN Partitioned m
path2’ mpath1’
/dev/mapper/mpath1
cluster md0
shared vg0 / lv
Applications/FS
SAN : LUN
NVMe-oF : NS
/dev/mapper/mpath2
cluster md0
shared vg0 / lv
Applications/FS
Active node-1 Active node-2
1 2 4
Heartbeating
Lock messageing
5
3
SAN : LUN
NVMe-oF : NS
a b a
a
21
Now, you have Act-Act NVMe-oF in stretched cluster!
22
NVMe-oF in OpenStack
23
NVMe-oF
Host3
VM2
VM1
Host4
VM2
VM1
NVMe-oF
Host1
VM2
VM1
Host2
VM2
VM1
Active-Active Shared LVM
●
Aug 2018, Rocky release
➢
Nova:
Adding NVMEoF libvirt driver for
supporting NVMEoF initiator CLI
commit a833bcd05f811325f40cb3c8cce7f94c93cd6b6e
Author: Rawan Herzallah <rawanh@mellanox.com>
Date: Tue Jul 11 20:18:07 2017 +0300
➢
Cinder:
Adding NVMET target for NVMeOF
commit d2b3e1011e238ce1c29157e0614a0416a30448a8
Merge: f6cad8178 8d7e131c5
Author: Zuul <zuul@review.openstack.org>
Date: Wed May 9 22:01:16 2018 +0000
24
Let’s play with it !
25
Challenges ahead
●
Cluster RAID10
●
Cluster RAID5
●
Preferred site in case stretched SAN partitioned
Welcome to join in Open Source!
SUSE 抽奖活动及规则介绍
参与方式:
① 扫描左侧二维码,关注 SUSE 官方微信;
② 发送“抽奖”至 SUSE 官方微信;
③ 简单填写信息后,进入幸运大转盘抽取礼品;
④ 凭中奖页面,前往 SUSE 展台领取礼品。

More Related Content

What's hot

LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream KernelLF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
LF_OpenvSwitch
 
DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)
Jaroslav Jacjuk
 
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIOLF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
LF_OpenvSwitch
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networkingSim Janghoon
 
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OpenvSwitch
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
HungWei Chiu
 
The Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitchThe Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitch
Te-Yen Liu
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
LinuxCon ContainerCon CloudOpen China
 
SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home. SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home.
LinuxCon ContainerCon CloudOpen China
 
Openv switchの使い方とか
Openv switchの使い方とかOpenv switchの使い方とか
Openv switchの使い方とか
kotto_hihihi
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
Kernel TLV
 
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
The Linux Foundation
 
Linux Kernel Development
Linux Kernel DevelopmentLinux Kernel Development
Linux Kernel Development
LinuxCon ContainerCon CloudOpen China
 
How Networking works with Data Science
How Networking works with Data Science How Networking works with Data Science
How Networking works with Data Science
HungWei Chiu
 
LF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress Scheduling
LF_OpenvSwitch
 
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Hien Nguyen Van
 
Building a network emulator with Docker and Open vSwitch
Building a network emulator with Docker and Open vSwitchBuilding a network emulator with Docker and Open vSwitch
Building a network emulator with Docker and Open vSwitch
Goran Cetusic
 
Linux networking is Awesome!
Linux networking is Awesome!Linux networking is Awesome!
Linux networking is Awesome!
Cumulus Networks
 
Open vSwitch Offload: Conntrack and the Upstream Kernel
Open vSwitch Offload: Conntrack and the Upstream KernelOpen vSwitch Offload: Conntrack and the Upstream Kernel
Open vSwitch Offload: Conntrack and the Upstream Kernel
Netronome
 
DLM knowledge-sharing
DLM knowledge-sharingDLM knowledge-sharing
DLM knowledge-sharing
Eric Ren
 

What's hot (20)

LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream KernelLF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
LF_OVS_17_Open vSwitch Offload: Conntrack and the Upstream Kernel
 
DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)DRBD + OpenStack (Openstack Live Prague 2016)
DRBD + OpenStack (Openstack Live Prague 2016)
 
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIOLF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
LF_OVS_17_Enabling Hardware Offload of OVS Control & Data plane using LiquidIO
 
OpenStack networking
OpenStack networkingOpenStack networking
OpenStack networking
 
LF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and GotchasLF_OVS_17_OVS-DPDK Installation and Gotchas
LF_OVS_17_OVS-DPDK Installation and Gotchas
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
The Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitchThe Basic Introduction of Open vSwitch
The Basic Introduction of Open vSwitch
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
 
SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home. SecurityPI - Hardening your IoT endpoints in Home.
SecurityPI - Hardening your IoT endpoints in Home.
 
Openv switchの使い方とか
Openv switchの使い方とかOpenv switchの使い方とか
Openv switchの使い方とか
 
Switchdev - No More SDK
Switchdev - No More SDKSwitchdev - No More SDK
Switchdev - No More SDK
 
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
 
Linux Kernel Development
Linux Kernel DevelopmentLinux Kernel Development
Linux Kernel Development
 
How Networking works with Data Science
How Networking works with Data Science How Networking works with Data Science
How Networking works with Data Science
 
LF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress Scheduling
 
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
 
Building a network emulator with Docker and Open vSwitch
Building a network emulator with Docker and Open vSwitchBuilding a network emulator with Docker and Open vSwitch
Building a network emulator with Docker and Open vSwitch
 
Linux networking is Awesome!
Linux networking is Awesome!Linux networking is Awesome!
Linux networking is Awesome!
 
Open vSwitch Offload: Conntrack and the Upstream Kernel
Open vSwitch Offload: Conntrack and the Upstream KernelOpen vSwitch Offload: Conntrack and the Upstream Kernel
Open vSwitch Offload: Conntrack and the Upstream Kernel
 
DLM knowledge-sharing
DLM knowledge-sharingDLM knowledge-sharing
DLM knowledge-sharing
 

Similar to 延伸Linux关键业务到双活高速NVMe-oF存储-OpenInfraDays-China2018

SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
ShapeBlue
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Community
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
Roger Zhou 周志强
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Ryan Aydelott
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
Ceph Community
 
VMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSANVMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSAN
Duncan Epping
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
Ceph Community
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
James Denton
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
RedWireServices
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
Gluster.org
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
Linaro
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
Shawn Zhu
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Ceph Community
 
Ha nsf notes
Ha nsf notesHa nsf notes
Ha nsf notes
Krunal Shah
 

Similar to 延伸Linux关键业务到双活高速NVMe-oF存储-OpenInfraDays-China2018 (20)

SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-DeviceSUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
SUSE Expert Days Paris 2018 - SUSE HA Cluster Multi-Device
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT Ceph Day Netherlands - Ceph @ BIT
Ceph Day Netherlands - Ceph @ BIT
 
VMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSANVMworld 2017 - Top 10 things to know about vSAN
VMworld 2017 - Top 10 things to know about vSAN
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus NetworksOpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
OpenStack Networks the Web-Scale Way - Scott Laffer, Cumulus Networks
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Nic bonding
Nic bonding Nic bonding
Nic bonding
 
Practice and challenges from building IaaS
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Ha nsf notes
Ha nsf notesHa nsf notes
Ha nsf notes
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

延伸Linux关键业务到双活高速NVMe-oF存储-OpenInfraDays-China2018

  • 1. 延伸 Linux 关键业务到双活 NVMe-oF 存储 - 近期 SUSE 研发人员的相关进展 周志强 (Roger) SUSE 高级研发经理 zzhou@suse.com 2018 OpenInfra Days China
  • 3. 3 Short about NVMe-oF in Linux 2017 2018 ● Linux storage stack catches up the hardware evolution ➢ Transport: 100M/1G → 10G/40G/100G network ➢ Media: HDD → SSD Flash ➢ S/W Stack: SCSI protocol → NVMe protocol ➢ NVMe-oF Storage Array: Very High iops, Very Low Latency. ● Linux MD RAID1 new I/O barrier, 70% NVMe speed. ➢ Contributed by Coly Li, Neil Brown, Hannes Reinecke, Guoqing Jiang, etc. ➢ 2017, SLE12SP2 Maintenance Update ● NVMe-oF products. ➢ 2017, SLE12SP3 support NVMe-oF with NetApp, Emulex, Mellanox. ➢ 2018-05, Broadcom, NetApp and SUSE Announce Production Availability
  • 5. 5 Data Center Host3 VM2 VM1 Host4 VM2 VM1 NVMe-oF Host1 VM2 VM1 Host2 VM2 VM1 ● 期望: FTT >= 2 Failures To Tolerate 可容忍 / 可恢复错误的数量 ● 期望:接近于 0 的 RTO/RPO. ● 期望:数据保护 / 灾难恢复
  • 6. 6 Successful Stories: Stretched / Host-Based-Mirroring - Banking & Automakers Site B Host3 VM2 VM1 Host4 VM2 VM1 Site A Host1 VM2 VM1 Host2 VM2 VM1 Active-Passive Shared Storage ● 支持异构存储 ● 解锁 供应商特定方案 , 特定存储 ● 解锁 灾难恢复 (DR) 厂 商专有复制工具。 ● 解锁 基于存储厂商的 镜像复制工具 . ● 和 Linux 无缝集成 ** hundreds of clusters
  • 7. 7 SAN : LUN NVMe-oF : NS Failover Successful Stories: 里面的挑战 - Banking & Automakers What happen during “Failover” ? From node-1 1. To umount filesystem 2. To deactivate lvm 3. To remove RAID1 To node-2 4. To assemble RAID1 5. To activate lvm 6. To mount filesystem Imaging hundreds of RAID1 devices, RTO can be very long! m path2’ mpath1’ /dev/mapper/mpath1 MD RAID1 vg / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 MD RAID1 vg / lv Applications/FS Active node-1 Passive node-2 1 2 4 Heartbeating Lock messageing 5 3
  • 8. 8 Improve the cluster to Active - Active ( This Talk ) Linux MD RAID cluster aware 2016 Guoqing Jiang, Neil Brown ● Assemble MD RAID1 on both datacenters ● Activate shared LV on both datacenters ● Mount OCFS2 on both datacenters m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 No native SAN syncing Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS
  • 9. 9 Cluster RAID1 performance is nearly same as native FIO test with sync engine Read Write 4k 16k 0 500 1000 1500 2000 2500 3000 3500 RawDisk NativeRaid Clustermd Cmirror Average iops Blocksize 4k 16k 0 1000 2000 3000 4000 5000 6000 7000 8000 RawDisk NativeRaid Clustermd Cmirror Average iops Blocksize
  • 10. 10 Failures in Stretched Cluster - Ethernet / Cluster Communication
  • 11. 11 Keep stretching – Ethernet perspective FTT = 1 ● Heartbeating ➢ Network Bonding (L2) ➢ Redundant Rings (L3) ● Distributed Lock Messaging ➢ SCTP 2018 Gang He, Michal Kubecek Host1 VM VM Host2 VM VM router router bond rrp UDP Host1 VM VM Host2 VM VM router router ip SCTP
  • 12. 12 Mature Linux HA stack to deal with SPLIT BRAIN FTT = 2 ● Pacemaker ● Corosync ● STONITH Host1 VM VM Host2 VM VM router router bond rrp UDP Host1 VM VM Host2 VM VM router router ip SCTP
  • 13. 13 Failures in Stretched Cluster - SAN Storage ( eg. NVMe-oF )
  • 14. 14 Failure 1: SAN Storage( NVMe-oF ) lose power ● Node-2 RAID1 marks mpath2 as FAULTY device. ● Node-1 RAID1 marks mpath2’ as FAULTY device. ● Both sites working well via node-1’s SAN storage. m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS a a a
  • 15. 15 Keep stretching ● Storage links failures in between ( 蓝翔挖掘机和光缆的恩怨 ) Original image: baike.baidu
  • 16. 16 Failure: SAN Partitioned Byzantine Failures (Wikipedia) 组件在故障检测系统中 的呈现可能不一致,不 同的观察者有不同的症 状:一个角度看正常工 作,另一个看已经失 败。 m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS
  • 17. 17 Failure 2: one storage link failed a) Assuming, Link② failed. Node1 RAID1 marks mpath2’ as FAULTY b) Cluster RAID1 will populate FAULTY device role of mpath2’ in superblock (*), and Node2 mpath2 becomes as FAULTY too. c) That says, Cluster RAID1 will populate FAULTY disk. In the end. Just like a whole SAN failure . (*) That says, MD RAID superblock plays the role to **populate FAULTY device role** in the cluster m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS a b c
  • 18. 18 Failure 3: SAN Partitioned : both links failed a) Assume Link② is the first failure detected by the cluster. • FAULTY is populated, and • just like a whole SAN failure. b) Sequentially(*), the cluster deals with Link③ failure. • MD RAID1 on node-2 lose all devices. • Cluster MD on node-2 is disabled. dmesg report: “ [ 79.942305] md: md0 stopped”. • RA RAID will fail. • c) Services failover to node-1. • Only one site keeps running. (*) the distributed lock play the game here. m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS a b a a
  • 19. 19 Failure 4: SAN switch broken same as Failure 3: SAN Partitioned m path2’ mpath1’ /dev/mapper/mpath1 cluster md0 shared vg0 / lv Applications/FS SAN : LUN NVMe-oF : NS /dev/mapper/mpath2 cluster md0 shared vg0 / lv Applications/FS Active node-1 Active node-2 1 2 4 Heartbeating Lock messageing 5 3 SAN : LUN NVMe-oF : NS a b a a
  • 20. 21 Now, you have Act-Act NVMe-oF in stretched cluster!
  • 22. 23 NVMe-oF Host3 VM2 VM1 Host4 VM2 VM1 NVMe-oF Host1 VM2 VM1 Host2 VM2 VM1 Active-Active Shared LVM ● Aug 2018, Rocky release ➢ Nova: Adding NVMEoF libvirt driver for supporting NVMEoF initiator CLI commit a833bcd05f811325f40cb3c8cce7f94c93cd6b6e Author: Rawan Herzallah <rawanh@mellanox.com> Date: Tue Jul 11 20:18:07 2017 +0300 ➢ Cinder: Adding NVMET target for NVMeOF commit d2b3e1011e238ce1c29157e0614a0416a30448a8 Merge: f6cad8178 8d7e131c5 Author: Zuul <zuul@review.openstack.org> Date: Wed May 9 22:01:16 2018 +0000
  • 24. 25 Challenges ahead ● Cluster RAID10 ● Cluster RAID5 ● Preferred site in case stretched SAN partitioned Welcome to join in Open Source!
  • 25. SUSE 抽奖活动及规则介绍 参与方式: ① 扫描左侧二维码,关注 SUSE 官方微信; ② 发送“抽奖”至 SUSE 官方微信; ③ 简单填写信息后,进入幸运大转盘抽取礼品; ④ 凭中奖页面,前往 SUSE 展台领取礼品。