Resilient and Fast Persistent Container Storage
Leveraging Linux’s Storage Functionalities
Philipp Reisner, CEO LINBIT
31
COMPANY OVERVIEW
REFERENCES
• Developer of DRBD
• 100% founder owned
• Offices in Europe and US
• Team of 30 highly
experienced Linux experts
• Partner in Japan
TECHNOLOGY OVERVIEW
LINBIT - the company behind it
Linux Storage Gems
LVM, RAID, SSD cache tiers, deduplication, targets & initiators
31
Linux's LVM
Volume Group
physical volume physical volumephysical volume
logical volumelogical volume snapshot
31
Linux's LVM
• based on device mapper
• original objects
• PVs, VGs, LVs, snapshots
• LVs can scatter over PVs in multiple segments
• thinlv
• thinpools = LVs
• thin LVs live in thinpools
• multiple snapshots became efficient!
31
Linux's LVM
VG
PV PVPV
thinpoolLV snapshot
thin-LV thin-LV thin-sLV
31
Linux's RAID
• original MD code
• mdadm command
• Raid Levels: 0,1,4,5,6,10
• Now available in LVM as well
• device mapper interface for MD code
• do not call it ‘dmraid’; that is software for hardware fake-raid
• lvcreate --type raid6 --size 100G VG_name
A4
A3
A2
A1
A4
A3
A2
A1
RAID1
31
SSD cache for HDD
• dm-cache
• device mapper module
• accessible via LVM tools
• bcache
• generic Linux block device
• slightly ahead in the performance game
31
Linux’s DeDupe
• Virtual Data Optimizer (VDO) since RHEL 7.5
• Red hat acquired Permabit and is GPLing VDO
• Linux upstreaming is in preparation
• in-line data deduplication
• kernel part is a device mapper module
• indexing service runs in user-space
• async or synchronous writeback
• Recommended to be used below LVM
31
Linux’s targets & initiators
• Open-ISCSI initiator
• Ietd, STGT, SCST
• mostly historical
• LIO
• iSCSI, iSER, SRP, FC, FCoE
• SCSI pass through, block IO, file IO, user-specific-IO
• NVMe-OF
• target & initiator
Initiator Target
IO-requests
data/completion
31
ZFS on Linux
• Ubuntu eco-system only
• has its own
• logic volume manager (zVols)
• thin provisioning
• RAID (RAIDz)
• caching for SSDs (ZIL, SLOG)
• and a file system!
Put in simplest form
31
DRBD – think of it as ...
Target
A4
A3
A2
A1
Initiator
IO-requests
data/completion
RAID1
A4
A3
A2
A1
31
DRBD Roles: Primary & Secondary
SecondaryPrimary
replication
31
DRBD – multiple Volumes
• consistency group
SecondaryPrimary
replication
31
DRBD – up to 32 replicas
• each may be synchronous or async
Primary
Secondary
Secondary
31
DRBD – Diskless nodes
• intentional diskless (no change tracking bitmap)
• disks can fail
Primary
Secondary
Secondary
31
DRBD - more about
• a node knows the version of the data is exposes
• automatic partial resync after connection outage
• checksum-based verify & resync
• split brain detection & resolution policies
• fencing
• quorum
• multiple resouces per node possible (1000s)
• dual Primary for live migration of VMs only!
31
DRBD Roadmap
• performance optimizations (2018)
• meta-data on PMEM/NVDIMMS
• zero copy receive on diskless (RDMA-transport)
• no context switch send (RDMA & TCP transport)
• Eurostars grant: DRBD4Cloud
• erasure coding (2019)
The combination is more than the sum of its parts
31
LINSTOR - goals
• storage build from generic (x86) nodes
• for SDS consumers (K8s, OpenStack, OpenNebula)
• building on existing Linux storage components
• multiple tenants possible
• deployment architectures
• distinct storage nodes
• hyperconverged with hypervisors / container hosts
LVM, thin LVM or ZFS for volume management (stratis later)
Open Source, GPL
LINSTOR
DRBD
storage nodestorage node
hypervisor
VM VM
storage node storage node
hypervisor
VM VM
DRBD
hypervisor
LINSTOR w. failed Hypervisor
DRBD
storage nodestorage node storage node storage node
hypervisor
VM VM
DRBD
hypervisor
VM VM
LINSTOR w. failed storage node
DRBD
storage node
hypervisor
VM VM
storage node storage node
hypervisor
VM VM
DRBD
hypervisor
LINSTOR - Hyperconverged
hypervisor & storage
VM VM
hypervisor & storage hypervisor & storage
hypervisor & storage hypervisor & storage hypervisor & storage
LINSTOR - VM migrated
hypervisor & storage
VM
hypervisor & storage hypervisor & storage
hypervisor & storage hypervisor & storage hypervisor & storage
VM
LINSTOR - add local storage
hypervisor & storage
VM
hypervisor & storage hypervisor & storage
hypervisor & storage hypervisor & storage hypervisor & storage
VM
LINSTOR - remove 3rd copy
hypervisor & storage
VM
hypervisor & storage hypervisor & storage
hypervisor & storage hypervisor & storage hypervisor & storage
VM
31
LINSTOR Architecture
31
LINSTOR Roadmap
• Swordfish API
• volume management
• access via NVMe-oF
• inventory sync from Redfish/Swordfish
• support for multiple sites & DRBD-Proxy (Dec 2018)
• north bound drivers
• Kubernetes, OpenStack, OpenNebula, Proxmox, XenServer
31
Case study - intel
LINBIT working together with Intel
LINSTOR is a storage orchestration technology that brings storage from generic Linux
servers and SNIA Swordfish enabled targets to containerized workloads as persistent
storage. LINBIT is working with Intel to develop a Data Management Platform that
includes a storage backend based on LINBIT’s software. LINBIT adds support for the
SNIA Swordfish API and NVMe-oF to LINSTOR.
Intel® Rack Scale Design (Intel® RSD)
is an industry-wide architecture for disaggregated,
composable infrastructure that fundamentally changes the
way a data center is built, managed, and expanded over time.
Thank you
https://www.linbit.com

OpenNebulaConf2018 - LINSTOR - Philipp Reisner - LINBIT

  • 1.
    Resilient and FastPersistent Container Storage Leveraging Linux’s Storage Functionalities Philipp Reisner, CEO LINBIT
  • 2.
    31 COMPANY OVERVIEW REFERENCES • Developerof DRBD • 100% founder owned • Offices in Europe and US • Team of 30 highly experienced Linux experts • Partner in Japan TECHNOLOGY OVERVIEW LINBIT - the company behind it
  • 3.
    Linux Storage Gems LVM,RAID, SSD cache tiers, deduplication, targets & initiators
  • 4.
    31 Linux's LVM Volume Group physicalvolume physical volumephysical volume logical volumelogical volume snapshot
  • 5.
    31 Linux's LVM • basedon device mapper • original objects • PVs, VGs, LVs, snapshots • LVs can scatter over PVs in multiple segments • thinlv • thinpools = LVs • thin LVs live in thinpools • multiple snapshots became efficient!
  • 6.
    31 Linux's LVM VG PV PVPV thinpoolLVsnapshot thin-LV thin-LV thin-sLV
  • 7.
    31 Linux's RAID • originalMD code • mdadm command • Raid Levels: 0,1,4,5,6,10 • Now available in LVM as well • device mapper interface for MD code • do not call it ‘dmraid’; that is software for hardware fake-raid • lvcreate --type raid6 --size 100G VG_name A4 A3 A2 A1 A4 A3 A2 A1 RAID1
  • 8.
    31 SSD cache forHDD • dm-cache • device mapper module • accessible via LVM tools • bcache • generic Linux block device • slightly ahead in the performance game
  • 9.
    31 Linux’s DeDupe • VirtualData Optimizer (VDO) since RHEL 7.5 • Red hat acquired Permabit and is GPLing VDO • Linux upstreaming is in preparation • in-line data deduplication • kernel part is a device mapper module • indexing service runs in user-space • async or synchronous writeback • Recommended to be used below LVM
  • 10.
    31 Linux’s targets &initiators • Open-ISCSI initiator • Ietd, STGT, SCST • mostly historical • LIO • iSCSI, iSER, SRP, FC, FCoE • SCSI pass through, block IO, file IO, user-specific-IO • NVMe-OF • target & initiator Initiator Target IO-requests data/completion
  • 11.
    31 ZFS on Linux •Ubuntu eco-system only • has its own • logic volume manager (zVols) • thin provisioning • RAID (RAIDz) • caching for SSDs (ZIL, SLOG) • and a file system!
  • 12.
  • 13.
    31 DRBD – thinkof it as ... Target A4 A3 A2 A1 Initiator IO-requests data/completion RAID1 A4 A3 A2 A1
  • 14.
    31 DRBD Roles: Primary& Secondary SecondaryPrimary replication
  • 15.
    31 DRBD – multipleVolumes • consistency group SecondaryPrimary replication
  • 16.
    31 DRBD – upto 32 replicas • each may be synchronous or async Primary Secondary Secondary
  • 17.
    31 DRBD – Disklessnodes • intentional diskless (no change tracking bitmap) • disks can fail Primary Secondary Secondary
  • 18.
    31 DRBD - moreabout • a node knows the version of the data is exposes • automatic partial resync after connection outage • checksum-based verify & resync • split brain detection & resolution policies • fencing • quorum • multiple resouces per node possible (1000s) • dual Primary for live migration of VMs only!
  • 19.
    31 DRBD Roadmap • performanceoptimizations (2018) • meta-data on PMEM/NVDIMMS • zero copy receive on diskless (RDMA-transport) • no context switch send (RDMA & TCP transport) • Eurostars grant: DRBD4Cloud • erasure coding (2019)
  • 20.
    The combination ismore than the sum of its parts
  • 21.
    31 LINSTOR - goals •storage build from generic (x86) nodes • for SDS consumers (K8s, OpenStack, OpenNebula) • building on existing Linux storage components • multiple tenants possible • deployment architectures • distinct storage nodes • hyperconverged with hypervisors / container hosts LVM, thin LVM or ZFS for volume management (stratis later) Open Source, GPL
  • 22.
    LINSTOR DRBD storage nodestorage node hypervisor VMVM storage node storage node hypervisor VM VM DRBD hypervisor
  • 23.
    LINSTOR w. failedHypervisor DRBD storage nodestorage node storage node storage node hypervisor VM VM DRBD hypervisor VM VM
  • 24.
    LINSTOR w. failedstorage node DRBD storage node hypervisor VM VM storage node storage node hypervisor VM VM DRBD hypervisor
  • 25.
    LINSTOR - Hyperconverged hypervisor& storage VM VM hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage
  • 26.
    LINSTOR - VMmigrated hypervisor & storage VM hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage VM
  • 27.
    LINSTOR - addlocal storage hypervisor & storage VM hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage VM
  • 28.
    LINSTOR - remove3rd copy hypervisor & storage VM hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage hypervisor & storage VM
  • 29.
  • 30.
    31 LINSTOR Roadmap • SwordfishAPI • volume management • access via NVMe-oF • inventory sync from Redfish/Swordfish • support for multiple sites & DRBD-Proxy (Dec 2018) • north bound drivers • Kubernetes, OpenStack, OpenNebula, Proxmox, XenServer
  • 31.
    31 Case study -intel LINBIT working together with Intel LINSTOR is a storage orchestration technology that brings storage from generic Linux servers and SNIA Swordfish enabled targets to containerized workloads as persistent storage. LINBIT is working with Intel to develop a Data Management Platform that includes a storage backend based on LINBIT’s software. LINBIT adds support for the SNIA Swordfish API and NVMe-oF to LINSTOR. Intel® Rack Scale Design (Intel® RSD) is an industry-wide architecture for disaggregated, composable infrastructure that fundamentally changes the way a data center is built, managed, and expanded over time.
  • 32.