Ceph Day Taipei - Accelerate Ceph via SPDK

Accelerate Ceph via SPDK
XSKY’s BlueStore as a case study
Danny.Kuo@intel.com
Haomai Wang, XSKY

Outline
• Background
• SPDK introduction
• XSKY’s BlueStore
• Conclusion

• Low performance of Ceph’s storage service
• Ceph’s original architecture is designed on the low speed storage devices
(with ms latency level)
• There are more and more fast devices in both Network and Storage
• Network: 10G/25G/40G/100G (low performance → high performance)
• Storage: HDD → SATA SSD → NMVe SSD → NVDIMM (high latency →
low latency)
• Challenge: Software design and implementation in Ceph is the bottleneck
• Equipped with those fast devices, software needs to be refreshed to explore
the limitation of those hardware devices.
Background – Performance Driven

SAN(Storage
AreaNetwork)
Application Server Application Server
Capacity
Performance
Scale-up
Capacity
Performance
Scale-out
CephCluster
Standard Ethernet
network
Data distributed
across multiple nodes
or clusters
Flexible design to
support multiple
workloads
Separate,
dedicated networks
Data stored in
proprietary storage
hardware
Optimized to run only
a specific workload
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
7200 RPM 15000
RPM
SATA
NAND
Enterprise
NAND
Optane
SSD
3D Xpoint
DIMMs
Hardware vs. Software Latency
Drive Read Latency Software Overhead
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
7200 RPM 15000 RPMSATA NAND Enterprise
NAND
Optane SSD 3D Xpoint
DIMMs
Media vs. Network + Software Latency
Drive Read Latency Network Latency (200usec)
Background – New Hardware need New Balance

FileStore and problems in Ceph
• FileStore
• PG = collection = directory
• Object = file
• Advantages:
• Most are simple via POSIX
interface
• Disadvantages:
• Poor to extend advanced features
like compress/checksum
• POSIX Fails:
• Transaction Atomic → Double
Write (increase latency)
• Enumeration → Build directory
tree by hash-value prefix (need
high computing power)

Potential Solutions
• Invent a new ObjectStore/FileStore design and implementation in
following aspects:
API Change
• Synchronous APIs → Asynchronous APIs (POSIX → NON-POSIX)
• Benefit: Obtaining performance via completing several requests instead of
one.
I/O stack optimization:
• Replace Kernel I/O stacks with user space stacks (e.g., Network I/O,
Storage I/O )
• Benefit: No context switch, no data copy among kernel and user space,
locked architecture → unlocked architecture
SPDK (Storage Performance Development Kit, https://www.spdk.io/)
provides a set of libraries to address such issues.
asynchronous, polled mode, zero-copy

Scalable and Efficient
Millions of IOPS per core
Linear scaling with more cores
iSCSI and NVMe over Fabrics targets
IA-Optimized Storage Reference Architecture
Lockless, polled-mode drivers and protocol libraries
Designed for 3D XPoint® media latencies
BSD licensed drivers via github.com/spdk
User-Space & Polled-Mode, End-to-End
No Kernel/Interrupt context switching overhead
Drops latencies from microsecond to nanosecond
Storage Performance
Development Kit

Built on Intel® Data Plane Development Kit (DPDK)
Software infrastructure to accelerate the packet input/output to Intel CPU
*Other names and brands may be claimed as the property of others.
Storage Performance
Development Kit
User space Network Services (UNS)
TCP/IP stack implemented as polling, lock-light library, bypassing
kernel bottlenecks, and enabling scalability
User space NVMe, Intel® Xeon®/Intel® Atom™
Processor DMA, and Linux* AIO drivers
Optimizes back end driver performance and prevents kernel
bottlenecks from forming at the back end of the I/O chain
Reference Software with Example Application
Customer-relevant example application leveraging Intel® Storage
Acceleration Libraries (ISA-L) is included; support provided on a best-
effort basis

User Space
KNI IGB_UIO VFIO
EAL
MBUF
MEMPOO
L
RING
TIMER
KernelUIO_PCI_GENERI
C
FM10K
IXGBE
VMXNET
3
IGB
E1000
I40E
XENVIRT PCAP
MLX4
MLX5
ETHDEV
RING
NULL
AF_PKT
BONDING
VIRTIOENIC
CXGBE
BNX2X
PMDs: Native & Virtual
SZEDATA2
NFP
MPIPE
HASH
LPM
ACL
JOBSTAT
DISTRIB
IP FRAGKNI
REORDE
R
POWER
VHOST
IVSHMEM
SCHED
METER
PIPELINE
PORT TABLE
Network Functions (Cloud, Enterprise, Comms)
CRYPTO
QAT
AESNI
MB
Future
TBD
AcceleratorsCore
Classify Extensions QoS Pkt Framework
ENA
AESNI GCM
SNOW
3G
NULL
VHOST
ISA-L
DPDK Framework

PERFORMANCE OPTIMIZING
DATA
PROTECTION
XOR (RAID 5), P+Q (RAID 6), Reed Solomon Erasure
Code
COMPRESSION
“DEFLATE”
IGZIP: Fast CompressionMulti-Buffer: SHA-1, SHA-256, SHA-512, MD5
CRYPTOGRAPHIC
HASHING
Dog
06d80e7
b0C50bs
49a509t
b49f249
24e8c8o
05x84q4
CRC-T10, CRC-IEEE (802.3),
CRC32-iSCSI
DATA
INTEGRITY
ReceiverSender
CRC DataDivisor
00..0 Data
Remainder
Remainder
Divisor
CRC Data
Zero,
accept
Non-zero,
reject
CRC n bits
n+1 bits
XTS-AES 128,
XTS-AES 256
ENCRYPTION
plaintext
ReceiverSender
plaintext
Decryption
Algorithm
Encryption
Algorithm
Ciphertext
Public encryption key
Private encryption keydB
eB
1111
Intel® ISA-L Functions

12
Extends Data Plane Development Kit concepts through an end-to-end storage context
 Optimized, user-space lockless polling in the NIC driver, TCP/IP stack, iSCSI target, and NVMe driver
 iSCSI and NVMe over Fabrics targets integrated
Exposes the performance potential of current and next-generation storage media
 Media latencies moving from µsec to nsec, storage software architectures must keep up
 Permissive open source license for user-space media drivers: NVMe & CBDMA drivers are on github.com
 Media drivers support both Linux* and FreeBSD*
NVMf Application and Protocol Library:
 Provisioning, Fabric Interface Processing , Memory Allocation, Fabric Connection Handling, RDMA Data Xfer
 Discovery, Subsystems, Logical Controller, Capsule Processing, Manage Interface with NVMe Driver library
*Other names and brands may be claimed as the property of others.
iSCSI
Target
Block
Device
Abstraction
DPDK
NIC Driver
TCP
IP (UNS)
NVMe Driver
DPDK LIBRARIES
NIC
User-space
Mem Driver DDR
CBDMACBDMA Driver
CLOUD
WRITEREAD
Customer SW
Existing SW
Linux* Kernel
Enhanced SW
NVMe
HW
RNIC
NVMf
Target
RDMA
VERBS
RNIC
RDMA
SW
Linux* OFED
SPDK architecture Overview

Licensed Package Includes:
 Media Drivers: I/OAT DMA (CBDMA) and NVMe
Protocols: iSCSI and NVMe over Fabrics (NVMf)
Optimized Libraries: DPDK and UNS TCP/IP Stack
 User space support code (written in C):
 POSIX compliant
 Demo/Usage, Unit test (functional correctness),
Basic performance
 API manuals – may include links or copy key papers
 Release.txt – release notes, version, etc.
Source Agreement
 BSD licensed code distributed via https://github.com/spdk
 Licensed version (including UNS and other components in
development) is available under non-commercial restricted
license and full software license agreement
 All code is provided as reference software with best-effort
support model
SPDK
Packaging and
Contents

Performance comparison:
User-space NVMe driver vs. Kernel NVMe driver

4KB Random Read Performance: Partition variants on 4 NVMe SSD Drives
Single-Core Intel® Xeon® Processor
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 Partition 2 Partitions 4 Partitions 8 Partitions 16 Partitions
IOps(thousands)
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more information go to http://www.intel.com/performance.
Kernel
NVMe Driver
SPDK
NVMe Driver
SPDK NVMe driver delivers up to 6x performance improvement
vs. Kernel NVMe driver with a single-core Intel® Xeon® processor

4KB Random Read Performance: 1 to 4 NVMe SSD Drives
Single-Core Intel® Xeon® Processor
SPDK NVMe driver scales linearly in performance
from 1 to 4 NVMe drives with a single-core Intel® Xeon® processor
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 NVMe 2 NVMe 4 NVMe
IOps(thousands)
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance.
Kernel
NVMe Driver
SPDK
NVMe Driver

Performance comparison:
iSCSI target in SPDK vs. Linux-IO target

Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance
0
1
2
3
4
5
6
7
8
0
100
200
300
400
500
600
1 2 4 8 16 32
Lines-Latency(inmsec)
(lowerisbetter)
Bars-IOps(inthousands)
(higherisbetter)
Queue Depth
0
2
4
6
8
10
0
100
200
300
400
500
600
1 2 4 8 16 32
Lines-#ofCoresutilized
(lowerisbetter)
(higherisbetter)
Queue Depth
IOps vs. LATENCY IOps vs. CORE UTILIZATION
SPDK 2 Core
LIO unlimited cores
LIO 2 Core
SPDK 4 Core
SPDK can provide similar IOps and latency characteristics as LIO
while utilizing up to 8 fewer cores
Intel® Xeon® Processor v3 – 4KB - iSCSI Random Read:
SPDK vs. LIO

Intel® Xeon® Processor v3 – 4KB - iSCSI Random Write:
SPDK vs. LIO
0
1
2
3
4
5
6
7
8
9
10
0
50
100
150
200
250
300
350
400
450
500
1 2 4 8 16 32
Lines-#ofCoresutilized
(lowerisbetter)
(higherisbetter)
Queue Depth
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
0
1
2
3
4
5
6
7
8
9
10
0
50
100
150
200
250
300
350
400
450
500
1 2 4 8 16 32
Lines-Latency(inmsec)
(lowerisbetter)
(higherisbetter)
Queue Depth
SPDK can provide similar IOps and latency characteristics as LIO
while utilizing up to 2 fewer cores
IOps vs. LATENCY IOps vs. CORE UTILIZATION
SPDK 2 Core
LIO unlimited cores
LIO 2 Core
SPDK 4 Core

Intel® Xeon® Processor E5-2620v2-iSCSI Read/Write:
4 KB Data
0
100
200
300
400
500
600
IO/s(inthousands)
LIO 6 CoreSPDK
2 Core
SPDK
1 Core
0
50
100
150
200
250
300
350
IO/s(inthousands)
LIOSPDK
+ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. Source: Intel Internal Measurements as of 22 August 2014. See back up slide # 10-13 for configuration details.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.
Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.
NVM
Express
Backend
PERFORMANCE PERFORMANCE/CORE
Up to 650% increase in max performance per core+
4 KB-Random-
100% Read
4 KB-Random-
70% Read 30% Write
4 KB-Random-
100% Write

Performance demonstration:
User-space NVMf target in SPDK

SPDK NVMf Performance Approaches Local NVMe
Efficiency and Scalable Performance
 NVMe >2M IOPS per Xeon-D core
 NVMf 1.2M IOPS per Xeon-D core
Optimized for Intel® Architecture and
NVMe
 Latency and jitter reduced
 Leaves CPU cycles for storage
application and value
 4X efficiency of kernel NVMe driver
0.3
1.8
1.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1x NVMe 2x NVMe 4x NVMe
IOPS(inmillions)
higherisbetter
Single Core Performance Comparison
Intel® Xeon® processor D, Intel P3700 800GB SSDs, FIO-
2.2.9, direct=1, iodepth=128 per LUN
Kernel NVMe Driver SPDK NVMe Driver SPDK NVMf Target
Configuration details SPDK NVMf Target Intel Xeon-D Processor D-1567 No. of Sockets 1 No. of Cores 12 Cores/ 12 Threads per socket Memory 32G NIC Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port MTU 1500 OS Fedora 23
Linux Kernel 4.4.3-300.fc23.x86_64 NVMf Target Intel SPDK Storage Array Component Details Storage Drives 4x Intel SSD 800GB DC P3700series NVMf Initiator Component Details No. of Initiators Dell PowerEdge 730xd Processor Intel
Xeon Processor E5-2699v3 (45M Cache, 2.30 GHz) No. of Sockets 2 sockets populated No. of Cores 18 Cores/ 18 Threads per socket Memory 132G NIC Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port MTU 1500 OS RHEL 7.2 Linux
Kernel 4.5.0-rc3. Tested by Intel, 3/22/2016

NVMf Target
NVMf Client
SPDK NVMf Target
SPDK NVMe lib
NVMe Controller
spdk_lib_read_start spdk_lib_read_complete
FIO- Libaio- Block I/O- Direct- 1 worker per device -1 Queue Depth to each device
nvmf_read I/O start nvmf_read I/O complete
93 usec
• 93 usec round trip time measured from NVMf
client
• Out of 93usec, ~80 usec spent in NVMe
controller
• 12-13 usec measured time over the fabric
• SPDK NVMf target adding ~3% to fabric
overhead
Disclaimer:
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products.
85.6 usec
85.7 usec
FIO Read
start
6.5 usec
6.6 usec
85.6 usec6.7 usec
NVMf IO Latency Model, 4KB 100% Random read

What SPDK can do to improve Ceph?
• Accelerate the backend I/Os in Ceph OSD (Object storage service)
• Key solution: Replace the Kernel drivers with user-space NVMe drivers provided by
SPDK to accelerate the I/Os on NVMe SSDs.
• Accelerate the client I/O performance on Ceph Cluster
• Key solution: use the accelerated iSCSI application and user-space NVMe drivers in
SPDK to build a caching solution in front of Ceph Clusters.
• Accelerate the network performance (TCP/IP) in Ceph’s internal network.
• Key solution: Replace the existed network solution provided by kernel in each OSD
Node with DPDK + User-space TCP/IP stack (e.g., LIBUNS, SEASTAR, MTCP and
etc.).

• key/value database (RocksDB) for metadata
• data written directly to block device
 Write through cache
• pluggable block Allocator (policy)
• Adaptive driver policy
 Kernel and user-space coexist
BlueStore
Consume raw block device

From Sage 06/21 Talk
Performance Status -- Sequential Write(HDD)

Performance Status -- Random Write(HDD)

• Done
• fully function IO path with checksums and compression
• fsck
• bitmap-based allocator and freelist
• Current efforts
• optimize metadata encoding efficiency
• performance tuning
• ZetaScale key/value db as RocksDB alternative
• bounds on compressed blob occlusion
• Coming Soon
• per-pool properties that map to compression, checksum, IO hints more
performance optimization
• native SMR support high density HDD
• Leverage SPDK (Bypass kernel for NVMe devices)
BlueStore Status

Performance Bottleneck – kernel AIO library

• Non Local Connections
• NIC RX and application call in different core
• Global TCP Control Block Management
• Socket API Overhead (building the link)
Other Kernel Bottleneck

BlueStore Architecture with DPDK/SPDK

DPDK-Messenger Plugin, create alternative data path

• TCP, IP, ARP, DPDK Device:
• hardware features offloads
• port from seastar TCP/IP stack
• integrated with Ceph’s libraries
• Event-drive:
• User-space Event Center(like epoll)
• NetworkStack API:
• Basic Network Interface With Zero-copy or Non Zero-copy
• Ensure PosixStack ↔ DPDK Stack Compatible
• AsyncMessenger:
• A collection of Connections
• Network Error Policy
Design

• Local Listen Table → low latency
• Local Connection Process → run-to-complete
• TCP 5 Tuples → RX/TX Cores (RSS)
• Mbuf go through the whole IO Stack → prevent context switch
Shared Nothing TCP/IP (local TCP/IP)

• Status
• User-space NVMe Library(SPDK)
• Already in Ceph master branch
• DPDK integrated
• IO Data From NIC(DPDK mbuf) To Device
• Missing part (plan in Q4’16)
• User-space Cache
NVMe Device

Random 4KB Read Random 4KB Write
IOPS
Kernel Userspace
Random 4KB Read Random 4KB Write
Avg Latency
Kernel Userspace
Improvements

• Core Logics
• no signal/wait
• future/promise
• full async
• Memory Allocation
• rte_malloc isn’t effective enough
• mbuf live cycle control
• Full user-space logic
Bluestore Roadmap

• There are performance issues in Ceph with the emerging fast network and
storage devices.
• Storage system need to refactor to catch up hardware.
• Ceph is hoped to change to share-nothing implementation.
• Mainly, We introduce SPDK and BlueStore to address the current issues in
Ceph.
• SPDK: Libraries (e.g., user-space NVMe driver) can be used for performance
acceleration.
• BlueStore: Invent a new store to implement lockless, asynchronous and high
performance storage service.
• Lots of details need to work(coming soon)
Summary

Overview: ObjectStore and Data Model
• ObjectStore
• Abstract interface for storing local
data
• decouple data and metadata
EBOFS, FileStore
• EBOFS highlight
• A user space extent-based object
file system
• Deprecated in favor of FileStore on
BTRFS in 2009
• Object – “file”
• data (file-like byte stream)
• Attributes (small key/value)
• Omap (unbounded key/value)
• Collection-”directory”
• placement group shard(slice of the
RADOS pool)
• shared by 32bit hash value
• All writes are transactions
• Atomic + Consistent + Durable
• Isolation provided by OSD
RADOS
EBOFS
RADOS
EBOFS

Ceph Day Taipei - Accelerate Ceph via SPDK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Ceph Day Taipei - Accelerate Ceph via SPDK

Similar to Ceph Day Taipei - Accelerate Ceph via SPDK (20)

Recently uploaded

Recently uploaded (20)

Ceph Day Taipei - Accelerate Ceph via SPDK

Editor's Notes