3. • Low performance of Ceph’s storage service
• Ceph’s original architecture is designed on the low speed storage devices
(with ms latency level)
• There are more and more fast devices in both Network and Storage
• Network: 10G/25G/40G/100G (low performance → high performance)
• Storage: HDD → SATA SSD → NMVe SSD → NVDIMM (high latency →
low latency)
• Challenge: Software design and implementation in Ceph is the bottleneck
• Equipped with those fast devices, software needs to be refreshed to explore
the limitation of those hardware devices.
Background – Performance Driven
4. SAN(Storage
AreaNetwork)
Application Server Application Server
Capacity
Performance
Scale-up
Capacity
Performance
Scale-out
CephCluster
Standard Ethernet
network
Data distributed
across multiple nodes
or clusters
Flexible design to
support multiple
workloads
Separate,
dedicated networks
Data stored in
proprietary storage
hardware
Optimized to run only
a specific workload
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
7200 RPM 15000
RPM
SATA
NAND
Enterprise
NAND
Optane
SSD
3D Xpoint
DIMMs
Hardware vs. Software Latency
Drive Read Latency Software Overhead
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
7200 RPM 15000 RPMSATA NAND Enterprise
NAND
Optane SSD 3D Xpoint
DIMMs
Media vs. Network + Software Latency
Drive Read Latency Network Latency (200usec)
Background – New Hardware need New Balance
5. FileStore and problems in Ceph
• FileStore
• PG = collection = directory
• Object = file
• Advantages:
• Most are simple via POSIX
interface
• Disadvantages:
• Poor to extend advanced features
like compress/checksum
• POSIX Fails:
• Transaction Atomic → Double
Write (increase latency)
• Enumeration → Build directory
tree by hash-value prefix (need
high computing power)
6. Potential Solutions
• Invent a new ObjectStore/FileStore design and implementation in
following aspects:
API Change
• Synchronous APIs → Asynchronous APIs (POSIX → NON-POSIX)
• Benefit: Obtaining performance via completing several requests instead of
one.
I/O stack optimization:
• Replace Kernel I/O stacks with user space stacks (e.g., Network I/O,
Storage I/O )
• Benefit: No context switch, no data copy among kernel and user space,
locked architecture → unlocked architecture
SPDK (Storage Performance Development Kit, https://www.spdk.io/)
provides a set of libraries to address such issues.
asynchronous, polled mode, zero-copy
8. Scalable and Efficient
Millions of IOPS per core
Linear scaling with more cores
iSCSI and NVMe over Fabrics targets
IA-Optimized Storage Reference Architecture
Lockless, polled-mode drivers and protocol libraries
Designed for 3D XPoint® media latencies
BSD licensed drivers via github.com/spdk
User-Space & Polled-Mode, End-to-End
No Kernel/Interrupt context switching overhead
Drops latencies from microsecond to nanosecond
Storage Performance
Development Kit
9. Built on Intel® Data Plane Development Kit (DPDK)
Software infrastructure to accelerate the packet input/output to Intel CPU
*Other names and brands may be claimed as the property of others.
Storage Performance
Development Kit
User space Network Services (UNS)
TCP/IP stack implemented as polling, lock-light library, bypassing
kernel bottlenecks, and enabling scalability
User space NVMe, Intel® Xeon®/Intel® Atom™
Processor DMA, and Linux* AIO drivers
Optimizes back end driver performance and prevents kernel
bottlenecks from forming at the back end of the I/O chain
Reference Software with Example Application
Customer-relevant example application leveraging Intel® Storage
Acceleration Libraries (ISA-L) is included; support provided on a best-
effort basis
10. User Space
KNI IGB_UIO VFIO
EAL
MBUF
MEMPOO
L
RING
TIMER
KernelUIO_PCI_GENERI
C
FM10K
IXGBE
VMXNET
3
IGB
E1000
I40E
XENVIRT PCAP
MLX4
MLX5
ETHDEV
RING
NULL
AF_PKT
BONDING
VIRTIOENIC
CXGBE
BNX2X
PMDs: Native & Virtual
SZEDATA2
NFP
MPIPE
HASH
LPM
ACL
JOBSTAT
DISTRIB
IP FRAGKNI
REORDE
R
POWER
VHOST
IVSHMEM
SCHED
METER
PIPELINE
PORT TABLE
Network Functions (Cloud, Enterprise, Comms)
CRYPTO
QAT
AESNI
MB
Future
TBD
AcceleratorsCore
Classify Extensions QoS Pkt Framework
ENA
AESNI GCM
SNOW
3G
NULL
VHOST
ISA-L
DPDK Framework
12. 12
Extends Data Plane Development Kit concepts through an end-to-end storage context
Optimized, user-space lockless polling in the NIC driver, TCP/IP stack, iSCSI target, and NVMe driver
iSCSI and NVMe over Fabrics targets integrated
Exposes the performance potential of current and next-generation storage media
Media latencies moving from µsec to nsec, storage software architectures must keep up
Permissive open source license for user-space media drivers: NVMe & CBDMA drivers are on github.com
Media drivers support both Linux* and FreeBSD*
NVMf Application and Protocol Library:
Provisioning, Fabric Interface Processing , Memory Allocation, Fabric Connection Handling, RDMA Data Xfer
Discovery, Subsystems, Logical Controller, Capsule Processing, Manage Interface with NVMe Driver library
*Other names and brands may be claimed as the property of others.
iSCSI
Target
Block
Device
Abstraction
DPDK
NIC Driver
TCP
IP (UNS)
NVMe Driver
DPDK LIBRARIES
NIC
User-space
Mem Driver DDR
CBDMACBDMA Driver
CLOUD
WRITEREAD
Customer SW
Existing SW
Linux* Kernel
Enhanced SW
NVMe
HW
RNIC
NVMf
Target
RDMA
VERBS
RNIC
RDMA
SW
Linux* OFED
SPDK architecture Overview
13. Licensed Package Includes:
Media Drivers: I/OAT DMA (CBDMA) and NVMe
Protocols: iSCSI and NVMe over Fabrics (NVMf)
Optimized Libraries: DPDK and UNS TCP/IP Stack
User space support code (written in C):
POSIX compliant
Demo/Usage, Unit test (functional correctness),
Basic performance
API manuals – may include links or copy key papers
Release.txt – release notes, version, etc.
Source Agreement
BSD licensed code distributed via https://github.com/spdk
Licensed version (including UNS and other components in
development) is available under non-commercial restricted
license and full software license agreement
All code is provided as reference software with best-effort
support model
SPDK
Packaging and
Contents
15. 4KB Random Read Performance: Partition variants on 4 NVMe SSD Drives
Single-Core Intel® Xeon® Processor
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 Partition 2 Partitions 4 Partitions 8 Partitions 16 Partitions
IOps(thousands)
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more information go to http://www.intel.com/performance.
Kernel
NVMe Driver
SPDK
NVMe Driver
SPDK NVMe driver delivers up to 6x performance improvement
vs. Kernel NVMe driver with a single-core Intel® Xeon® processor
16. 4KB Random Read Performance: 1 to 4 NVMe SSD Drives
Single-Core Intel® Xeon® Processor
SPDK NVMe driver scales linearly in performance
from 1 to 4 NVMe drives with a single-core Intel® Xeon® processor
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 NVMe 2 NVMe 4 NVMe
IOps(thousands)
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance.
Kernel
NVMe Driver
SPDK
NVMe Driver
18. Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance
0
1
2
3
4
5
6
7
8
0
100
200
300
400
500
600
1 2 4 8 16 32
Lines-Latency(inmsec)
(lowerisbetter)
Bars-IOps(inthousands)
(higherisbetter)
Queue Depth
0
2
4
6
8
10
0
100
200
300
400
500
600
1 2 4 8 16 32
Lines-#ofCoresutilized
(lowerisbetter)
Bars-IOps(inthousands)
(higherisbetter)
Queue Depth
IOps vs. LATENCY IOps vs. CORE UTILIZATION
SPDK 2 Core
LIO unlimited cores
LIO 2 Core
SPDK 4 Core
SPDK can provide similar IOps and latency characteristics as LIO
while utilizing up to 8 fewer cores
Intel® Xeon® Processor v3 – 4KB - iSCSI Random Read:
SPDK vs. LIO
19. Intel® Xeon® Processor v3 – 4KB - iSCSI Random Write:
SPDK vs. LIO
0
1
2
3
4
5
6
7
8
9
10
0
50
100
150
200
250
300
350
400
450
500
1 2 4 8 16 32
Lines-#ofCoresutilized
(lowerisbetter)
Bars-IOps(inthousands)
(higherisbetter)
Queue Depth
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/performance
0
1
2
3
4
5
6
7
8
9
10
0
50
100
150
200
250
300
350
400
450
500
1 2 4 8 16 32
Lines-Latency(inmsec)
(lowerisbetter)
Bars-IOps(inthousands)
(higherisbetter)
Queue Depth
SPDK can provide similar IOps and latency characteristics as LIO
while utilizing up to 2 fewer cores
IOps vs. LATENCY IOps vs. CORE UTILIZATION
SPDK 2 Core
LIO unlimited cores
LIO 2 Core
SPDK 4 Core
20. Intel® Xeon® Processor E5-2620v2-iSCSI Read/Write:
4 KB Data
0
100
200
300
400
500
600
IO/s(inthousands)
LIO 6 CoreSPDK
2 Core
SPDK
1 Core
0
50
100
150
200
250
300
350
IO/s(inthousands)
LIOSPDK
+ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. Source: Intel Internal Measurements as of 22 August 2014. See back up slide # 10-13 for configuration details.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction
sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this
product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.
Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.
For more information go to http://www.intel.com/performance
NVM
Express
Backend
PERFORMANCE PERFORMANCE/CORE
Up to 650% increase in max performance per core+
4 KB-Random-
100% Read
4 KB-Random-
70% Read 30% Write
4 KB-Random-
100% Write
22. SPDK NVMf Performance Approaches Local NVMe
Efficiency and Scalable Performance
NVMe >2M IOPS per Xeon-D core
NVMf 1.2M IOPS per Xeon-D core
Optimized for Intel® Architecture and
NVMe
Latency and jitter reduced
Leaves CPU cycles for storage
application and value
4X efficiency of kernel NVMe driver
0.3
1.8
1.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1x NVMe 2x NVMe 4x NVMe
IOPS(inmillions)
higherisbetter
Single Core Performance Comparison
Intel® Xeon® processor D, Intel P3700 800GB SSDs, FIO-
2.2.9, direct=1, iodepth=128 per LUN
Kernel NVMe Driver SPDK NVMe Driver SPDK NVMf Target
Configuration details SPDK NVMf Target Intel Xeon-D Processor D-1567 No. of Sockets 1 No. of Cores 12 Cores/ 12 Threads per socket Memory 32G NIC Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port MTU 1500 OS Fedora 23
Linux Kernel 4.4.3-300.fc23.x86_64 NVMf Target Intel SPDK Storage Array Component Details Storage Drives 4x Intel SSD 800GB DC P3700series NVMf Initiator Component Details No. of Initiators Dell PowerEdge 730xd Processor Intel
Xeon Processor E5-2699v3 (45M Cache, 2.30 GHz) No. of Sockets 2 sockets populated No. of Cores 18 Cores/ 18 Threads per socket Memory 132G NIC Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port MTU 1500 OS RHEL 7.2 Linux
Kernel 4.5.0-rc3. Tested by Intel, 3/22/2016
23. NVMf Target
NVMf Client
SPDK NVMf Target
SPDK NVMe lib
NVMe Controller
spdk_lib_read_start spdk_lib_read_complete
FIO- Libaio- Block I/O- Direct- 1 worker per device -1 Queue Depth to each device
nvmf_read I/O start nvmf_read I/O complete
93 usec
• 93 usec round trip time measured from NVMf
client
• Out of 93usec, ~80 usec spent in NVMe
controller
• 12-13 usec measured time over the fabric
• SPDK NVMf target adding ~3% to fabric
overhead
Disclaimer:
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products.
For more information go to http://www.intel.com/performance
85.6 usec
85.7 usec
FIO Read
start
6.5 usec
6.6 usec
85.6 usec6.7 usec
NVMf IO Latency Model, 4KB 100% Random read
24. What SPDK can do to improve Ceph?
• Accelerate the backend I/Os in Ceph OSD (Object storage service)
• Key solution: Replace the Kernel drivers with user-space NVMe drivers provided by
SPDK to accelerate the I/Os on NVMe SSDs.
• Accelerate the client I/O performance on Ceph Cluster
• Key solution: use the accelerated iSCSI application and user-space NVMe drivers in
SPDK to build a caching solution in front of Ceph Clusters.
• Accelerate the network performance (TCP/IP) in Ceph’s internal network.
• Key solution: Replace the existed network solution provided by kernel in each OSD
Node with DPDK + User-space TCP/IP stack (e.g., LIBUNS, SEASTAR, MTCP and
etc.).
26. • key/value database (RocksDB) for metadata
• data written directly to block device
Write through cache
• pluggable block Allocator (policy)
• Adaptive driver policy
Kernel and user-space coexist
BlueStore
Consume raw block device
27. From Sage 06/21 Talk
Performance Status -- Sequential Write(HDD)
28. From Sage 06/21 Talk
Performance Status -- Random Write(HDD)
29. • Done
• fully function IO path with checksums and compression
• fsck
• bitmap-based allocator and freelist
• Current efforts
• optimize metadata encoding efficiency
• performance tuning
• ZetaScale key/value db as RocksDB alternative
• bounds on compressed blob occlusion
• Coming Soon
• per-pool properties that map to compression, checksum, IO hints more
performance optimization
• native SMR support high density HDD
• Leverage SPDK (Bypass kernel for NVMe devices)
From Sage 06/21 Talk
BlueStore Status
32. • Non Local Connections
• NIC RX and application call in different core
• Global TCP Control Block Management
• Socket API Overhead (building the link)
Other Kernel Bottleneck
42. • There are performance issues in Ceph with the emerging fast network and
storage devices.
• Storage system need to refactor to catch up hardware.
• Ceph is hoped to change to share-nothing implementation.
• Mainly, We introduce SPDK and BlueStore to address the current issues in
Ceph.
• SPDK: Libraries (e.g., user-space NVMe driver) can be used for performance
acceleration.
• BlueStore: Invent a new store to implement lockless, asynchronous and high
performance storage service.
• Lots of details need to work(coming soon)
Summary
44. Overview: ObjectStore and Data Model
• ObjectStore
• Abstract interface for storing local
data
• decouple data and metadata
EBOFS, FileStore
• EBOFS highlight
• A user space extent-based object
file system
• Deprecated in favor of FileStore on
BTRFS in 2009
• Object – “file”
• data (file-like byte stream)
• Attributes (small key/value)
• Omap (unbounded key/value)
• Collection-”directory”
• placement group shard(slice of the
RADOS pool)
• shared by 32bit hash value
• All writes are transactions
• Atomic + Consistent + Durable
• Isolation provided by OSD
RADOS
EBOFS
RADOS
EBOFS
Editor's Notes
Encryption is for maintaining data confidentiality and requires the use of a key (kept secret) in order to return to plaintext.
Hashing is for validating the integrity of content by detecting all modification thereof via obvious changes to the hash output.
CRC are good for preventing random errors in transmission but provide little protection from an intentional attack on your data
RAID: provides parity function
EC: Provides Encode /decode
Summary of DPDK as background (this is NOT a description of SPDK):
If you look at a storage system, you have a wire that goes into the box. That wire has a NIC driver. The NIC driver comes from DPDK.
It is a very tailored driver that runs in user space and does what is called “polling”.
DPDK does not do an interrupt
DPDK is very fast compared to traditional operating systems
DPDK runs in user space. Not kernel space. That is relevant because kernel space is really close to the hardware and that is where the drivers run. You have extra privileges and you can mess up your system because you have all those privileges.
Use space is a little more protected. You don’t have as many privileges.
When you transition from user space to kernel space, there is a context switch (just like in an interrupt as described earlier)
For every I/O operation you do, there are many more user space to kernel transitions than there are interrupts.
You are doing the transition from user to kernel hundreds of thousands of times per second also. This is very painful in terms of consuming CPU resources when you look at this as a whole.
A lot of what DPDK does is to get rid of those painful context switches.
DPDK is also software that does not need “locks”. In a general purpose operating system, it has to be set up to handle anything at any time for any application that comes in over the wire. That means that there is a lot of synchronization that has to occur between cores and the threads.
That synchronization is also very expensive.
You get rid of that problem by creating software that is essentially lockless.
This allows you to better scale your software – every core you add can increase your performance linearly. Adding a second core could double your performance, tripling your cores could triple your performance.
SPDK takes these concepts and applies them to storage with iSCSI and NVMe.
If you take all these things together, you end up with a specific instance of a storage system.
Looking at the data flow.
Bits come in from the left, go through the NIC, the DPDK NIC driver is running in user space and is polling. It takes the bits and passes it to a TCP/IP stack. In this case, user-mode network services (UNS).
The TCP/IP stack takes those bits and does a lot of work to make sure those bits were intended for you and properly formed.
That goes into an iSCSI target. That speaks block-based language of SCSI. The user storage app could look at that SCSI request and decide how to service it. It could be a SCSI read request, a SCSI write request, or a SCSI inquiry request trying to find out what the device is capable of, etc.
The storage app is where our customers come in. They have the opportunity to provide their value add differentiation services like in-line de-duplication, erasure code, compression, hashing, etc.
After the user storage app completes its operation, it is going to want to persist the data somewhere. In this case, we use NVMe. NVMe is the Intel prescribed way of doing PCIe-based SSDs. NVMe is an open standard so we can create a driver that can be open sourced.
From there, the data gets stored.
WKB is currently under development. It will run on Linux or FreeBSD. Targeting release of WKB in 2015.
WKB is also a good system-level vehicle for demonstrating ISA-L. This allows us to provide real-world performance numbers for ISA-L. Previously, all we could show is cycles per byte for the algorithm.
Not the easiest charts to read.
For the left chart, the lines are latency, the bars are throughput. The key take away is that even at reasonably high queue depths (16) the SPDK stack delivers sub-millisecond latency at 500K IOPS using just two cores.
On the right chart, the bars are the same, but the lines are looking at the number of cores consumed – this just highlights the differences in efficiency between similar levels of performance. The dark blue bar (indicating LIO + kernel TCP/IP) tracks along with the SPDK performance, but the line shows how many more cores are required to keep up. Even at a queue depth of 1, SPDK is twice as efficient as the kernel stack, and that advantage grows as queue depth (and thus overall throughput) increases.
Latency is measured from the initiator side. This latency measurement denotes “avg. latency” metric computed by FIO which is the sum of submission + completion latencies. All the latency measurements are in milliseconds (msecs). LIO 2 core and SPDK 2 core data were carried out by restricting the system to only 2 cores.
# echo 0 > /sys/devices/system/cpu/cpu{}/online
This is where it would be good to have actual data to update this foil. Our demo running at the performance of the right most bar for 4-NVMe devices.
Configuration details”
SPDK NVMf Target
Component
Details
Hardware
Intel Xeon-D
Processor
D-1567
No. of Sockets
1
No. of Cores
12 Cores/ 12 Threads per socket
Memory
32G
NIC
Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port
MTU
1500
OS
Fedora 23
Linux Kernel
4.4.3-300.fc23.x86_64
NVMf Target
Intel SPDK
Storage Array
Component
Details
Storage Drives
4x Intel SSD 800GB DC P3700series
NVMf Initiator
Component
Details
No. of Initiators
Dell PowerEdge 730xd
Processor
Intel Xeon Processor E5-2699v3 (45M Cache, 2.30 GHz)
No. of Sockets
2 sockets populated
No. of Cores
18 Cores/ 18 Threads per socket
Memory
132G
NIC
Mellanox Connectx-4 EN Adapter (25Gbps) – Dual Port
MTU
1500
OS
RHEL 7.2
Linux Kernel
4.5.0-rc3