Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Ceph Day San Francisco – March 12, 2015
Deploying Flash Storage For Ceph

© 2015 Mellanox Technologies 2- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage
Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN

Solid State Storage Technology Evolution – Lower Latency
Advanced Networking and Protocol Offloads Required to Match Storage Media Performance
0.1
10
1000
HD SSD NVM
AccessTime(micro-Sec)
StorageMedia Technology
50%
100%
Networked Storage
Storage Protocol (SW) Network
Storage Media
Network
HW & SW
Hard
Drives
NAND
Flash
Next Gen
NVM

Scale-Out Architecture Requires A Fast Network
 Scale-out grows capacity and performance in parallel
 Requires fast network for replication, sharing, and metadata (file)
• Throughput requires bandwidth
• IOPS requires low latency
• Efficiency requires CPU offload
 Proven in HPC, storage appliances, cloud, and now… Ceph
Interconnect Capabilities Determine Scale Out Performance

Ceph and Networks
 High performance networks enable maximum cluster availability
• Clients, OSD, Monitors and Metadata servers communicate over multiple network layers
• Real-time requirements for heartbeat, replication, recovery and re-balancing
 Cluster (“backend”) network performance dictates cluster’s performance and scalability
• “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Cluster” (Ceph Documentation)

How Customers Deploy Ceph with Mellanox Interconnect
 Building Scalable, Performing Storage Solutions
• Cluster network @ 40Gb Ethernet
• Clients @ 10G/40Gb Ethernet
 High performance at Low Cost
• Allows more capacity per OSD
 Flash Deployment Options
• All HDD (no flash)
• Flash for OSD Journals
• 100% Flash in OSDs
8.5PB System Currently Being Deployed

Ceph Deployment Using 10GbE and 40GbE
 Cluster (Private) Network @ 40/56GbE
• Smooth HA, unblocked heartbeats, efficient data balancing
 Throughput Clients @ 40/56GbE
• Guaranties line rate for high ingress/egress clients
 IOPs Clients @ 10GbE or 40/56GbE
• 100K+ IOPs/Client @4K blocks
20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!
(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster Network
Admin Node
40GbE
Public Network
10GbE/40GBE
Ceph Nodes
(Monitors, OSDs, MDS)
Client Nodes
10GbE/40GbE

Ceph Throughput using 40Gb and 56Gb Ethernet
0
1000
2000
3000
4000
5000
6000
Ceph 64KB Random REad Ceph 256KB Random Read
MB/s
40Gb TCP
56Gb TCP
56Gb
TCP
40Gb
TCP
56Gb
TCP
40Gb
TCP
One OSD, One Client, 8 Threads

By SanDisk
Optimizing Ceph for Flash

Ceph Flash Optimization
Highlights Compared to Stock Ceph
• Read performance up to 8x better
• Write performance up to 2x better with tuning
Optimizations
• All-flash storage for OSDs
• Enhanced parallelism and lock optimization
• Optimization for reads from flash
• Improvements to Ceph messenger
Test Configuration
• InfiniFlash Storage with IFOS 1.0 EAP3
• Up to 4 RBDs
• 2 Ceph OSD nodes, connected to InfiniFlash
• 40GbE NICs from Mellanox
SanDisk InfiniFlash

8K Random - 2 RBD/Client with File System
IOPS: 2 LUNs /Client (Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph IFOS 1.0
Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
20
40
60
80
100
120
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)

Performance: 64K Random -2 RBD/Client with File System
IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph
IFOS 1.0
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)

XioMessenger
Adding RDMA To Ceph

I/O Offload Frees Up CPU for Application Processing
~88% CPU
Efficiency
UserSpaceSystemSpace
~53% CPU
Efficiency
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
UserSpaceSystemSpace

Adding RDMA to Ceph
 RDMA Beta to be Included in Hammer
• Mellanox, Red Hat, CohortFS, and Community collaboration
• Full RDMA expected in Infernalis
 Refactoring of Ceph messaging layer
• New RDMA messenger layer called XioMessenger
• New class hierarchy allowing multiple transports (simple one is TCP)
• Async design that leverages Accelio
• Reduced locks; Reduced number of threads
 XioMessenger built on top of Accelio (RDMA abstraction layer)
• Integrated into all CEPH user space components: daemons and clients
• “public network” and “cloud network”

 Open source!
• https://github.com/accelio/accelio/ && www.accelio.org
 Faster RDMA integration to application
 Asynchronous
 Maximize msg and CPU parallelism
• Enable >10GB/s from single node
• Enable <10usec latency under load
 In Giant and Hammer
• http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
Accelio, High-Performance Reliable Messaging and RPC Library

Ceph Large Block Throughput with 40Gb and 56Gb Ethernet
0
1000
2000
3000
4000
5000
6000
64KB Random REad 256KB Random Read
40Gb
56Gb TCP
56 Mb RDMA
8coresinOSD
8coresinclient
6coresinOSD
5coresinclient
TCP:8coresinOSD
RDMA:6coresinOSD
8coresinOSD
8coresinclient
6coresinOSD
5coresinclient
TCP:8coresinOSD
RDMA:6coresinOSD
56Gb
TCP
56Gb
RDMA
40Gb 56Gb
TCP
56Gb
RDMA
40Gb

Ceph 4KB Read K-IOPS: 40Gb TCP vs. 40Gb RDMA
0
50
100
150
200
250
300
350
400
450
2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients
ThousandsofIOPS
40Gb TCP
40Gb RDMA
34coresinOSD
4coresinclient
38coresinOSD
24coresinclient
38coresinOSD
30coresinclient
36coresinOSD
32coresinclient
34coresinOSD
27coresinclient
38coresinOSD
24coresinclient
RDMATCP RDMATCP RDMATCP

Ceph-Powered Solutions
Deployment Examples

Ceph For Large Scale Storage– Fujitsu Eternus CD10000
 Hyperscale Storage
• 4 to 224 nodes
• Up to 56 PB raw capacity
 Runs Ceph with Enhancements
• 3 different storage nodes
• Object, block, and file storage
 Mellanox InfiniBand Cluster Network
• 40Gb InfiniBand cluster network
• 10Gb Ethernet front end network

Media & Entertainment Storage – StorageFoundry Nautilus
 Turnkey Object Storage
• Built on Ceph
• Pre-configured for rapid deployment
• Mellanox 10/40GbE networking
 High-Capacity Configuration
• 6-8TB Helium-filled drives
• Up to 2PB in 18U
 High-Performance Configuration
• Single client read 2.2 GB/s
• SSD caching + Hard Drives
• Supports Ethernet, IB, FC, FCoE front-end ports
 More information: www.storagefoundry.net

SanDisk InfiniFlash
 Flash Storage System
• Announced March 3, 2015
• InfiniFlash OS uses Ceph
• 512 TB (raw) in one 3U enclosure
• Tested with 40GbE networking
 High Throughput
• Up to 7GB/s
• Up to 1M IOPS with two nodes
 More information:
• http://bigdataflash.sandisk.com/infiniflash

More Ceph Solutions
 Cloud – OnyxCCS ElectraStack
• Turnkey IaaS
• Multi-tenant computing system
• 5x faster Node/Data restoration
• https://www.onyxccs.com/products/8-series
 Flextronics CloudLabs
• OpenStack on CloudX design
• 2SSD + 20HDD per node
• Mix of 1Gb/40GbE network
• http://www.flextronics.com/
 ISS Storage Supercore
• Healthcare solution
• 82,000 IOPS on 512B reads
• 74,000 IOPS on 4KB reads
• 1.1GB/s on 256KB reads
• http://www.iss-integration.com/supercore.html
 Scalable Informatics Unison
• High availability cluster
• 60 HDD in 4U
• Tier 1 performance at archive cost
• https://scalableinformatics.com/unison.html

Summary
 Ceph scalability and performance benefit from high performance networks
 Ceph being optimized for flash storage
 End-to-end 40/56 Gb/s transport accelerates Ceph today
• 100Gb/s testing has begun!
• Available in various Ceph solutions and appliances
 RDMA is next to optimize flash performance—beta in Hammer

Ceph Performance Summary – 40Gb/s vs. 56Gb/s
Network Speed
and frame size
RDMA Verbs level
ib_send_bw
MB/s
64K random read
MB/s
256K random read
MB/s
40 Gb/s (MB/s)
(mtu = 1500)
4350 4300 4350
56 Gb/s (MB/s)
(mtu = 1500)
5710 5050 (RDMA)
4300 (tcp)
5450
56 Gb/s (MB/s)
(mtu = 2500)
6051 5355 (RDMA)
4350 (tcp)
5450
56 Gb/s (MB/s)
(mtu = 4500)
6070 5500 (RDMA)
4460 (tcp)
5500

Ceph Performance Summary with 1 OSD
1 client 1
stream
Simple
1 client 1
stream
XIO
1 client 8
streams
Simple
1 client 8
streams
XIO
2 clients
16 streams
Simple
2 clients
16 streams
XIO
KIOPs
Read
20
(4 cores used at OSD)
21
105
125
125
155
KIOPs
Write
8.5
9.1
NA NA NA NA
BW Read
(MB/s)
1140
3140
4300
4300
4300
4300
BW Write
(MB/s)
450
520
NA NA NA NA
BW tested with 256K IOs
IOPs tested with 4K IOs
Single OSD

Ceph Performance Summary with 2-8 OSDs
2 OSDs 4
clients
Simple
2 OSDs 4
clients
XIO
4 OSDs 4
clients
Simple
4 OSDs 4
clients
XIO
8 OSDs 4
clients
Simple
8 OSDs 4
clients
XIO
KIOPs
Read
230
(30 cores used at client)
332
265
423
247
365
KIOPs
Write
25
(2.5 cores used at client)
26
(2 cores use at client)
41
45
65
66
(4 cores used at clients)
BW Read
(MB/s)
4300
4300
4300
4300
4300
4300
BW Write
(MB/s)
1475
1464
2490
( 2 cores used at client)
2495
3730
3700
BW tested with 256K IOs
IOPs tested with 4K IOs
2-8 OSDs

Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Similar to Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance (20)

Recently uploaded

Recently uploaded (20)

Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Editor's Notes