SlideShare a Scribd company logo
1 of 28
Ceph Day San Francisco – March 12, 2015
Deploying Flash Storage For Ceph
© 2015 Mellanox Technologies 2- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage
Front / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2015 Mellanox Technologies 3- Mellanox Confidential -
Solid State Storage Technology Evolution – Lower Latency
Advanced Networking and Protocol Offloads Required to Match Storage Media Performance
0.1
10
1000
HD SSD NVM
AccessTime(micro-Sec)
StorageMedia Technology
50%
100%
Networked Storage
Storage Protocol (SW) Network
Storage Media
Network
HW & SW
Hard
Drives
NAND
Flash
Next Gen
NVM
© 2015 Mellanox Technologies 4- Mellanox Confidential -
Scale-Out Architecture Requires A Fast Network
 Scale-out grows capacity and performance in parallel
 Requires fast network for replication, sharing, and metadata (file)
• Throughput requires bandwidth
• IOPS requires low latency
• Efficiency requires CPU offload
 Proven in HPC, storage appliances, cloud, and now… Ceph
Interconnect Capabilities Determine Scale Out Performance
© 2015 Mellanox Technologies 5- Mellanox Confidential -
Ceph and Networks
 High performance networks enable maximum cluster availability
• Clients, OSD, Monitors and Metadata servers communicate over multiple network layers
• Real-time requirements for heartbeat, replication, recovery and re-balancing
 Cluster (“backend”) network performance dictates cluster’s performance and scalability
• “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Cluster” (Ceph Documentation)
© 2015 Mellanox Technologies 6- Mellanox Confidential -
How Customers Deploy Ceph with Mellanox Interconnect
 Building Scalable, Performing Storage Solutions
• Cluster network @ 40Gb Ethernet
• Clients @ 10G/40Gb Ethernet
 High performance at Low Cost
• Allows more capacity per OSD
 Flash Deployment Options
• All HDD (no flash)
• Flash for OSD Journals
• 100% Flash in OSDs
8.5PB System Currently Being Deployed
© 2015 Mellanox Technologies 7- Mellanox Confidential -
Ceph Deployment Using 10GbE and 40GbE
 Cluster (Private) Network @ 40/56GbE
• Smooth HA, unblocked heartbeats, efficient data balancing
 Throughput Clients @ 40/56GbE
• Guaranties line rate for high ingress/egress clients
 IOPs Clients @ 10GbE or 40/56GbE
• 100K+ IOPs/Client @4K blocks
20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!
(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster Network
Admin Node
40GbE
Public Network
10GbE/40GBE
Ceph Nodes
(Monitors, OSDs, MDS)
Client Nodes
10GbE/40GbE
© 2015 Mellanox Technologies 8- Mellanox Confidential -
Ceph Throughput using 40Gb and 56Gb Ethernet
0
1000
2000
3000
4000
5000
6000
Ceph 64KB Random REad Ceph 256KB Random Read
MB/s
40Gb TCP
56Gb TCP
56Gb
TCP
40Gb
TCP
56Gb
TCP
40Gb
TCP
One OSD, One Client, 8 Threads
© 2015 Mellanox Technologies 9- Mellanox Confidential -
By SanDisk
Optimizing Ceph for Flash
© 2015 Mellanox Technologies 10- Mellanox Confidential -
Ceph Flash Optimization
Highlights Compared to Stock Ceph
• Read performance up to 8x better
• Write performance up to 2x better with tuning
Optimizations
• All-flash storage for OSDs
• Enhanced parallelism and lock optimization
• Optimization for reads from flash
• Improvements to Ceph messenger
Test Configuration
• InfiniFlash Storage with IFOS 1.0 EAP3
• Up to 4 RBDs
• 2 Ceph OSD nodes, connected to InfiniFlash
• 40GbE NICs from Mellanox
SanDisk InfiniFlash
© 2015 Mellanox Technologies 11- Mellanox Confidential -
8K Random - 2 RBD/Client with File System
IOPS: 2 LUNs /Client (Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph IFOS 1.0
Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
20
40
60
80
100
120
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)
© 2015 Mellanox Technologies 12- Mellanox Confidential -
Performance: 64K Random -2 RBD/Client with File System
IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Stock Ceph
IFOS 1.0
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)
© 2015 Mellanox Technologies 13- Mellanox Confidential -
XioMessenger
Adding RDMA To Ceph
© 2015 Mellanox Technologies 14- Mellanox Confidential -
I/O Offload Frees Up CPU for Application Processing
~88% CPU
Efficiency
UserSpaceSystemSpace
~53% CPU
Efficiency
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
UserSpaceSystemSpace
© 2015 Mellanox Technologies 15- Mellanox Confidential -
Adding RDMA to Ceph
 RDMA Beta to be Included in Hammer
• Mellanox, Red Hat, CohortFS, and Community collaboration
• Full RDMA expected in Infernalis
 Refactoring of Ceph messaging layer
• New RDMA messenger layer called XioMessenger
• New class hierarchy allowing multiple transports (simple one is TCP)
• Async design that leverages Accelio
• Reduced locks; Reduced number of threads
 XioMessenger built on top of Accelio (RDMA abstraction layer)
• Integrated into all CEPH user space components: daemons and clients
• “public network” and “cloud network”
© 2015 Mellanox Technologies 16- Mellanox Confidential -
 Open source!
• https://github.com/accelio/accelio/ && www.accelio.org
 Faster RDMA integration to application
 Asynchronous
 Maximize msg and CPU parallelism
• Enable >10GB/s from single node
• Enable <10usec latency under load
 In Giant and Hammer
• http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
Accelio, High-Performance Reliable Messaging and RPC Library
© 2015 Mellanox Technologies 17- Mellanox Confidential -
Ceph Large Block Throughput with 40Gb and 56Gb Ethernet
0
1000
2000
3000
4000
5000
6000
64KB Random REad 256KB Random Read
40Gb
56Gb TCP
56 Mb RDMA
8coresinOSD
8coresinclient
6coresinOSD
5coresinclient
TCP:8coresinOSD
RDMA:6coresinOSD
8coresinOSD
8coresinclient
6coresinOSD
5coresinclient
TCP:8coresinOSD
RDMA:6coresinOSD
56Gb
TCP
56Gb
RDMA
40Gb 56Gb
TCP
56Gb
RDMA
40Gb
© 2015 Mellanox Technologies 18- Mellanox Confidential -
Ceph 4KB Read K-IOPS: 40Gb TCP vs. 40Gb RDMA
0
50
100
150
200
250
300
350
400
450
2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients
ThousandsofIOPS
40Gb TCP
40Gb RDMA
34coresinOSD
4coresinclient
38coresinOSD
24coresinclient
38coresinOSD
30coresinclient
36coresinOSD
32coresinclient
34coresinOSD
27coresinclient
38coresinOSD
24coresinclient
RDMATCP RDMATCP RDMATCP
© 2015 Mellanox Technologies 19- Mellanox Confidential -
Ceph-Powered Solutions
Deployment Examples
© 2015 Mellanox Technologies 20- Mellanox Confidential -
Ceph For Large Scale Storage– Fujitsu Eternus CD10000
 Hyperscale Storage
• 4 to 224 nodes
• Up to 56 PB raw capacity
 Runs Ceph with Enhancements
• 3 different storage nodes
• Object, block, and file storage
 Mellanox InfiniBand Cluster Network
• 40Gb InfiniBand cluster network
• 10Gb Ethernet front end network
© 2015 Mellanox Technologies 21- Mellanox Confidential -
Media & Entertainment Storage – StorageFoundry Nautilus
 Turnkey Object Storage
• Built on Ceph
• Pre-configured for rapid deployment
• Mellanox 10/40GbE networking
 High-Capacity Configuration
• 6-8TB Helium-filled drives
• Up to 2PB in 18U
 High-Performance Configuration
• Single client read 2.2 GB/s
• SSD caching + Hard Drives
• Supports Ethernet, IB, FC, FCoE front-end ports
 More information: www.storagefoundry.net
© 2015 Mellanox Technologies 22- Mellanox Confidential -
SanDisk InfiniFlash
 Flash Storage System
• Announced March 3, 2015
• InfiniFlash OS uses Ceph
• 512 TB (raw) in one 3U enclosure
• Tested with 40GbE networking
 High Throughput
• Up to 7GB/s
• Up to 1M IOPS with two nodes
 More information:
• http://bigdataflash.sandisk.com/infiniflash
© 2015 Mellanox Technologies 23- Mellanox Confidential -
More Ceph Solutions
 Cloud – OnyxCCS ElectraStack
• Turnkey IaaS
• Multi-tenant computing system
• 5x faster Node/Data restoration
• https://www.onyxccs.com/products/8-series
 Flextronics CloudLabs
• OpenStack on CloudX design
• 2SSD + 20HDD per node
• Mix of 1Gb/40GbE network
• http://www.flextronics.com/
 ISS Storage Supercore
• Healthcare solution
• 82,000 IOPS on 512B reads
• 74,000 IOPS on 4KB reads
• 1.1GB/s on 256KB reads
• http://www.iss-integration.com/supercore.html
 Scalable Informatics Unison
• High availability cluster
• 60 HDD in 4U
• Tier 1 performance at archive cost
• https://scalableinformatics.com/unison.html
© 2015 Mellanox Technologies 24- Mellanox Confidential -
Summary
 Ceph scalability and performance benefit from high performance networks
 Ceph being optimized for flash storage
 End-to-end 40/56 Gb/s transport accelerates Ceph today
• 100Gb/s testing has begun!
• Available in various Ceph solutions and appliances
 RDMA is next to optimize flash performance—beta in Hammer
Thank You
© 2015 Mellanox Technologies 26- Mellanox Confidential -
Ceph Performance Summary – 40Gb/s vs. 56Gb/s
Network Speed
and frame size
RDMA Verbs level
ib_send_bw
MB/s
64K random read
MB/s
256K random read
MB/s
40 Gb/s (MB/s)
(mtu = 1500)
4350 4300 4350
56 Gb/s (MB/s)
(mtu = 1500)
5710 5050 (RDMA)
4300 (tcp)
5450
56 Gb/s (MB/s)
(mtu = 2500)
6051 5355 (RDMA)
4350 (tcp)
5450
56 Gb/s (MB/s)
(mtu = 4500)
6070 5500 (RDMA)
4460 (tcp)
5500
© 2015 Mellanox Technologies 27- Mellanox Confidential -
Ceph Performance Summary with 1 OSD
1 client 1
stream
Simple
1 client 1
stream
XIO
1 client 8
streams
Simple
1 client 8
streams
XIO
2 clients
16 streams
Simple
2 clients
16 streams
XIO
KIOPs
Read
20
(4 cores used at OSD)
21
(5 cores used at OSD)
105
(20 cores used at OSD)
125
(15 cores used at OSD)
125
(25 cores used at OSD)
155
(19 cores used at OSD)
KIOPs
Write
8.5
(7 cores used at OSD)
9.1
(6 cores used at OSD)
NA NA NA NA
BW Read
(MB/s)
1140
(3 cores used at OSD)
3140
(4 cores used at OSD)
4300
(8 cores used at OSD)
4300
(6 cores used at OSD)
4300
(11 cores used at OSD)
4300
(8 cores used at OSD)
BW Write
(MB/s)
450
(3 cores used at OSD)
520
(3 cores used at OSD)
NA NA NA NA
BW tested with 256K IOs
IOPs tested with 4K IOs
Single OSD
© 2015 Mellanox Technologies 28- Mellanox Confidential -
Ceph Performance Summary with 2-8 OSDs
2 OSDs 4
clients
Simple
2 OSDs 4
clients
XIO
4 OSDs 4
clients
Simple
4 OSDs 4
clients
XIO
8 OSDs 4
clients
Simple
8 OSDs 4
clients
XIO
KIOPs
Read
230
(38 cores used at OSD)
(30 cores used at client)
332
(34 cores used at OSD)
(4 cores used at client)
265
(38 cores used at OSD)
(24 cores used at client)
423
(38 cores used at OSD)
(24 cores used at client)
247
(36 cores used at OSD)
(32 cores used at client)
365
(34 cores used at OSD)
(27 cores used at client)
KIOPs
Write
25
(10 cores used at OSD)
(2.5 cores used at client)
26
(10 cores used at OSD)
(2 cores use at client)
41
(20 cores used at OSD)
(4 cores used at client)
45
(18 cores used at OSD)
(3 cores used at client)
65
(32 cores used at OSD)
(6 cores used at client)
66
(32 cores used at OSD)
(4 cores used at clients)
BW Read
(MB/s)
4300
(7 cores used at OSD)
(7 cores used at client)
4300
(6 cores used at OSD)
(5 cores used at client)
4300
(9 cores used at OSD)
(8 cores used at client)
4300
(6 cores used at OSD)
(5 cores used at client)
4300
(8 cores used at OSD)
(8 cores used at client)
4300
(6 cores used at OSD)
(5 cores used at client)
BW Write
(MB/s)
1475
(6 cores used at OSD)
(1.5 cores used at client)
1464
(6 cores used at OSD)
(1.5 cores used at client)
2490
(11 cores used at OSD)
( 2 cores used at client)
2495
(11 cores used at OSD)
(2 cores used at client)
3730
(22 cores used at OSD)
(3 cores used at client)
3700
(20 cores used at OSD)
(3 cores used at client)
BW tested with 256K IOs
IOPs tested with 4K IOs
2-8 OSDs

More Related Content

What's hot

DPDK Summit 2015 - Sprint - Arun Rajagopal
DPDK Summit 2015 - Sprint - Arun RajagopalDPDK Summit 2015 - Sprint - Arun Rajagopal
DPDK Summit 2015 - Sprint - Arun RajagopalJim St. Leger
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingNetronome
 
Cumulus networks conversion guide
Cumulus networks conversion guideCumulus networks conversion guide
Cumulus networks conversion guideScott Suehle
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesSagi Brody
 
Cumulus networks - Overcoming traditional network limitations with open source
Cumulus networks - Overcoming traditional network limitations with open sourceCumulus networks - Overcoming traditional network limitations with open source
Cumulus networks - Overcoming traditional network limitations with open sourceNat Morris
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 
Cumulus Linux 2.5 Overview
Cumulus Linux 2.5 OverviewCumulus Linux 2.5 Overview
Cumulus Linux 2.5 OverviewCumulus Networks
 
Layer 3 Tunnel Support for Open vSwitch
Layer 3 Tunnel Support for Open vSwitchLayer 3 Tunnel Support for Open vSwitch
Layer 3 Tunnel Support for Open vSwitchNetronome
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-GeneOpenStack Korea Community
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...Nati Shalom
 
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Nat Morris
 
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...inside-BigData.com
 
I/O virtualization with InfiniBand and 40 Gigabit Ethernet
I/O virtualization with InfiniBand and 40 Gigabit EthernetI/O virtualization with InfiniBand and 40 Gigabit Ethernet
I/O virtualization with InfiniBand and 40 Gigabit EthernetMellanox Technologies
 
Software Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_truptiSoftware Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_truptitrups7778
 
Configuration & Routing of Clos Networks
Configuration & Routing of Clos NetworksConfiguration & Routing of Clos Networks
Configuration & Routing of Clos NetworksCumulus Networks
 
VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauseryfauser
 

What's hot (20)

DPDK Summit 2015 - Sprint - Arun Rajagopal
DPDK Summit 2015 - Sprint - Arun RajagopalDPDK Summit 2015 - Sprint - Arun Rajagopal
DPDK Summit 2015 - Sprint - Arun Rajagopal
 
SoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based NetworkingSoC Solutions Enabling Server-Based Networking
SoC Solutions Enabling Server-Based Networking
 
Cumulus networks conversion guide
Cumulus networks conversion guideCumulus networks conversion guide
Cumulus networks conversion guide
 
Multi-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation StrategiesMulti-Layer DDoS Mitigation Strategies
Multi-Layer DDoS Mitigation Strategies
 
Cumulus networks - Overcoming traditional network limitations with open source
Cumulus networks - Overcoming traditional network limitations with open sourceCumulus networks - Overcoming traditional network limitations with open source
Cumulus networks - Overcoming traditional network limitations with open source
 
What is 3d torus
What is 3d torusWhat is 3d torus
What is 3d torus
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Cumulus Linux 2.5 Overview
Cumulus Linux 2.5 OverviewCumulus Linux 2.5 Overview
Cumulus Linux 2.5 Overview
 
Layer 3 Tunnel Support for Open vSwitch
Layer 3 Tunnel Support for Open vSwitchLayer 3 Tunnel Support for Open vSwitch
Layer 3 Tunnel Support for Open vSwitch
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
 
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014Hardware accelerated switching with Linux @ SWLUG Talks May 2014
Hardware accelerated switching with Linux @ SWLUG Talks May 2014
 
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
Open CAPI, A New Standard for High Performance Attachment of Memory, Accelera...
 
I/O virtualization with InfiniBand and 40 Gigabit Ethernet
I/O virtualization with InfiniBand and 40 Gigabit EthernetI/O virtualization with InfiniBand and 40 Gigabit Ethernet
I/O virtualization with InfiniBand and 40 Gigabit Ethernet
 
Software Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_truptiSoftware Defined Networking(SDN) and practical implementation_trupti
Software Defined Networking(SDN) and practical implementation_trupti
 
Configuration & Routing of Clos Networks
Configuration & Routing of Clos NetworksConfiguration & Routing of Clos Networks
Configuration & Routing of Clos Networks
 
VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead VMworld 2013: Extreme Performance Series: Network Speed Ahead
VMworld 2013: Extreme Performance Series: Network Speed Ahead
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauser
 
Building a Router
Building a RouterBuilding a Router
Building a Router
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 

Similar to Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...OpenStack Korea Community
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RSimon Huang
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Ceph Community
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreinside-BigData.com
 
Designing and deploying converged storage area networks final
Designing and deploying converged storage area networks finalDesigning and deploying converged storage area networks final
Designing and deploying converged storage area networks finalBhavin Yadav
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...Indonesia Network Operators Group
 
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Cloud Native Day Tel Aviv
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Netronome
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouseAltinity Ltd
 
Mellanox for OpenStack - OpenStack最新情報セミナー 2014年10月
Mellanox for OpenStack  - OpenStack最新情報セミナー 2014年10月Mellanox for OpenStack  - OpenStack最新情報セミナー 2014年10月
Mellanox for OpenStack - OpenStack最新情報セミナー 2014年10月VirtualTech Japan Inc.
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVYoshihiro Nakajima
 

Similar to Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance (20)

Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
Mellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDNMellanox Approach to NFV & SDN
Mellanox Approach to NFV & SDN
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
[OpenStack Days Korea 2016] Track1 - Mellanox CloudX - Acceleration for Cloud...
 
Scale Out Database Solution
Scale Out Database SolutionScale Out Database Solution
Scale Out Database Solution
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
 
Mellanox Storage Solutions
Mellanox Storage SolutionsMellanox Storage Solutions
Mellanox Storage Solutions
 
Designing and deploying converged storage area networks final
Designing and deploying converged storage area networks finalDesigning and deploying converged storage area networks final
Designing and deploying converged storage area networks final
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
 
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
Born to be fast! - Aviram Bar Haim - OpenStack Israel 2017
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Mellanox for OpenStack - OpenStack最新情報セミナー 2014年10月
Mellanox for OpenStack  - OpenStack最新情報セミナー 2014年10月Mellanox for OpenStack  - OpenStack最新情報セミナー 2014年10月
Mellanox for OpenStack - OpenStack最新情報セミナー 2014年10月
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFV
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising performance

  • 1. Ceph Day San Francisco – March 12, 2015 Deploying Flash Storage For Ceph
  • 2. © 2015 Mellanox Technologies 2- Mellanox Confidential - Leading Supplier of End-to-End Interconnect Solutions Virtual Protocol Interconnect Storage Front / Back-End Server / Compute Switch / Gateway 56G IB & FCoIB 56G InfiniBand 10/40/56GbE & FCoE 10/40/56GbE Virtual Protocol Interconnect Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules Comprehensive End-to-End InfiniBand and Ethernet Portfolio Metro / WAN
  • 3. © 2015 Mellanox Technologies 3- Mellanox Confidential - Solid State Storage Technology Evolution – Lower Latency Advanced Networking and Protocol Offloads Required to Match Storage Media Performance 0.1 10 1000 HD SSD NVM AccessTime(micro-Sec) StorageMedia Technology 50% 100% Networked Storage Storage Protocol (SW) Network Storage Media Network HW & SW Hard Drives NAND Flash Next Gen NVM
  • 4. © 2015 Mellanox Technologies 4- Mellanox Confidential - Scale-Out Architecture Requires A Fast Network  Scale-out grows capacity and performance in parallel  Requires fast network for replication, sharing, and metadata (file) • Throughput requires bandwidth • IOPS requires low latency • Efficiency requires CPU offload  Proven in HPC, storage appliances, cloud, and now… Ceph Interconnect Capabilities Determine Scale Out Performance
  • 5. © 2015 Mellanox Technologies 5- Mellanox Confidential - Ceph and Networks  High performance networks enable maximum cluster availability • Clients, OSD, Monitors and Metadata servers communicate over multiple network layers • Real-time requirements for heartbeat, replication, recovery and re-balancing  Cluster (“backend”) network performance dictates cluster’s performance and scalability • “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients and the Ceph Storage Cluster” (Ceph Documentation)
  • 6. © 2015 Mellanox Technologies 6- Mellanox Confidential - How Customers Deploy Ceph with Mellanox Interconnect  Building Scalable, Performing Storage Solutions • Cluster network @ 40Gb Ethernet • Clients @ 10G/40Gb Ethernet  High performance at Low Cost • Allows more capacity per OSD  Flash Deployment Options • All HDD (no flash) • Flash for OSD Journals • 100% Flash in OSDs 8.5PB System Currently Being Deployed
  • 7. © 2015 Mellanox Technologies 7- Mellanox Confidential - Ceph Deployment Using 10GbE and 40GbE  Cluster (Private) Network @ 40/56GbE • Smooth HA, unblocked heartbeats, efficient data balancing  Throughput Clients @ 40/56GbE • Guaranties line rate for high ingress/egress clients  IOPs Clients @ 10GbE or 40/56GbE • 100K+ IOPs/Client @4K blocks 20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients! (http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf) Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2 IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2 Cluster Network Admin Node 40GbE Public Network 10GbE/40GBE Ceph Nodes (Monitors, OSDs, MDS) Client Nodes 10GbE/40GbE
  • 8. © 2015 Mellanox Technologies 8- Mellanox Confidential - Ceph Throughput using 40Gb and 56Gb Ethernet 0 1000 2000 3000 4000 5000 6000 Ceph 64KB Random REad Ceph 256KB Random Read MB/s 40Gb TCP 56Gb TCP 56Gb TCP 40Gb TCP 56Gb TCP 40Gb TCP One OSD, One Client, 8 Threads
  • 9. © 2015 Mellanox Technologies 9- Mellanox Confidential - By SanDisk Optimizing Ceph for Flash
  • 10. © 2015 Mellanox Technologies 10- Mellanox Confidential - Ceph Flash Optimization Highlights Compared to Stock Ceph • Read performance up to 8x better • Write performance up to 2x better with tuning Optimizations • All-flash storage for OSDs • Enhanced parallelism and lock optimization • Optimization for reads from flash • Improvements to Ceph messenger Test Configuration • InfiniFlash Storage with IFOS 1.0 EAP3 • Up to 4 RBDs • 2 Ceph OSD nodes, connected to InfiniFlash • 40GbE NICs from Mellanox SanDisk InfiniFlash
  • 11. © 2015 Mellanox Technologies 11- Mellanox Confidential - 8K Random - 2 RBD/Client with File System IOPS: 2 LUNs /Client (Total 4 Clients) 0 50000 100000 150000 200000 250000 300000 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 0 25 50 75 100 Stock Ceph IFOS 1.0 Lat(ms): 2 LUNs/Client (Total 4 Clients) [Queue Depth] Read Percent 0 20 40 60 80 100 120 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 0 25 50 75 100 IOPS Latency (ms)
  • 12. © 2015 Mellanox Technologies 12- Mellanox Confidential - Performance: 64K Random -2 RBD/Client with File System IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients) [Queue Depth] Read Percent 0 20000 40000 60000 80000 100000 120000 140000 160000 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 0 25 50 75 100 Stock Ceph IFOS 1.0 0 20 40 60 80 100 120 140 160 180 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 0 25 50 75 100 IOPS Latency (ms)
  • 13. © 2015 Mellanox Technologies 13- Mellanox Confidential - XioMessenger Adding RDMA To Ceph
  • 14. © 2015 Mellanox Technologies 14- Mellanox Confidential - I/O Offload Frees Up CPU for Application Processing ~88% CPU Efficiency UserSpaceSystemSpace ~53% CPU Efficiency ~47% CPU Overhead/Idle ~12% CPU Overhead/Idle Without RDMA With RDMA and Offload UserSpaceSystemSpace
  • 15. © 2015 Mellanox Technologies 15- Mellanox Confidential - Adding RDMA to Ceph  RDMA Beta to be Included in Hammer • Mellanox, Red Hat, CohortFS, and Community collaboration • Full RDMA expected in Infernalis  Refactoring of Ceph messaging layer • New RDMA messenger layer called XioMessenger • New class hierarchy allowing multiple transports (simple one is TCP) • Async design that leverages Accelio • Reduced locks; Reduced number of threads  XioMessenger built on top of Accelio (RDMA abstraction layer) • Integrated into all CEPH user space components: daemons and clients • “public network” and “cloud network”
  • 16. © 2015 Mellanox Technologies 16- Mellanox Confidential -  Open source! • https://github.com/accelio/accelio/ && www.accelio.org  Faster RDMA integration to application  Asynchronous  Maximize msg and CPU parallelism • Enable >10GB/s from single node • Enable <10usec latency under load  In Giant and Hammer • http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger Accelio, High-Performance Reliable Messaging and RPC Library
  • 17. © 2015 Mellanox Technologies 17- Mellanox Confidential - Ceph Large Block Throughput with 40Gb and 56Gb Ethernet 0 1000 2000 3000 4000 5000 6000 64KB Random REad 256KB Random Read 40Gb 56Gb TCP 56 Mb RDMA 8coresinOSD 8coresinclient 6coresinOSD 5coresinclient TCP:8coresinOSD RDMA:6coresinOSD 8coresinOSD 8coresinclient 6coresinOSD 5coresinclient TCP:8coresinOSD RDMA:6coresinOSD 56Gb TCP 56Gb RDMA 40Gb 56Gb TCP 56Gb RDMA 40Gb
  • 18. © 2015 Mellanox Technologies 18- Mellanox Confidential - Ceph 4KB Read K-IOPS: 40Gb TCP vs. 40Gb RDMA 0 50 100 150 200 250 300 350 400 450 2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients ThousandsofIOPS 40Gb TCP 40Gb RDMA 34coresinOSD 4coresinclient 38coresinOSD 24coresinclient 38coresinOSD 30coresinclient 36coresinOSD 32coresinclient 34coresinOSD 27coresinclient 38coresinOSD 24coresinclient RDMATCP RDMATCP RDMATCP
  • 19. © 2015 Mellanox Technologies 19- Mellanox Confidential - Ceph-Powered Solutions Deployment Examples
  • 20. © 2015 Mellanox Technologies 20- Mellanox Confidential - Ceph For Large Scale Storage– Fujitsu Eternus CD10000  Hyperscale Storage • 4 to 224 nodes • Up to 56 PB raw capacity  Runs Ceph with Enhancements • 3 different storage nodes • Object, block, and file storage  Mellanox InfiniBand Cluster Network • 40Gb InfiniBand cluster network • 10Gb Ethernet front end network
  • 21. © 2015 Mellanox Technologies 21- Mellanox Confidential - Media & Entertainment Storage – StorageFoundry Nautilus  Turnkey Object Storage • Built on Ceph • Pre-configured for rapid deployment • Mellanox 10/40GbE networking  High-Capacity Configuration • 6-8TB Helium-filled drives • Up to 2PB in 18U  High-Performance Configuration • Single client read 2.2 GB/s • SSD caching + Hard Drives • Supports Ethernet, IB, FC, FCoE front-end ports  More information: www.storagefoundry.net
  • 22. © 2015 Mellanox Technologies 22- Mellanox Confidential - SanDisk InfiniFlash  Flash Storage System • Announced March 3, 2015 • InfiniFlash OS uses Ceph • 512 TB (raw) in one 3U enclosure • Tested with 40GbE networking  High Throughput • Up to 7GB/s • Up to 1M IOPS with two nodes  More information: • http://bigdataflash.sandisk.com/infiniflash
  • 23. © 2015 Mellanox Technologies 23- Mellanox Confidential - More Ceph Solutions  Cloud – OnyxCCS ElectraStack • Turnkey IaaS • Multi-tenant computing system • 5x faster Node/Data restoration • https://www.onyxccs.com/products/8-series  Flextronics CloudLabs • OpenStack on CloudX design • 2SSD + 20HDD per node • Mix of 1Gb/40GbE network • http://www.flextronics.com/  ISS Storage Supercore • Healthcare solution • 82,000 IOPS on 512B reads • 74,000 IOPS on 4KB reads • 1.1GB/s on 256KB reads • http://www.iss-integration.com/supercore.html  Scalable Informatics Unison • High availability cluster • 60 HDD in 4U • Tier 1 performance at archive cost • https://scalableinformatics.com/unison.html
  • 24. © 2015 Mellanox Technologies 24- Mellanox Confidential - Summary  Ceph scalability and performance benefit from high performance networks  Ceph being optimized for flash storage  End-to-end 40/56 Gb/s transport accelerates Ceph today • 100Gb/s testing has begun! • Available in various Ceph solutions and appliances  RDMA is next to optimize flash performance—beta in Hammer
  • 26. © 2015 Mellanox Technologies 26- Mellanox Confidential - Ceph Performance Summary – 40Gb/s vs. 56Gb/s Network Speed and frame size RDMA Verbs level ib_send_bw MB/s 64K random read MB/s 256K random read MB/s 40 Gb/s (MB/s) (mtu = 1500) 4350 4300 4350 56 Gb/s (MB/s) (mtu = 1500) 5710 5050 (RDMA) 4300 (tcp) 5450 56 Gb/s (MB/s) (mtu = 2500) 6051 5355 (RDMA) 4350 (tcp) 5450 56 Gb/s (MB/s) (mtu = 4500) 6070 5500 (RDMA) 4460 (tcp) 5500
  • 27. © 2015 Mellanox Technologies 27- Mellanox Confidential - Ceph Performance Summary with 1 OSD 1 client 1 stream Simple 1 client 1 stream XIO 1 client 8 streams Simple 1 client 8 streams XIO 2 clients 16 streams Simple 2 clients 16 streams XIO KIOPs Read 20 (4 cores used at OSD) 21 (5 cores used at OSD) 105 (20 cores used at OSD) 125 (15 cores used at OSD) 125 (25 cores used at OSD) 155 (19 cores used at OSD) KIOPs Write 8.5 (7 cores used at OSD) 9.1 (6 cores used at OSD) NA NA NA NA BW Read (MB/s) 1140 (3 cores used at OSD) 3140 (4 cores used at OSD) 4300 (8 cores used at OSD) 4300 (6 cores used at OSD) 4300 (11 cores used at OSD) 4300 (8 cores used at OSD) BW Write (MB/s) 450 (3 cores used at OSD) 520 (3 cores used at OSD) NA NA NA NA BW tested with 256K IOs IOPs tested with 4K IOs Single OSD
  • 28. © 2015 Mellanox Technologies 28- Mellanox Confidential - Ceph Performance Summary with 2-8 OSDs 2 OSDs 4 clients Simple 2 OSDs 4 clients XIO 4 OSDs 4 clients Simple 4 OSDs 4 clients XIO 8 OSDs 4 clients Simple 8 OSDs 4 clients XIO KIOPs Read 230 (38 cores used at OSD) (30 cores used at client) 332 (34 cores used at OSD) (4 cores used at client) 265 (38 cores used at OSD) (24 cores used at client) 423 (38 cores used at OSD) (24 cores used at client) 247 (36 cores used at OSD) (32 cores used at client) 365 (34 cores used at OSD) (27 cores used at client) KIOPs Write 25 (10 cores used at OSD) (2.5 cores used at client) 26 (10 cores used at OSD) (2 cores use at client) 41 (20 cores used at OSD) (4 cores used at client) 45 (18 cores used at OSD) (3 cores used at client) 65 (32 cores used at OSD) (6 cores used at client) 66 (32 cores used at OSD) (4 cores used at clients) BW Read (MB/s) 4300 (7 cores used at OSD) (7 cores used at client) 4300 (6 cores used at OSD) (5 cores used at client) 4300 (9 cores used at OSD) (8 cores used at client) 4300 (6 cores used at OSD) (5 cores used at client) 4300 (8 cores used at OSD) (8 cores used at client) 4300 (6 cores used at OSD) (5 cores used at client) BW Write (MB/s) 1475 (6 cores used at OSD) (1.5 cores used at client) 1464 (6 cores used at OSD) (1.5 cores used at client) 2490 (11 cores used at OSD) ( 2 cores used at client) 2495 (11 cores used at OSD) (2 cores used at client) 3730 (22 cores used at OSD) (3 cores used at client) 3700 (20 cores used at OSD) (3 cores used at client) BW tested with 256K IOs IOPs tested with 4K IOs 2-8 OSDs

Editor's Notes

  1. 2
  2. Default mtu = 1500 (not optimum for RoCE because of header overhead and RoCE’s mtu =1K) Will rerun the test with jumbo frames (2K and 4K) Single node cluster single OSD, one fio_rbd client with numjobs=8 run random read test. Loading mlx4_core with parameter log_num_mgm_entry_size=-7 for ConnectX-3 HCA (instead of default = -10)
  3. Notes: Two nodes benchmarks, no storage replication/clustering etc. Workloads are random read/write BW is measured at 256K IOs Iops is measured at 4K IOs Data and header CRCs are OFF for both XioMessenger and SimpleMessenger 1 fio_rbd client 1 stream (1 OSD, 1 pool/image, numjobs=1, iodepth=64,32 for random write/read respectively) 1 fio_rbd client 8 streams (1OSD, 1 pool/image, nubjobs=8, iodepth=64,32) 2 fio_rbd clients 16 streams (1OSD, 2 pools/images, each client with numjobs=8, iodepth=64,32 for random write/read) NA – having problem running fio_rbd randwrite/write with numjobs=n > 2. Will update with multiple clients with numjobs=1. I’ll run random write tests differently, instead of running with single fio_client with numjobs=8, I will run 8 different fio_rbd clients to 8 different pools/images on same OSD with numjobs=1 each. 8 different clients on 8 separated pools/images will be a little bit better results than 1 client with numjobs=8
  4. Notes: Two nodes benchmarks, no storage replication/clustering etc. Workloads are random read/write BW is measured at 256K IOs Iops is measured at 4K IOs Data and header CRCs are OFF for both XioMessenger and SimpleMessenger 2/4/8 OSDs instances, 4 clients on 4 separated pools/images (numjobs=8 for random read, numjobs=1 for random write)