Reddy Chagam – Principal Engineer, Storage Architect
Stephen L Blinick – Senior Cloud Storage Performance Engineer
Acknowledgments: Warren Wang, Anton Thaker (WalMart)
Orlando Moreno, Vishal Verma (Intel)
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system
manufacturer or retailer or learn more at http://intel.com.
Software and workloads used in performance tests may have been optimized for performance
only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
§ Configurations: Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6, CBT used for testing and data acquisition, OSD
System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe, partitioned into 4
OSD’s each, 16 OSD’s total per node, FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Single 10GbE
network for client & replication data transfer. FIO 2.2.8 with LibRBD engine. Tests run by Intel DCG Storage Group iin Intel lab. Ceph configuration and CBT YAML file
provided in backup slides.
§ For more information go to http://www.intel.com/performance.
Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and
other countries. *Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
12
DCG Storage Group 13
Agenda
• The transition to flash and the impact of NVMe
• NVMe technology with Ceph
• Cassandra & Ceph – a case for storage convergence
• The all-NVMe high-density Ceph Cluster
• Raw performance measurements and observations
• Examining performance of a Cassandra DB like workload
DCG Storage Group
Evolution of Non-Volatile Memory Storage Devices
PCIe
NVMe
10s us
>10 DW/day
<10 DW/day
100s K
10s K
PCIe NVMe
GB/s
SATA/SAS
SSDs
~100s MB/s
HDDs
~sub 100
MB/s
SATA/SAS
SSDs
100s us
HDDs
~ ms
IOPsEndurance
4K Read Latency
PCI Express® (PCIe)
NVM Express™ (NVMe)
3D XPoint™
DIMMs
3D XPoint
NVM SSDs
NVM Plays a Key Role in Delivering Performance for latency sensitive workloads
DCG Storage Group 15
Ceph Workloads
StoragePerformance
(IOPS,Throughput)
Storage Capacity
(PB)
Lower Higher
LowerHigher
Boot
Volumes
CDN
Enterprise
Dropbox
Backup,
Archive
Remote
Disks
VDI
App
Storage
BigData
Mobile
Content
Depot
Databases
Block
Object
NVM
Focus
Test &
Dev
Cloud
DVR
HPC
DCG Storage Group 16
Caching
Ceph - NVM Usages
Virtual Machine
Baremetal
RADOS
Node
Hypervisor
Guest
VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
Kernel
User
RBD DriverRBD Driver
RADOSRADOS
ApplicationApplication
RADOS
Protocol
RADOS
Protocol
RBDRBD
RADOSRADOS
RADOS Protocol RADOS Protocol
OSDOSD
JournalJournal FilestoreFilestore
NVMNVM
File SystemFile System
10GbE
Client caching w/
write through
NVM
NVMNVM
NVM
NVMNVM
Journaling
Read cache
OSD data
DCG Storage Group 17
Cassandra – What and Why?
Cassandra Ring
p1
p1
p20
p5
p3
p6
p5
p2 p4p8
p10
p7
Client
• Cassandra is column-oriented NoSQL DB with CQL
interface
 Each row has unique key which is used for partitioning
 No relations
 A row can have multiple columns – not necessarily same no. of
columns
• Open source, distributed, decentralized, highly available,
linearly scalable, multi DC, …..
• Used for analytics, real-time insights, fraud-detection,
IOT/sensor data, messaging etc.
Usecases: http://www.planetcassandra.org/apachecassandra-use-cases/
• Ceph is a popular open source unified storage platform
• Many large scale Ceph deployments in production
• End customers prefer converged infrastructure to
support multiple workloads (e.g. analytics) to achieve
CapEx, OpEx savings
• Several customers are asking for Cassandra workload on
Ceph
DCG Storage Group
IP Fabric
18
Ceph and Cassandra Integration
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Ceph Storage Cluster
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
MON MON
Deployment Considerations
• Bootable Ceph volumes
(OS & Cassandra data)
• Cassandra RBD data
volumes
• Data protection
(Cassandra or Ceph)
DCG Storage Group
Ceph Storage Cluster
Hardware Environment Overview
Ceph network (192.168.142.0/24) - 10Gbps
CBT / Zabbix /
Monitoring
CBT / Zabbix /
Monitoring FIO RBD ClientFIO RBD Client
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FatTwin (4x dual-socket XeonE5 v3)
FatTwin (4x dual-socket XeonE5 v3)
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Easily serviceable NVMe Drives
DCG Storage Group
• High performance NVMe devices are capable of high parallelism at low latency
• DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128
• By using multiple OSD partitions, Ceph performance scales linearly
• Reduces lock contention within a single OSD process
• Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical device
• Conceptually similar crushmap data placement rules as managing disks in an enclosure
• High Resiliency of “Data Center” Class NVMe devices
• At least 10 Drive writes per day
• Power loss protection, full data path protection, device level telemetry
Multi-partitioning flash devices
NVMe1NVMe1
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group 21
Partitioning multiple OSD’s per NVMe
• Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization
0
2
4
6
8
10
12
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
AvgLatency(ms)
IOPS
Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison
5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc,
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
0
10
20
30
40
50
60
70
80
90
%CPUUtilization
Single Node CPU Utilization Comparison - 4K Random Reads@QD32
4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
Single OSD
Double OSD
Quad OSD
DCG Storage Group
4K Random Read & Write Performance Summary
22
First Ceph cluster to break 1 Million 4K random IOPS
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
Workload Pattern Max IOPS
4K 100% Random Reads (2TB Dataset)
1.35Million
4K 100% Random Reads (4.8TB Dataset)
1.15Million
4K 100% Random Writes (4.8TB Dataset)
200K
4K 70%/30% Read/Write OLTP Mix
(4.8TB Dataset) 452K
DCG Storage Group
0
1
2
3
4
5
6
7
8
9
10
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000
AvgLatency(ms)
IOPS
IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix
5 nodes, 60 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
100% 4K RandomRead 100% 4K RandomWrite 70/30% 4K Random OLTP 100% 4K RandomRead - 2TB DataSet
4K Random Read & Write Performance and Latency
23
First Ceph cluster to break 1 Million 4K random IOPS, ~1ms response time
171K 100% 4k Random
Write IOPS @ 6ms
400K 70/30% (OLTP) 4k
Random IOPS @~3ms
1M 100% 4k Random
Read IOPS @~1.1ms
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
1.35M 4k Random Read
IOPS w/ 2TB Hot Data
DCG Storage Group
Sequential performance (512KB)
24
• With 10gbE per node, both writes and reads are achieving line rate bottlenecked by the OSD node single interface.
• Higher throughputs would be possible through bonding or 40GbE connectivity.
3,214
5,888 5,631
0
1000
2000
3000
4000
5000
6000
7000
100% Write 100% Read 70/30% R/W Mix
MB/s
512k Sequential Performance Bandwidth
5 nodes, 80 OSDs, DC P3700, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group
Cassandra-like workload
25
242K IOPS at < 2ms latency
• Based on a typical customer cassanda workload profile
• 50% Reads and 50% Writes, predominantly 8K Reads and 12K Writes, FIO Queue depth = 8
78%
19%
3%
8K 5K 7K
92%
5%
12K 33K 115K 50K 80K
0
0.5
1
1.5
2
2.5
0.00
50,000.00
100,000.00
150,000.00
200,000.00
250,000.00
300,000.00
Latency(ms)
IOPS
Cassandra like workload - 50/50 Read/Write Mix
5 nodes, 80 OSDs, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
IOPS Latency
IO-Size Breakdown
Reads Writes
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group 26
Summary & Conclusions
• Flash technology including NVMe enables new performance capabilities in small
footprints
• Ceph and Cassandra provide a compelling case for feature-rich converged
storage that can support latency sensitive analytics workloads
• Using the latest standard high-volume servers and Ceph, you can now build an
open, high density, scalable, high performance cluster that can handle a low-
latency mixed workload.
• Ceph performance improvements over recent releases are significant, and today
over 1 Million random IOPS is achievable in 5U with ~1ms latency.
• Next steps:
• Address small block write performance, limited by Filestore backend
• Improve long tail latency for transactional workloads
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
Thank you!
28
DCG Storage Group 29
Configuration Detail – ceph.conf
Section Perf. Tuning Parameter Default Tuned
[global]
Authentication
auth_client_required cephx none
auth_cluster_required cephx none
auth_service_required cephx none
Debug logging
debug_lockdep 0/1 0/0
debug_context 0/1 0/0
debug_crush 1/1 0/0
debug_buffer 0/1 0/0
debug_timer 0/1 0/0
debug_filer 0/1 0/0
debug_objector 0/1 0/0
debug_rados 0/5 0/0
debug_rbd 0/5 0/0
debug_ms 0/5 0/0
debug_monc 0/5 0/0
debug_tp 0/5 0/0
debug_auth 1/5 0/0
debug_finisher 1/5 0/0
debug_heartbeatmap 1/5 0/0
debug_perfcounter 1/5 0/0
debug_rgw 1/5 0/0
debug_asok 1/5 0/0
debug_throttle 1/1 0/0
DCG Storage Group 30
Configuration Detail – ceph.conf (continued)Section Perf. Tuning Parameter Default Tuned
[global]
CBT specific
mon_pg_warn_max_object_skew 10 10000
mon_pg_warn_min_per_osd 0 0
mon_pg_warn_max_per_osd 32768 32768
osd_pg_bits 8 8
osd_pgp_bits 8 8
RBD cache rbd_cache true true
Other
mon_compact_on_trim true false
log_to_syslog false false
log_file /var/log/ceph/$name.log /var/log/ceph/$name.log
perf true true
mutex_perf_counter false true
throttler_perf_counter true false
[mon] CBT specific
mon_data /var/lib/ceph/mon/ceph-0 /home/bmpa/tmp_cbt/ceph/mon.$id
mon_max_pool_pg_num 65536 166496
mon_osd_max_split_count 32 10000
[osd]
Filestore parameters
filestore_wbthrottle_enable true false
filestore_queue_max_bytes 104857600 1048576000
filestore_queue_committing_max_bytes 104857600 1048576000
filestore_queue_max_ops 50 5000
filestore_queue_committing_max_ops 500 5000
filestore_max_sync_interval 5 10
filestore_fd_cache_size 128 64
filestore_fd_cache_shards 16 32
filestore_op_threads 2 6
Mount parameters
osd_mount_options_xfs rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs -f -i size=2048
Journal parameters
journal_max_write_entries 100 1000
journal_queue_max_ops 300 3000
journal_max_write_bytes 10485760 1048576000
journal_queue_max_bytes 33554432 1048576000
Op tracker osd_enable_op_tracker true false
OSD client
osd_client_message_size_cap 524288000 0
osd_client_message_cap 100 0
Objecter
objecter_inflight_ops 1024 102400
objecter_inflight_op_bytes 104857600 1048576000
Throttles ms_dispatch_throttle_bytes 104857600 1048576000
OSD number of threads
osd_op_threads 2 32
osd_op_num_shards 5 5
osd_op_num_threads_per_shard 2 2
DCG Storage Group 31
Configuration Detail - CBT YAML File
cluster:
user: "bmpa"
head: "ft01"
clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
mons:
ft02:
a: "192.168.142.202:6789"
osds_per_node: 8
fs: xfs
mkfs_opts: '-f -i size=2048 -n size=64k'
mount_opts: '-o inode64,noatime,logbsize=256k'
conf_file: '/home/bmpa/cbt/ceph_nvme_2partition_5node_hsw.conf'
use_existing: False
rebuild_every_test: False
clusterid: "ceph"
iterations: 1
tmp_dir: "/home/bmpa/tmp_cbt"
pool_profiles:
2rep:
pg_size: 4096
pgp_size: 4096
replication: 2
DCG Storage Group 32
Configuration Detail - CBT YAML File (Continued)
benchmarks:
librbdfio:
time: 300
ramp: 600
vol_size: 81920
mode: ['randrw‘]
rwmixread: [0, 70, 100]
op_size: [4096]
procs_per_volume: [1]
volumes_per_client: [10]
use_existing_volumes: False
iodepth: [4, 8, 16, 32, 64, 96, 128]
osd_ra: [128]
norandommap: True
cmd_path: '/usr/bin/fio'
pool_profile: '2rep'
log_avg_msec: 250
DCG Storage Group 33
Storage Node Diagram
Two CPU Sockets: Socket 0 and Socket 1
 Socket 0
• 2 NVMes
• Intel X540-AT2 (10Gbps)
• 64GB: 8x 8GB 2133 DIMMs
 Socket 1
• 2 NVMes
• 64GB: 8x 8GB 2133 DIMMs
Explore additional
optimizations using
cgroups, IRQ affinity
DCG Storage Group
• Generally available server designs built for high density and high performance
• High density 1U standard high volume server
• Dual socket 3rd Generation Xeon E5 (2699v3)
• 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector
• Multiple 10Gb network ports, additional slots for 40Gb networking
• Intel DC P3700 NVMe drives are available in 2.5” drive form-factor
• Allowing easier service in a datacenter environment
High Performance Ceph Node Hardware Building
Blocks
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

  • 1.
    Reddy Chagam –Principal Engineer, Storage Architect Stephen L Blinick – Senior Cloud Storage Performance Engineer Acknowledgments: Warren Wang, Anton Thaker (WalMart) Orlando Moreno, Vishal Verma (Intel)
  • 2.
    Intel technologies’ featuresand benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://intel.com. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § Configurations: Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6, CBT used for testing and data acquisition, OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node, FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Single 10GbE network for client & replication data transfer. FIO 2.2.8 with LibRBD engine. Tests run by Intel DCG Storage Group iin Intel lab. Ceph configuration and CBT YAML file provided in backup slides. § For more information go to http://www.intel.com/performance. Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation. 12
  • 3.
    DCG Storage Group13 Agenda • The transition to flash and the impact of NVMe • NVMe technology with Ceph • Cassandra & Ceph – a case for storage convergence • The all-NVMe high-density Ceph Cluster • Raw performance measurements and observations • Examining performance of a Cassandra DB like workload
  • 4.
    DCG Storage Group Evolutionof Non-Volatile Memory Storage Devices PCIe NVMe 10s us >10 DW/day <10 DW/day 100s K 10s K PCIe NVMe GB/s SATA/SAS SSDs ~100s MB/s HDDs ~sub 100 MB/s SATA/SAS SSDs 100s us HDDs ~ ms IOPsEndurance 4K Read Latency PCI Express® (PCIe) NVM Express™ (NVMe) 3D XPoint™ DIMMs 3D XPoint NVM SSDs NVM Plays a Key Role in Delivering Performance for latency sensitive workloads
  • 5.
    DCG Storage Group15 Ceph Workloads StoragePerformance (IOPS,Throughput) Storage Capacity (PB) Lower Higher LowerHigher Boot Volumes CDN Enterprise Dropbox Backup, Archive Remote Disks VDI App Storage BigData Mobile Content Depot Databases Block Object NVM Focus Test & Dev Cloud DVR HPC
  • 6.
    DCG Storage Group16 Caching Ceph - NVM Usages Virtual Machine Baremetal RADOS Node Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication Kernel User RBD DriverRBD Driver RADOSRADOS ApplicationApplication RADOS Protocol RADOS Protocol RBDRBD RADOSRADOS RADOS Protocol RADOS Protocol OSDOSD JournalJournal FilestoreFilestore NVMNVM File SystemFile System 10GbE Client caching w/ write through NVM NVMNVM NVM NVMNVM Journaling Read cache OSD data
  • 7.
    DCG Storage Group17 Cassandra – What and Why? Cassandra Ring p1 p1 p20 p5 p3 p6 p5 p2 p4p8 p10 p7 Client • Cassandra is column-oriented NoSQL DB with CQL interface  Each row has unique key which is used for partitioning  No relations  A row can have multiple columns – not necessarily same no. of columns • Open source, distributed, decentralized, highly available, linearly scalable, multi DC, ….. • Used for analytics, real-time insights, fraud-detection, IOT/sensor data, messaging etc. Usecases: http://www.planetcassandra.org/apachecassandra-use-cases/ • Ceph is a popular open source unified storage platform • Many large scale Ceph deployments in production • End customers prefer converged infrastructure to support multiple workloads (e.g. analytics) to achieve CapEx, OpEx savings • Several customers are asking for Cassandra workload on Ceph
  • 8.
    DCG Storage Group IPFabric 18 Ceph and Cassandra Integration Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Ceph Storage Cluster SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD MON MON Deployment Considerations • Bootable Ceph volumes (OS & Cassandra data) • Cassandra RBD data volumes • Data protection (Cassandra or Ceph)
  • 9.
    DCG Storage Group CephStorage Cluster Hardware Environment Overview Ceph network (192.168.142.0/24) - 10Gbps CBT / Zabbix / Monitoring CBT / Zabbix / Monitoring FIO RBD ClientFIO RBD Client • OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node • FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6 • CBT used for testing and data acquisition • Single 10GbE network for client & replication data transfer, Replication factor 2 • OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node • FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6 • CBT used for testing and data acquisition • Single 10GbE network for client & replication data transfer, Replication factor 2 FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FatTwin (4x dual-socket XeonE5 v3) FatTwin (4x dual-socket XeonE5 v3) CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U Intel Xeon E5 v3 18 Core CPUs Intel P3700 NVMe PCI-e Flash Intel Xeon E5 v3 18 Core CPUs Intel P3700 NVMe PCI-e Flash Easily serviceable NVMe Drives
  • 10.
    DCG Storage Group •High performance NVMe devices are capable of high parallelism at low latency • DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128 • By using multiple OSD partitions, Ceph performance scales linearly • Reduces lock contention within a single OSD process • Lower latency at all queue-depths, biggest impact to random reads • Introduces the concept of multiple OSD’s on the same physical device • Conceptually similar crushmap data placement rules as managing disks in an enclosure • High Resiliency of “Data Center” Class NVMe devices • At least 10 Drive writes per day • Power loss protection, full data path protection, device level telemetry Multi-partitioning flash devices NVMe1NVMe1 CephOSD1CephOSD1 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 11.
    DCG Storage Group21 Partitioning multiple OSD’s per NVMe • Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization 0 2 4 6 8 10 12 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 AvgLatency(ms) IOPS Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison 5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc, 1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. 0 10 20 30 40 50 60 70 80 90 %CPUUtilization Single Node CPU Utilization Comparison - 4K Random Reads@QD32 4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc 1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe Single OSD Double OSD Quad OSD
  • 12.
    DCG Storage Group 4KRandom Read & Write Performance Summary 22 First Ceph cluster to break 1 Million 4K random IOPS Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. Workload Pattern Max IOPS 4K 100% Random Reads (2TB Dataset) 1.35Million 4K 100% Random Reads (4.8TB Dataset) 1.15Million 4K 100% Random Writes (4.8TB Dataset) 200K 4K 70%/30% Read/Write OLTP Mix (4.8TB Dataset) 452K
  • 13.
    DCG Storage Group 0 1 2 3 4 5 6 7 8 9 10 0200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 AvgLatency(ms) IOPS IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix 5 nodes, 60 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc 100% 4K RandomRead 100% 4K RandomWrite 70/30% 4K Random OLTP 100% 4K RandomRead - 2TB DataSet 4K Random Read & Write Performance and Latency 23 First Ceph cluster to break 1 Million 4K random IOPS, ~1ms response time 171K 100% 4k Random Write IOPS @ 6ms 400K 70/30% (OLTP) 4k Random IOPS @~3ms 1M 100% 4k Random Read IOPS @~1.1ms Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. 1.35M 4k Random Read IOPS w/ 2TB Hot Data
  • 14.
    DCG Storage Group Sequentialperformance (512KB) 24 • With 10gbE per node, both writes and reads are achieving line rate bottlenecked by the OSD node single interface. • Higher throughputs would be possible through bonding or 40GbE connectivity. 3,214 5,888 5,631 0 1000 2000 3000 4000 5000 6000 7000 100% Write 100% Read 70/30% R/W Mix MB/s 512k Sequential Performance Bandwidth 5 nodes, 80 OSDs, DC P3700, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 15.
    DCG Storage Group Cassandra-likeworkload 25 242K IOPS at < 2ms latency • Based on a typical customer cassanda workload profile • 50% Reads and 50% Writes, predominantly 8K Reads and 12K Writes, FIO Queue depth = 8 78% 19% 3% 8K 5K 7K 92% 5% 12K 33K 115K 50K 80K 0 0.5 1 1.5 2 2.5 0.00 50,000.00 100,000.00 150,000.00 200,000.00 250,000.00 300,000.00 Latency(ms) IOPS Cassandra like workload - 50/50 Read/Write Mix 5 nodes, 80 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc IOPS Latency IO-Size Breakdown Reads Writes Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 16.
    DCG Storage Group26 Summary & Conclusions • Flash technology including NVMe enables new performance capabilities in small footprints • Ceph and Cassandra provide a compelling case for feature-rich converged storage that can support latency sensitive analytics workloads • Using the latest standard high-volume servers and Ceph, you can now build an open, high density, scalable, high performance cluster that can handle a low- latency mixed workload. • Ceph performance improvements over recent releases are significant, and today over 1 Million random IOPS is achievable in 5U with ~1ms latency. • Next steps: • Address small block write performance, limited by Filestore backend • Improve long tail latency for transactional workloads Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 17.
  • 18.
  • 19.
    DCG Storage Group29 Configuration Detail – ceph.conf Section Perf. Tuning Parameter Default Tuned [global] Authentication auth_client_required cephx none auth_cluster_required cephx none auth_service_required cephx none Debug logging debug_lockdep 0/1 0/0 debug_context 0/1 0/0 debug_crush 1/1 0/0 debug_buffer 0/1 0/0 debug_timer 0/1 0/0 debug_filer 0/1 0/0 debug_objector 0/1 0/0 debug_rados 0/5 0/0 debug_rbd 0/5 0/0 debug_ms 0/5 0/0 debug_monc 0/5 0/0 debug_tp 0/5 0/0 debug_auth 1/5 0/0 debug_finisher 1/5 0/0 debug_heartbeatmap 1/5 0/0 debug_perfcounter 1/5 0/0 debug_rgw 1/5 0/0 debug_asok 1/5 0/0 debug_throttle 1/1 0/0
  • 20.
    DCG Storage Group30 Configuration Detail – ceph.conf (continued)Section Perf. Tuning Parameter Default Tuned [global] CBT specific mon_pg_warn_max_object_skew 10 10000 mon_pg_warn_min_per_osd 0 0 mon_pg_warn_max_per_osd 32768 32768 osd_pg_bits 8 8 osd_pgp_bits 8 8 RBD cache rbd_cache true true Other mon_compact_on_trim true false log_to_syslog false false log_file /var/log/ceph/$name.log /var/log/ceph/$name.log perf true true mutex_perf_counter false true throttler_perf_counter true false [mon] CBT specific mon_data /var/lib/ceph/mon/ceph-0 /home/bmpa/tmp_cbt/ceph/mon.$id mon_max_pool_pg_num 65536 166496 mon_osd_max_split_count 32 10000 [osd] Filestore parameters filestore_wbthrottle_enable true false filestore_queue_max_bytes 104857600 1048576000 filestore_queue_committing_max_bytes 104857600 1048576000 filestore_queue_max_ops 50 5000 filestore_queue_committing_max_ops 500 5000 filestore_max_sync_interval 5 10 filestore_fd_cache_size 128 64 filestore_fd_cache_shards 16 32 filestore_op_threads 2 6 Mount parameters osd_mount_options_xfs rw,noatime,inode64,logbsize=256k,delaylog osd_mkfs_options_xfs -f -i size=2048 Journal parameters journal_max_write_entries 100 1000 journal_queue_max_ops 300 3000 journal_max_write_bytes 10485760 1048576000 journal_queue_max_bytes 33554432 1048576000 Op tracker osd_enable_op_tracker true false OSD client osd_client_message_size_cap 524288000 0 osd_client_message_cap 100 0 Objecter objecter_inflight_ops 1024 102400 objecter_inflight_op_bytes 104857600 1048576000 Throttles ms_dispatch_throttle_bytes 104857600 1048576000 OSD number of threads osd_op_threads 2 32 osd_op_num_shards 5 5 osd_op_num_threads_per_shard 2 2
  • 21.
    DCG Storage Group31 Configuration Detail - CBT YAML File cluster: user: "bmpa" head: "ft01" clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"] osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"] mons: ft02: a: "192.168.142.202:6789" osds_per_node: 8 fs: xfs mkfs_opts: '-f -i size=2048 -n size=64k' mount_opts: '-o inode64,noatime,logbsize=256k' conf_file: '/home/bmpa/cbt/ceph_nvme_2partition_5node_hsw.conf' use_existing: False rebuild_every_test: False clusterid: "ceph" iterations: 1 tmp_dir: "/home/bmpa/tmp_cbt" pool_profiles: 2rep: pg_size: 4096 pgp_size: 4096 replication: 2
  • 22.
    DCG Storage Group32 Configuration Detail - CBT YAML File (Continued) benchmarks: librbdfio: time: 300 ramp: 600 vol_size: 81920 mode: ['randrw‘] rwmixread: [0, 70, 100] op_size: [4096] procs_per_volume: [1] volumes_per_client: [10] use_existing_volumes: False iodepth: [4, 8, 16, 32, 64, 96, 128] osd_ra: [128] norandommap: True cmd_path: '/usr/bin/fio' pool_profile: '2rep' log_avg_msec: 250
  • 23.
    DCG Storage Group33 Storage Node Diagram Two CPU Sockets: Socket 0 and Socket 1  Socket 0 • 2 NVMes • Intel X540-AT2 (10Gbps) • 64GB: 8x 8GB 2133 DIMMs  Socket 1 • 2 NVMes • 64GB: 8x 8GB 2133 DIMMs Explore additional optimizations using cgroups, IRQ affinity
  • 24.
    DCG Storage Group •Generally available server designs built for high density and high performance • High density 1U standard high volume server • Dual socket 3rd Generation Xeon E5 (2699v3) • 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector • Multiple 10Gb network ports, additional slots for 40Gb networking • Intel DC P3700 NVMe drives are available in 2.5” drive form-factor • Allowing easier service in a datacenter environment High Performance Ceph Node Hardware Building Blocks