Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Reddy Chagam – Principal Engineer, Storage Architect
Stephen L Blinick – Senior Cloud Storage Performance Engineer
Acknowledgments: Warren Wang, Anton Thaker (WalMart)
Orlando Moreno, Vishal Verma (Intel)

Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system
manufacturer or retailer or learn more at http://intel.com.
Software and workloads used in performance tests may have been optimized for performance
only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
§ Configurations: Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6, CBT used for testing and data acquisition, OSD
System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe, partitioned into 4
OSD’s each, 16 OSD’s total per node, FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Single 10GbE
network for client & replication data transfer. FIO 2.2.8 with LibRBD engine. Tests run by Intel DCG Storage Group iin Intel lab. Ceph configuration and CBT YAML file
provided in backup slides.
§ For more information go to http://www.intel.com/performance.
Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and
other countries. *Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
12

DCG Storage Group 13
Agenda
• The transition to flash and the impact of NVMe
• NVMe technology with Ceph
• Cassandra & Ceph – a case for storage convergence
• The all-NVMe high-density Ceph Cluster
• Raw performance measurements and observations
• Examining performance of a Cassandra DB like workload

DCG Storage Group
Evolution of Non-Volatile Memory Storage Devices
PCIe
NVMe
10s us
>10 DW/day
<10 DW/day
100s K
10s K
PCIe NVMe
GB/s
SATA/SAS
SSDs
~100s MB/s
HDDs
~sub 100
MB/s
SATA/SAS
SSDs
100s us
HDDs
~ ms
IOPsEndurance
4K Read Latency
PCI Express® (PCIe)
NVM Express™ (NVMe)
3D XPoint™
DIMMs
3D XPoint
NVM SSDs
NVM Plays a Key Role in Delivering Performance for latency sensitive workloads

Ceph Workloads
StoragePerformance
(IOPS,Throughput)
Storage Capacity
(PB)
Lower Higher
LowerHigher
Boot
Volumes
CDN
Enterprise
Dropbox
Backup,
Archive
Remote
Disks
VDI
App
Storage
BigData
Mobile
Content
Depot
Databases
Block
Object
NVM
Focus
Test &
Dev
Cloud
DVR
HPC

Caching
Ceph - NVM Usages
Virtual Machine
Baremetal
RADOS
Node
Hypervisor
Guest
VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
Kernel
User
RBD DriverRBD Driver
RADOSRADOS
RADOS
Protocol
RADOS
Protocol
RBDRBD
RADOSRADOS
RADOS Protocol RADOS Protocol
OSDOSD
JournalJournal FilestoreFilestore
NVMNVM
File SystemFile System
10GbE
Client caching w/
write through
NVM
NVMNVM
NVM
NVMNVM
Journaling
Read cache
OSD data

Cassandra – What and Why?
Cassandra Ring
p1
p1
p20
p5
p3
p6
p5
p2 p4p8
p10
p7
Client
• Cassandra is column-oriented NoSQL DB with CQL
interface
 Each row has unique key which is used for partitioning
 No relations
 A row can have multiple columns – not necessarily same no. of
columns
• Open source, distributed, decentralized, highly available,
linearly scalable, multi DC, …..
• Used for analytics, real-time insights, fraud-detection,
IOT/sensor data, messaging etc.
Usecases: http://www.planetcassandra.org/apachecassandra-use-cases/
• Ceph is a popular open source unified storage platform
• Many large scale Ceph deployments in production
• End customers prefer converged infrastructure to
support multiple workloads (e.g. analytics) to achieve
CapEx, OpEx savings
• Several customers are asking for Cassandra workload on
Ceph

DCG Storage Group
IP Fabric
18
Ceph and Cassandra Integration
Virtual Machine
Hypervisor
Guest VM
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
RBDRBD
RADOSRADOS
CassandraCassandra
Ceph Storage Cluster
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
MON MON
Deployment Considerations
• Bootable Ceph volumes
(OS & Cassandra data)
• Cassandra RBD data
volumes
• Data protection
(Cassandra or Ceph)

DCG Storage Group
Ceph Storage Cluster
Hardware Environment Overview
Ceph network (192.168.142.0/24) - 10Gbps
CBT / Zabbix /
Monitoring
CBT / Zabbix /
Monitoring FIO RBD ClientFIO RBD Client
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
FIO RBD ClientFIO RBD Client
FatTwin (4x dual-socket XeonE5 v3)
FatTwin (4x dual-socket XeonE5 v3)
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Easily serviceable NVMe Drives

DCG Storage Group
• High performance NVMe devices are capable of high parallelism at low latency
• DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128
• By using multiple OSD partitions, Ceph performance scales linearly
• Reduces lock contention within a single OSD process
• Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical device
• Conceptually similar crushmap data placement rules as managing disks in an enclosure
• High Resiliency of “Data Center” Class NVMe devices
• At least 10 Drive writes per day
• Power loss protection, full data path protection, device level telemetry
Multi-partitioning flash devices
NVMe1NVMe1
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.

Partitioning multiple OSD’s per NVMe
• Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization
0
2
4
6
8
10
12
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
AvgLatency(ms)
IOPS
Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison
5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc,
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
parameters.
0
10
20
30
40
50
60
70
80
90
%CPUUtilization
Single Node CPU Utilization Comparison - 4K Random Reads@QD32
4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
Single OSD
Double OSD
Quad OSD

DCG Storage Group
4K Random Read & Write Performance Summary
22
First Ceph cluster to break 1 Million 4K random IOPS
parameters.
Workload Pattern Max IOPS
4K 100% Random Reads (2TB Dataset)
1.35Million
4K 100% Random Reads (4.8TB Dataset)
1.15Million
4K 100% Random Writes (4.8TB Dataset)
200K
4K 70%/30% Read/Write OLTP Mix
(4.8TB Dataset) 452K

DCG Storage Group
0
1
2
3
4
5
6
7
8
9
10
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000
AvgLatency(ms)
IOPS
IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix
5 nodes, 60 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE
100% 4K RandomRead 100% 4K RandomWrite 70/30% 4K Random OLTP 100% 4K RandomRead - 2TB DataSet
4K Random Read & Write Performance and Latency
23
First Ceph cluster to break 1 Million 4K random IOPS, ~1ms response time
171K 100% 4k Random
Write IOPS @ 6ms
400K 70/30% (OLTP) 4k
Random IOPS @~3ms
1M 100% 4k Random
Read IOPS @~1.1ms
parameters.
1.35M 4k Random Read
IOPS w/ 2TB Hot Data

DCG Storage Group
Sequential performance (512KB)
24
• With 10gbE per node, both writes and reads are achieving line rate bottlenecked by the OSD node single interface.
• Higher throughputs would be possible through bonding or 40GbE connectivity.
3,214
5,888 5,631
0
1000
2000
3000
4000
5000
6000
7000
100% Write 100% Read 70/30% R/W Mix
MB/s
512k Sequential Performance Bandwidth
5 nodes, 80 OSDs, DC P3700, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
parameters.

DCG Storage Group
Cassandra-like workload
25
242K IOPS at < 2ms latency
• Based on a typical customer cassanda workload profile
• 50% Reads and 50% Writes, predominantly 8K Reads and 12K Writes, FIO Queue depth = 8
78%
19%
3%
8K 5K 7K
92%
5%
12K 33K 115K 50K 80K
0
0.5
1
1.5
2
2.5
0.00
50,000.00
100,000.00
150,000.00
200,000.00
250,000.00
300,000.00
Latency(ms)
IOPS
Cassandra like workload - 50/50 Read/Write Mix
5 nodes, 80 OSDs, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
IOPS Latency
IO-Size Breakdown
Reads Writes
parameters.

Summary & Conclusions
• Flash technology including NVMe enables new performance capabilities in small
footprints
• Ceph and Cassandra provide a compelling case for feature-rich converged
storage that can support latency sensitive analytics workloads
• Using the latest standard high-volume servers and Ceph, you can now build an
open, high density, scalable, high performance cluster that can handle a low-
latency mixed workload.
• Ceph performance improvements over recent releases are significant, and today
over 1 Million random IOPS is achievable in 5U with ~1ms latency.
• Next steps:
• Address small block write performance, limited by Filestore backend
• Improve long tail latency for transactional workloads
parameters.

Configuration Detail – ceph.conf
Section Perf. Tuning Parameter Default Tuned
[global]
Authentication
auth_client_required cephx none
auth_cluster_required cephx none
auth_service_required cephx none
Debug logging
debug_lockdep 0/1 0/0
debug_context 0/1 0/0
debug_crush 1/1 0/0
debug_buffer 0/1 0/0
debug_timer 0/1 0/0
debug_filer 0/1 0/0
debug_objector 0/1 0/0
debug_rados 0/5 0/0
debug_rbd 0/5 0/0
debug_ms 0/5 0/0
debug_monc 0/5 0/0
debug_tp 0/5 0/0
debug_auth 1/5 0/0
debug_finisher 1/5 0/0
debug_heartbeatmap 1/5 0/0
debug_perfcounter 1/5 0/0
debug_rgw 1/5 0/0
debug_asok 1/5 0/0
debug_throttle 1/1 0/0

Configuration Detail – ceph.conf (continued)Section Perf. Tuning Parameter Default Tuned
[global]
CBT specific
mon_pg_warn_max_object_skew 10 10000
mon_pg_warn_min_per_osd 0 0
mon_pg_warn_max_per_osd 32768 32768
osd_pg_bits 8 8
osd_pgp_bits 8 8
RBD cache rbd_cache true true
Other
mon_compact_on_trim true false
log_to_syslog false false
log_file /var/log/ceph/$name.log /var/log/ceph/$name.log
perf true true
mutex_perf_counter false true
throttler_perf_counter true false
[mon] CBT specific
mon_data /var/lib/ceph/mon/ceph-0 /home/bmpa/tmp_cbt/ceph/mon.$id
mon_max_pool_pg_num 65536 166496
mon_osd_max_split_count 32 10000
[osd]
Filestore parameters
filestore_wbthrottle_enable true false
filestore_queue_max_bytes 104857600 1048576000
filestore_queue_committing_max_bytes 104857600 1048576000
filestore_queue_max_ops 50 5000
filestore_queue_committing_max_ops 500 5000
filestore_max_sync_interval 5 10
filestore_fd_cache_size 128 64
filestore_fd_cache_shards 16 32
filestore_op_threads 2 6
Mount parameters
osd_mount_options_xfs rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs -f -i size=2048
Journal parameters
journal_max_write_entries 100 1000
journal_queue_max_ops 300 3000
journal_max_write_bytes 10485760 1048576000
journal_queue_max_bytes 33554432 1048576000
Op tracker osd_enable_op_tracker true false
OSD client
osd_client_message_size_cap 524288000 0
osd_client_message_cap 100 0
Objecter
objecter_inflight_ops 1024 102400
objecter_inflight_op_bytes 104857600 1048576000
Throttles ms_dispatch_throttle_bytes 104857600 1048576000
OSD number of threads
osd_op_threads 2 32
osd_op_num_shards 5 5
osd_op_num_threads_per_shard 2 2

Configuration Detail - CBT YAML File
cluster:
user: "bmpa"
head: "ft01"
clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
mons:
ft02:
a: "192.168.142.202:6789"
osds_per_node: 8
fs: xfs
mkfs_opts: '-f -i size=2048 -n size=64k'
mount_opts: '-o inode64,noatime,logbsize=256k'
conf_file: '/home/bmpa/cbt/ceph_nvme_2partition_5node_hsw.conf'
use_existing: False
rebuild_every_test: False
clusterid: "ceph"
iterations: 1
tmp_dir: "/home/bmpa/tmp_cbt"
pool_profiles:
2rep:
pg_size: 4096
pgp_size: 4096
replication: 2

Configuration Detail - CBT YAML File (Continued)
benchmarks:
librbdfio:
time: 300
ramp: 600
vol_size: 81920
mode: ['randrw‘]
rwmixread: [0, 70, 100]
op_size: [4096]
procs_per_volume: [1]
volumes_per_client: [10]
use_existing_volumes: False
iodepth: [4, 8, 16, 32, 64, 96, 128]
osd_ra: [128]
norandommap: True
cmd_path: '/usr/bin/fio'
pool_profile: '2rep'
log_avg_msec: 250

Storage Node Diagram
Two CPU Sockets: Socket 0 and Socket 1
 Socket 0
• 2 NVMes
• Intel X540-AT2 (10Gbps)
• 64GB: 8x 8GB 2133 DIMMs
 Socket 1
• 2 NVMes
• 64GB: 8x 8GB 2133 DIMMs
Explore additional
optimizations using
cgroups, IRQ affinity

DCG Storage Group
• Generally available server designs built for high density and high performance
• High density 1U standard high volume server
• Dual socket 3rd Generation Xeon E5 (2699v3)
• 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector
• Multiple 10Gb network ports, additional slots for 40Gb networking
• Intel DC P3700 NVMe drives are available in 2.5” drive form-factor
• Allowing easier service in a datacenter environment
High Performance Ceph Node Hardware Building
Blocks

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

Similar to Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS (20)

Recently uploaded

Recently uploaded (20)

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS