Ceph barcelona-v-1.2

.
Best practices & Performance Tuning
OpenStack Cloud Storage with Ceph
OpenStack Summit Barcelona
25th Oct 2015 @17:05 - 17:45
Room: 118-119

Swami Reddy
RJIL
Openstack & Ceph Dev
Pandiyan M
RJIL
Openstack Dev
Who are we?

Agenda
• Ceph - Quick Overview
• OpenStack Ceph Integration
• OpenStack - Recommendations.
• Ceph - Recommendations.
• Q & A
• References

Cloud Environment Details
Cloud env with 200 nodes for general purpose use-cases.
~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.
• Average boot volume sizes
o Linux VMs - 20 GB
o Windows VMs – 100 GB
• Average data Volume sizes: 200 GB
Compute (~160 nodes)
• CPU : 2 * 16 @ 2.60 Hz
• RAM : 256 GB
• HDD : 3.6 TB (OS Drive)
• NICs : 2 * 10 Gbps, 2 * 1 Gbps
• Overprovision: CPU - 1:8
RAM - 1:1
Storage (~44 nodes)
• CPU : 2 * 12 @ 2.50 Hz
• RAM : 128 GB
• HDD : 2 * 1 TB (OS Drive)
• OSD : 22 * 3.6 TB
• SSD : 2 * 800 GB (Intel S3700)
• NICs : 2 * 10 Gbps , 2 * 1 Gbps
• Replication: 3

Ceph Overview
Design Goals
• Every component must scale
• No single point of failure
• Open source
• Runs on commodity hardware
• Everything must self-manage
Key Benefits
• Multi-node striping and redundancy
• COW cloning of images to volumes
• Live migration of Ceph-backed VMs

OpenStack - Ceph Integration
CEPH STORAGE CLUSTER (RADOS)
CINDER GLANCE NOVA
RBD
HYPERVISOR
(Qemu / KVM)
OPENSTACK
RGW
SWIFT

OpenStack - Ceph Integration
OpenStack Block storage - RBD flow:
• libvirt
• QEMU
• librbd
• librados
• OSDs and MONs
OpenStack Object storage - RGW flow:
• S3/SWIFT APIs
• RGW
• librados
• OSDs and MONs
Openstack
libvirt
QEMU
librbd
librados
RADOS
Configures
S3 Compatible API Swift Compatible API
radosgw
librados
RADOS

Glance Recommendations
• What is Glance ?
• Configuration settings: /etc/glance/glance-api.conf
• Use the ceph rbd as glance storage
• During the boot from volumes:
• Disable local cache
• Expose Image URL helps saving time as image download
and copy are NOT required
default_store=rbd
flavor = keystone+cachemanagement/flavor = keystone/
show_image_direct_url = True
show_multiple_locations = True
# glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26
+------------------+-----------------------------------------------------------------+
| Property | Value |
+------------------+-----------------------------------------------------------------+
| checksum | 24bc1b62a77389c083ac7812a08333f2 |
| container_format | bare |
| created_at | 2016-04-19T05:56:46Z |
| description | Image Updated on 18th April 2016 |
| direct_url | rbd://8a0021e6-3788-4cb3-8ada- |
| | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap |
| disk_format | raw |

Glance Recommendations
Image Format: Use ONLY RAW Images
With QCOW2 images:
• Convert qcow2 to RAW image
• Get the image UUID
With RAW images (No conversion; saves time):
• Get the image UUID
Image Size (in GB) Format VM Boot time (Approx.)
50 (Windows) QCOW2 ~ 45 minutes
RAW ~ 1 minute
6 (Linux) QCOW2 ~ 2 minutes
RAW ~ 1 minute

Cinder Recommendations
• What is Cinder ?
• Configuration settings: /etc/glance/cinder.conf
Enable Ceph as backend
• Cinder Backup
Ceph supports Incremental backup
enabled_backend=ceph
backup_driver = cinder.backup.drivers.ceph
backup_ceph_conf=/etc/ceph/ceph.conf
backup_ceph_user = cinder
backup_ceph_chunk_size = 134217728
backup_ceph_pool = backups
backup_ceph_stripe_unit = 0
backup_ceph_stripe_count = 0

Nova Recommendations
• What is Nova ?
• Configuration settings: /etc/nova/nova.conf
• Use librados (instead of krdb).
[libvirt]
# enable discard support (be careful of perf)
hw_disk_discard = unmap
# disable password injection
inject_password = false
# disable key injection
inject_key = false
# disable partition injection
inject_partition = -2
# make QEMU aware so caching works
disk_cachemodes = "network=writeback"
live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,
VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“

Performance Decision Factors
• What is required storage (usable/RAW)?
• How many IOPS?
• Aggregated
• Per VM (min/max)
• Optimization for?
• Performance
• Cost

Ceph Cluster Optimization Criteria
Cluster Optimization Criteria Properties Sample Use Cases
IOPS - Optimized • Lowest cost per IOP
• Highest IOPS
• Meets minimum fault domain recommendation
• Typically block storage
• 3x replication
Throughput - Optimized • Lowest cost per given unit of throughput
• Highest throughput
• Highest throughput per BTU
• Highest throughput per watt
• Block or object storage
• 3x replication for higher read
throughput
Capacity - Optimized • Lowest cost per TB
• Lowest BTU per TB
• Lowest watt per TB
• Typically object storage
Erasure coding common for
maximizing usable capacity

OSD Considerations
• RAM
o 1 GB of RAM per 1TB OSD space
• CPU
o 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives)
• Ceph-mons
o 1 ceph-mon node per 15-20 OSD nodes
• Network
o The sum of the total throughput of your OSD hard disks doesn’t exceed the network
bandwidth
• Thread count
o High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and
rebalancing
HOST
OSD.2
OSD.4
OSD.6
OSD.1
OSD.3
OSD.5

Ceph OSD Journal
• Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput.
• On-disk journals can halve write throughput .
• Use SSD journals for high write throughput workloads.
• Performance comparison with/without SSD journal using rados bench
o 100% Write Operation with 4MB object size (default):
 On-disk journal: 45 MB/s
 SSD journal: 80 MB/s
• Note: Above results with 1:11 SSD:OSD ratio
• Recommended to use 1 SSD with 4 - 6 OSDs for better results
Op Type No SDD SSD
Write (MB/s) 45 80
Seq Read (MB/s) 73 140
Rand Read (MB/s) 55 655

OS Considerations
• Kernel: Latest stable release
• BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology).
• Kernel PID max:
• Read ahead: Set in all block devices
• Swappiness:
• Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel.
• The same parameter could be controlled via the kernel.numa_balancing sysctl:
• CPU Tuning: Set “performance” mode use 100% CPU frequency always.
• I/O Scheduler:
# echo “4194303” > /proc/sys/kernel/pid_max
# echo "8192" > /sys/block/sda/queue/read_ahead_kb
# echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf
# echo 0 > /proc/sys/kernel/numa_balancing
SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler
SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler
# echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Ceph Deployment Network
• Each host have at least two 1Gbps network interface controllers (NICs).
• Use 10G Ethernet
• Always use JUMBO frames
• High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps
• Hardware should have a Baseboard Management Controller (BMC)
• Note: Running three networks in HA mode may seem like overkill
Public N/w
Cluster N/w
NIC-1
NIC-2
# ifconfig ethx mtu 9000
#echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx

Ceph Deployment Network
• NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics:
• Test results with 2x10G NIC:
• Active-Passive bond mode:
Traffic between 2 nodes:
Case#1 : node-1 to node-2 => BW 4.80 Gb/s
Case#2: node-1 to node-2 => BW 4.62 Gb/s
• Speed of one 10Gig NIC.
• Balance-alb bond mode:
• Case#1 : node-1 to node-2 => BW 8.18 Gb/s
• Case#2: node-1 to node-2 => BW 8.37 Gb/s
• Speed of two 10Gig NICs

Ceph Failure Domains
• A failure domain is any failure that prevents access to one or more OSDs.
Added costs of isolating every potential failure domain.
Failure domains:
• osd
• host
• chassis
• rack
• row
• pdu
• pod
• room
• datacenter
• region

Ceph Ops Recommendations
Scrub and deep scrub operations are very IO consuming and can affect cluster performance.
o Disable scrub and deep scrub
o After setting noscrub, nodeep-scrub ceph health became WARN state
o Enable Scrub and Deep Scrub
o Configure Scrub and Deep Scrub
#ceph osd set noscrub
set noscrub
#ceph osd set nodeep-scrub
set nodeep-scrub
#ceph health
HEALTH_WARN noscrub, nodeep-scrub flag(s) set
# ceph osd unset noscrub
unset noscrub
# ceph osd unset nodeep-scrub
unset nodeep-scrub
osd_scrub_begin_hour = 0 # begin at this hour
osd_scrub_end_hour = 24 # start last scrub at
osd_scrub_load_threshold = 0.05 #scrub only below load
osd_scrub_min_interval = 86400 # not more often than 1 day
osd_scrub_max_interval = 604800 # not less often than 1 week
osd_deep_scrub_interval = 604800 # scrub deeply once a week

Ceph Ops Recommendations
• Decreasing recovery and backfilling performance impact
• Settings for recovery and backfilling :
Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values.
Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa
‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10]
‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15]
‘osd recovery threads’ - The number of threads for recovering data. [default 1]
‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]

Ceph Performance Measurement Guidelines
For best measurement results, follow these rules while testing:
• One option at a time.
• Check - what is changing.
• Choose the right performance test for the changed option.
• Re-test the changes - at least ten times.
• Run tests for hours, not seconds.
• Trace for any errors.
• Decisively look at results.
• Always try to estimate results and see at standard difference to eliminate spikes and false tests.
Tuning:
• Ceph clusters can be parametrized after deployment to better fit the requirements of the workload.
• Some configuration options can affect data redundancy and have significant implications on stability and safety of data.
• Tuning should be performed on test environment prior issuing any command and configuration changes on production.

Reference Links
• Ceph documentation
• Previous Openstack summit presentations
• Tech Talk Ceph
• A few blogs on Ceph
• https://www.sebastien-han.fr/blog/categories/ceph/
• https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-
INC0270868_v2_0715.pdf

Ceph H/W Best Practices
OSD
HOST
MDS
HOST
MON
HOST
1 x 64-Bit core
1 x 32-Bit Dual Core
1 x i386 Dual- Core
1GB per 1TB 1 x 64-Bit core
1 x i386 Dual- Core
1GB per daemon 1GB per daemon1 x 64-Bit core
1 x i386 Dual- Core

HDD, SDD, Controllers
• Ceph best practices to run operating systems, OSD data and OSD journals on separate drives.
Hard Disk Drives (HDD)
• minimum hard disk drive size of 1 terabyte.
• ~1GB of RAM for 1TB of storage space.
NOTE:
NOT a good idea to run:
1. multiple OSDs on a single disk.
2. OSD/monitor/metadata server on a single disk.
Solid State Drives (SSD)
Use SSDs to improve performance.
NOTE:
Controllers
Disk controllers also have a significant impact on write throughput.
Controller

Ceph OSD Journal - Results
Write operations

Seq Read operations

Read operations

Ceph barcelona-v-1.2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ceph barcelona-v-1.2

Similar to Ceph barcelona-v-1.2 (20)

Recently uploaded

Recently uploaded (20)

Ceph barcelona-v-1.2

Editor's Notes