Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ceph barcelona-v-1.2

661 views

Published on

Best Practices & Performance Tuning - OpenStack Cloud Storage with Ceph - In this presentation, we discuss best practices and performance tuning for OpenStack cloud storage with Ceph to achieve high availability, durability, reliability and scalability at any point of time. Also discuss best practices for failure domain, recovery, rebalancing, backfilling, scrubbing, deep-scrubbing and operations

Published in: Engineering
  • What Openstack release / vendor is being used here ?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Ceph barcelona-v-1.2

  1. 1. . Best practices & Performance Tuning OpenStack Cloud Storage with Ceph OpenStack Summit Barcelona 25th Oct 2015 @17:05 - 17:45 Room: 118-119
  2. 2. Swami Reddy RJIL Openstack & Ceph Dev Pandiyan M RJIL Openstack Dev Who are we?
  3. 3. Agenda • Ceph - Quick Overview • OpenStack Ceph Integration • OpenStack - Recommendations. • Ceph - Recommendations. • Q & A • References
  4. 4. Cloud Environment Details Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage. • Average boot volume sizes o Linux VMs - 20 GB o Windows VMs – 100 GB • Average data Volume sizes: 200 GB Compute (~160 nodes) • CPU : 2 * 16 @ 2.60 Hz • RAM : 256 GB • HDD : 3.6 TB (OS Drive) • NICs : 2 * 10 Gbps, 2 * 1 Gbps • Overprovision: CPU - 1:8 RAM - 1:1 Storage (~44 nodes) • CPU : 2 * 12 @ 2.50 Hz • RAM : 128 GB • HDD : 2 * 1 TB (OS Drive) • OSD : 22 * 3.6 TB • SSD : 2 * 800 GB (Intel S3700) • NICs : 2 * 10 Gbps , 2 * 1 Gbps • Replication: 3
  5. 5. Cloud Environment Details Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage. • Average boot volume sizes o Linux VMs - 20 GB o Windows VMs – 100 GB • Average data Volume sizes: 200 GB Compute (~160 nodes) • CPU : 2 * 16 @ 2.60 Hz • RAM : 256 GB • HDD : 3.6 TB (OS Drive) • NICs : 2 * 10 Gbps, 2 * 1 Gbps • Overprovision: CPU - 1:8 RAM - 1:1 Storage (~44 nodes) • CPU : 2 * 12 @ 2.50 Hz • RAM : 128 GB • HDD : 2 * 1 TB (OS Drive) • OSD : 22 * 3.6 TB • SSD : 2 * 800 GB (Intel S3700) • NICs : 2 * 10 Gbps , 2 * 1 Gbps • Replication: 3
  6. 6. Ceph - Quick Overview
  7. 7. Ceph Overview Design Goals • Every component must scale • No single point of failure • Open source • Runs on commodity hardware • Everything must self-manage Key Benefits • Multi-node striping and redundancy • COW cloning of images to volumes • Live migration of Ceph-backed VMs
  8. 8. OpenStack - Ceph Integration
  9. 9. OpenStack - Ceph Integration CEPH STORAGE CLUSTER (RADOS) CINDER GLANCE NOVA RBD HYPERVISOR (Qemu / KVM) OPENSTACK RGW SWIFT
  10. 10. OpenStack - Ceph Integration OpenStack Block storage - RBD flow: • libvirt • QEMU • librbd • librados • OSDs and MONs OpenStack Object storage - RGW flow: • S3/SWIFT APIs • RGW • librados • OSDs and MONs Openstack libvirt QEMU librbd librados RADOS Configures S3 Compatible API Swift Compatible API radosgw librados RADOS
  11. 11. OpenStack - Recommendations
  12. 12. Glance Recommendations • What is Glance ? • Configuration settings: /etc/glance/glance-api.conf • Use the ceph rbd as glance storage • During the boot from volumes: • Disable local cache • Expose Image URL helps saving time as image download and copy are NOT required default_store=rbd flavor = keystone+cachemanagement/flavor = keystone/ show_image_direct_url = True show_multiple_locations = True # glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26 +------------------+-----------------------------------------------------------------+ | Property | Value | +------------------+-----------------------------------------------------------------+ | checksum | 24bc1b62a77389c083ac7812a08333f2 | | container_format | bare | | created_at | 2016-04-19T05:56:46Z | | description | Image Updated on 18th April 2016 | | direct_url | rbd://8a0021e6-3788-4cb3-8ada- | | | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap | | disk_format | raw |
  13. 13. Glance Recommendations Image Format: Use ONLY RAW Images With QCOW2 images: • Convert qcow2 to RAW image • Get the image UUID With RAW images (No conversion; saves time): • Get the image UUID Image Size (in GB) Format VM Boot time (Approx.) 50 (Windows) QCOW2 ~ 45 minutes RAW ~ 1 minute 6 (Linux) QCOW2 ~ 2 minutes RAW ~ 1 minute
  14. 14. Cinder Recommendations • What is Cinder ? • Configuration settings: /etc/glance/cinder.conf Enable Ceph as backend • Cinder Backup Ceph supports Incremental backup enabled_backend=ceph backup_driver = cinder.backup.drivers.ceph backup_ceph_conf=/etc/ceph/ceph.conf backup_ceph_user = cinder backup_ceph_chunk_size = 134217728 backup_ceph_pool = backups backup_ceph_stripe_unit = 0 backup_ceph_stripe_count = 0
  15. 15. Nova Recommendations • What is Nova ? • Configuration settings: /etc/nova/nova.conf • Use librados (instead of krdb). [libvirt] # enable discard support (be careful of perf) hw_disk_discard = unmap # disable password injection inject_password = false # disable key injection inject_key = false # disable partition injection inject_partition = -2 # make QEMU aware so caching works disk_cachemodes = "network=writeback" live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“
  16. 16. Ceph - Recommendations
  17. 17. Performance Decision Factors • What is required storage (usable/RAW)? • How many IOPS? • Aggregated • Per VM (min/max) • Optimization for? • Performance • Cost
  18. 18. Ceph Cluster Optimization Criteria Cluster Optimization Criteria Properties Sample Use Cases IOPS - Optimized • Lowest cost per IOP • Highest IOPS • Meets minimum fault domain recommendation • Typically block storage • 3x replication Throughput - Optimized • Lowest cost per given unit of throughput • Highest throughput • Highest throughput per BTU • Highest throughput per watt • Meets minimum fault domain recommendation • Block or object storage • 3x replication for higher read throughput Capacity - Optimized • Lowest cost per TB • Lowest BTU per TB • Lowest watt per TB • Meets minimum fault domain recommendation • Typically object storage Erasure coding common for maximizing usable capacity
  19. 19. OSD Considerations • RAM o 1 GB of RAM per 1TB OSD space • CPU o 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives) • Ceph-mons o 1 ceph-mon node per 15-20 OSD nodes • Network o The sum of the total throughput of your OSD hard disks doesn’t exceed the network bandwidth • Thread count o High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and rebalancing HOST OSD.2 OSD.4 OSD.6 OSD.1 OSD.3 OSD.5
  20. 20. Ceph OSD Journal • Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput. • On-disk journals can halve write throughput . • Use SSD journals for high write throughput workloads. • Performance comparison with/without SSD journal using rados bench o 100% Write Operation with 4MB object size (default):  On-disk journal: 45 MB/s  SSD journal: 80 MB/s • Note: Above results with 1:11 SSD:OSD ratio • Recommended to use 1 SSD with 4 - 6 OSDs for better results Op Type No SDD SSD Write (MB/s) 45 80 Seq Read (MB/s) 73 140 Rand Read (MB/s) 55 655
  21. 21. OS Considerations • Kernel: Latest stable release • BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology). • Kernel PID max: • Read ahead: Set in all block devices • Swappiness: • Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel. • The same parameter could be controlled via the kernel.numa_balancing sysctl: • CPU Tuning: Set “performance” mode use 100% CPU frequency always. • I/O Scheduler: # echo “4194303” > /proc/sys/kernel/pid_max # echo "8192" > /sys/block/sda/queue/read_ahead_kb # echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf # echo 0 > /proc/sys/kernel/numa_balancing SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler # echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  22. 22. Ceph Deployment Network
  23. 23. Ceph Deployment Network • Each host have at least two 1Gbps network interface controllers (NICs). • Use 10G Ethernet • Always use JUMBO frames • High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps • Hardware should have a Baseboard Management Controller (BMC) • Note: Running three networks in HA mode may seem like overkill Public N/w Cluster N/w NIC-1 NIC-2 # ifconfig ethx mtu 9000 #echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx
  24. 24. Ceph Deployment Network • NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics: • Test results with 2x10G NIC: • Active-Passive bond mode: Traffic between 2 nodes: Case#1 : node-1 to node-2 => BW 4.80 Gb/s Case#2: node-1 to node-2 => BW 4.62 Gb/s • Speed of one 10Gig NIC. • Balance-alb bond mode: • Case#1 : node-1 to node-2 => BW 8.18 Gb/s • Case#2: node-1 to node-2 => BW 8.37 Gb/s • Speed of two 10Gig NICs
  25. 25. Ceph Failure Domains • A failure domain is any failure that prevents access to one or more OSDs. Added costs of isolating every potential failure domain. Failure domains: • osd • host • chassis • rack • row • pdu • pod • room • datacenter • region
  26. 26. Ceph Ops Recommendations Scrub and deep scrub operations are very IO consuming and can affect cluster performance. o Disable scrub and deep scrub o After setting noscrub, nodeep-scrub ceph health became WARN state o Enable Scrub and Deep Scrub o Configure Scrub and Deep Scrub #ceph osd set noscrub set noscrub #ceph osd set nodeep-scrub set nodeep-scrub #ceph health HEALTH_WARN noscrub, nodeep-scrub flag(s) set # ceph osd unset noscrub unset noscrub # ceph osd unset nodeep-scrub unset nodeep-scrub osd_scrub_begin_hour = 0 # begin at this hour osd_scrub_end_hour = 24 # start last scrub at osd_scrub_load_threshold = 0.05 #scrub only below load osd_scrub_min_interval = 86400 # not more often than 1 day osd_scrub_max_interval = 604800 # not less often than 1 week osd_deep_scrub_interval = 604800 # scrub deeply once a week
  27. 27. Ceph Ops Recommendations • Decreasing recovery and backfilling performance impact • Settings for recovery and backfilling : Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values. Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa ‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10] ‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15] ‘osd recovery threads’ - The number of threads for recovering data. [default 1] ‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]
  28. 28. Ceph Performance Measurement Guidelines For best measurement results, follow these rules while testing: • One option at a time. • Check - what is changing. • Choose the right performance test for the changed option. • Re-test the changes - at least ten times. • Run tests for hours, not seconds. • Trace for any errors. • Decisively look at results. • Always try to estimate results and see at standard difference to eliminate spikes and false tests. Tuning: • Ceph clusters can be parametrized after deployment to better fit the requirements of the workload. • Some configuration options can affect data redundancy and have significant implications on stability and safety of data. • Tuning should be performed on test environment prior issuing any command and configuration changes on production.
  29. 29. Any questions?
  30. 30. Thank You Swami Reddy | swami.reddy@ril.com | swamireddy @ irc Satish | Satish.venkatsubramaniam@ril.com | satish @ irc Pandiyan M | Pandiyan.muthuraman@ril.com | maestropandy @ irc
  31. 31. Reference Links • Ceph documentation • Previous Openstack summit presentations • Tech Talk Ceph • A few blogs on Ceph • https://www.sebastien-han.fr/blog/categories/ceph/ • https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro- INC0270868_v2_0715.pdf
  32. 32. Appendix
  33. 33. Ceph H/W Best Practices OSD HOST MDS HOST MON HOST 1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core 1GB per 1TB 1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core 1GB per daemon 1GB per daemon1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core
  34. 34. HDD, SDD, Controllers • Ceph best practices to run operating systems, OSD data and OSD journals on separate drives. Hard Disk Drives (HDD) • minimum hard disk drive size of 1 terabyte. • ~1GB of RAM for 1TB of storage space. NOTE: NOT a good idea to run: 1. multiple OSDs on a single disk. 2. OSD/monitor/metadata server on a single disk. Solid State Drives (SSD) Use SSDs to improve performance. NOTE: Controllers Disk controllers also have a significant impact on write throughput. Controller
  35. 35. Ceph OSD Journal - Results Write operations
  36. 36. Ceph OSD Journal - Results Seq Read operations
  37. 37. Ceph OSD Journal - Results Read operations

×