SlideShare a Scribd company logo
.
Best practices & Performance Tuning
OpenStack Cloud Storage with Ceph
OpenStack Summit Barcelona
25th Oct 2015 @17:05 - 17:45
Room: 118-119
Swami Reddy
RJIL
Openstack & Ceph Dev
Pandiyan M
RJIL
Openstack Dev
Who are we?
Agenda
• Ceph - Quick Overview
• OpenStack Ceph Integration
• OpenStack - Recommendations.
• Ceph - Recommendations.
• Q & A
• References
Cloud Environment Details
Cloud env with 200 nodes for general purpose use-cases.
~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.
• Average boot volume sizes
o Linux VMs - 20 GB
o Windows VMs – 100 GB
• Average data Volume sizes: 200 GB
Compute (~160 nodes)
• CPU : 2 * 16 @ 2.60 Hz
• RAM : 256 GB
• HDD : 3.6 TB (OS Drive)
• NICs : 2 * 10 Gbps, 2 * 1 Gbps
• Overprovision: CPU - 1:8
RAM - 1:1
Storage (~44 nodes)
• CPU : 2 * 12 @ 2.50 Hz
• RAM : 128 GB
• HDD : 2 * 1 TB (OS Drive)
• OSD : 22 * 3.6 TB
• SSD : 2 * 800 GB (Intel S3700)
• NICs : 2 * 10 Gbps , 2 * 1 Gbps
• Replication: 3
Cloud Environment Details
Cloud env with 200 nodes for general purpose use-cases.
~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.
• Average boot volume sizes
o Linux VMs - 20 GB
o Windows VMs – 100 GB
• Average data Volume sizes: 200 GB
Compute (~160 nodes)
• CPU : 2 * 16 @ 2.60 Hz
• RAM : 256 GB
• HDD : 3.6 TB (OS Drive)
• NICs : 2 * 10 Gbps, 2 * 1 Gbps
• Overprovision: CPU - 1:8
RAM - 1:1
Storage (~44 nodes)
• CPU : 2 * 12 @ 2.50 Hz
• RAM : 128 GB
• HDD : 2 * 1 TB (OS Drive)
• OSD : 22 * 3.6 TB
• SSD : 2 * 800 GB (Intel S3700)
• NICs : 2 * 10 Gbps , 2 * 1 Gbps
• Replication: 3
Ceph - Quick Overview
Ceph Overview
Design Goals
• Every component must scale
• No single point of failure
• Open source
• Runs on commodity hardware
• Everything must self-manage
Key Benefits
• Multi-node striping and redundancy
• COW cloning of images to volumes
• Live migration of Ceph-backed VMs
OpenStack - Ceph Integration
OpenStack - Ceph Integration
CEPH STORAGE CLUSTER (RADOS)
CINDER GLANCE NOVA
RBD
HYPERVISOR
(Qemu / KVM)
OPENSTACK
RGW
SWIFT
OpenStack - Ceph Integration
OpenStack Block storage - RBD flow:
• libvirt
• QEMU
• librbd
• librados
• OSDs and MONs
OpenStack Object storage - RGW flow:
• S3/SWIFT APIs
• RGW
• librados
• OSDs and MONs
Openstack
libvirt
QEMU
librbd
librados
RADOS
Configures
S3 Compatible API Swift Compatible API
radosgw
librados
RADOS
OpenStack - Recommendations
Glance Recommendations
• What is Glance ?
• Configuration settings: /etc/glance/glance-api.conf
• Use the ceph rbd as glance storage
• During the boot from volumes:
• Disable local cache
• Expose Image URL helps saving time as image download
and copy are NOT required
default_store=rbd
flavor = keystone+cachemanagement/flavor = keystone/
show_image_direct_url = True
show_multiple_locations = True
# glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26
+------------------+-----------------------------------------------------------------+
| Property | Value |
+------------------+-----------------------------------------------------------------+
| checksum | 24bc1b62a77389c083ac7812a08333f2 |
| container_format | bare |
| created_at | 2016-04-19T05:56:46Z |
| description | Image Updated on 18th April 2016 |
| direct_url | rbd://8a0021e6-3788-4cb3-8ada- |
| | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap |
| disk_format | raw |
Glance Recommendations
Image Format: Use ONLY RAW Images
With QCOW2 images:
• Convert qcow2 to RAW image
• Get the image UUID
With RAW images (No conversion; saves time):
• Get the image UUID
Image Size (in GB) Format VM Boot time (Approx.)
50 (Windows) QCOW2 ~ 45 minutes
RAW ~ 1 minute
6 (Linux) QCOW2 ~ 2 minutes
RAW ~ 1 minute
Cinder Recommendations
• What is Cinder ?
• Configuration settings: /etc/glance/cinder.conf
Enable Ceph as backend
• Cinder Backup
Ceph supports Incremental backup
enabled_backend=ceph
backup_driver = cinder.backup.drivers.ceph
backup_ceph_conf=/etc/ceph/ceph.conf
backup_ceph_user = cinder
backup_ceph_chunk_size = 134217728
backup_ceph_pool = backups
backup_ceph_stripe_unit = 0
backup_ceph_stripe_count = 0
Nova Recommendations
• What is Nova ?
• Configuration settings: /etc/nova/nova.conf
• Use librados (instead of krdb).
[libvirt]
# enable discard support (be careful of perf)
hw_disk_discard = unmap
# disable password injection
inject_password = false
# disable key injection
inject_key = false
# disable partition injection
inject_partition = -2
# make QEMU aware so caching works
disk_cachemodes = "network=writeback"
live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,
VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“
Ceph - Recommendations
Performance Decision Factors
• What is required storage (usable/RAW)?
• How many IOPS?
• Aggregated
• Per VM (min/max)
• Optimization for?
• Performance
• Cost
Ceph Cluster Optimization Criteria
Cluster Optimization Criteria Properties Sample Use Cases
IOPS - Optimized • Lowest cost per IOP
• Highest IOPS
• Meets minimum fault domain recommendation
• Typically block storage
• 3x replication
Throughput - Optimized • Lowest cost per given unit of throughput
• Highest throughput
• Highest throughput per BTU
• Highest throughput per watt
• Meets minimum fault domain recommendation
• Block or object storage
• 3x replication for higher read
throughput
Capacity - Optimized • Lowest cost per TB
• Lowest BTU per TB
• Lowest watt per TB
• Meets minimum fault domain recommendation
• Typically object storage
Erasure coding common for
maximizing usable capacity
OSD Considerations
• RAM
o 1 GB of RAM per 1TB OSD space
• CPU
o 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives)
• Ceph-mons
o 1 ceph-mon node per 15-20 OSD nodes
• Network
o The sum of the total throughput of your OSD hard disks doesn’t exceed the network
bandwidth
• Thread count
o High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and
rebalancing
HOST
OSD.2
OSD.4
OSD.6
OSD.1
OSD.3
OSD.5
Ceph OSD Journal
• Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput.
• On-disk journals can halve write throughput .
• Use SSD journals for high write throughput workloads.
• Performance comparison with/without SSD journal using rados bench
o 100% Write Operation with 4MB object size (default):
 On-disk journal: 45 MB/s
 SSD journal: 80 MB/s
• Note: Above results with 1:11 SSD:OSD ratio
• Recommended to use 1 SSD with 4 - 6 OSDs for better results
Op Type No SDD SSD
Write (MB/s) 45 80
Seq Read (MB/s) 73 140
Rand Read (MB/s) 55 655
OS Considerations
• Kernel: Latest stable release
• BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology).
• Kernel PID max:
• Read ahead: Set in all block devices
• Swappiness:
• Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel.
• The same parameter could be controlled via the kernel.numa_balancing sysctl:
• CPU Tuning: Set “performance” mode use 100% CPU frequency always.
• I/O Scheduler:
# echo “4194303” > /proc/sys/kernel/pid_max
# echo "8192" > /sys/block/sda/queue/read_ahead_kb
# echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf
# echo 0 > /proc/sys/kernel/numa_balancing
SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler
SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler
# echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Ceph Deployment Network
Ceph Deployment Network
• Each host have at least two 1Gbps network interface controllers (NICs).
• Use 10G Ethernet
• Always use JUMBO frames
• High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps
• Hardware should have a Baseboard Management Controller (BMC)
• Note: Running three networks in HA mode may seem like overkill
Public N/w
Cluster N/w
NIC-1
NIC-2
# ifconfig ethx mtu 9000
#echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx
Ceph Deployment Network
• NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics:
• Test results with 2x10G NIC:
• Active-Passive bond mode:
Traffic between 2 nodes:
Case#1 : node-1 to node-2 => BW 4.80 Gb/s
Case#2: node-1 to node-2 => BW 4.62 Gb/s
• Speed of one 10Gig NIC.
• Balance-alb bond mode:
• Case#1 : node-1 to node-2 => BW 8.18 Gb/s
• Case#2: node-1 to node-2 => BW 8.37 Gb/s
• Speed of two 10Gig NICs
Ceph Failure Domains
• A failure domain is any failure that prevents access to one or more OSDs.
Added costs of isolating every potential failure domain.
Failure domains:
• osd
• host
• chassis
• rack
• row
• pdu
• pod
• room
• datacenter
• region
Ceph Ops Recommendations
Scrub and deep scrub operations are very IO consuming and can affect cluster performance.
o Disable scrub and deep scrub
o After setting noscrub, nodeep-scrub ceph health became WARN state
o Enable Scrub and Deep Scrub
o Configure Scrub and Deep Scrub
#ceph osd set noscrub
set noscrub
#ceph osd set nodeep-scrub
set nodeep-scrub
#ceph health
HEALTH_WARN noscrub, nodeep-scrub flag(s) set
# ceph osd unset noscrub
unset noscrub
# ceph osd unset nodeep-scrub
unset nodeep-scrub
osd_scrub_begin_hour = 0 # begin at this hour
osd_scrub_end_hour = 24 # start last scrub at
osd_scrub_load_threshold = 0.05 #scrub only below load
osd_scrub_min_interval = 86400 # not more often than 1 day
osd_scrub_max_interval = 604800 # not less often than 1 week
osd_deep_scrub_interval = 604800 # scrub deeply once a week
Ceph Ops Recommendations
• Decreasing recovery and backfilling performance impact
• Settings for recovery and backfilling :
Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values.
Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa
‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10]
‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15]
‘osd recovery threads’ - The number of threads for recovering data. [default 1]
‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]
Ceph Performance Measurement Guidelines
For best measurement results, follow these rules while testing:
• One option at a time.
• Check - what is changing.
• Choose the right performance test for the changed option.
• Re-test the changes - at least ten times.
• Run tests for hours, not seconds.
• Trace for any errors.
• Decisively look at results.
• Always try to estimate results and see at standard difference to eliminate spikes and false tests.
Tuning:
• Ceph clusters can be parametrized after deployment to better fit the requirements of the workload.
• Some configuration options can affect data redundancy and have significant implications on stability and safety of data.
• Tuning should be performed on test environment prior issuing any command and configuration changes on production.
Any questions?
Thank You
Swami Reddy | swami.reddy@ril.com | swamireddy @ irc
Satish | Satish.venkatsubramaniam@ril.com | satish @ irc
Pandiyan M | Pandiyan.muthuraman@ril.com | maestropandy @ irc
Reference Links
• Ceph documentation
• Previous Openstack summit presentations
• Tech Talk Ceph
• A few blogs on Ceph
• https://www.sebastien-han.fr/blog/categories/ceph/
• https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-
INC0270868_v2_0715.pdf
Appendix
Ceph H/W Best Practices
OSD
HOST
MDS
HOST
MON
HOST
1 x 64-Bit core
1 x 32-Bit Dual Core
1 x i386 Dual- Core
1GB per 1TB 1 x 64-Bit core
1 x 32-Bit Dual Core
1 x i386 Dual- Core
1GB per daemon 1GB per daemon1 x 64-Bit core
1 x 32-Bit Dual Core
1 x i386 Dual- Core
HDD, SDD, Controllers
• Ceph best practices to run operating systems, OSD data and OSD journals on separate drives.
Hard Disk Drives (HDD)
• minimum hard disk drive size of 1 terabyte.
• ~1GB of RAM for 1TB of storage space.
NOTE:
NOT a good idea to run:
1. multiple OSDs on a single disk.
2. OSD/monitor/metadata server on a single disk.
Solid State Drives (SSD)
Use SSDs to improve performance.
NOTE:
Controllers
Disk controllers also have a significant impact on write throughput.
Controller
Ceph OSD Journal - Results
Write operations
Ceph OSD Journal - Results
Seq Read operations
Ceph OSD Journal - Results
Read operations

More Related Content

What's hot

Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
Ceph Community
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Sage Weil
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
Lars Marowsky-Brée
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
Ceph Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
Jose De La Rosa
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
Sage Weil
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
Sage Weil
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
Red_Hat_Storage
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Patrick McGarry
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
Kyle Bader
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
Ceph Community
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
Ceph Community
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Ceph Community
 

What's hot (20)

Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Making distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyondMaking distributed storage easy: usability in Ceph Luminous and beyond
Making distributed storage easy: usability in Ceph Luminous and beyond
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016Red Hat Ceph Storage Roadmap: January 2016
Red Hat Ceph Storage Roadmap: January 2016
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 

Similar to Ceph barcelona-v-1.2

Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
Mirantis
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
Mayur Shetty
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
ShapeBlue
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
Vietnam Open Infrastructure User Group
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
Viet Stack
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Community
 
Servers Technologies and Enterprise Data Center Trends 2014 - Thailand
Servers Technologies and Enterprise Data Center Trends 2014 - ThailandServers Technologies and Enterprise Data Center Trends 2014 - Thailand
Servers Technologies and Enterprise Data Center Trends 2014 - Thailand
Aruj Thirawat
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax Academy
 
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
Ceph Community
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
Ceph Community
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Community
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuAzure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Marco Obinu
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
Ceph Community
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 

Similar to Ceph barcelona-v-1.2 (20)

Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
Red hat open stack and storage presentation
Red hat open stack and storage presentationRed hat open stack and storage presentation
Red hat open stack and storage presentation
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Servers Technologies and Enterprise Data Center Trends 2014 - Thailand
Servers Technologies and Enterprise Data Center Trends 2014 - ThailandServers Technologies and Enterprise Data Center Trends 2014 - Thailand
Servers Technologies and Enterprise Data Center Trends 2014 - Thailand
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuAzure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 

Recently uploaded

DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 

Recently uploaded (20)

DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 

Ceph barcelona-v-1.2

  • 1. . Best practices & Performance Tuning OpenStack Cloud Storage with Ceph OpenStack Summit Barcelona 25th Oct 2015 @17:05 - 17:45 Room: 118-119
  • 2. Swami Reddy RJIL Openstack & Ceph Dev Pandiyan M RJIL Openstack Dev Who are we?
  • 3. Agenda • Ceph - Quick Overview • OpenStack Ceph Integration • OpenStack - Recommendations. • Ceph - Recommendations. • Q & A • References
  • 4. Cloud Environment Details Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage. • Average boot volume sizes o Linux VMs - 20 GB o Windows VMs – 100 GB • Average data Volume sizes: 200 GB Compute (~160 nodes) • CPU : 2 * 16 @ 2.60 Hz • RAM : 256 GB • HDD : 3.6 TB (OS Drive) • NICs : 2 * 10 Gbps, 2 * 1 Gbps • Overprovision: CPU - 1:8 RAM - 1:1 Storage (~44 nodes) • CPU : 2 * 12 @ 2.50 Hz • RAM : 128 GB • HDD : 2 * 1 TB (OS Drive) • OSD : 22 * 3.6 TB • SSD : 2 * 800 GB (Intel S3700) • NICs : 2 * 10 Gbps , 2 * 1 Gbps • Replication: 3
  • 5. Cloud Environment Details Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage. • Average boot volume sizes o Linux VMs - 20 GB o Windows VMs – 100 GB • Average data Volume sizes: 200 GB Compute (~160 nodes) • CPU : 2 * 16 @ 2.60 Hz • RAM : 256 GB • HDD : 3.6 TB (OS Drive) • NICs : 2 * 10 Gbps, 2 * 1 Gbps • Overprovision: CPU - 1:8 RAM - 1:1 Storage (~44 nodes) • CPU : 2 * 12 @ 2.50 Hz • RAM : 128 GB • HDD : 2 * 1 TB (OS Drive) • OSD : 22 * 3.6 TB • SSD : 2 * 800 GB (Intel S3700) • NICs : 2 * 10 Gbps , 2 * 1 Gbps • Replication: 3
  • 6. Ceph - Quick Overview
  • 7. Ceph Overview Design Goals • Every component must scale • No single point of failure • Open source • Runs on commodity hardware • Everything must self-manage Key Benefits • Multi-node striping and redundancy • COW cloning of images to volumes • Live migration of Ceph-backed VMs
  • 8. OpenStack - Ceph Integration
  • 9. OpenStack - Ceph Integration CEPH STORAGE CLUSTER (RADOS) CINDER GLANCE NOVA RBD HYPERVISOR (Qemu / KVM) OPENSTACK RGW SWIFT
  • 10. OpenStack - Ceph Integration OpenStack Block storage - RBD flow: • libvirt • QEMU • librbd • librados • OSDs and MONs OpenStack Object storage - RGW flow: • S3/SWIFT APIs • RGW • librados • OSDs and MONs Openstack libvirt QEMU librbd librados RADOS Configures S3 Compatible API Swift Compatible API radosgw librados RADOS
  • 12. Glance Recommendations • What is Glance ? • Configuration settings: /etc/glance/glance-api.conf • Use the ceph rbd as glance storage • During the boot from volumes: • Disable local cache • Expose Image URL helps saving time as image download and copy are NOT required default_store=rbd flavor = keystone+cachemanagement/flavor = keystone/ show_image_direct_url = True show_multiple_locations = True # glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26 +------------------+-----------------------------------------------------------------+ | Property | Value | +------------------+-----------------------------------------------------------------+ | checksum | 24bc1b62a77389c083ac7812a08333f2 | | container_format | bare | | created_at | 2016-04-19T05:56:46Z | | description | Image Updated on 18th April 2016 | | direct_url | rbd://8a0021e6-3788-4cb3-8ada- | | | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap | | disk_format | raw |
  • 13. Glance Recommendations Image Format: Use ONLY RAW Images With QCOW2 images: • Convert qcow2 to RAW image • Get the image UUID With RAW images (No conversion; saves time): • Get the image UUID Image Size (in GB) Format VM Boot time (Approx.) 50 (Windows) QCOW2 ~ 45 minutes RAW ~ 1 minute 6 (Linux) QCOW2 ~ 2 minutes RAW ~ 1 minute
  • 14. Cinder Recommendations • What is Cinder ? • Configuration settings: /etc/glance/cinder.conf Enable Ceph as backend • Cinder Backup Ceph supports Incremental backup enabled_backend=ceph backup_driver = cinder.backup.drivers.ceph backup_ceph_conf=/etc/ceph/ceph.conf backup_ceph_user = cinder backup_ceph_chunk_size = 134217728 backup_ceph_pool = backups backup_ceph_stripe_unit = 0 backup_ceph_stripe_count = 0
  • 15. Nova Recommendations • What is Nova ? • Configuration settings: /etc/nova/nova.conf • Use librados (instead of krdb). [libvirt] # enable discard support (be careful of perf) hw_disk_discard = unmap # disable password injection inject_password = false # disable key injection inject_key = false # disable partition injection inject_partition = -2 # make QEMU aware so caching works disk_cachemodes = "network=writeback" live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“
  • 17. Performance Decision Factors • What is required storage (usable/RAW)? • How many IOPS? • Aggregated • Per VM (min/max) • Optimization for? • Performance • Cost
  • 18. Ceph Cluster Optimization Criteria Cluster Optimization Criteria Properties Sample Use Cases IOPS - Optimized • Lowest cost per IOP • Highest IOPS • Meets minimum fault domain recommendation • Typically block storage • 3x replication Throughput - Optimized • Lowest cost per given unit of throughput • Highest throughput • Highest throughput per BTU • Highest throughput per watt • Meets minimum fault domain recommendation • Block or object storage • 3x replication for higher read throughput Capacity - Optimized • Lowest cost per TB • Lowest BTU per TB • Lowest watt per TB • Meets minimum fault domain recommendation • Typically object storage Erasure coding common for maximizing usable capacity
  • 19. OSD Considerations • RAM o 1 GB of RAM per 1TB OSD space • CPU o 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives) • Ceph-mons o 1 ceph-mon node per 15-20 OSD nodes • Network o The sum of the total throughput of your OSD hard disks doesn’t exceed the network bandwidth • Thread count o High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and rebalancing HOST OSD.2 OSD.4 OSD.6 OSD.1 OSD.3 OSD.5
  • 20. Ceph OSD Journal • Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput. • On-disk journals can halve write throughput . • Use SSD journals for high write throughput workloads. • Performance comparison with/without SSD journal using rados bench o 100% Write Operation with 4MB object size (default):  On-disk journal: 45 MB/s  SSD journal: 80 MB/s • Note: Above results with 1:11 SSD:OSD ratio • Recommended to use 1 SSD with 4 - 6 OSDs for better results Op Type No SDD SSD Write (MB/s) 45 80 Seq Read (MB/s) 73 140 Rand Read (MB/s) 55 655
  • 21. OS Considerations • Kernel: Latest stable release • BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology). • Kernel PID max: • Read ahead: Set in all block devices • Swappiness: • Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel. • The same parameter could be controlled via the kernel.numa_balancing sysctl: • CPU Tuning: Set “performance” mode use 100% CPU frequency always. • I/O Scheduler: # echo “4194303” > /proc/sys/kernel/pid_max # echo "8192" > /sys/block/sda/queue/read_ahead_kb # echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf # echo 0 > /proc/sys/kernel/numa_balancing SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler # echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  • 23. Ceph Deployment Network • Each host have at least two 1Gbps network interface controllers (NICs). • Use 10G Ethernet • Always use JUMBO frames • High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps • Hardware should have a Baseboard Management Controller (BMC) • Note: Running three networks in HA mode may seem like overkill Public N/w Cluster N/w NIC-1 NIC-2 # ifconfig ethx mtu 9000 #echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx
  • 24. Ceph Deployment Network • NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics: • Test results with 2x10G NIC: • Active-Passive bond mode: Traffic between 2 nodes: Case#1 : node-1 to node-2 => BW 4.80 Gb/s Case#2: node-1 to node-2 => BW 4.62 Gb/s • Speed of one 10Gig NIC. • Balance-alb bond mode: • Case#1 : node-1 to node-2 => BW 8.18 Gb/s • Case#2: node-1 to node-2 => BW 8.37 Gb/s • Speed of two 10Gig NICs
  • 25. Ceph Failure Domains • A failure domain is any failure that prevents access to one or more OSDs. Added costs of isolating every potential failure domain. Failure domains: • osd • host • chassis • rack • row • pdu • pod • room • datacenter • region
  • 26. Ceph Ops Recommendations Scrub and deep scrub operations are very IO consuming and can affect cluster performance. o Disable scrub and deep scrub o After setting noscrub, nodeep-scrub ceph health became WARN state o Enable Scrub and Deep Scrub o Configure Scrub and Deep Scrub #ceph osd set noscrub set noscrub #ceph osd set nodeep-scrub set nodeep-scrub #ceph health HEALTH_WARN noscrub, nodeep-scrub flag(s) set # ceph osd unset noscrub unset noscrub # ceph osd unset nodeep-scrub unset nodeep-scrub osd_scrub_begin_hour = 0 # begin at this hour osd_scrub_end_hour = 24 # start last scrub at osd_scrub_load_threshold = 0.05 #scrub only below load osd_scrub_min_interval = 86400 # not more often than 1 day osd_scrub_max_interval = 604800 # not less often than 1 week osd_deep_scrub_interval = 604800 # scrub deeply once a week
  • 27. Ceph Ops Recommendations • Decreasing recovery and backfilling performance impact • Settings for recovery and backfilling : Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values. Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa ‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10] ‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15] ‘osd recovery threads’ - The number of threads for recovering data. [default 1] ‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]
  • 28. Ceph Performance Measurement Guidelines For best measurement results, follow these rules while testing: • One option at a time. • Check - what is changing. • Choose the right performance test for the changed option. • Re-test the changes - at least ten times. • Run tests for hours, not seconds. • Trace for any errors. • Decisively look at results. • Always try to estimate results and see at standard difference to eliminate spikes and false tests. Tuning: • Ceph clusters can be parametrized after deployment to better fit the requirements of the workload. • Some configuration options can affect data redundancy and have significant implications on stability and safety of data. • Tuning should be performed on test environment prior issuing any command and configuration changes on production.
  • 30. Thank You Swami Reddy | swami.reddy@ril.com | swamireddy @ irc Satish | Satish.venkatsubramaniam@ril.com | satish @ irc Pandiyan M | Pandiyan.muthuraman@ril.com | maestropandy @ irc
  • 31. Reference Links • Ceph documentation • Previous Openstack summit presentations • Tech Talk Ceph • A few blogs on Ceph • https://www.sebastien-han.fr/blog/categories/ceph/ • https://www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro- INC0270868_v2_0715.pdf
  • 33. Ceph H/W Best Practices OSD HOST MDS HOST MON HOST 1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core 1GB per 1TB 1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core 1GB per daemon 1GB per daemon1 x 64-Bit core 1 x 32-Bit Dual Core 1 x i386 Dual- Core
  • 34. HDD, SDD, Controllers • Ceph best practices to run operating systems, OSD data and OSD journals on separate drives. Hard Disk Drives (HDD) • minimum hard disk drive size of 1 terabyte. • ~1GB of RAM for 1TB of storage space. NOTE: NOT a good idea to run: 1. multiple OSDs on a single disk. 2. OSD/monitor/metadata server on a single disk. Solid State Drives (SSD) Use SSDs to improve performance. NOTE: Controllers Disk controllers also have a significant impact on write throughput. Controller
  • 35. Ceph OSD Journal - Results Write operations
  • 36. Ceph OSD Journal - Results Seq Read operations
  • 37. Ceph OSD Journal - Results Read operations

Editor's Notes

  1. Hello Good evening all...Today we will be talking about "Best practices and Perfomance tunnings for Openstack + Ceph" So how many of you using Ceph? how many of you planned to Ceph near time?
  2. I Swami – working with RJIL. I am working on openstack and ceph projects for last 3 years. My key responsibilities include to manger multiple Ceph storage clusters for Openstack Clouds. Having 15+ years of Exp with Opensource projects like Linux, GNU GCC tools etc. Now, I would like to introduce my colleague Mr Pandiayn, who is an ATC in Openstack and he is one of the active community member in India Openstack community.
  3. Move to Agenda - Rest of the talk is Best practices and performance recommendation for openstack with ceph. First quick overview of Ceph and then Openstack integration with Ceph And Iam going to talk about what recommendation values for OpenStack and Ceph. Then we can go for Questions.
  4. Typical general purpose Cloud env with 200 nodes spread across DCs with compute, block and objects storage use-cases. We have around 2.5K VMs with compute capacity of: 40 TB RAM and 5120 CPU cores and raw storage capacity of: 4PB. On an average, we use 20 GB and 100 GB boot volumes for Linux and Windows respectively. Additionally we use 200 GB data volumes on an average. Below table shows the compute nodes and storage nodes details.
  5. Typical general purpose Cloud env with 200 nodes spread across DCs with compute, block and objects storage use-cases. We have around 2.5K VMs with compute capacity of: 40 TB RAM and 5120 CPU cores and raw storage capacity of: 4PB. On an average, we use 20 GB and 100 GB boot volumes for Linux and Windows respectively. Additionally we use 200 GB data volumes on an average. Below table shows the compute nodes and storage nodes details.
  6. In this section I will do quick recap of Ceph.
  7. Ceph is a distributed storage system designed to provide excellent performance, reliability and scalability and it delivers object, block, and file storage in one unified system. Ceph block devices (RBD) are thin-provisioned, resizable and store data striped over multiple OSDs in a Ceph cluster and it leverage RADOS capabilities such as snapshotting, replication and consistency. Ceph’s RADOS Block Devices (RBD) interact with OSDs using kernel modules or the librbd library. -Ceph Object Storage uses the Ceph Object Gateway daemon (radosgw), which is a FastCGI module for interacting with a Ceph Storage Cluster. Since it provides interfaces compatible with OpenStack Swift and Amazon S3 APIs.
  8. In this section I will be talking on Openstack ceph integrations, like how openstack components interacts with Ceph components.
  9. Cinder is The OpenStack Block Storage service, provides persistent block storage resources and its backed by Ceph RBD. Basically Cinder used to create volumes in RBD. Glance is OpenStack Image service - to store images and maintain a catalog of available images and its also backed by Ceph RBD. Nova is the OpenStack Compute service. You can use NOVA to host and manage cloud computing systems. Nova will do attach/detach volumes. Object Storage is a robust, highly scalable and fault tolerant storage platform for unstructured data such as objects and its backed by Ceph RGW
  10. Basically Cinder used to create volumes in RBD. Nova is the OpenStack Compute service. You can use NOVA to host and manage cloud computing systems. Nova will do attach/detach volumes. Object Storage backed by Ceph RGW Ceph Object Gateway with Keystone, the OpenStack identity service. This sets up the gateway to accept Keystone as the users authority. A user that Keystone authorizes to access the gateway will also be automatically created on the Ceph Object Gateway (if didn’t exist beforehand). A token that Keystone validates will be considered as valid by the gateway. A Ceph Object Gateway user is mapped into a Keystone tenant. A Keystone user has different roles assigned to it on possibly more than a single tenant. When the Ceph Object Gateway gets the ticket, it looks at the tenant, and the user roles that are assigned to that ticket, and accepts/rejects the request according to the rgw keystoneaccepted roles configurable.
  11. In this section I will be covering a few Openstack components recommendations even though - more recommendation comes from Ceph in the next section.
  12. During the boot from volumes – all images will download to controller and by default it caches into “cahced” location. If we span a multple VMs of bigger image sizes, so all images will be cashed and eventaully takes controller space, which will cause the controller stop its operations.
  13. Ceph internally use RAW format to store images, so it optimal to use RAW images by Glance (instead of qcow2)
  14. As we discussed, Cinder is The OpenStack Block Storage service. In this section not much recommendations, expect, use the cinder-backup services with Ceph back-end to ahcive the incremental backup functionaltiyes supported by Ceph. Here are the default cinder-backup configurations.
  15. As you already know that, Nova is the OpenStack Compute service. In nova there are no Ceph specific recommendations, which I can give ATM. Its good to use the krbd instead of librbd to get the page caches functionaly supported by Kernel.
  16. In this section we will discuss about Ceph specific recommendations in detail.
  17. Performance is always depends on the use-case, so need to answere these questions: - What is storage needs (like raw storage and usable storage). For ex: need more usable stroage, so less replication factor etc. - IOPs needs? - What to optimize? is it cost? or performance? Always its big challage to ahive the best perforamce in low cost, so need to compramize one to get other.
  18. In this slide, will talk about a few optimizations and their critias. This table has slef explanation so the sack of time, I won't discuss more about here. Please go through the table.
  19. Now, will take about Ceph OSD - OSD is the object storage daemon for the Ceph storage and it is responsible for storing objects on a local file system and providing access to them over the network. This slide shows minimum OSD requirements for OSD CPU and RAM. Ceph-mon is the cluster monitor daemon - A cep monitor always refers to the local copy of the monmap when discovering other monitors in the cluster to ahive the consistency. Have a check on - all OSDs disk's throughput and the network throughput. All OSDs throughput should not exceed the network throughput. Using more OSDs per server - may hit the lake of thread count because, OSDs need more threads during the rebalance, recovery and other actities.
  20. Ceph OSDs use a journal for two reasons: 1 - Speed and 2- Consistency. Speed: The journal enables the Ceph OSD Daemon to commit small writes quickly. Consistency: Ceph OSD Daemons require a filesystem interface that guarantees atomic compound operations. Every few seconds–between filestore max sync interval and filestore min sync interval–the Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space. 2 types of OSD journal - like on-disk and SSD disk journal. General on-disk journal will show less performance as compared with SSD journal. here are the quick results, we have performed on our env. For 100% write operations - we have seen 40 MB/s using on-disk journal. And 85 MB/s using ssd journal. But here we used 1:11 SSD and OSD ratio. For better performance - it recommended to use 1:[4-6] SSD and OSD ration.
  21. Here are a few Operating system recommendations. Please refer the slide (its self explanatory).
  22. Now, we will discuss the Ceph network considerations. here is standard Ceph network diagram taken from offical Ceph docs. Always recommended to use separate networks for public and cluster. Ceph internally do a lot actities like scrub, deep-scub, recovey,etc, which should not impact the public/use network.
  23. To support, separate networks, each Ceph node should have aleast 2 NICs - 1 for Publuc and other for cluster network. Recommanded to use JUMBO Frame across the network.
  24. In our cloud Env, we have done the NIC bonding with balanced -ALB mode, which eventually showcased the better performance. here are the results:
  25. Now, will talk about Ceph failure domains selection. As a defination, A failure domain - any failure that prevents access objects to one or more OSDs. Ceph maps objects (ie PG) to the OSDs across failure domains. Here are the list of failure domains. Its recommended to use chasis or Rack for durable cluster. That could be a stopped daemon on a host; a hard disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, etc.
  26. To verify the integrity of data, Ceph uses a mechanism called scrub and scrubbing. Ceph insures data integrity by scrubbing placement groups. Light scrubbing (daily) checks the object size and attributes. Deep scrubbing (weekly) reads the data and uses checksums to ensure data integrity. Scrubbing is important for maintaining data integrity, but it can reduce performance. We can adjust the following settings to increase or decrease scrubbing operations as shown in this slide.
  27. Here will talk about recovery and backfill considerations, when an OSD goes into cocovery state due to mutile reasons. To maintain operational performance, Ceph performs recovery with limitations on the number recovery requests, threads and object chunk sizes which allows Ceph perform well in a degraded state.