5. What is Ceph?
• Ceph is a free & open distributed storage
platform that provides unified object, block,
and file storage.
6. Object Storage
• Native RADOS objects:
– hash-based placement
– synchronous replication
– radosgw provides S3 and Swift compatible
REST access to RGW objects (striped over
multiple RADOS objects).
7. Block storage
• RBD block devices are thinly provisioned
over RADOS objects and can be accessed
by QEMU via librbd library.
8. File storage
• CephFS metadata servers (MDS) provide
a POSIX-compliant file system backed by
RADOS.
– …still experimental?
9. How does Ceph fit with OpenStack?
• RBD drivers for OpenStack tell libvirt to
configure the QEMU interface to librbd
• Ceph benefits:
– Multi-node striping and redundancy for block
storage (Cinder volumes, Nova ephemeral
drives, Glance images)
– Copy-on-write cloning of images to volumes
– Unified pool of storage nodes for all types of
data (objects, block devices, files)
– Live migration of Ceph-backed VMs
11. Questions
• How much net storage, in what tiers?
• How many IOPS?
– Aggregated
– Per VM (average)
– Per VM (peak)
• What to optimize for?
– Cost
– Performance
12. Rules of thumb and magic numbers
#1
• Ceph-OSD sizing:
– Disks
• 8-10 SAS HDDs per 1 x 10G NIC
• ~12 SATA HDDs per 1 x 10G NIC
• 1 x SSD for write journal per 4-6 OSD drives
– RAM:
• 1 GB of RAM per 1 TB of OSD storage space
– CPU:
• 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD
drives)
• Ceph-mon sizing:
13. Rules of thumb and magic numbers
#2
• For typical “unpredictable” workload we usually
assume:
– 70/30 read/write IOPS proportion
– ~4-8KB random read pattern
• To estimate Ceph IOPS efficiency we usually
take*:
– 4-8K Random Read - 0.88
– 4-8K Random Write - 0.64
* Based on benchmark data and semi-empirical evidence
15. Sizing example #1 – Ceph-mon
• Ceph monitors
– Sample spec:
• HP DL360 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD: 1 x 1 Tb SATA
• NIC: 1 x 1 Gig NIC, 1 x 10 Gig
– How many?
• 1 x ceph-mon node per 15-20 OSD nodes
16. Sizing example #2 – Ceph-OSD
• Lower density, performance optimized
– Sample spec:
• HP DL380 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST
C10K1800)
• SSD (write journals): 4 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net
17. Expected performance Level
• Conservative drive rating: 250 IOPs
• Cluster read rating:
– 250*20*98*.88 = 431200 IOPs
– Approx 800 IOPs per VM capacity is available
• Cluster write rating:
– 250*20*98*.88 = 313600 IOPs
– Approx 600 IOPs per VM capacity is available
18. Sizing example #3 – Ceph-OSD
• Higher density, cost optimized
– Sample spec:
• SuperMicro 6048R
• CPU: 2 x 2630v3 (16 cores)
• RAM: 192 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000)
• SSD (write journals): 6 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net
19. Expected performance Level
• Conservative drive rating: 150 IOPs
• Cluster read rating:
– 150*32*28*0.88 = 118,272 IOPs
– Approx 200 IOPs per VM capacity is available
• Cluster write rating:
– 150*32*28*0.64 = 86,016 IOPs
– Approx 150 IOPs per VM capacity is available
20. What about SSDs?
• In Firefly (0.8x) an OSD is limited by ~5k
IOPS, so full-SSD Ceph cluster performance
disappoints
• In Giant (0.9x) ~30k per OSD should be
reachable, but has yet to be verified
22. How we deploy (with Fuel)
Select storage options ⇒ Assign roles to nodes ⇒ Allocate
disks:
23. What we now know #1
• RBD is not suitable for IOPS heavy workloads:
– Realistic expectations:
• 100 IOPS per VM (OSD with HDD disks)
• 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph)
• 10-40 ms latency expected and accepted
• high CPU load on OSD nodes at high IOPS (bumping up CPU requirements)
• Object storage for tenants and applications has limitations
– Swift API gaps in radosgw:
• bulk operations
• custom account metadata,
• static website
• expiring objects
• object versioning
24. What we now know #2
• NIC bonding
– Do linux bonds, not OVS bonds. OVS underperforms and eats CPU
• Ceph and OVS
– DON’T. Bridges for Ceph networks should be connected directly to
NIC bond, not via OVS integration bridge
• Nova evacuate doesn’t work with Ceph RBD-backed VMs
– Fixed in Kilo
• Ceph network co-location (with Management, public etc)
– DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph-
internal (replication)
• Tested ceph.conf
– Enable RBD cache and others -
https://bugs.launchpad.net/fuel/+bug/1374969