Your 1st Ceph cluster

Your 1st Ceph cluster
Rules of thumb
and magic numbers

Agenda
• Ceph overview
• Hardware planning
• Lessons learned

Who are we?
• Udo Seidel
• Greg Elkinbard
• Dmitriy Novakovskiy

What is Ceph?
• Ceph is a free & open distributed storage
platform that provides unified object, block,
and file storage.

Object Storage
• Native RADOS objects:
– hash-based placement
– synchronous replication
– radosgw provides S3 and Swift compatible
REST access to RGW objects (striped over
multiple RADOS objects).

Block storage
• RBD block devices are thinly provisioned
over RADOS objects and can be accessed
by QEMU via librbd library.

File storage
• CephFS metadata servers (MDS) provide
a POSIX-compliant file system backed by
RADOS.
– …still experimental?

How does Ceph fit with OpenStack?
• RBD drivers for OpenStack tell libvirt to
configure the QEMU interface to librbd
• Ceph benefits:
– Multi-node striping and redundancy for block
storage (Cinder volumes, Nova ephemeral
drives, Glance images)
– Copy-on-write cloning of images to volumes
– Unified pool of storage nodes for all types of
data (objects, block devices, files)
– Live migration of Ceph-backed VMs

Questions
• How much net storage, in what tiers?
• How many IOPS?
– Aggregated
– Per VM (average)
– Per VM (peak)
• What to optimize for?
– Cost
– Performance

Rules of thumb and magic numbers
#1
• Ceph-OSD sizing:
– Disks
• 8-10 SAS HDDs per 1 x 10G NIC
• ~12 SATA HDDs per 1 x 10G NIC
• 1 x SSD for write journal per 4-6 OSD drives
– RAM:
• 1 GB of RAM per 1 TB of OSD storage space
– CPU:
• 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD
drives)
• Ceph-mon sizing:

Rules of thumb and magic numbers
#2
• For typical “unpredictable” workload we usually
assume:
– 70/30 read/write IOPS proportion
– ~4-8KB random read pattern
• To estimate Ceph IOPS efficiency we usually
take*:
– 4-8K Random Read - 0.88
– 4-8K Random Write - 0.64
* Based on benchmark data and semi-empirical evidence

Sizing task #1
• 1 Petabyte usable (net) storage
• 500 VMs
• 100 IOPS per VM
• 50000 IOPS aggregated

Sizing example #1 – Ceph-mon
• Ceph monitors
– Sample spec:
• HP DL360 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD: 1 x 1 Tb SATA
• NIC: 1 x 1 Gig NIC, 1 x 10 Gig
– How many?
• 1 x ceph-mon node per 15-20 OSD nodes

Sizing example #2 – Ceph-OSD
• Lower density, performance optimized
– Sample spec:
• HP DL380 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST
C10K1800)
• SSD (write journals): 4 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net

Expected performance Level
• Conservative drive rating: 250 IOPs
• Cluster read rating:
– 250*20*98*.88 = 431200 IOPs
– Approx 800 IOPs per VM capacity is available
• Cluster write rating:
– 250*20*98*.88 = 313600 IOPs

Sizing example #3 – Ceph-OSD
• Higher density, cost optimized
– Sample spec:
• SuperMicro 6048R
• CPU: 2 x 2630v3 (16 cores)
• RAM: 192 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000)
• SSD (write journals): 6 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net

Expected performance Level
• Conservative drive rating: 150 IOPs
• Cluster read rating:
– 150*32*28*0.88 = 118,272 IOPs
• Cluster write rating:
– 150*32*28*0.64 = 86,016 IOPs

What about SSDs?
• In Firefly (0.8x) an OSD is limited by ~5k
IOPS, so full-SSD Ceph cluster performance
disappoints
• In Giant (0.9x) ~30k per OSD should be
reachable, but has yet to be verified

How we deploy (with Fuel)
Select storage options ⇒ Assign roles to nodes ⇒ Allocate
disks:

What we now know #1
• RBD is not suitable for IOPS heavy workloads:
– Realistic expectations:
• 100 IOPS per VM (OSD with HDD disks)
• 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph)
• 10-40 ms latency expected and accepted
• high CPU load on OSD nodes at high IOPS (bumping up CPU requirements)
• Object storage for tenants and applications has limitations
– Swift API gaps in radosgw:
• bulk operations
• custom account metadata,
• static website
• expiring objects
• object versioning

What we now know #2
• NIC bonding
– Do linux bonds, not OVS bonds. OVS underperforms and eats CPU
• Ceph and OVS
– DON’T. Bridges for Ceph networks should be connected directly to
NIC bond, not via OVS integration bridge
• Nova evacuate doesn’t work with Ceph RBD-backed VMs
– Fixed in Kilo
• Ceph network co-location (with Management, public etc)
– DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph-
internal (replication)
• Tested ceph.conf
– Enable RBD cache and others -
https://bugs.launchpad.net/fuel/+bug/1374969

Your 1st Ceph cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Your 1st Ceph cluster

Similar to Your 1st Ceph cluster (20)

More from Mirantis

More from Mirantis (20)

Recently uploaded

Recently uploaded (20)

Your 1st Ceph cluster