Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Your 1st Ceph cluster

18,050 views

Published on

Rules of thumb and magic numbers. By Udo Seidel, Greg Elkinbard and Dmitriy Novakovskiy.

Published in: Software

Your 1st Ceph cluster

  1. 1. Your 1st Ceph cluster Rules of thumb and magic numbers
  2. 2. Agenda • Ceph overview • Hardware planning • Lessons learned
  3. 3. Who are we? • Udo Seidel • Greg Elkinbard • Dmitriy Novakovskiy
  4. 4. Ceph overview
  5. 5. What is Ceph? • Ceph is a free & open distributed storage platform that provides unified object, block, and file storage.
  6. 6. Object Storage • Native RADOS objects: – hash-based placement – synchronous replication – radosgw provides S3 and Swift compatible REST access to RGW objects (striped over multiple RADOS objects).
  7. 7. Block storage • RBD block devices are thinly provisioned over RADOS objects and can be accessed by QEMU via librbd library.
  8. 8. File storage • CephFS metadata servers (MDS) provide a POSIX-compliant file system backed by RADOS. – …still experimental?
  9. 9. How does Ceph fit with OpenStack? • RBD drivers for OpenStack tell libvirt to configure the QEMU interface to librbd • Ceph benefits: – Multi-node striping and redundancy for block storage (Cinder volumes, Nova ephemeral drives, Glance images) – Copy-on-write cloning of images to volumes – Unified pool of storage nodes for all types of data (objects, block devices, files) – Live migration of Ceph-backed VMs
  10. 10. Hardware planning
  11. 11. Questions • How much net storage, in what tiers? • How many IOPS? – Aggregated – Per VM (average) – Per VM (peak) • What to optimize for? – Cost – Performance
  12. 12. Rules of thumb and magic numbers #1 • Ceph-OSD sizing: – Disks • 8-10 SAS HDDs per 1 x 10G NIC • ~12 SATA HDDs per 1 x 10G NIC • 1 x SSD for write journal per 4-6 OSD drives – RAM: • 1 GB of RAM per 1 TB of OSD storage space – CPU: • 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD drives) • Ceph-mon sizing:
  13. 13. Rules of thumb and magic numbers #2 • For typical “unpredictable” workload we usually assume: – 70/30 read/write IOPS proportion – ~4-8KB random read pattern • To estimate Ceph IOPS efficiency we usually take*: – 4-8K Random Read - 0.88 – 4-8K Random Write - 0.64 * Based on benchmark data and semi-empirical evidence
  14. 14. Sizing task #1 • 1 Petabyte usable (net) storage • 500 VMs • 100 IOPS per VM • 50000 IOPS aggregated
  15. 15. Sizing example #1 – Ceph-mon • Ceph monitors – Sample spec: • HP DL360 Gen9 • CPU: 2620v3 (6 cores) • RAM: 64 GB • HDD: 1 x 1 Tb SATA • NIC: 1 x 1 Gig NIC, 1 x 10 Gig – How many? • 1 x ceph-mon node per 15-20 OSD nodes
  16. 16. Sizing example #2 – Ceph-OSD • Lower density, performance optimized – Sample spec: • HP DL380 Gen9 • CPU: 2620v3 (6 cores) • RAM: 64 GB • HDD (OS drives): 2 x SATA 500 Gb • HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST C10K1800) • SSD (write journals): 4 x SSD 128 GB Intel S3700 • NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph- replication) – 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net
  17. 17. Expected performance Level • Conservative drive rating: 250 IOPs • Cluster read rating: – 250*20*98*.88 = 431200 IOPs – Approx 800 IOPs per VM capacity is available • Cluster write rating: – 250*20*98*.88 = 313600 IOPs – Approx 600 IOPs per VM capacity is available
  18. 18. Sizing example #3 – Ceph-OSD • Higher density, cost optimized – Sample spec: • SuperMicro 6048R • CPU: 2 x 2630v3 (16 cores) • RAM: 192 GB • HDD (OS drives): 2 x SATA 500 Gb • HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000) • SSD (write journals): 6 x SSD 128 GB Intel S3700 • NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph- replication) – 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net
  19. 19. Expected performance Level • Conservative drive rating: 150 IOPs • Cluster read rating: – 150*32*28*0.88 = 118,272 IOPs – Approx 200 IOPs per VM capacity is available • Cluster write rating: – 150*32*28*0.64 = 86,016 IOPs – Approx 150 IOPs per VM capacity is available
  20. 20. What about SSDs? • In Firefly (0.8x) an OSD is limited by ~5k IOPS, so full-SSD Ceph cluster performance disappoints • In Giant (0.9x) ~30k per OSD should be reachable, but has yet to be verified
  21. 21. Lessons learned
  22. 22. How we deploy (with Fuel) Select storage options ⇒ Assign roles to nodes ⇒ Allocate disks:
  23. 23. What we now know #1 • RBD is not suitable for IOPS heavy workloads: – Realistic expectations: • 100 IOPS per VM (OSD with HDD disks) • 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph) • 10-40 ms latency expected and accepted • high CPU load on OSD nodes at high IOPS (bumping up CPU requirements) • Object storage for tenants and applications has limitations – Swift API gaps in radosgw: • bulk operations • custom account metadata, • static website • expiring objects • object versioning
  24. 24. What we now know #2 • NIC bonding – Do linux bonds, not OVS bonds. OVS underperforms and eats CPU • Ceph and OVS – DON’T. Bridges for Ceph networks should be connected directly to NIC bond, not via OVS integration bridge • Nova evacuate doesn’t work with Ceph RBD-backed VMs – Fixed in Kilo • Ceph network co-location (with Management, public etc) – DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph- internal (replication) • Tested ceph.conf – Enable RBD cache and others - https://bugs.launchpad.net/fuel/+bug/1374969
  25. 25. Questions?
  26. 26. Thank you!

×