Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Nick Fisk - low latency Ceph

1,766 views

Published on

Nick Fisk - low latency Ceph. Presentation from CloudStack / Ceph day Thursday, April 19, London.

Published in: Technology
  • Be the first to comment

Nick Fisk - low latency Ceph

  1. 1. Life at 700us Nick Fisk
  2. 2. Who Am I • Nick Fisk • Ceph user since 2012 • Author of Mastering Ceph • Technical manager at SysGroup • Managed Service Provider • Use Ceph for providing tier-2 services to customers (Backups, standby replicas) - Veeam • Ceph RBD to ESXi via NFS
  3. 3. What is Latency? • What the user feels when he clicks the button • Buffer IO probably not affected though • Traditional 10G iSCSI storage array will service a 4KB IO in around 300us. • Local SAS SSD ~20us • NVME ~2us • Software defined storage will always have higher latency due to replication across nodes and a fatter software stack. • Latency heavily affects single threaded operations that can’t run in parallel. • Eg. SQL transaction logs • Or in the case of Ceph PG Contention
  4. 4. PG Contention • PG serialises distributed workload in Ceph • Each operation takes a lock on that PG, can lead to contention • Multiple requests to a single Object will be hitting same PG • Or if you are unlucky 2 hot objects may share the same PG • Latency defines how fast a PG can process that operation, 2nd operation has to wait • If you dump slow ops from the OSD admin socket and see a lot of delay “Waiting For PG”, you are likely hitting PG Contention
  5. 5. The theory behind minimising latency • Ceph is software • Each step of the Ceph “software” will run through faster with faster CPU’s (Ghz) • Generally, CPU’s with more cores = lower Ghz • High CPU Ghz = $$$ ? • Try and avoid dual socket systems, adds latency and can introduce complications on high disk count boxes (thread counts, thread pinning, interrupts) • Every write has to go to journal, so make journal as fast as reasonably possible • Bluestore – Only small IO’s • Blessing or a curse? • 10G networking is a must • So…..less faster cores + NVME journal = Ceph Latency Nirvana • Lets come up with a hardware design that takes this into account…
  6. 6. Bluestore – deferred writes • For spinning disks • IO<64K write to WAL, ACK, async commit later to disk • IO>64K sync commit to disk • This is great from a double write perspective, WAL doesn’t need to be stupidly fast or have massive write endurance • But a NVME will service a 128kb write a lot faster than a 7.2k disk • May need to tune cutover for your use case
  7. 7. Ceph CPU Frequency Scaling CPU Mhz 4Kb Write IO Min Latency (us) Avg Latency (us) 1600 797 886 1250 2000 815 746 1222 2400 1161 630 857 2800 1227 549 812 3300 1320 482 755 4300 1548 437 644 • Ever wondered how Ceph performs at different clockspeeds? • Using manual CPU governor on unlocked desktop CPU, ran fio QD=1 on a RBD at different clock speeds
  8. 8. Networking Latency • Sample ping test with 4KB payload over 1G and 10G networks • 25Gb networking is interesting in potentially further reducing latency • Even still, networking latency makes up a large part of the overall latency due to Ceph replication between nodes. • Client -> Primary OSD -> Replica OSD’s • If using a NFS/iSCSI gateway/proxy, extra network hop added again • RDMA will be the game changer!!
  9. 9. The Hardware • 1U server • Xeon E3 4x3.5Ghz (3.9Ghz Turbo) • 10GB-T Onboard • 8 SAS onboard • 8 SATA onboard • 64GB Ram • 12x8TB He8’s (Not pictured) • Intel P3700 400GB for Journal + OS • 96TB node = ~£5k (Brexit!!) • 160W Idle • 180W average Ceph Load • 220W Disks + CPU maxed out
  10. 10. How much CPU does Ceph require? • Please don’t take this as a “HW requirements” guide • Use it to make informed decisions, instead of 1 core per OSD • If latency is important, work out total required Ghz and find CPU with highest Ghz per core that meets that total. Ie 3.5Ghz*4 = 14 0.1 1 10 100 1000 4KB 8KB 16KB 32KB 64KB 1MB 4MB Mhz per Ceph IO Mhz Per IO Mhz Per MB/s
  11. 11. Initial Results • I was wrong!!!! - 4kb Average latency 2.4ms write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81 clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57 lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128], | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448], | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960], | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536], | 99.99th=[22400]
  12. 12. But Hang On, what’s this? Real Current Frequency 900.47 MHz [100.11 x 8.99] (Max of below) Core [core-id] :Actual Freq (Mult.) C0% Halt(C1)% C3 % C6 % Temp VCore Core 1 [0]: 900.38 (8.99x) 10.4 44.2 3.47 49.7 27 0.7406 Core 2 [1]: 900.16 (8.99x) 8.46 66.7 1.18 29.9 27 0.7404 Core 3 [2]: 900.47 (8.99x) 10.5 73.8 1 22.5 27 0.7404 Core 4 [3]: 900.12 (8.99x) 8.03 58.6 1 38.3 27 0.7404 • Core’s are spending a lot of their time in C6 and below • And only running at 900Mhz
  13. 13. Intel Cstate Wake Up Latency (us) • POLL • 0 • C1-SKL • 2 • C1E-SKL • 10 • C3-SKL • 70 • C6-SKL • 85 • C7s-SKL • 124 • C8-SKL • 200 From the previous slide, a large proportion of threads could be waiting for up to 200us for the CPU to wakeup to be serviced!!!
  14. 14. 4kb Seq Write – Replica x3 • That’s more like it write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31 clat (usec): min=491, max=32099, avg=694.16, stdev=491.91 lat (usec): min=494, max=32102, avg=697.66, stdev=492.04 clat percentiles (usec): | 1.00th=[ 540], 5.00th=[ 572], 10.00th=[ 588], 20.00th=[ 604], | 30.00th=[ 620], 40.00th=[ 636], 50.00th=[ 652], 60.00th=[ 668], | 70.00th=[ 692], 80.00th=[ 716], 90.00th=[ 764], 95.00th=[ 820], | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712], | 99.99th=[24448]
  15. 15. Questions?

×