2. Emerging Storage Solutions (EMS) SanDisk Confidential 2
Setup
4 OSDs, one per SSD (4TB)
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 40 cores and 64 GB RAM
Shards : thread_per_shard = 25:1
3. Emerging Storage Solutions (EMS) SanDisk Confidential 3
Result
Transport IOPS BW % of read served
from disk
User%cpu Sys%cpu %idle
TCP ~50K ~200M ~99 ~15 ~12 ~55
RDMA ~130K ~520M ~99 ~40 ~19 ~11
Summary:
• ~1.5X performance gain
• TCP iops/core = 2777, XIO iops/core = 3651
4. Emerging Storage Solutions (EMS) SanDisk Confidential 4
Setup
16 OSDs, one per SSD (4TB)
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 40 cores and 64 GB RAM
Shards : thread_per_shard = 25:1, 10:1
5. Emerging Storage Solutions (EMS) SanDisk Confidential 5
Result
Transport IOPS BW Disk read (%) Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~118K ~470M ~99% ~3 ~26 ~16%
RDMA ~120K ~480M ~99% ~7 ~25 ~28%
Summary:
• TCP is catching up; TCP iops/core = 3041, XIO iops/core = 3225 in cluster nodes
• More memory consumed by XIO
6. Emerging Storage Solutions (EMS) SanDisk Confidential 6
Setup
16 OSDs, one per SSD (4TB)
2 hosts, 8 OSDs each
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~6 TB
Code base is latest ceph master
Server has 40 cores and 64 GB RAM
Shards : thread_per_shard = 25:1, 10:1
7. Emerging Storage Solutions (EMS) SanDisk Confidential 7
Result
Transport IOPS BW Disk read (%) Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~175K ~700M ~99% ~8 ~18 ~16%
RDMA ~238K ~952M ~99% ~14 ~20 ~28%
Summary:
• ~36% performance gain
• TCP iops/core = 4755, XIO iops/core = 6918 in cluster nodes
• More than 10% memory usage by RDMA
8. Emerging Storage Solutions (EMS) SanDisk Confidential 8
Setup
32 OSDs, one per SSD (4TB)
2 hosts, 16 OSDs each
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~6 TB
Code base is latest ceph master
Server has 40 cores and 64 GB RAM
Shards : thread_per_shard = 25:1, 10:1,15:1,5:2
9. Emerging Storage Solutions (EMS) SanDisk Confidential 9
Result
Transport IOPS BW Disk read (%) Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~214K ~775M ~99% ~9 ~12 ~16%
RDMA ~230K ~870M ~99% ~12 ~18 ~28%
Summary:
• TCP is catching up again, not much of gain
• TCP iops/core = 2939, XIO iops/core = 3267 in cluster nodes
• More emory usage per cluster node
10. Emerging Storage Solutions (EMS) SanDisk Confidential 10
Did some testing with more powerful setup
8 OSDs, one per SSD (4TB)
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 56 cores Xeon E5-2697 v3 @ 2.60GHz and 64 GB RAM
Shards : thread_per_shard = 25:1
11. Emerging Storage Solutions (EMS) SanDisk Confidential 11
Result
Transport IOPS BW (% of read served
from disk)
Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~148K ~505M ~99% ~15 ~68 ~11%
RDMA ~166K ~665M ~99% ~18 ~73 ~19%
Summary:
• ~12% performance gain
• TCP iops/core = 3109, XIO iops/core = 3616 in cluster nodes.
• For client node, TCP iops/core = 8258, XIO iops/core = 10978
• More than 8% memory usage by RDMA
12. Emerging Storage Solutions (EMS) SanDisk Confidential 12
Result no disk hit
Transport IOPS BW (% of read served
from disk)
Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~265K ~1037M ~0 ~35 ~40 ~11%
RDMA ~276K ~1084M ~0 ~60 ~63 ~19%
Summary:
• Not much difference throughput wise
• But, significant difference here.. TCP iops/core = 7280, XIO iops/core = 12,321 in cluster nodes
• More than 8% memory usage by RDMA
13. Emerging Storage Solutions (EMS) SanDisk Confidential 13
Bumping up OSDs on the same setup
16 OSDs, one per SSD (4TB)
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 56 cores Xeon E5-2697 v3 @ 2.60GHz and 64 GB RAM
Shards : thread_per_shard = 10:1, 4:2, 25:1
Little bit experiment with xio_portal_thread features
14. Emerging Storage Solutions (EMS) SanDisk Confidential 14
Result
Transport IOPS BW (% of read served
from disk)
Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~142K ~505M ~99% ~18 ~68 ~18%
RDMA ~166K ~665M ~99% ~18 ~73 ~38%
Summary:
• TCP iops/core = 3092, XIO iops/core = 3614 in cluster nodes
• TCP iops/core = 7924, XIO iops/core = 10978
• More than 2X memory usage by RDMA
• No t much scaling between 8 and 16 OSDs for both TCP/RDMA !!! Nothing is saturated at this point.
15. Emerging Storage Solutions (EMS) SanDisk Confidential 15
Result no disk hit
Transport IOPS BW (% of read served
from disk)
Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~268K ~1049M ~0 ~37 ~37 ~17%
RDMA ~400K (when osd
side portal thread
= 2, client side =
8)
~1600M ~0 ~40 ~42 ~40%
Summary:
• Well, suspecting some lock contention in the OSD layer, started playing with xio portal threads
• With less number of portal threads (2) in the OSD node, bumped up the no disk hit performance to 400K !!
• I can see increasing XIO portal threads in OSD layer decreasing performance in this case
• Tried with some shard options but TCP remains almost similar to 8 OSD case. Seems like this is a limit.
16. Emerging Storage Solutions (EMS) SanDisk Confidential 16
Checking the scale out nature
32OSDs, one per SSD (4TB)
2 nodes with 16 OSDs each
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 4 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Block size = 4K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 56 cores Xeon E5-2697 v3 @ 2.60GHz and 64 GB RAM
Shards : thread_per_shard = 10:1, 4:2, 25:1
Little bit experiment with xio_portal_thread features
17. Emerging Storage Solutions (EMS) SanDisk Confidential 17
Result no disk hit
Transport IOPS BW (% of read served
from disk)
Cpu usage
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage
cluster nodes
(%)
TCP ~323K ~1263M ~0 ~40 ~12 ~18.7%
RDMA ~343K ~1339M ~0 ~55 ~30 ~37.5%
Summary:
• TCP is scaling but not XIO !
• In fact it is giving less throughput than 16 OSD setup !
• TCP iops/core = 4806, XIO iops/core = 6805 in cluster nodes
• TCP iops/core = 6565, XIO iops/core=8750, even more significant in the client nodes
• XIO mem usage per node is again ~2X
18. Emerging Storage Solutions (EMS) SanDisk Confidential 18
Result
Transport IOPS BW (% of read served
from disk)
Cpu usage per
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage per
cluster nodes
(%)
TCP ~249K ~973M ~99% ~22 ~18 ~15.5%
RDMA ~258K ~1006M ~99% ~24 ~40 ~38%
Summary:
• TCP/XIO similar throughput
• TCP iops/core = 5422, XIO iops/core=7678. Significant gain with XIO in client side
• XIO mem usage per node is again more than 2X
19. Emerging Storage Solutions (EMS) SanDisk Confidential 19
Trying out bigger block sizes
32OSDs, one per SSD (4TB)
2 nodes with 16 OSDs each
4 pools, 4 rbd images (one per pool)
1 physical client box. Total 1 fio_rbd clients, each with 8 (num_jobs) * 32 = 256 QD
Couldn’t able to run 4 clients in parallel in case of XIO
Block size = 16K/64K, 100% RR
Working set ~4 TB
Code base is latest ceph master
Server has 56 cores Xeon E5-2697 v3 @ 2.60GHz and 64 GB RAM
Shards : thread_per_shard = 10:1, 4:2, 25:1
Little bit experiment with xio_portal_thread features
20. Emerging Storage Solutions (EMS) SanDisk Confidential 20
Result(32OSDS,16K,1client )
Transport IOPS BW (% of read served
from disk)
Cpu usage per
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage per
cluster nodes
(%)
TCP ~150K ~2354M ~99% ~35 ~48 ~15.5%
RDMA ~152K (spiky) ~2355M ~99% ~40 ~60 ~38%
Summary:
• TCP/XIO similar throughput
• XIO is very spiky
• Couldn’t run more than 1 client (8 num_jobs) with XIO.
• But, cpu gain is visible
21. Emerging Storage Solutions (EMS) SanDisk Confidential 21
Result(32OSDS, 1 client)
Transport IOPS BW (% of read served
from disk)
Cpu usage per
cluster Nodes
(%idle)
Cpu usage
Client nodes
(%idle)
Mem usage per
cluster nodes
(%)
TCP ~53K ~3312M ~99% ~57 ~74 ~15.5%
RDMA ~55K (but spiky) ~3625M ~99% ~57 ~82 ~39%
Summary:
• TCP/XIO similar throughput
• XIO is very spiky
• Couldn’t run more than 1 client (8 num_jobs) with XIO.
• But, cpu gain is visible specially in client side
22. Emerging Storage Solutions (EMS) SanDisk Confidential 22
Summary
Highlights:
– Definite improvement on iops/core
– Single client is much more efficient with XIO messenger
– Lower number of OSDs can give high throughput
– If we can fix the internal XIO messenger contention, it has potential to outperform TCP in a big way
Lowlights:
– TCP is catching up fast with increasing OSDs
– TCP also scaling out well than XIO I guess
– XIO present state is *unstable*, some crash/peering problem
– Startup time for a connection is much higher for XIO
– XIO connection is taking time to stabilize to a fix throughput
– Memory requirement is considerably higher