Performance Optimization for All Flash based on aarch64
Huawei 2021/06/11
 Core Count:32/48/64 cores
 Frequency:2.6 /3.0GHz
 Memory controller : 8 DDR4 controllers
 Interface: PCIe 4.0, CCIX, 100G RoCE, SAS/SATA
 Process:7nm
Kunpeng920 ARM-Based CPU
Industry’s Most Powerful ARM-Based CPU
Ceph solution based on Kunpeng920
2
Ceph technical architecture based on Kunpeng chips
TaiShan
hardware
OS
Acceleration
library
Compression Crypto
OpenEuler …
Optimizes performance through software and hardware
collaboration
Disk IO
CPU
Network
Improves CPU usage
• Turn CPU prefetching on or off
• Optimize the number of concurrent threads
• NUMA optimization
• kernel 4K/64K pagesize
I/O performance optimization
Optimizes NIC performance
• Interrupt core binding
• MTU adjustment
• TCP parameter adjustment
3
 Data access across NUMA
 Multi-port NIC deployment
 DDR Multi-channel deployment
 Messenger throttle low
 Long queue waiting time in the OSD
 Use 64K PageSize
 Optimized crc32c in rocksdb
Table of Contents
4
①Data access across NUMA
m
C
…
NIC
NVME
m
C
…
NVME
PCIe PCIe
m
C
…
m
C
…
NVME
PCIe
NVME
PCIe
Data flows across NUMA The data flow is completed within NUMA.
5
OSD NUMA-aware Deployment
 OSDs are evenly deployed in NUMA nodes to avoid cross-NUMA scheduling and memory access overheads.
6
9%
4%
11%
8%
11%
8%
0%
2%
4%
6%
8%
10%
12%
Rand Write Rand Read 70/30 Rand RW
NUMA-aware Deployment IOPS Higher
4K 8K
8%
3%
10%
8%
10%
7%
0%
2%
4%
6%
8%
10%
12%
Rand Write Rand Read 70/30 Rand RW
NUMA-aware Deployment Latency Lower
4K 8K
② Multi-port NIC deployment
 NIC interrupts and
OSD processes are
bind to the same
NUMA.
 NICs receive packets
and OSD recvmsg in
the same NUMA.
7
7%
5%
3%
5%
3%
6%
0%
1%
2%
3%
4%
5%
6%
7%
8%
Rand Write Rand Read 70/30 Rand RW
Multi-port NIC Deployment IOPS Higher
4K 8K
6%
5%
3%
4%
0%
6%
0%
1%
2%
3%
4%
5%
6%
7%
Rand Write Rand Read 70/30 Rand RW
Multi-port NIC Deployment Latency Lower
4K 8K
NIC0
O
S
D
O
S
D
O
S
D
NUMA0
O
S
D
O
S
D
O
S
D
NUMA1
O
S
D
O
S
D
O
S
D
NUMA2
O
S
D
O
S
D
O
S
D
NUMA3
NIC1
NIC2
NIC3
NUMA0
NIC0 NIC1 NIC2 NIC3
O
S
D
O
S
D
O
S
D
NUMA0
O
S
D
O
S
D
O
S
D
NUMA1
O
S
D
O
S
D
O
S
D
NUMA2
O
S
D
O
S
D
O
S
D
NUMA3
NUMA0
③ DDR Multi-channel connection
8
7%
11%
0%
2%
4%
6%
8%
10%
12%
14%
Rand Write Rand Read
DDR channel 16 VS 12 | 4KB Block Size | IOPS Higher
 16-channels DDR is 7% higher than that of 12-channels DDR.
④Messenger throttler
9
 Gradually increase the osd_client_message_cap.
When the value is 1000, the IOPS increases by 3.7%.
0.0%
3.1%
1.7%
0.9%
2.8%
3.5%
2.1%
1.7%
2.5%
3.7% 3.7%
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
0 200 400 600 800 1000 1200 1400 1600
osd_client_message_cap
4K Rand Write | IOPS higher
⑤ Long queue waiting time in the OSD
 The average length of the OPQueue queue is less than 10 and the queuing time is short. The length of kv_queue and
kv_committing queues is greater than 30, and the queuing duration is long.
OPQueue
Messenger
Msgr-workers
OSD Threads
Data
Disk
kv_queue
bstore_aio
kv_committing
Swap
bstore_kv_sync
WAL+DB
Disk
flush
kv_committing_to_finalize onCommits
OSD Threads
bstore_kv_final
OSD Processing
Messenger
Messenger
Enqueue OP
Dequeue OP
Enqueue Swap
ContextQueue
op_before_dequeue_op_lat 16%
state_kv_queued_lat 26%
state_kv_commiting_lat 27%
10
send
0%
recv
1% dispatch
op_before_dequeue_op
_lat 17%
op_w_prepare_l
atency 3%
state_aio_wait_lat
12%
state_kv_queued_lat
27%
kv_sync_lat
7%
kv_final_lat
3%
wait in queues
28%
state_kv_commiting_lat=
kv_sync_lat+kv_final_lat+
wait in queue=38%
CPU partition affinity
11
13%
21%
8%
0%
14%
4%
0%
5%
10%
15%
20%
25%
Rand Write Rand Read 70/30 Rand RW
CPU partition affinity IOPS Higher
4K 8K
12%
18%
7%
0%
13%
4%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Rand Write Rand Read 70/30 Rand RW
CPU partition affinity Latency Lower
4K 8K
 The CPU partition is used to separate the msgr-worker/tp_osd_tp and
bstore threads to achieve fair scheduling.
⑥Use 64K PageSize
 Use CEPH_PAGE_SHIFT for compatibility with various page sizes
1.12
1.17
1.15
1.09
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
RandWrite Randread RandReadWrite
64K/4K pagesize
4KB Block Size| IOPS Higher
 Compared with 4K pages, 64K pages reduce TLB
miss and improve performance by 10%+.
Tips:
 Use small page alignment to reduce memory waste in
bufferlist::reserve
12
64K Pagesize Issues
13
 Write Amplification Issue
When bluefs_buffered_io is set to true, metadata is written using buffer I/O, and sync_file_range is called to write data to disks by page in the kernel. The
magnification factor is 2.46 for 4K pages and 5.46 for 64K pages.
When bluefs_buffered_io is set to false, metadata is written using direct I/O, sync_file_range is not called. The magnification factor is 2.29. Too many
writes affect the disk life cycle. Therefore, set bluefs_buffered_io to false.
 Tcmalloc and kernel page size issue
When the tcmalloc page size is smaller than the kernel page size, the memory keeps increasing until it approaches osd_memory_target, and the
performance deteriorates significantly. Ensure that the page size of tcmalloc is greater than the kernel page size.
⑦ Optimized crc32c in rocksdb
Rocksdb's crc32c_arm64 is supported since Ceph pacific. Backport this feature to earlier version, the performance is
improved by about 3%.
14
Questions?
THANK YOU

Performance optimization for all flash based on aarch64 v2.0

  • 1.
    Performance Optimization forAll Flash based on aarch64 Huawei 2021/06/11
  • 2.
     Core Count:32/48/64cores  Frequency:2.6 /3.0GHz  Memory controller : 8 DDR4 controllers  Interface: PCIe 4.0, CCIX, 100G RoCE, SAS/SATA  Process:7nm Kunpeng920 ARM-Based CPU Industry’s Most Powerful ARM-Based CPU Ceph solution based on Kunpeng920 2 Ceph technical architecture based on Kunpeng chips TaiShan hardware OS Acceleration library Compression Crypto OpenEuler …
  • 3.
    Optimizes performance throughsoftware and hardware collaboration Disk IO CPU Network Improves CPU usage • Turn CPU prefetching on or off • Optimize the number of concurrent threads • NUMA optimization • kernel 4K/64K pagesize I/O performance optimization Optimizes NIC performance • Interrupt core binding • MTU adjustment • TCP parameter adjustment 3
  • 4.
     Data accessacross NUMA  Multi-port NIC deployment  DDR Multi-channel deployment  Messenger throttle low  Long queue waiting time in the OSD  Use 64K PageSize  Optimized crc32c in rocksdb Table of Contents 4
  • 5.
    ①Data access acrossNUMA m C … NIC NVME m C … NVME PCIe PCIe m C … m C … NVME PCIe NVME PCIe Data flows across NUMA The data flow is completed within NUMA. 5
  • 6.
    OSD NUMA-aware Deployment OSDs are evenly deployed in NUMA nodes to avoid cross-NUMA scheduling and memory access overheads. 6 9% 4% 11% 8% 11% 8% 0% 2% 4% 6% 8% 10% 12% Rand Write Rand Read 70/30 Rand RW NUMA-aware Deployment IOPS Higher 4K 8K 8% 3% 10% 8% 10% 7% 0% 2% 4% 6% 8% 10% 12% Rand Write Rand Read 70/30 Rand RW NUMA-aware Deployment Latency Lower 4K 8K
  • 7.
    ② Multi-port NICdeployment  NIC interrupts and OSD processes are bind to the same NUMA.  NICs receive packets and OSD recvmsg in the same NUMA. 7 7% 5% 3% 5% 3% 6% 0% 1% 2% 3% 4% 5% 6% 7% 8% Rand Write Rand Read 70/30 Rand RW Multi-port NIC Deployment IOPS Higher 4K 8K 6% 5% 3% 4% 0% 6% 0% 1% 2% 3% 4% 5% 6% 7% Rand Write Rand Read 70/30 Rand RW Multi-port NIC Deployment Latency Lower 4K 8K NIC0 O S D O S D O S D NUMA0 O S D O S D O S D NUMA1 O S D O S D O S D NUMA2 O S D O S D O S D NUMA3 NIC1 NIC2 NIC3 NUMA0 NIC0 NIC1 NIC2 NIC3 O S D O S D O S D NUMA0 O S D O S D O S D NUMA1 O S D O S D O S D NUMA2 O S D O S D O S D NUMA3 NUMA0
  • 8.
    ③ DDR Multi-channelconnection 8 7% 11% 0% 2% 4% 6% 8% 10% 12% 14% Rand Write Rand Read DDR channel 16 VS 12 | 4KB Block Size | IOPS Higher  16-channels DDR is 7% higher than that of 12-channels DDR.
  • 9.
    ④Messenger throttler 9  Graduallyincrease the osd_client_message_cap. When the value is 1000, the IOPS increases by 3.7%. 0.0% 3.1% 1.7% 0.9% 2.8% 3.5% 2.1% 1.7% 2.5% 3.7% 3.7% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 0 200 400 600 800 1000 1200 1400 1600 osd_client_message_cap 4K Rand Write | IOPS higher
  • 10.
    ⑤ Long queuewaiting time in the OSD  The average length of the OPQueue queue is less than 10 and the queuing time is short. The length of kv_queue and kv_committing queues is greater than 30, and the queuing duration is long. OPQueue Messenger Msgr-workers OSD Threads Data Disk kv_queue bstore_aio kv_committing Swap bstore_kv_sync WAL+DB Disk flush kv_committing_to_finalize onCommits OSD Threads bstore_kv_final OSD Processing Messenger Messenger Enqueue OP Dequeue OP Enqueue Swap ContextQueue op_before_dequeue_op_lat 16% state_kv_queued_lat 26% state_kv_commiting_lat 27% 10 send 0% recv 1% dispatch op_before_dequeue_op _lat 17% op_w_prepare_l atency 3% state_aio_wait_lat 12% state_kv_queued_lat 27% kv_sync_lat 7% kv_final_lat 3% wait in queues 28% state_kv_commiting_lat= kv_sync_lat+kv_final_lat+ wait in queue=38%
  • 11.
    CPU partition affinity 11 13% 21% 8% 0% 14% 4% 0% 5% 10% 15% 20% 25% RandWrite Rand Read 70/30 Rand RW CPU partition affinity IOPS Higher 4K 8K 12% 18% 7% 0% 13% 4% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% Rand Write Rand Read 70/30 Rand RW CPU partition affinity Latency Lower 4K 8K  The CPU partition is used to separate the msgr-worker/tp_osd_tp and bstore threads to achieve fair scheduling.
  • 12.
    ⑥Use 64K PageSize Use CEPH_PAGE_SHIFT for compatibility with various page sizes 1.12 1.17 1.15 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 RandWrite Randread RandReadWrite 64K/4K pagesize 4KB Block Size| IOPS Higher  Compared with 4K pages, 64K pages reduce TLB miss and improve performance by 10%+. Tips:  Use small page alignment to reduce memory waste in bufferlist::reserve 12
  • 13.
    64K Pagesize Issues 13 Write Amplification Issue When bluefs_buffered_io is set to true, metadata is written using buffer I/O, and sync_file_range is called to write data to disks by page in the kernel. The magnification factor is 2.46 for 4K pages and 5.46 for 64K pages. When bluefs_buffered_io is set to false, metadata is written using direct I/O, sync_file_range is not called. The magnification factor is 2.29. Too many writes affect the disk life cycle. Therefore, set bluefs_buffered_io to false.  Tcmalloc and kernel page size issue When the tcmalloc page size is smaller than the kernel page size, the memory keeps increasing until it approaches osd_memory_target, and the performance deteriorates significantly. Ensure that the page size of tcmalloc is greater than the kernel page size.
  • 14.
    ⑦ Optimized crc32cin rocksdb Rocksdb's crc32c_arm64 is supported since Ceph pacific. Backport this feature to earlier version, the performance is improved by about 3%. 14
  • 15.
  • 16.