SlideShare a Scribd company logo
Performance Optimization for All Flash based on aarch64
Huawei 2021/06/11
 Core Count:32/48/64 cores
 Frequency:2.6 /3.0GHz
 Memory controller : 8 DDR4 controllers
 Interface: PCIe 4.0, CCIX, 100G RoCE, SAS/SATA
 Process:7nm
Kunpeng920 ARM-Based CPU
Industry’s Most Powerful ARM-Based CPU
Ceph solution based on Kunpeng920
2
Ceph technical architecture based on Kunpeng chips
TaiShan
hardware
OS
Acceleration
library
Compression Crypto
OpenEuler …
Optimizes performance through software and hardware
collaboration
Disk IO
CPU
Network
Improves CPU usage
• Turn CPU prefetching on or off
• Optimize the number of concurrent threads
• NUMA optimization
• kernel 4K/64K pagesize
I/O performance optimization
Optimizes NIC performance
• Interrupt core binding
• MTU adjustment
• TCP parameter adjustment
3
 Data access across NUMA
 Multi-port NIC deployment
 DDR Multi-channel deployment
 Messenger throttle low
 Long queue waiting time in the OSD
 Use 64K PageSize
 Optimized crc32c in rocksdb
Table of Contents
4
①Data access across NUMA
m
C
…
NIC
NVME
m
C
…
NVME
PCIe PCIe
m
C
…
m
C
…
NVME
PCIe
NVME
PCIe
Data flows across NUMA The data flow is completed within NUMA.
5
OSD NUMA-aware Deployment
 OSDs are evenly deployed in NUMA nodes to avoid cross-NUMA scheduling and memory access overheads.
6
9%
4%
11%
8%
11%
8%
0%
2%
4%
6%
8%
10%
12%
Rand Write Rand Read 70/30 Rand RW
NUMA-aware Deployment IOPS Higher
4K 8K
8%
3%
10%
8%
10%
7%
0%
2%
4%
6%
8%
10%
12%
Rand Write Rand Read 70/30 Rand RW
NUMA-aware Deployment Latency Lower
4K 8K
② Multi-port NIC deployment
 NIC interrupts and
OSD processes are
bind to the same
NUMA.
 NICs receive packets
and OSD recvmsg in
the same NUMA.
7
7%
5%
3%
5%
3%
6%
0%
1%
2%
3%
4%
5%
6%
7%
8%
Rand Write Rand Read 70/30 Rand RW
Multi-port NIC Deployment IOPS Higher
4K 8K
6%
5%
3%
4%
0%
6%
0%
1%
2%
3%
4%
5%
6%
7%
Rand Write Rand Read 70/30 Rand RW
Multi-port NIC Deployment Latency Lower
4K 8K
NIC0
O
S
D
O
S
D
O
S
D
NUMA0
O
S
D
O
S
D
O
S
D
NUMA1
O
S
D
O
S
D
O
S
D
NUMA2
O
S
D
O
S
D
O
S
D
NUMA3
NIC1
NIC2
NIC3
NUMA0
NIC0 NIC1 NIC2 NIC3
O
S
D
O
S
D
O
S
D
NUMA0
O
S
D
O
S
D
O
S
D
NUMA1
O
S
D
O
S
D
O
S
D
NUMA2
O
S
D
O
S
D
O
S
D
NUMA3
NUMA0
③ DDR Multi-channel connection
8
7%
11%
0%
2%
4%
6%
8%
10%
12%
14%
Rand Write Rand Read
DDR channel 16 VS 12 | 4KB Block Size | IOPS Higher
 16-channels DDR is 7% higher than that of 12-channels DDR.
④Messenger throttler
9
 Gradually increase the osd_client_message_cap.
When the value is 1000, the IOPS increases by 3.7%.
0.0%
3.1%
1.7%
0.9%
2.8%
3.5%
2.1%
1.7%
2.5%
3.7% 3.7%
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
0 200 400 600 800 1000 1200 1400 1600
osd_client_message_cap
4K Rand Write | IOPS higher
⑤ Long queue waiting time in the OSD
 The average length of the OPQueue queue is less than 10 and the queuing time is short. The length of kv_queue and
kv_committing queues is greater than 30, and the queuing duration is long.
OPQueue
Messenger
Msgr-workers
OSD Threads
Data
Disk
kv_queue
bstore_aio
kv_committing
Swap
bstore_kv_sync
WAL+DB
Disk
flush
kv_committing_to_finalize onCommits
OSD Threads
bstore_kv_final
OSD Processing
Messenger
Messenger
Enqueue OP
Dequeue OP
Enqueue Swap
ContextQueue
op_before_dequeue_op_lat 16%
state_kv_queued_lat 26%
state_kv_commiting_lat 27%
10
send
0%
recv
1% dispatch
op_before_dequeue_op
_lat 17%
op_w_prepare_l
atency 3%
state_aio_wait_lat
12%
state_kv_queued_lat
27%
kv_sync_lat
7%
kv_final_lat
3%
wait in queues
28%
state_kv_commiting_lat=
kv_sync_lat+kv_final_lat+
wait in queue=38%
CPU partition affinity
11
13%
21%
8%
0%
14%
4%
0%
5%
10%
15%
20%
25%
Rand Write Rand Read 70/30 Rand RW
CPU partition affinity IOPS Higher
4K 8K
12%
18%
7%
0%
13%
4%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
Rand Write Rand Read 70/30 Rand RW
CPU partition affinity Latency Lower
4K 8K
 The CPU partition is used to separate the msgr-worker/tp_osd_tp and
bstore threads to achieve fair scheduling.
⑥Use 64K PageSize
 Use CEPH_PAGE_SHIFT for compatibility with various page sizes
1.12
1.17
1.15
1.09
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
RandWrite Randread RandReadWrite
64K/4K pagesize
4KB Block Size| IOPS Higher
 Compared with 4K pages, 64K pages reduce TLB
miss and improve performance by 10%+.
Tips:
 Use small page alignment to reduce memory waste in
bufferlist::reserve
12
64K Pagesize Issues
13
 Write Amplification Issue
When bluefs_buffered_io is set to true, metadata is written using buffer I/O, and sync_file_range is called to write data to disks by page in the kernel. The
magnification factor is 2.46 for 4K pages and 5.46 for 64K pages.
When bluefs_buffered_io is set to false, metadata is written using direct I/O, sync_file_range is not called. The magnification factor is 2.29. Too many
writes affect the disk life cycle. Therefore, set bluefs_buffered_io to false.
 Tcmalloc and kernel page size issue
When the tcmalloc page size is smaller than the kernel page size, the memory keeps increasing until it approaches osd_memory_target, and the
performance deteriorates significantly. Ensure that the page size of tcmalloc is greater than the kernel page size.
⑦ Optimized crc32c in rocksdb
Rocksdb's crc32c_arm64 is supported since Ceph pacific. Backport this feature to earlier version, the performance is
improved by about 3%.
14
Questions?
THANK YOU

More Related Content

What's hot

BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
Sage Weil
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
Sage Weil
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
OpenStack Korea Community
 
Bluestore
BluestoreBluestore
Bluestore
Patrick McGarry
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Karan Singh
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Ceph Community
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
Sage Weil
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
Red_Hat_Storage
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Karan Singh
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Danielle Womboldt
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
Mahmoud Shiri Varamini
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
Open Source Consulting
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
Danielle Womboldt
 
Ceph c01
Ceph c01Ceph c01
Ceph c01
Lâm Đào
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Community
 

What's hot (20)

BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Ceph c01
Ceph c01Ceph c01
Ceph c01
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 

Similar to Performance optimization for all flash based on aarch64 v2.0

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Ceph Community
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
Ceph Community
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
In-Memory Computing Summit
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
Somnath Roy
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Community
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Community
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
Ceph Community
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Community
 
Ceph
CephCeph
Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oF
inside-BigData.com
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail Internet World
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Community
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
hellobank1
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, Samsung
Redis Labs
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Community
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 

Similar to Performance optimization for all flash based on aarch64 v2.0 (20)

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Ceph on rdma
Ceph on rdmaCeph on rdma
Ceph on rdma
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
Ceph
CephCeph
Ceph
 
Accelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oFAccelerating Ceph with RDMA and NVMe-oF
Accelerating Ceph with RDMA and NVMe-oF
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Redis on NVMe SSD - Zvika Guz, Samsung
 Redis on NVMe SSD - Zvika Guz, Samsung Redis on NVMe SSD - Zvika Guz, Samsung
Redis on NVMe SSD - Zvika Guz, Samsung
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 

Performance optimization for all flash based on aarch64 v2.0

  • 1. Performance Optimization for All Flash based on aarch64 Huawei 2021/06/11
  • 2.  Core Count:32/48/64 cores  Frequency:2.6 /3.0GHz  Memory controller : 8 DDR4 controllers  Interface: PCIe 4.0, CCIX, 100G RoCE, SAS/SATA  Process:7nm Kunpeng920 ARM-Based CPU Industry’s Most Powerful ARM-Based CPU Ceph solution based on Kunpeng920 2 Ceph technical architecture based on Kunpeng chips TaiShan hardware OS Acceleration library Compression Crypto OpenEuler …
  • 3. Optimizes performance through software and hardware collaboration Disk IO CPU Network Improves CPU usage • Turn CPU prefetching on or off • Optimize the number of concurrent threads • NUMA optimization • kernel 4K/64K pagesize I/O performance optimization Optimizes NIC performance • Interrupt core binding • MTU adjustment • TCP parameter adjustment 3
  • 4.  Data access across NUMA  Multi-port NIC deployment  DDR Multi-channel deployment  Messenger throttle low  Long queue waiting time in the OSD  Use 64K PageSize  Optimized crc32c in rocksdb Table of Contents 4
  • 5. ①Data access across NUMA m C … NIC NVME m C … NVME PCIe PCIe m C … m C … NVME PCIe NVME PCIe Data flows across NUMA The data flow is completed within NUMA. 5
  • 6. OSD NUMA-aware Deployment  OSDs are evenly deployed in NUMA nodes to avoid cross-NUMA scheduling and memory access overheads. 6 9% 4% 11% 8% 11% 8% 0% 2% 4% 6% 8% 10% 12% Rand Write Rand Read 70/30 Rand RW NUMA-aware Deployment IOPS Higher 4K 8K 8% 3% 10% 8% 10% 7% 0% 2% 4% 6% 8% 10% 12% Rand Write Rand Read 70/30 Rand RW NUMA-aware Deployment Latency Lower 4K 8K
  • 7. ② Multi-port NIC deployment  NIC interrupts and OSD processes are bind to the same NUMA.  NICs receive packets and OSD recvmsg in the same NUMA. 7 7% 5% 3% 5% 3% 6% 0% 1% 2% 3% 4% 5% 6% 7% 8% Rand Write Rand Read 70/30 Rand RW Multi-port NIC Deployment IOPS Higher 4K 8K 6% 5% 3% 4% 0% 6% 0% 1% 2% 3% 4% 5% 6% 7% Rand Write Rand Read 70/30 Rand RW Multi-port NIC Deployment Latency Lower 4K 8K NIC0 O S D O S D O S D NUMA0 O S D O S D O S D NUMA1 O S D O S D O S D NUMA2 O S D O S D O S D NUMA3 NIC1 NIC2 NIC3 NUMA0 NIC0 NIC1 NIC2 NIC3 O S D O S D O S D NUMA0 O S D O S D O S D NUMA1 O S D O S D O S D NUMA2 O S D O S D O S D NUMA3 NUMA0
  • 8. ③ DDR Multi-channel connection 8 7% 11% 0% 2% 4% 6% 8% 10% 12% 14% Rand Write Rand Read DDR channel 16 VS 12 | 4KB Block Size | IOPS Higher  16-channels DDR is 7% higher than that of 12-channels DDR.
  • 9. ④Messenger throttler 9  Gradually increase the osd_client_message_cap. When the value is 1000, the IOPS increases by 3.7%. 0.0% 3.1% 1.7% 0.9% 2.8% 3.5% 2.1% 1.7% 2.5% 3.7% 3.7% 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 0 200 400 600 800 1000 1200 1400 1600 osd_client_message_cap 4K Rand Write | IOPS higher
  • 10. ⑤ Long queue waiting time in the OSD  The average length of the OPQueue queue is less than 10 and the queuing time is short. The length of kv_queue and kv_committing queues is greater than 30, and the queuing duration is long. OPQueue Messenger Msgr-workers OSD Threads Data Disk kv_queue bstore_aio kv_committing Swap bstore_kv_sync WAL+DB Disk flush kv_committing_to_finalize onCommits OSD Threads bstore_kv_final OSD Processing Messenger Messenger Enqueue OP Dequeue OP Enqueue Swap ContextQueue op_before_dequeue_op_lat 16% state_kv_queued_lat 26% state_kv_commiting_lat 27% 10 send 0% recv 1% dispatch op_before_dequeue_op _lat 17% op_w_prepare_l atency 3% state_aio_wait_lat 12% state_kv_queued_lat 27% kv_sync_lat 7% kv_final_lat 3% wait in queues 28% state_kv_commiting_lat= kv_sync_lat+kv_final_lat+ wait in queue=38%
  • 11. CPU partition affinity 11 13% 21% 8% 0% 14% 4% 0% 5% 10% 15% 20% 25% Rand Write Rand Read 70/30 Rand RW CPU partition affinity IOPS Higher 4K 8K 12% 18% 7% 0% 13% 4% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% Rand Write Rand Read 70/30 Rand RW CPU partition affinity Latency Lower 4K 8K  The CPU partition is used to separate the msgr-worker/tp_osd_tp and bstore threads to achieve fair scheduling.
  • 12. ⑥Use 64K PageSize  Use CEPH_PAGE_SHIFT for compatibility with various page sizes 1.12 1.17 1.15 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 RandWrite Randread RandReadWrite 64K/4K pagesize 4KB Block Size| IOPS Higher  Compared with 4K pages, 64K pages reduce TLB miss and improve performance by 10%+. Tips:  Use small page alignment to reduce memory waste in bufferlist::reserve 12
  • 13. 64K Pagesize Issues 13  Write Amplification Issue When bluefs_buffered_io is set to true, metadata is written using buffer I/O, and sync_file_range is called to write data to disks by page in the kernel. The magnification factor is 2.46 for 4K pages and 5.46 for 64K pages. When bluefs_buffered_io is set to false, metadata is written using direct I/O, sync_file_range is not called. The magnification factor is 2.29. Too many writes affect the disk life cycle. Therefore, set bluefs_buffered_io to false.  Tcmalloc and kernel page size issue When the tcmalloc page size is smaller than the kernel page size, the memory keeps increasing until it approaches osd_memory_target, and the performance deteriorates significantly. Ensure that the page size of tcmalloc is greater than the kernel page size.
  • 14. ⑦ Optimized crc32c in rocksdb Rocksdb's crc32c_arm64 is supported since Ceph pacific. Backport this feature to earlier version, the performance is improved by about 3%. 14