Ilsoo Byun (ilsoo.byun@sk.com)
Network-IT Convergence R&D Center
SDS Tech. Lab
CEPH Optimization on
SSD
Introduction
3
NIC 기술원 소개
• COSMOS(Composable, Open, Scalable, Mobile-Oriented System)
• Telco 인프라 혁신
• 개방형 All-IT 인프라 구축
Software Defined Storage
4
Traditional Storage Emerging SDS
Architecture
Proprietary H/W, Proprietary
S/W Only
Commodity H/W, Open Source S/W
Available
Benefit Turnkey Solution
Low Cost, Flexible Mgmt., High
Scalability
Considerations
High Cost, Inflexible Mgmt.,
Limited Scalability,
Vendor Lock-in
Tuning & Development Efforts
Needed
+
SSD
Why using SSD make differences?
• SSD
• Interface: Same Physical Interface as HDDs(SATA/SAS) or
Same Logical Interface as HDDs(NVMe)
• Media Characteristics: Lower Latency, Higher Parallelism
• Why SSD was so successful?
• Interface Compatibility, and
• In mobile systems: Higher reliability due to no moving
part
• In enterprise systems: Higher performance (1 SSD can
5
SSD Optimization Example
• Linux IO Scheduler
• No-op, CFQ, Deadline
59
60
61
62
63
64
65
66
67
68
69
52,000
54,000
56,000
58,000
60,000
62,000
64,000
noop cfq deadline
Latency(us)
IOPS
Linux IO Scheduler
IOPS Latency(us)
6
Performance Optimization
Storage
(HDD, SSD, …)
Network
(1g, 10g, …)
Computing
(CPU, Memory, …)
Theory Measurements Analysis Optimization
• H/W Configuration
• Parameter Optimization
• Source Modification
2
1
7
• Good design will reduce performance problems:
• Identify the areas that will yield the largest
performance boosts
• Over widest variety of situations
• Focus attention on these areas
• Consider Peak load, Not average ones.
• Performance issues arise when loads are high
Performance Tuning Principles
Source: http://mscom.co.il/downloads/sqlpresentations/12060.pdf
8
Target Workload
IOPS-Sensitive Workload
IOPS
Sensitive
• Block-
based(iSCSI,
RBD)
• SSD
• Block Device, DB
• Random IO
Throughput
Sensitive
Capacity
Sensitive
• File-based(NFS,
CIFS)
• SSD, HDD
• Contents Sharing
• Sequential IO
• Object-based
(S3)
• HDD
• Archiving,
Backup
• Sequential IO
9
CEPH Architecture
10
CLIENTS
LIBRADOS
RADOS
LIBRADOS
CEPH
FS
RGWRBD
MON MON MON MDS
OSD OSD OSD OSD
OSD OSD OSD OSD
CRUSH
11
Placement Group #1 Placement Group #2
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5
obj obj obj obj obj obj
CRUSH
HASH pgid = hash(oid) & mask
File
Stripe & Replication
1. pgid
2. replication factor
3. crush map
4. placement rules
RADOS
Scalibility
Reference H/W
10G
cluster
pubic
• Intel(R) Xeon(R) CPU E5-
2690 v3 @ 2.60GHz x 2
• 256GB Memory
• 480GB SATA SSD x 6
• CentOS 7
• Ceph(Master-0923)
12
KRBD, /dev/rbd0
Transaction
OSD
OSD
peering
IO Request
ObjectStore
ObjectStore FileStore BlueStore
MemStore
OSD & ObjectStore
SSD
14
Transaction
• All or Nothing
• Ordering
– Strong Consistency
ObjectStore Transaction
15
All or Nothing
How to support atomicity of a transaction
FileStore BlueStore
Journal
Storage
interval=5ms
2. Ack1. Write
3. Store
4. Applied
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
16
• Theorem: You can have at
most two of these properties
for any shared-data system
Consistency Availability
Tolerance to
network Partition
CAP Theorem
17
Consistency Implementation
• Strong Consistency
• w(X) = A
A A A A
18
Consistency Implementation
• Weak Consistency
• w(X) = A
• r(X) = ?
A A A A
w(X) = A
r(X) = A
r(X) = ?
r(X) = A
r(X) = A
19
CEPH Consistency Model
P
G
P
G
P
G
P
G
Acting Set
Hashing
CRUSH
20
CEPH Consistency Model
P
G
P
G
P
G
P
G
Acting Set
Hashing
CRUSH
Causal
Consistency!
Block
21
IBM Cleversafe dsNet
• Eventual Consistency를 적용함으로 인해
상위에 파일시스템을 올릴 경우 오동작
• IBM에서는 Temporal 하고 Critical 하지
않은 오류라고 언급함
• Write시 Orphaned Data
• Delete 시 Read Error
Eventual Consistency
A
B
A
B
A BC
Strong Consistency
B AC
ACK
ACK ACK
Why is consistency model
important?
22
• Ceph Tech Talk (2016/1/28)
– PostgreSQL on Ceph under Mesos/Aurora with
Docker
• Read: Databases don’t have IO depth of 64. It’s
1.
• Write: Databases want each and every
transaction to be acknowledged by the storage
layer.
– Full round-trip down to the storage layer
Example of Strong Consistent
Application
23
Metadata Processing
BlueStore Design
BlockDevice BlockDevice BlockDevice
Allocator
RocksDB
BlueRocksEnv
BlueFS
Data MetaData
WAL DB
SLOW
25
Distributed Metadata
Traditional Ceph
Metadata
Storage Storage Storage
Storage Storage Storage
Client
query
Client
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
CRUSHBottleneck
26
Onode
atomic_int nref
ghobject_t oid
string key
ExtentMap
extent_map
Collection
c
bluestore_onode_t
uint64_t nid
uint64_t size
map<string, bufferptr> attrs;
onode
Extent
<<intrusive::set>> *
Blob
blob
shard_info
uint32_t offset
uint32_t bytes
uint32_t extents
extent_map_shards <<vector>>*
Shard
bool loaded
bool dirty
shards
<<vector>>
*
27
BlueStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction::
OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction:: OP_SETATTRS
Transaction::
OP_OMAP_SETKEYS
block
block.db
OSD
BlueStore
Metadata
Data
28
FileStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction::
OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction:: OP_SETATTRS
Transaction::
OP_OMAP_SETKEYS
Journal
SSD
OSD
FileStore
29
BlueStore Transaction
ProcessingTransaction::OP_TOUCH
Transaction::
OP_SETALLOCHINT
RocksDB Transactions
Transaction:: OP_SETATTRS
Transaction::
OP_OMAP_SETKEYS
Transaction:: OP_WRITE WriteContext
Allocator
RocksDB
Update
Metadata
Preparation Data Meta Data
bluestore_allocato
r
= bitmap | stupid
30
Write
Metadata Update Overhead
HDD
SSD
3,000us 20us
180us 20us
0.7%
11%?
Write Update Metadata
SSD
180us 40us ↑
22%↑
31
Metadata Tuning
• Encode, only if the data is modified (Blob::encode)
• Compression
• ex. 00000123 -> 5123
• Sharding
Onode Metadatadirty dirty
Write
Shard
32
Pre-allocation vs No pre-allocation
0
20,000
40,000
60,000
80,000
100,000
4 64 128 256 4096
Metadatasize(KB)
Shard size
Metadata Overhead (Randwrite,4KB,100MB)
No pre-alloc Pre-alloc
Default
33
Sending Ack
IO Path
SSD
op_wq kv_committing finisher queue
pipe out_q
Ack
write
flush
ROCKSDB
sync transaction
Request
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
35
ShardedOpWQ
Shard Shard Shard
Workers
ObjectStore
osd_op_num_shards
osd_op_threads
{# of Shard} * {# of Shard Worker}
= Total Threads
osd_op_queue = prio|wpq
36
{PGID} % {osd_op_num_shards}
= Target Shard
IO Path Bottleneck
SSD
op_wq kv_committing finisher queue
pipe out_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
37
IO Path Optimized
SSD
op_wq kv_committing
pipeout_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
Shard
finisherqueue
bstore_shard_finisher
s = true
38
0
1
2
3
4
5
6
7
8
9
0
20,000
40,000
60,000
80,000
100,000
120,000
no shard finishers shard finishers
Latency(ms)
IOPS
PerformanceImprovement
IOPS Latency(ms)
50%↑
• 09/23, 4619bc09703d429abd41554b693294236c192268
bstore_shard_finishers (Boolean)
39
Next Target
SSD
op_wq kv_committing
pipeout_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
Shard
finisherqueue
40
RocksDB Sync Transaction
Overhead
Request
SSTFile
Logfile
Memtable
Transaction Log
Flush
Batch-commit
41
Etc.
CPU Usage
• 4KB Random Write
• 155 K IOPS
http://www.xblafans.com/wp-
content/uploads//2011/09/WHY-NO-WORK.jpg
• Low!
• Unbalanced!
43
HT & RSS
• Effect of HT(Hyper Threading)
• Distortion of CPU Usage
• RSS(Receive Side Scaling)
CPU#1 CPU#2
RAM
RAM
RAM
RAM
RAM
RAM
NIC
QPI
PCIe
DIMM
CPU#1
source: https://software.intel.com/en-us/articles/introduction-to-hyper-threading-technology
44
rx_usecs = 10
rx_usecs = 0
Interrupt Coalescing
45
Optimization Result
• 4KB Random Write
• 164 K IOPS
10%↑
46
Conclusions
• SSD shows different performance characteristics from HDD
• S/W Optimization on SSD requires in-depth knowledges about
the system
• We explain the experiences getting from optimizing CEPH
• Ceph Consistency Model
• Metadata processing
• Shard Finishers
• OS Optimizations
47
Q&A
Thank You

ceph optimization on ssd ilsoo byun-short

Editor's Notes

  • #2 안녕하세요. 오늘 SSD 기반 CEPH 성능 최적화에 대해서 발표를 드리게 된 변일수라고 합니다. 저는 SK 텔레콤 SDS Tech. Lab에서 Ceph 성능 최적화, 특히 Ceph의 새로운 컴포넌트인 BlueStore에 대한 성능 최적화 작업을 수행하고 있습니다. 오늘 발표를 통해서 저는 Ceph 최적화 작업을 통해서 얻게 된 경험을 공유할 것이고요. 이 세션을 통해서 Ceph의 성능에 관심을 갖고 계신 분이나 내부 동작에 대해 관심을 갖고 계신 분에게 유용한 정보를 제공해 드릴 수 있도록 내용을 구성해보았습니다. 2013년 Deview에서 CEPH 멤버가 직접 와서 CEPH에 대서 설명을 했던 것으로 아는데요. 저는 일반적으로 알려진 내용보다는 조금 더 아래 쪽의 내용에 대해서 다루어보겠습니다.
  • #4 보통 SK텔레콤에서 CEPH를 왜 하는지에 대한 질문을 종종 받게 됩니다. 제가 속해있는 NIC 기술원에서는 COSMOS라는 기치 아래 Telco 인프라를 혁신할 수 있는 기술에 대한 연구를 진행하고 있습니다. 그중에서도 개방형이라는 단어에 주목해볼 수 있겠는데요. 기존의 접근법과는 다르게 H/W에서부터 상위의 S/W까지 모두 개방형 접근 방법을 취하려고 하고 있습니다. 따라서 Commodity 하드에워를 도입할 수 있는 노력도 하고 있고요. 오픈 소스를 최대한 활용하고 다시 커뮤니티에 기여하기 위해서 노력하고 있습니다. 저희 랩 같은 경우 스토리지를 연구하는 랩이기 때문에 자연스럽게 SDS라는 주제를 가지게 되었습니다.
  • #5 마지막에 SSD로 주의를 이끌 것
  • #6 SSD를 사용할 때 생기는 S/W 문제를 다음 슬라이드에서 보여줄 것
  • #18  Ceph는 CP 시스템
  • #44 Unbalance는 두가지 측면에서 생각해볼 수 있음 상위 프로세서들과 하위 프로세서들간의 불균형 홀수와 짝수 프로세서들의 불균형