ceph optimization on ssd ilsoo byun-short

Ilsoo Byun (ilsoo.byun@sk.com)
Network-IT Convergence R&D Center
SDS Tech. Lab
CEPH Optimization on
SSD

3
NIC 기술원 소개
• COSMOS(Composable, Open, Scalable, Mobile-Oriented System)
• Telco 인프라 혁신
• 개방형 All-IT 인프라 구축

Software Defined Storage
4
Traditional Storage Emerging SDS
Architecture
Proprietary H/W, Proprietary
S/W Only
Commodity H/W, Open Source S/W
Available
Benefit Turnkey Solution
Low Cost, Flexible Mgmt., High
Scalability
Considerations
High Cost, Inflexible Mgmt.,
Limited Scalability,
Vendor Lock-in
Tuning & Development Efforts
Needed
+
SSD

Why using SSD make differences?
• SSD
• Interface: Same Physical Interface as HDDs(SATA/SAS) or
Same Logical Interface as HDDs(NVMe)
• Media Characteristics: Lower Latency, Higher Parallelism
• Why SSD was so successful?
• Interface Compatibility, and
• In mobile systems: Higher reliability due to no moving
part
• In enterprise systems: Higher performance (1 SSD can
5

SSD Optimization Example
• Linux IO Scheduler
• No-op, CFQ, Deadline
59
60
61
62
63
64
65
66
67
68
69
52,000
54,000
56,000
58,000
60,000
62,000
64,000
noop cfq deadline
Latency(us)
IOPS
Linux IO Scheduler
IOPS Latency(us)
6

Performance Optimization
Storage
(HDD, SSD, …)
Network
(1g, 10g, …)
Computing
(CPU, Memory, …)
Theory Measurements Analysis Optimization
• H/W Configuration
• Parameter Optimization
• Source Modification
2
1
7

• Good design will reduce performance problems:
• Identify the areas that will yield the largest
performance boosts
• Over widest variety of situations
• Focus attention on these areas
• Consider Peak load, Not average ones.
• Performance issues arise when loads are high
Performance Tuning Principles
Source: http://mscom.co.il/downloads/sqlpresentations/12060.pdf
8

Target Workload
IOPS-Sensitive Workload
IOPS
Sensitive
• Block-
based(iSCSI,
RBD)
• SSD
• Block Device, DB
• Random IO
Throughput
Sensitive
Capacity
Sensitive
• File-based(NFS,
CIFS)
• SSD, HDD
• Contents Sharing
• Sequential IO
• Object-based
(S3)
• HDD
• Archiving,
Backup
• Sequential IO
9

CEPH Architecture
10
CLIENTS
LIBRADOS
RADOS
LIBRADOS
CEPH
FS
RGWRBD
MON MON MON MDS
OSD OSD OSD OSD
OSD OSD OSD OSD

CRUSH
11
Placement Group #1 Placement Group #2
OSD #1 OSD #2 OSD #3 OSD #4 OSD #5
obj obj obj obj obj obj
CRUSH
HASH pgid = hash(oid) & mask
File
Stripe & Replication
1. pgid
2. replication factor
3. crush map
4. placement rules
RADOS
Scalibility

Reference H/W
10G
cluster
pubic
• Intel(R) Xeon(R) CPU E5-
2690 v3 @ 2.60GHz x 2
• 256GB Memory
• 480GB SATA SSD x 6
• CentOS 7
• Ceph(Master-0923)
12
KRBD, /dev/rbd0

OSD
OSD
peering
IO Request
ObjectStore
ObjectStore FileStore BlueStore
MemStore
OSD & ObjectStore
SSD
14
Transaction

• All or Nothing
• Ordering
– Strong Consistency
ObjectStore Transaction
15

All or Nothing
How to support atomicity of a transaction
FileStore BlueStore
Journal
Storage
interval=5ms
2. Ack1. Write
3. Store
4. Applied
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
16

• Theorem: You can have at
most two of these properties
for any shared-data system
Consistency Availability
Tolerance to
network Partition
CAP Theorem
17

Consistency Implementation
• Strong Consistency
• w(X) = A
A A A A
18

Consistency Implementation
• Weak Consistency
• w(X) = A
• r(X) = ?
A A A A
w(X) = A
r(X) = A
r(X) = ?
r(X) = A
r(X) = A
19

CEPH Consistency Model
P
G
P
G
P
G
P
G
Acting Set
Hashing
CRUSH
20

CEPH Consistency Model
P
G
P
G
P
G
P
G
Acting Set
Hashing
CRUSH
Causal
Consistency!
Block
21

IBM Cleversafe dsNet
• Eventual Consistency를 적용함으로 인해
상위에 파일시스템을 올릴 경우 오동작
• IBM에서는 Temporal 하고 Critical 하지
않은 오류라고 언급함
• Write시 Orphaned Data
• Delete 시 Read Error
Eventual Consistency
A
B
A
B
A BC
Strong Consistency
B AC
ACK
ACK ACK
Why is consistency model
important?
22

• Ceph Tech Talk (2016/1/28)
– PostgreSQL on Ceph under Mesos/Aurora with
Docker
• Read: Databases don’t have IO depth of 64. It’s
1.
• Write: Databases want each and every
transaction to be acknowledged by the storage
layer.
– Full round-trip down to the storage layer
Example of Strong Consistent
Application
23

BlueStore Design
BlockDevice BlockDevice BlockDevice
Allocator
RocksDB
BlueRocksEnv
BlueFS
Data MetaData
WAL DB
SLOW
25

Distributed Metadata
Traditional Ceph
Metadata
Storage Storage Storage
Storage Storage Storage
Client
query
Client
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
CRUSHBottleneck
26

Onode
atomic_int nref
ghobject_t oid
string key
ExtentMap
extent_map
Collection
c
bluestore_onode_t
uint64_t nid
uint64_t size
map<string, bufferptr> attrs;
onode
Extent
<<intrusive::set>> *
Blob
blob
shard_info
uint32_t offset
uint32_t bytes
uint32_t extents
extent_map_shards <<vector>>*
Shard
bool loaded
bool dirty
shards
<<vector>>
*
27

BlueStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction::
OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction:: OP_SETATTRS
Transaction::
OP_OMAP_SETKEYS
block
block.db
OSD
BlueStore
Metadata
Data
28

FileStore Write Transacion
Transaction::OP_TOUCH
/dev/rbd0
write 4KB
Transaction::
OP_SETALLOCHINT
Transaction:: OP_WRITE
Transaction::
OP_OMAP_SETKEYS
Journal
SSD
OSD
FileStore
29

BlueStore Transaction
ProcessingTransaction::OP_TOUCH
Transaction::
OP_SETALLOCHINT
RocksDB Transactions
Transaction::
OP_OMAP_SETKEYS
Transaction:: OP_WRITE WriteContext
Allocator
RocksDB
Update
Metadata
Preparation Data Meta Data
bluestore_allocato
r
= bitmap | stupid
30
Write

Metadata Update Overhead
HDD
SSD
3,000us 20us
180us 20us
0.7%
11%?
Write Update Metadata
SSD
180us 40us ↑
22%↑
31

Metadata Tuning
• Encode, only if the data is modified (Blob::encode)
• Compression
• ex. 00000123 -> 5123
• Sharding
Onode Metadatadirty dirty
Write
Shard
32

Pre-allocation vs No pre-allocation
0
20,000
40,000
60,000
80,000
100,000
4 64 128 256 4096
Metadatasize(KB)
Shard size
Metadata Overhead (Randwrite,4KB,100MB)
No pre-alloc Pre-alloc
Default
33

IO Path
SSD
op_wq kv_committing finisher queue
pipe out_q
Ack
write
flush
ROCKSDB
sync transaction
Request
Storage
RocksDB
1. Write
2. Metadata Update
3. Ack & Applied
35

ShardedOpWQ
Shard Shard Shard
Workers
ObjectStore
osd_op_num_shards
osd_op_threads
{# of Shard} * {# of Shard Worker}
= Total Threads
osd_op_queue = prio|wpq
36
{PGID} % {osd_op_num_shards}
= Target Shard

IO Path Bottleneck
SSD
op_wq kv_committing finisher queue
pipe out_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
37

IO Path Optimized
SSD
op_wq kv_committing
pipeout_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
Shard
finisherqueue
bstore_shard_finisher
s = true
38

0
1
2
3
4
5
6
7
8
9
0
20,000
40,000
60,000
80,000
100,000
120,000
no shard finishers shard finishers
Latency(ms)
IOPS
PerformanceImprovement
IOPS Latency(ms)
50%↑
• 09/23, 4619bc09703d429abd41554b693294236c192268
bstore_shard_finishers (Boolean)
39

Next Target
SSD
op_wq kv_committing
pipeout_q
Request
Ack
write
flusth
ROCKSDB
sync transaction
Shard
finisherqueue
40

RocksDB Sync Transaction
Overhead
Request
SSTFile
Logfile
Memtable
Transaction Log
Flush
Batch-commit
41

CPU Usage
• 4KB Random Write
• 155 K IOPS
http://www.xblafans.com/wp-
content/uploads//2011/09/WHY-NO-WORK.jpg
• Low!
• Unbalanced!
43

HT & RSS
• Effect of HT(Hyper Threading)
• Distortion of CPU Usage
• RSS(Receive Side Scaling)
CPU#1 CPU#2
RAM
RAM
RAM
RAM
RAM
RAM
NIC
QPI
PCIe
DIMM
CPU#1
source: https://software.intel.com/en-us/articles/introduction-to-hyper-threading-technology
44

rx_usecs = 10
rx_usecs = 0
Interrupt Coalescing
45

Optimization Result
• 4KB Random Write
• 164 K IOPS
10%↑
46

Conclusions
• SSD shows different performance characteristics from HDD
• S/W Optimization on SSD requires in-depth knowledges about
the system
• We explain the experiences getting from optimizing CEPH
• Ceph Consistency Model
• Metadata processing
• Shard Finishers
• OS Optimizations
47

ceph optimization on ssd ilsoo byun-short

More Related Content

What's hot

Viewers also liked

Similar to ceph optimization on ssd ilsoo byun-short

More from NAVER D2

Recently uploaded

ceph optimization on ssd ilsoo byun-short

Editor's Notes