Ceph Day Seoul - The Anatomy of Ceph I/O

The Anatomy of Ceph I/O
Sang-Hoon Kim, Dong-Yun Lee, Sanghoon Han, Kisik Jeong, and Jin-Soo Kim
Computer Systems Laboratory
Sungkyunkwan University
August 27, 2016

This talk will …
• Focus on analyzing the write traffic generated by Ceph
– To understand the I/O characteristics of Ceph
• Consider the case for using RADOS Block Device (RBD)
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 2
Figure Courtesy: Sage Weil, RedHat

Why writes matter?
• Writes dominate the I/O traffic in production workloads
– Extensively cache performance-critical reads
– Require to keep written data persistently
– Account for 70-80% of I/O traffic
• Each write request may accompany hidden I/Os
– Can be amplified manyfold
– Influence on I/O performance
– Must be considered for estimating system performance
• Writes wear out flash-based storage devices
– Worsen their reliability, increase TCO

How writes are amplified?
Data
Ceph
metadata
Filesystem
metadata
Ceph
journal
Filesystem
journal
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system
Describe
data
Make updates
atomic and consistent
Data
Data
Data

Write amplification
• Metadata describes data
– Ceph metadata: object ID, offset, length, omap, …
– Filesystem metadata: inode, block allocation bitmap, inode bitmap, ...
• Journal / log makes data and metadata updates atomic and consistent
– Ceph journal
– Filesystem journal
• Write Amplification Factor (WAF)
– Metric for measuring the amplified amount of writes
– Amount of total writes / amount of the original writes
– The higher, the more the system amplifies writes

Our interests are …
• Analyze WAF in various many configurations and workloads
• Filesystems for FileStore
– xfs vs. ext4
• Backends for KeyValueStore
– LevelDB vs. RocksDB
• BlueStore

OSD Servers (x4)
Model DELL R730
Processor Intel® Xeon® CPU E5-2640 v3
Memory 32 GB
OS Ubuntu Server 14.04.4 LTS
Storage HP PM1633 960 GB x4 (OSDs)
Intel® 730 Series 480 GB x1
Intel® 750 Series 400 GB x4
Admin Server / Client (x1)
Model DELL R730XD
Memory 128 GB
Switch (x1)
Model HPE 1620-48G (JG914A)
Network 1Gbps EthernetStorageNetwork
Evaluation environment

Evaluation methodology
• Configure a 64GiB RBD over the OSD servers
– 4 OSDs per OSD server, 4 OSD servers
• A client mounts the RBD and generates workloads using fio
• Meanwhile, collect block-level I/O traces at OSD servers using ftrace
• Analyze the I/O traces offline
– Filesystem metadata: I/O requests tagged with REQ_META
– Filesystem journaling: LBA range
– Ceph metadata: By subtracting the original data size
– Ceph journaling: Set the journaling device via ceph-deploy

Evaluation workloads
• RBD allocates an object for a 4MiB LBA chunk
• Microbenchmark
– 1st write: Creating an object
– 2nd write: Appending data to the object
– Overwrite: Overwriting on existing data
– sync and wait between writes
• Long-term workload
– Generate 4 KiB uniform random writes equivalent to 90% of RBD capacity
1st write 2nd write
Overwrite
0 Size

Filesystems for FileStore
Hammer v0.94.7

Filesystems for FileStore
• ext4
– Not recommended due to the limited length for XATTRs
• Obsolete in Jewel
– However, it is the most popular filesystem
– Used the ordered-mode journaling
• xfs
– Officially supported by Jewel
– Optimized for scalable and parallel I/O
– Perform logical journaling
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system

3
3
6
6
9
9
33
48
45
15
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal
Microbenchmark - 4 KiB
x96
x81
Identical at
Ceph layer
Identical at
Ceph layer
Identical at Ceph layer
xfs performs journaling
more effectively

3
3
3
3
6
8
6
8
9
9
9
9
33
21
48
34
45
34
15
7
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
x96
x81
x75
x61

3
3
3
3
3
3
6
8
7
6
8
6
9
9
9
9
9
9
33
21
15
48
34
24
45
34
27
15
7
12
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
x96
x81
x75
x61
x61
x54

WAF on various request sizes
0
10
20
30
40
50
60
70
80
90
100
WAF
ext4
1st write 2nd write Overwrite
0
10
20
30
40
50
60
70
80
90
100
WAF
xfs
Identical at Ceph layer
ext4 amplifies writes
more than xfs
WAFs converge to 6
• x3 by replication
• x2 by Ceph journaling

Long-term workload
0
10
20
30
40
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
5
10
15
20
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data Ceph metadata Ceph journal Filesystem metadata Filesystem journal WAF
ext4
xfs
x35.7
x16.2
FS metadata updates are increased
as storage is being filled up

Backends for KeyValueStore
Hammer v0.94.7

Backends for KeyValueStore
• Use key-value stores on the xfs filesystem
– <key, value> pair for each 4 KiB
• LevelDB
– Based on the Log-structured Merge (LSM) Tree
• RocksDB
– Fork of LevelDB
– Optimized to server workloads and fast storage
• Unable to separate KeyValueStore metadata and
journal
– Require filesystem-level semantics
Application
RDB
OSD OSD OSD
Storage
KVStore
Storage
KVStore
Storage
KVStore
File
system
File
system
File
system

Log-Structured Merge (LSM) Tree
19
(Sorted String Table)
Sort, dump

0
0
0
0
0
0
8
8
8
9
8
7
42
42
42
12
12
12
5.25
5.25
5.25
2.25
2.25
2.25
0 5 10 15 20 25 30 35 40 45 50 55 60
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
LevelDBRocksDB
WAF
Data KeyValueStore Filesystem Metadata Filesystem Journal
x55.3
x23.25
x55.3
x22.25
x55.3
x21.25
Cached in MemTable

0
10
20
30
40
50
60
WAF
LevelDB
0
10
20
30
40
50
60
WAF
RocksDB
MemTable is flushed
MemTable flush
RocksDB outperforms LevelDB
for object creation and
small writes
Aggressive compaction

Long-term workload
0
20
40
60
80
100
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
30
60
90
120
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data KeyValueStore Filesystem metadata Filesystem journal WAF
LevelDB
RocksDB
Huge amount of write amplification

Long-term workload
0
20
40
60
80
100
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
30
60
90
120
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data KeyValueStore Compaction Filesystem metadata Filesystem journal WAF
LevelDB
RocksDB
x68.3
x88.4
LSM tree utilizes the extent-based allocation in xfs
by shaping I/O to sequential writes
Huge amount of write amplification
• Compaction
• Additional metadata for each 4 KB

BlueStore
• An alternative to FileStore and KeyValueStore, introduced in Jewel
– Provide better transactions and object enumeration mechanisms
• Write data on raw block devices
– Manage the space using +64KiB extents
– Small writes are journaled in RocksDB WAL
• Keep metadata in RocksDB
Figure Courtesy: Sébastien Han
https://www.sebastien-han.fr

3
3
3
0
0
0
9
7
16
0 2 4 6 8 10 12 14 16 18 20
1st
2nd
Overwrite
WAF
Data RocksDB RocksDB WAL
x12
x10
x19
Small overwrite requires data
journaling, amplifying journaling
traffic
RocksDB writes are deferred
by WAL

0
2
4
6
8
10
12
14
16
18
20
WAF
x3 by replication
without data journaling
x2 by data journaling for
small overwrites

Long-term workload
0
10
20
30
40
50
60
0%
20%
40%
60%
80%
100%
WAF
Faction
Time
Data Raw Data RocksDB RocksDB WAL WAF
x22.7
MemTable flush and checkpointing
generate RocksDB writes
Zeroing tail extent incurs
extra writes

To sum up …
• Writes are amplified by x3 to x96 in Ceph
• FileStore is not the best choice for RBD
– Journal-on-journal happens
– Logical journaling in xfs greatly reduces WAF
• Using KeyValueStore is not good either
– Very high compaction overhead
• BlueStore seems to be promising
– No journal-on-journal and data compaction
– However, bad for small overwrites
Store type
Small
writes
(4 KiB)
Large
writes
(1 MiB)
Long-
term
FileStore
ext4 96 6.4 35.7
xfs 81 6.5 16.2
KeyValue
Store
LevelDB 55.3 3.3 68.3
RocksDB 23.3 3.1 88.4
BlueStore 11 3.0 22.7

Thank You!
Sang-Hoon Kim
Post-Doc
Computer Systems Laboratory
Sungkyunkwan University
✉️ sanghoon@csl.skku.edu
This research is funded by Samsung Electronics, Co.

Ceph Day Seoul - The Anatomy of Ceph I/O

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Ceph Day Seoul - The Anatomy of Ceph I/O

Similar to Ceph Day Seoul - The Anatomy of Ceph I/O (20)

Recently uploaded

Recently uploaded (20)

Ceph Day Seoul - The Anatomy of Ceph I/O

Editor's Notes