New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Ceph Day Seoul - The Anatomy of Ceph I/O
1. The Anatomy of Ceph I/O
Sang-Hoon Kim, Dong-Yun Lee, Sanghoon Han, Kisik Jeong, and Jin-Soo Kim
Computer Systems Laboratory
Sungkyunkwan University
August 27, 2016
2. This talk will …
• Focus on analyzing the write traffic generated by Ceph
– To understand the I/O characteristics of Ceph
• Consider the case for using RADOS Block Device (RBD)
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 2
Figure Courtesy: Sage Weil, RedHat
3. Why writes matter?
• Writes dominate the I/O traffic in production workloads
– Extensively cache performance-critical reads
– Require to keep written data persistently
– Account for 70-80% of I/O traffic
• Each write request may accompany hidden I/Os
– Can be amplified manyfold
– Influence on I/O performance
– Must be considered for estimating system performance
• Writes wear out flash-based storage devices
– Worsen their reliability, increase TCO
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 3
4. How writes are amplified?
Data
Ceph
metadata
Filesystem
metadata
Ceph
journal
Filesystem
journal
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system
Describe
data
Make updates
atomic and consistent
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 4
Data
Data
Data
5. Write amplification
• Metadata describes data
– Ceph metadata: object ID, offset, length, omap, …
– Filesystem metadata: inode, block allocation bitmap, inode bitmap, ...
• Journal / log makes data and metadata updates atomic and consistent
– Ceph journal
– Filesystem journal
• Write Amplification Factor (WAF)
– Metric for measuring the amplified amount of writes
– Amount of total writes / amount of the original writes
– The higher, the more the system amplifies writes
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 5
6. Our interests are …
• Analyze WAF in various many configurations and workloads
• Filesystems for FileStore
– xfs vs. ext4
• Backends for KeyValueStore
– LevelDB vs. RocksDB
• BlueStore
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 6
7. OSD Servers (x4)
Model DELL R730
Processor Intel® Xeon® CPU E5-2640 v3
Memory 32 GB
OS Ubuntu Server 14.04.4 LTS
Storage HP PM1633 960 GB x4 (OSDs)
Intel® 730 Series 480 GB x1
Intel® 750 Series 400 GB x4
Admin Server / Client (x1)
Model DELL R730XD
Memory 128 GB
Switch (x1)
Model HPE 1620-48G (JG914A)
Network 1Gbps EthernetStorageNetwork
Evaluation environment
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 7
8. Evaluation methodology
• Configure a 64GiB RBD over the OSD servers
– 4 OSDs per OSD server, 4 OSD servers
• A client mounts the RBD and generates workloads using fio
• Meanwhile, collect block-level I/O traces at OSD servers using ftrace
• Analyze the I/O traces offline
– Filesystem metadata: I/O requests tagged with REQ_META
– Filesystem journaling: LBA range
– Ceph metadata: By subtracting the original data size
– Ceph journaling: Set the journaling device via ceph-deploy
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 8
9. Evaluation workloads
• RBD allocates an object for a 4MiB LBA chunk
• Microbenchmark
– 1st write: Creating an object
– 2nd write: Appending data to the object
– Overwrite: Overwriting on existing data
– sync and wait between writes
• Long-term workload
– Generate 4 KiB uniform random writes equivalent to 90% of RBD capacity
1st write 2nd write
Overwrite
0 Size
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 9
11. Filesystems for FileStore
• ext4
– Not recommended due to the limited length for XATTRs
• Obsolete in Jewel
– However, it is the most popular filesystem
– Used the ordered-mode journaling
• xfs
– Officially supported by Jewel
– Optimized for scalable and parallel I/O
– Perform logical journaling
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 11
12. 3
3
6
6
9
9
33
48
45
15
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal
Microbenchmark - 4 KiB
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 12
x96
x81
Identical at
Ceph layer
Identical at
Ceph layer
Identical at Ceph layer
xfs performs journaling
more effectively
18. Backends for KeyValueStore
• Use key-value stores on the xfs filesystem
– <key, value> pair for each 4 KiB
• LevelDB
– Based on the Log-structured Merge (LSM) Tree
• RocksDB
– Fork of LevelDB
– Optimized to server workloads and fast storage
• Unable to separate KeyValueStore metadata and
journal
– Require filesystem-level semantics
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 18
Application
RDB
OSD OSD OSD
Storage
KVStore
Storage
KVStore
Storage
KVStore
File
system
File
system
File
system
25. BlueStore
• An alternative to FileStore and KeyValueStore, introduced in Jewel
– Provide better transactions and object enumeration mechanisms
• Write data on raw block devices
– Manage the space using +64KiB extents
– Small writes are journaled in RocksDB WAL
• Keep metadata in RocksDB
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 25
Figure Courtesy: Sébastien Han
https://www.sebastien-han.fr
26. Microbenchmark - 4 KiB
3
3
3
0
0
0
9
7
16
0 2 4 6 8 10 12 14 16 18 20
1st
2nd
Overwrite
WAF
Data RocksDB RocksDB WAL
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 26
x12
x10
x19
Small overwrite requires data
journaling, amplifying journaling
traffic
RocksDB writes are deferred
by WAL
27. WAF on various request sizes
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 27
0
2
4
6
8
10
12
14
16
18
20
WAF
1st write 2nd write Overwrite
x3 by replication
without data journaling
x2 by data journaling for
small overwrites
29. To sum up …
• Writes are amplified by x3 to x96 in Ceph
• FileStore is not the best choice for RBD
– Journal-on-journal happens
– Logical journaling in xfs greatly reduces WAF
• Using KeyValueStore is not good either
– Very high compaction overhead
• BlueStore seems to be promising
– No journal-on-journal and data compaction
– However, bad for small overwrites
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 29
Store type
Small
writes
(4 KiB)
Large
writes
(1 MiB)
Long-
term
FileStore
ext4 96 6.4 35.7
xfs 81 6.5 16.2
KeyValue
Store
LevelDB 55.3 3.3 68.3
RocksDB 23.3 3.1 88.4
BlueStore 11 3.0 22.7