SlideShare a Scribd company logo
1 of 30
The Anatomy of Ceph I/O
Sang-Hoon Kim, Dong-Yun Lee, Sanghoon Han, Kisik Jeong, and Jin-Soo Kim
Computer Systems Laboratory
Sungkyunkwan University
August 27, 2016
This talk will …
• Focus on analyzing the write traffic generated by Ceph
– To understand the I/O characteristics of Ceph
• Consider the case for using RADOS Block Device (RBD)
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 2
Figure Courtesy: Sage Weil, RedHat
Why writes matter?
• Writes dominate the I/O traffic in production workloads
– Extensively cache performance-critical reads
– Require to keep written data persistently
– Account for 70-80% of I/O traffic
• Each write request may accompany hidden I/Os
– Can be amplified manyfold
– Influence on I/O performance
– Must be considered for estimating system performance
• Writes wear out flash-based storage devices
– Worsen their reliability, increase TCO
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 3
How writes are amplified?
Data
Ceph
metadata
Filesystem
metadata
Ceph
journal
Filesystem
journal
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system
Describe
data
Make updates
atomic and consistent
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 4
Data
Data
Data
Write amplification
• Metadata describes data
– Ceph metadata: object ID, offset, length, omap, …
– Filesystem metadata: inode, block allocation bitmap, inode bitmap, ...
• Journal / log makes data and metadata updates atomic and consistent
– Ceph journal
– Filesystem journal
• Write Amplification Factor (WAF)
– Metric for measuring the amplified amount of writes
– Amount of total writes / amount of the original writes
– The higher, the more the system amplifies writes
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 5
Our interests are …
• Analyze WAF in various many configurations and workloads
• Filesystems for FileStore
– xfs vs. ext4
• Backends for KeyValueStore
– LevelDB vs. RocksDB
• BlueStore
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 6
OSD Servers (x4)
Model DELL R730
Processor Intel® Xeon® CPU E5-2640 v3
Memory 32 GB
OS Ubuntu Server 14.04.4 LTS
Storage HP PM1633 960 GB x4 (OSDs)
Intel® 730 Series 480 GB x1
Intel® 750 Series 400 GB x4
Admin Server / Client (x1)
Model DELL R730XD
Memory 128 GB
Switch (x1)
Model HPE 1620-48G (JG914A)
Network 1Gbps EthernetStorageNetwork
Evaluation environment
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 7
Evaluation methodology
• Configure a 64GiB RBD over the OSD servers
– 4 OSDs per OSD server, 4 OSD servers
• A client mounts the RBD and generates workloads using fio
• Meanwhile, collect block-level I/O traces at OSD servers using ftrace
• Analyze the I/O traces offline
– Filesystem metadata: I/O requests tagged with REQ_META
– Filesystem journaling: LBA range
– Ceph metadata: By subtracting the original data size
– Ceph journaling: Set the journaling device via ceph-deploy
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 8
Evaluation workloads
• RBD allocates an object for a 4MiB LBA chunk
• Microbenchmark
– 1st write: Creating an object
– 2nd write: Appending data to the object
– Overwrite: Overwriting on existing data
– sync and wait between writes
• Long-term workload
– Generate 4 KiB uniform random writes equivalent to 90% of RBD capacity
1st write 2nd write
Overwrite
0 Size
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 9
Filesystems for FileStore
Hammer v0.94.7
Filesystems for FileStore
• ext4
– Not recommended due to the limited length for XATTRs
• Obsolete in Jewel
– However, it is the most popular filesystem
– Used the ordered-mode journaling
• xfs
– Officially supported by Jewel
– Optimized for scalable and parallel I/O
– Perform logical journaling
Application
RDB
OSD OSD OSD
Storage
File
system
Storage
File
system
Storage
File
system
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 11
3
3
6
6
9
9
33
48
45
15
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal
Microbenchmark - 4 KiB
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 12
x96
x81
Identical at
Ceph layer
Identical at
Ceph layer
Identical at Ceph layer
xfs performs journaling
more effectively
Microbenchmark - 4 KiB
3
3
3
3
6
8
6
8
9
9
9
9
33
21
48
34
45
34
15
7
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 13
x96
x81
x75
x61
Microbenchmark - 4 KiB
3
3
3
3
3
3
6
8
7
6
8
6
9
9
9
9
9
9
33
21
15
48
34
24
45
34
27
15
7
12
0 10 20 30 40 50 60 70 80 90 100
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
ext4xfs
WAF
Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 14
x96
x81
x75
x61
x61
x54
WAF on various request sizes
0
10
20
30
40
50
60
70
80
90
100
WAF
ext4
1st write 2nd write Overwrite
0
10
20
30
40
50
60
70
80
90
100
WAF
xfs
1st write 2nd write Overwrite
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 15
Identical at Ceph layer
ext4 amplifies writes
more than xfs
WAFs converge to 6
• x3 by replication
• x2 by Ceph journaling
Long-term workload
0
10
20
30
40
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
5
10
15
20
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data Ceph metadata Ceph journal Filesystem metadata Filesystem journal WAF
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 16
ext4
xfs
x35.7
x16.2
FS metadata updates are increased
as storage is being filled up
Backends for KeyValueStore
Hammer v0.94.7
Backends for KeyValueStore
• Use key-value stores on the xfs filesystem
– <key, value> pair for each 4 KiB
• LevelDB
– Based on the Log-structured Merge (LSM) Tree
• RocksDB
– Fork of LevelDB
– Optimized to server workloads and fast storage
• Unable to separate KeyValueStore metadata and
journal
– Require filesystem-level semantics
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 18
Application
RDB
OSD OSD OSD
Storage
KVStore
Storage
KVStore
Storage
KVStore
File
system
File
system
File
system
Log-Structured Merge (LSM) Tree
19
(Sorted String Table)
Sort, dump
Microbenchmark - 4 KiB
0
0
0
0
0
0
8
8
8
9
8
7
42
42
42
12
12
12
5.25
5.25
5.25
2.25
2.25
2.25
0 5 10 15 20 25 30 35 40 45 50 55 60
1st write
2nd write
Overwrite
1st write
2nd write
Overwrite
LevelDBRocksDB
WAF
Data KeyValueStore Filesystem Metadata Filesystem Journal
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 20
x55.3
x23.25
x55.3
x22.25
x55.3
x21.25
Cached in MemTable
WAF on various request sizes
0
10
20
30
40
50
60
WAF
LevelDB
1st write 2nd write Overwrite
0
10
20
30
40
50
60
WAF
RocksDB
1st write 2nd write Overwrite
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 21
MemTable is flushed
MemTable flush
RocksDB outperforms LevelDB
for object creation and
small writes
Aggressive compaction
Long-term workload
0
20
40
60
80
100
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
30
60
90
120
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data KeyValueStore Filesystem metadata Filesystem journal WAF
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 22
LevelDB
RocksDB
Huge amount of write amplification
Long-term workload
0
20
40
60
80
100
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
0
30
60
90
120
0%
20%
40%
60%
80%
100%
WAF
Fraction
Time 
Data KeyValueStore Compaction Filesystem metadata Filesystem journal WAF
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 23
LevelDB
RocksDB
x68.3
x88.4
LSM tree utilizes the extent-based allocation in xfs
by shaping I/O to sequential writes
Huge amount of write amplification
• Compaction
• Additional metadata for each 4 KB
BlueStore
Jewel v10.2.2
BlueStore
• An alternative to FileStore and KeyValueStore, introduced in Jewel
– Provide better transactions and object enumeration mechanisms
• Write data on raw block devices
– Manage the space using +64KiB extents
– Small writes are journaled in RocksDB WAL
• Keep metadata in RocksDB
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 25
Figure Courtesy: Sébastien Han
https://www.sebastien-han.fr
Microbenchmark - 4 KiB
3
3
3
0
0
0
9
7
16
0 2 4 6 8 10 12 14 16 18 20
1st
2nd
Overwrite
WAF
Data RocksDB RocksDB WAL
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 26
x12
x10
x19
Small overwrite requires data
journaling, amplifying journaling
traffic
RocksDB writes are deferred
by WAL
WAF on various request sizes
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 27
0
2
4
6
8
10
12
14
16
18
20
WAF
1st write 2nd write Overwrite
x3 by replication
without data journaling
x2 by data journaling for
small overwrites
Long-term workload
0
10
20
30
40
50
60
0%
20%
40%
60%
80%
100%
WAF
Faction
Time
Data Raw Data RocksDB RocksDB WAL WAF
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 28
x22.7
MemTable flush and checkpointing
generate RocksDB writes
Zeroing tail extent incurs
extra writes
To sum up …
• Writes are amplified by x3 to x96 in Ceph
• FileStore is not the best choice for RBD
– Journal-on-journal happens
– Logical journaling in xfs greatly reduces WAF
• Using KeyValueStore is not good either
– Very high compaction overhead
• BlueStore seems to be promising
– No journal-on-journal and data compaction
– However, bad for small overwrites
Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 29
Store type
Small
writes
(4 KiB)
Large
writes
(1 MiB)
Long-
term
FileStore
ext4 96 6.4 35.7
xfs 81 6.5 16.2
KeyValue
Store
LevelDB 55.3 3.3 68.3
RocksDB 23.3 3.1 88.4
BlueStore 11 3.0 22.7
Thank You!
Sang-Hoon Kim
Post-Doc
Computer Systems Laboratory
Sungkyunkwan University
✉️ sanghoon@csl.skku.edu
This research is funded by Samsung Electronics, Co.

More Related Content

What's hot

Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Community
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Community
 
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Community
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Community
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Community
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Community
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Community
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Danielle Womboldt
 

What's hot (18)

Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFS
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 

Similar to Ceph Day Seoul - The Anatomy of Ceph I/O

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1Alexander Zaytsev
 
Oracle Database Appliance
Oracle Database ApplianceOracle Database Appliance
Oracle Database ApplianceJay Patel
 
Introduction to IBM Spectrum Scale and Its Use in Life Science
Introduction to IBM Spectrum Scale and Its Use in Life ScienceIntroduction to IBM Spectrum Scale and Its Use in Life Science
Introduction to IBM Spectrum Scale and Its Use in Life ScienceSandeep Patil
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overviewKognitio
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData
 
MySQL Day Paris 2016 - MySQL as a Document Store
MySQL Day Paris 2016 - MySQL as a Document StoreMySQL Day Paris 2016 - MySQL as a Document Store
MySQL Day Paris 2016 - MySQL as a Document StoreOlivier DASINI
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CSMaking Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CSJohn Burwell
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.
 
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12c
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12cE-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12c
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12cAndrejs Karpovs
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingSimeon Warner
 

Similar to Ceph Day Seoul - The Anatomy of Ceph I/O (20)

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
HEPiX2015_a2_RACF_azaytsev_Ceph_v4_mod1
 
Oracle Database Appliance
Oracle Database ApplianceOracle Database Appliance
Oracle Database Appliance
 
Introduction to IBM Spectrum Scale and Its Use in Life Science
Introduction to IBM Spectrum Scale and Its Use in Life ScienceIntroduction to IBM Spectrum Scale and Its Use in Life Science
Introduction to IBM Spectrum Scale and Its Use in Life Science
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
NSCC Training - Introductory Class
NSCC Training - Introductory ClassNSCC Training - Introductory Class
NSCC Training - Introductory Class
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
MySQL Day Paris 2016 - MySQL as a Document Store
MySQL Day Paris 2016 - MySQL as a Document StoreMySQL Day Paris 2016 - MySQL as a Document Store
MySQL Day Paris 2016 - MySQL as a Document Store
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CSMaking Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12c
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12cE-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12c
E-Business Suite Rapid Provisioning Using Latest Features Of Oracle Database 12c
 
Mind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvestingMind the gap! Reflections on the state of repository data harvesting
Mind the gap! Reflections on the state of repository data harvesting
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Ceph Day Seoul - The Anatomy of Ceph I/O

  • 1. The Anatomy of Ceph I/O Sang-Hoon Kim, Dong-Yun Lee, Sanghoon Han, Kisik Jeong, and Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University August 27, 2016
  • 2. This talk will … • Focus on analyzing the write traffic generated by Ceph – To understand the I/O characteristics of Ceph • Consider the case for using RADOS Block Device (RBD) Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 2 Figure Courtesy: Sage Weil, RedHat
  • 3. Why writes matter? • Writes dominate the I/O traffic in production workloads – Extensively cache performance-critical reads – Require to keep written data persistently – Account for 70-80% of I/O traffic • Each write request may accompany hidden I/Os – Can be amplified manyfold – Influence on I/O performance – Must be considered for estimating system performance • Writes wear out flash-based storage devices – Worsen their reliability, increase TCO Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 3
  • 4. How writes are amplified? Data Ceph metadata Filesystem metadata Ceph journal Filesystem journal Application RDB OSD OSD OSD Storage File system Storage File system Storage File system Describe data Make updates atomic and consistent Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 4 Data Data Data
  • 5. Write amplification • Metadata describes data – Ceph metadata: object ID, offset, length, omap, … – Filesystem metadata: inode, block allocation bitmap, inode bitmap, ... • Journal / log makes data and metadata updates atomic and consistent – Ceph journal – Filesystem journal • Write Amplification Factor (WAF) – Metric for measuring the amplified amount of writes – Amount of total writes / amount of the original writes – The higher, the more the system amplifies writes Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 5
  • 6. Our interests are … • Analyze WAF in various many configurations and workloads • Filesystems for FileStore – xfs vs. ext4 • Backends for KeyValueStore – LevelDB vs. RocksDB • BlueStore Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 6
  • 7. OSD Servers (x4) Model DELL R730 Processor Intel® Xeon® CPU E5-2640 v3 Memory 32 GB OS Ubuntu Server 14.04.4 LTS Storage HP PM1633 960 GB x4 (OSDs) Intel® 730 Series 480 GB x1 Intel® 750 Series 400 GB x4 Admin Server / Client (x1) Model DELL R730XD Memory 128 GB Switch (x1) Model HPE 1620-48G (JG914A) Network 1Gbps EthernetStorageNetwork Evaluation environment Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 7
  • 8. Evaluation methodology • Configure a 64GiB RBD over the OSD servers – 4 OSDs per OSD server, 4 OSD servers • A client mounts the RBD and generates workloads using fio • Meanwhile, collect block-level I/O traces at OSD servers using ftrace • Analyze the I/O traces offline – Filesystem metadata: I/O requests tagged with REQ_META – Filesystem journaling: LBA range – Ceph metadata: By subtracting the original data size – Ceph journaling: Set the journaling device via ceph-deploy Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 8
  • 9. Evaluation workloads • RBD allocates an object for a 4MiB LBA chunk • Microbenchmark – 1st write: Creating an object – 2nd write: Appending data to the object – Overwrite: Overwriting on existing data – sync and wait between writes • Long-term workload – Generate 4 KiB uniform random writes equivalent to 90% of RBD capacity 1st write 2nd write Overwrite 0 Size Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 9
  • 11. Filesystems for FileStore • ext4 – Not recommended due to the limited length for XATTRs • Obsolete in Jewel – However, it is the most popular filesystem – Used the ordered-mode journaling • xfs – Officially supported by Jewel – Optimized for scalable and parallel I/O – Perform logical journaling Application RDB OSD OSD OSD Storage File system Storage File system Storage File system Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 11
  • 12. 3 3 6 6 9 9 33 48 45 15 0 10 20 30 40 50 60 70 80 90 100 1st write 2nd write Overwrite 1st write 2nd write Overwrite ext4xfs WAF Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal Microbenchmark - 4 KiB Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 12 x96 x81 Identical at Ceph layer Identical at Ceph layer Identical at Ceph layer xfs performs journaling more effectively
  • 13. Microbenchmark - 4 KiB 3 3 3 3 6 8 6 8 9 9 9 9 33 21 48 34 45 34 15 7 0 10 20 30 40 50 60 70 80 90 100 1st write 2nd write Overwrite 1st write 2nd write Overwrite ext4xfs WAF Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 13 x96 x81 x75 x61
  • 14. Microbenchmark - 4 KiB 3 3 3 3 3 3 6 8 7 6 8 6 9 9 9 9 9 9 33 21 15 48 34 24 45 34 27 15 7 12 0 10 20 30 40 50 60 70 80 90 100 1st write 2nd write Overwrite 1st write 2nd write Overwrite ext4xfs WAF Data Ceph Metadata Ceph Journal Filesystem Metadata Filesystem Journal Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 14 x96 x81 x75 x61 x61 x54
  • 15. WAF on various request sizes 0 10 20 30 40 50 60 70 80 90 100 WAF ext4 1st write 2nd write Overwrite 0 10 20 30 40 50 60 70 80 90 100 WAF xfs 1st write 2nd write Overwrite Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 15 Identical at Ceph layer ext4 amplifies writes more than xfs WAFs converge to 6 • x3 by replication • x2 by Ceph journaling
  • 16. Long-term workload 0 10 20 30 40 0% 20% 40% 60% 80% 100% WAF Fraction Time  0 5 10 15 20 0% 20% 40% 60% 80% 100% WAF Fraction Time  Data Ceph metadata Ceph journal Filesystem metadata Filesystem journal WAF Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 16 ext4 xfs x35.7 x16.2 FS metadata updates are increased as storage is being filled up
  • 18. Backends for KeyValueStore • Use key-value stores on the xfs filesystem – <key, value> pair for each 4 KiB • LevelDB – Based on the Log-structured Merge (LSM) Tree • RocksDB – Fork of LevelDB – Optimized to server workloads and fast storage • Unable to separate KeyValueStore metadata and journal – Require filesystem-level semantics Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 18 Application RDB OSD OSD OSD Storage KVStore Storage KVStore Storage KVStore File system File system File system
  • 19. Log-Structured Merge (LSM) Tree 19 (Sorted String Table) Sort, dump
  • 20. Microbenchmark - 4 KiB 0 0 0 0 0 0 8 8 8 9 8 7 42 42 42 12 12 12 5.25 5.25 5.25 2.25 2.25 2.25 0 5 10 15 20 25 30 35 40 45 50 55 60 1st write 2nd write Overwrite 1st write 2nd write Overwrite LevelDBRocksDB WAF Data KeyValueStore Filesystem Metadata Filesystem Journal Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 20 x55.3 x23.25 x55.3 x22.25 x55.3 x21.25 Cached in MemTable
  • 21. WAF on various request sizes 0 10 20 30 40 50 60 WAF LevelDB 1st write 2nd write Overwrite 0 10 20 30 40 50 60 WAF RocksDB 1st write 2nd write Overwrite Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 21 MemTable is flushed MemTable flush RocksDB outperforms LevelDB for object creation and small writes Aggressive compaction
  • 22. Long-term workload 0 20 40 60 80 100 0% 20% 40% 60% 80% 100% WAF Fraction Time  0 30 60 90 120 0% 20% 40% 60% 80% 100% WAF Fraction Time  Data KeyValueStore Filesystem metadata Filesystem journal WAF Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 22 LevelDB RocksDB Huge amount of write amplification
  • 23. Long-term workload 0 20 40 60 80 100 0% 20% 40% 60% 80% 100% WAF Fraction Time  0 30 60 90 120 0% 20% 40% 60% 80% 100% WAF Fraction Time  Data KeyValueStore Compaction Filesystem metadata Filesystem journal WAF Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 23 LevelDB RocksDB x68.3 x88.4 LSM tree utilizes the extent-based allocation in xfs by shaping I/O to sequential writes Huge amount of write amplification • Compaction • Additional metadata for each 4 KB
  • 25. BlueStore • An alternative to FileStore and KeyValueStore, introduced in Jewel – Provide better transactions and object enumeration mechanisms • Write data on raw block devices – Manage the space using +64KiB extents – Small writes are journaled in RocksDB WAL • Keep metadata in RocksDB Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 25 Figure Courtesy: Sébastien Han https://www.sebastien-han.fr
  • 26. Microbenchmark - 4 KiB 3 3 3 0 0 0 9 7 16 0 2 4 6 8 10 12 14 16 18 20 1st 2nd Overwrite WAF Data RocksDB RocksDB WAL Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 26 x12 x10 x19 Small overwrite requires data journaling, amplifying journaling traffic RocksDB writes are deferred by WAL
  • 27. WAF on various request sizes Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 27 0 2 4 6 8 10 12 14 16 18 20 WAF 1st write 2nd write Overwrite x3 by replication without data journaling x2 by data journaling for small overwrites
  • 28. Long-term workload 0 10 20 30 40 50 60 0% 20% 40% 60% 80% 100% WAF Faction Time Data Raw Data RocksDB RocksDB WAL WAF Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 28 x22.7 MemTable flush and checkpointing generate RocksDB writes Zeroing tail extent incurs extra writes
  • 29. To sum up … • Writes are amplified by x3 to x96 in Ceph • FileStore is not the best choice for RBD – Journal-on-journal happens – Logical journaling in xfs greatly reduces WAF • Using KeyValueStore is not good either – Very high compaction overhead • BlueStore seems to be promising – No journal-on-journal and data compaction – However, bad for small overwrites Ceph Day Seoul (August 27, 2016) - Sang-Hoon Kim (sanghoon@csl.skku.edu) 29 Store type Small writes (4 KiB) Large writes (1 MiB) Long- term FileStore ext4 96 6.4 35.7 xfs 81 6.5 16.2 KeyValue Store LevelDB 55.3 3.3 68.3 RocksDB 23.3 3.1 88.4 BlueStore 11 3.0 22.7
  • 30. Thank You! Sang-Hoon Kim Post-Doc Computer Systems Laboratory Sungkyunkwan University ✉️ sanghoon@csl.skku.edu This research is funded by Samsung Electronics, Co.

Editor's Notes

  1. 그러면 왜 I/O중에 하필 쓰기 I/O에 대해 그렇게 분석하였느냐 의문점을 가지실 수 있는데요, 이는 우선, 데이터 센터 수준의 환경에서는 쓰기가 읽기보다 많이 발생하기 떄문입니다.
  2. 그럼 어떤 과정을 거쳐서 write가 amplify되는가 설명드리면,
  3. Hammer (v0.94.7) vs. Jewel (v10.2.2)
  4. Hammer (v0.94.7) vs. Jewel (v10.2.2)
  5. Data file Omap file (Metadata, level DB) osd마다의 variation은 이 파일에 쓰는 양이 바뀌면서 발생함 Commit file
  6. Hammer (v0.94.7) vs. Jewel (v10.2.2)
  7. Ceph Hammer는 총 3개의 파일을 씀 Data file Omap file (Metadata, level DB) osd마다의 variation은 이 파일에 쓰는 양이 바뀌면서 발생함 Commit file (Ceph journaling)
  8. 거의 6으로 수렴하는데, Replication으로 x3 각 replica마다 ceph journaling은 data까지 journaling해서
  9. ㄹㅇㅁㄹㄴㅇㅁ
  10. 거의 6으로 수렴하는데, Replication으로 x3 각 replica마다 ceph journaling은 data까지 journaling해서
  11. 거의 6으로 수렴하는데, Replication으로 x3 각 replica마다 ceph journaling은 data까지 journaling해서
  12. ㄹㅇㅁㄹㄴㅇㅁ