SlideShare a Scribd company logo
1 of 12
Download to read offline
How to support QoS in
distributed storage system
CEPH	QoS
Taewoong Kim
SW-Defined Storage Lab.
SK Telecom
5G
Flash	device
High	Performance,	Low	Latency,	SLA
UHD
4
K
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-flash	Ceph !
SK Telecom and Ceph
Contribution : QoS,
Deduplication, etc.
Storage Solution for
- Private Cloud for Developers
- Virtual Desktop Infra
1
Why QoS?
□ Shared Storage : Virtualization, Consolidation
• Many tenants, even more VMs
ü Competition among clients for shared resource
ü Various workload & requirements
• Difficulty for Storage SLA
ü Not deterministic performance
• Situation	is	changing	every	time	rely	on	neighbors
□ SW-defined storages like CEPH provide many features but need
more background operations
• Background Operations
ü Replication
ü Recovering
ü Scrubbing
• Competition with foreground(client’s) operations
□ QoS schedules requests along administrator's pre-configured policies
Ceph Cluster
VM
Hypervisor
LIBRADOS
OSD #1
VM
VM
VM
VM
Hypervisor
LIBRADOS
VM
VM
VM
VM
Hypervisor
LIBRADOS
VM
VM
VM
…
OSD #2 OSD #N
VM
Hypervisor
LIBRADOS
VM
VM
VM
Background
Operations
Client’s
Requests
2
• Source: Sage Weil's 'Ceph Project Update' in OpenStack Summit 2017 Boston
OSD
Shard #1
STORE
Messenger
ThreadThreadThread
Shard #N
ThreadThreadThread
…
PriorityQueue :
• Weighted
• Prioritized
+ mClockOpClass
+ mClockClient
+ mClockPool(WIP	by	SKT)
QoS Support on Ceph
3
□ Motivation
• Lack of existing research to support QoS (reservation + proportional share) for
storage
ü Support only simple proportional share
ü Support for other hardware devices (CPU, memory)
□ Key Idea
• Controls the number of I/O requests to be serviced using time tags
• Uses multiple real-time clocks & time tags for reservation, limit and
weight(proportion) based I/O scheduling
• Dynamic clock selection depends on clocks’ progress status
• Can be extended for supporting distributed storage : dmClock
ü Clients track progress of each storage server & send feedback to each server with I/O requests
if ( smallest reservation tag < current time) // constraint-based
Schedule smallest eligible reservation tag
else // weight-based, reservations are met
Schedule smallest eligible shares tag
Subtract 1/rk from reservation tags of VM k.
A VM is eligible if (limit tag < current time)
What is mClock?
A. Gulati, A. Merchant, and P. J. Varman. mClock: Handling throughput
variability for hypervisor IO scheduling. In OSDI, 2010
4
□ Development and stabilization of QoS algorithm
(https://github.com/ceph/dmclock)
• Improved QoS instability in high load situations
• Fix QoS error due to heap management bug
• Fix tag adjustment algorithm calibrating proportional share tags against real time
• Enable changing QoS parameters in run time
• Improved Client QoS service status monitoring and reporting algorithm
• Add anticipation mechanism to solve deceptive idleness problem
□ QoS simulator stabilization and convenience improvement
(https://github.com/ceph/dmclock)
• File-based simulator settings
• High performance setup and fixed simulation error reporting error
• Fixed server node selection error
□ Ceph Integration Work
• Delivery of mClock distribution parameters (delta, rho and phase) + Enabling client QoS tracker
(https://github.com/ceph/ceph/pull/16369)
• osd: use dmclock library client_info_f function dynamically (https://github.com/ceph/ceph/pull/17063)
• Pool based dmClock Queue (WIP) (https://github.com/ceph/ceph/pull/19340)
• Anticipation timeout Configuration (https://github.com/ceph/ceph/pull/18827)
QoS on SKT: Contributions
5
1. Each pool’s properties(MAX, MIN, Weight) are stored in the OSD Map
CLI Example: ceph osd pool set [pool ID] resv 4000.0 (IOPs)
2. Distribution status is monitored by “dmClock QoS Monitor” and status info is embedded in RADOS requests
3. Since each OSD has the latest OSD Map, it can know the QoS control of the pool corresponding to the received
request, and proceed QoS with Pool’s QoS properties & QoS status info
dmClcok QoS
Monitor
LIBRADOS
CRUSH
Algorithm
I/O δOSD#
Target
OSD
QoS Status Info
Pool#1
User 2
LIBRBD
dmClcok QoS
Monitor
LIBRADOS
CRUSH
Algorithm
I/O δOSD#
Target
OSD
QoS Status Info
User 1
LIBRBD
…
OSD #1
dmClock
QoS Scheduler
STORE
OSD
Map
POOL#0
Current Implemented Ceph QoS Pool Units
Pool’s Properties
{MIN,
MAX,Weight]
OSD #2
dmClock
QoS Scheduler
STORE
OSD
Map
OSD #3
dmClock
QoS Scheduler
STORE
OSD
Map
OSD #n
dmClock
QoS Scheduler
STORE
OSD
Map
6
Outstanding IO Based Throttling
Client
Replica
OSD
Ceph
Device
PG
Messenger
OSD
dmClock
SSD or HDD
ObjectStore
PGBackend
OIO
Monitor
oio_throttle_get() oio_throttle_put()
Measure current load
on this range by
counting outstanding
IOs
Ø Throttler for QoS
ü Dispatch the number of I/O requests that is just
enough for exploiting the system.
ü The I/Os staying in the dmClock queue will be used
for more accurate scheduling later.
Ø OIO Monitor
ü Measure the average throughput.
ü Measure the average outstanding I/Os.
(OIOs from PG to disk)
ü Tracking the maximum throughput.
Ø Suspending dmClock scheduling
ü Just proportional scheduling will be throttled
– Null operation will be returned
ü However, reservation scheduling will be continue
because it’s the minimum requirement.
If OIOs are
enough
suspend()
resume()
If OIOs aren’t
enough
7
ü In one OSD, there will be dmClock queues depending on the number of shards (1-on-1)
ü As operations are distributed by the increased number of dmClock queues, the average queue depth in one
queue will be shorten
ü Not enough requests in the queue result in No rearrangement, No competition
ü Recommendation : Set the shard count to small number when using dmClock queue
2952 3060
9326 9487
15102
0
20000
Cli	0	(W:1)Cli	1	(W:1)Cli	2	(W:3)Cli	3	(W:3)Cli	4	(W:5)
Num of	Shards:1
15230 15468 16368
14801 15753
0
20000
Cli	0	(W:1) Cli	1	(W:1) Cli	2	(W:3) Cli	3	(W:3) Cli	4	(W:5)
Num of	Shards:10
…
OSD Shard	#0
ThreadThread
ThreadThread
Shard	#1
dmClock
dmClock
ThreadThread
Shard	#n
dmClcok
Client #0
Client #1
OSD Shard	#0
ThreadThread
dmClock
Client #0
Client #1
The stacking of OPs length is shortened
8
Queue Depth Problem
• Intel(R) Xeon(R) CPU E5-
2690 v3 @ 2.60GHz
• 256GB Memory
• 480GB SATA SSD x 8
• CentOS 7
10G
cluster
pubic
Ø Test env.
ü 5 Client Nodes
– Each client node uses single pool exclusively.
so, 5 pools total are used for this test.
ü 4 OSD Nodes
– 8 OSD daemons for single node.
– BlueStore.
– Single shard and 10 shard threads
ü Test Code
– https://github.com/ceph/ceph/pull/17450
ü fio options
– ioengine=rbd, 64 QD, 4 Jobs, 4KB rand write
OIO Based Throttler Test
9
Ø Plan
ü Weighting on request size or its type.
ü Improve the dmClock algorithm
ü Extend to serve QoS to an individual RBD
ü Add more metric. (Throughput, Latency, etc...)
ü Test dmClock QoS & OIO throttler in various environments
Plan and so on…
10
Q & A
11

More Related Content

What's hot

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Community
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나Jang Hoon
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015PostgreSQL-Consulting
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
Ansible tips & tricks
Ansible tips & tricksAnsible tips & tricks
Ansible tips & tricksbcoca
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar
 
Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentSadique Puthen
 
AvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregateAvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregateHiroki Ishikawa
 

What's hot (20)

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
Ansible tips & tricks
Ansible tips & tricksAnsible tips & tricks
Ansible tips & tricks
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?QEMU Disk IO Which performs Better: Native or threads?
QEMU Disk IO Which performs Better: Native or threads?
 
Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deployment
 
AvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregateAvailabilityZoneとHostAggregate
AvailabilityZoneとHostAggregate
 

Similar to Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim

VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentDoKC
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceShapeBlue
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsFederico Michele Facca
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
9th docker meetup 2016.07.13
9th docker meetup 2016.07.139th docker meetup 2016.07.13
9th docker meetup 2016.07.13Amrita Prasad
 
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
Red Hat Enterprise Linux: Open, hyperconverged infrastructureRed Hat Enterprise Linux: Open, hyperconverged infrastructure
Red Hat Enterprise Linux: Open, hyperconverged infrastructureRed_Hat_Storage
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Introducing OVHcloud Enterprise Cloud Databases
Introducing OVHcloud Enterprise Cloud DatabasesIntroducing OVHcloud Enterprise Cloud Databases
Introducing OVHcloud Enterprise Cloud DatabasesOVHcloud
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftJeremy Eder
 
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff mfrancis
 

Similar to Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim (20)

HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
9th docker meetup 2016.07.13
9th docker meetup 2016.07.139th docker meetup 2016.07.13
9th docker meetup 2016.07.13
 
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
Red Hat Enterprise Linux: Open, hyperconverged infrastructureRed Hat Enterprise Linux: Open, hyperconverged infrastructure
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
Introducing OVHcloud Enterprise Cloud Databases
Introducing OVHcloud Enterprise Cloud DatabasesIntroducing OVHcloud Enterprise Cloud Databases
Introducing OVHcloud Enterprise Cloud Databases
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Red Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShiftRed Hat Summit 2018 5 New High Performance Features in OpenShift
Red Hat Summit 2018 5 New High Performance Features in OpenShift
 
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff
OSGi Remote Services - Alexander Broekhuis, Bram de Kruijff
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim

  • 1. How to support QoS in distributed storage system CEPH QoS Taewoong Kim SW-Defined Storage Lab. SK Telecom
  • 2. 5G Flash device High Performance, Low Latency, SLA UHD 4 K Scalable, Available, Reliable, Unified Interface, Open Platform High Performance, Low Latency All-flash Ceph ! SK Telecom and Ceph Contribution : QoS, Deduplication, etc. Storage Solution for - Private Cloud for Developers - Virtual Desktop Infra 1
  • 3. Why QoS? □ Shared Storage : Virtualization, Consolidation • Many tenants, even more VMs ü Competition among clients for shared resource ü Various workload & requirements • Difficulty for Storage SLA ü Not deterministic performance • Situation is changing every time rely on neighbors □ SW-defined storages like CEPH provide many features but need more background operations • Background Operations ü Replication ü Recovering ü Scrubbing • Competition with foreground(client’s) operations □ QoS schedules requests along administrator's pre-configured policies Ceph Cluster VM Hypervisor LIBRADOS OSD #1 VM VM VM VM Hypervisor LIBRADOS VM VM VM VM Hypervisor LIBRADOS VM VM VM … OSD #2 OSD #N VM Hypervisor LIBRADOS VM VM VM Background Operations Client’s Requests 2
  • 4. • Source: Sage Weil's 'Ceph Project Update' in OpenStack Summit 2017 Boston OSD Shard #1 STORE Messenger ThreadThreadThread Shard #N ThreadThreadThread … PriorityQueue : • Weighted • Prioritized + mClockOpClass + mClockClient + mClockPool(WIP by SKT) QoS Support on Ceph 3
  • 5. □ Motivation • Lack of existing research to support QoS (reservation + proportional share) for storage ü Support only simple proportional share ü Support for other hardware devices (CPU, memory) □ Key Idea • Controls the number of I/O requests to be serviced using time tags • Uses multiple real-time clocks & time tags for reservation, limit and weight(proportion) based I/O scheduling • Dynamic clock selection depends on clocks’ progress status • Can be extended for supporting distributed storage : dmClock ü Clients track progress of each storage server & send feedback to each server with I/O requests if ( smallest reservation tag < current time) // constraint-based Schedule smallest eligible reservation tag else // weight-based, reservations are met Schedule smallest eligible shares tag Subtract 1/rk from reservation tags of VM k. A VM is eligible if (limit tag < current time) What is mClock? A. Gulati, A. Merchant, and P. J. Varman. mClock: Handling throughput variability for hypervisor IO scheduling. In OSDI, 2010 4
  • 6. □ Development and stabilization of QoS algorithm (https://github.com/ceph/dmclock) • Improved QoS instability in high load situations • Fix QoS error due to heap management bug • Fix tag adjustment algorithm calibrating proportional share tags against real time • Enable changing QoS parameters in run time • Improved Client QoS service status monitoring and reporting algorithm • Add anticipation mechanism to solve deceptive idleness problem □ QoS simulator stabilization and convenience improvement (https://github.com/ceph/dmclock) • File-based simulator settings • High performance setup and fixed simulation error reporting error • Fixed server node selection error □ Ceph Integration Work • Delivery of mClock distribution parameters (delta, rho and phase) + Enabling client QoS tracker (https://github.com/ceph/ceph/pull/16369) • osd: use dmclock library client_info_f function dynamically (https://github.com/ceph/ceph/pull/17063) • Pool based dmClock Queue (WIP) (https://github.com/ceph/ceph/pull/19340) • Anticipation timeout Configuration (https://github.com/ceph/ceph/pull/18827) QoS on SKT: Contributions 5
  • 7. 1. Each pool’s properties(MAX, MIN, Weight) are stored in the OSD Map CLI Example: ceph osd pool set [pool ID] resv 4000.0 (IOPs) 2. Distribution status is monitored by “dmClock QoS Monitor” and status info is embedded in RADOS requests 3. Since each OSD has the latest OSD Map, it can know the QoS control of the pool corresponding to the received request, and proceed QoS with Pool’s QoS properties & QoS status info dmClcok QoS Monitor LIBRADOS CRUSH Algorithm I/O δOSD# Target OSD QoS Status Info Pool#1 User 2 LIBRBD dmClcok QoS Monitor LIBRADOS CRUSH Algorithm I/O δOSD# Target OSD QoS Status Info User 1 LIBRBD … OSD #1 dmClock QoS Scheduler STORE OSD Map POOL#0 Current Implemented Ceph QoS Pool Units Pool’s Properties {MIN, MAX,Weight] OSD #2 dmClock QoS Scheduler STORE OSD Map OSD #3 dmClock QoS Scheduler STORE OSD Map OSD #n dmClock QoS Scheduler STORE OSD Map 6
  • 8. Outstanding IO Based Throttling Client Replica OSD Ceph Device PG Messenger OSD dmClock SSD or HDD ObjectStore PGBackend OIO Monitor oio_throttle_get() oio_throttle_put() Measure current load on this range by counting outstanding IOs Ø Throttler for QoS ü Dispatch the number of I/O requests that is just enough for exploiting the system. ü The I/Os staying in the dmClock queue will be used for more accurate scheduling later. Ø OIO Monitor ü Measure the average throughput. ü Measure the average outstanding I/Os. (OIOs from PG to disk) ü Tracking the maximum throughput. Ø Suspending dmClock scheduling ü Just proportional scheduling will be throttled – Null operation will be returned ü However, reservation scheduling will be continue because it’s the minimum requirement. If OIOs are enough suspend() resume() If OIOs aren’t enough 7
  • 9. ü In one OSD, there will be dmClock queues depending on the number of shards (1-on-1) ü As operations are distributed by the increased number of dmClock queues, the average queue depth in one queue will be shorten ü Not enough requests in the queue result in No rearrangement, No competition ü Recommendation : Set the shard count to small number when using dmClock queue 2952 3060 9326 9487 15102 0 20000 Cli 0 (W:1)Cli 1 (W:1)Cli 2 (W:3)Cli 3 (W:3)Cli 4 (W:5) Num of Shards:1 15230 15468 16368 14801 15753 0 20000 Cli 0 (W:1) Cli 1 (W:1) Cli 2 (W:3) Cli 3 (W:3) Cli 4 (W:5) Num of Shards:10 … OSD Shard #0 ThreadThread ThreadThread Shard #1 dmClock dmClock ThreadThread Shard #n dmClcok Client #0 Client #1 OSD Shard #0 ThreadThread dmClock Client #0 Client #1 The stacking of OPs length is shortened 8 Queue Depth Problem
  • 10. • Intel(R) Xeon(R) CPU E5- 2690 v3 @ 2.60GHz • 256GB Memory • 480GB SATA SSD x 8 • CentOS 7 10G cluster pubic Ø Test env. ü 5 Client Nodes – Each client node uses single pool exclusively. so, 5 pools total are used for this test. ü 4 OSD Nodes – 8 OSD daemons for single node. – BlueStore. – Single shard and 10 shard threads ü Test Code – https://github.com/ceph/ceph/pull/17450 ü fio options – ioengine=rbd, 64 QD, 4 Jobs, 4KB rand write OIO Based Throttler Test 9
  • 11. Ø Plan ü Weighting on request size or its type. ü Improve the dmClock algorithm ü Extend to serve QoS to an individual RBD ü Add more metric. (Throughput, Latency, etc...) ü Test dmClock QoS & OIO throttler in various environments Plan and so on… 10