Ceph Storage on SSD for Container
JANGSEON RYU
NAVER
• What is Container?
• Persistent Storage
• What is Ceph Storage?
• Write Process in Ceph
• Performance Test
Agenda
What is Container ?
What is Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
Docker Container Virtual Machine
Reference : https://www.docker.com/what-container
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
namespace
cgroup
isolation
resource
limiting
CPU MEM DISK NWT
Reference : https://www.slideshare.net/PhilEstes/docker-london-container-security
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
Reference : https://www.theregister.co.uk/2014/08/18/docker_kicks_kvms_butt_in_ibm_tests/
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
Reference : https://www.ibm.com/blogs/bluemix/2015/08/c-ports-docker-containers-across-multiple-clouds-datacenters/
PM / VM / Cloud
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
App A
Libs
Original Image
A’
Changed Image Deployment
A’App A
Libs
change update
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
Reference : http://www.devopsschool.com/slides/docker/docker-web-development/index.html#/17
What is Docker Container?
• VM vs Container
• OS Level Virtualization
• Performance
• Run Everywhere – Portable
• Scalable – lightweight
• Environment Consistency
• Easy to manage Image
• Easy to deploy
Reference : https://www.slideshare.net/insideHPC/docker-for-hpc-in-a-nutshell
Persistent Storage
Stateless vs Stateful Container
• Nothing to write on disk
• Web (Front-end)
• Easy to scale in/out
• Container is ephemeral
• If delete, will be lost data
Stateless
…
Easy to scale out
• Needs storage to write
• Database
• Logs
• CI config / repo data
• Secret Keys
Stateful
Hard to scale out
…
Ephemeral vs Persistent
Ephemeral Storage
• Data Lost
• Local Storage
• Stateless Container
Persistent Storage
• Data Save
• Network Storage
• Stateful Container
VS
Reference : https://www.infoworld.com/article/3106416/cloud-computing/containerizing-stateful-applications.html
Needs for Persistent Storage
Our mission is to provide “Persistence Storage Service”
while maintaining “Agility” & “Automation” of Docker Container
macvlan
IPVS IPVS
vrrp
Docker swarm cluster
ECMP enabled
BGP/ECMP
internet
consul
consul consul
Dynamic DNS
AA
MON MON MON OSD
OSD OSD OSD OSD
OSD OSD OSD OSD
OSD OSD OSD OSD
OSD OSD OSD OSD
KEYSTONE
CINDER
Manage Volume
PLUGIN
Get Auth-Token
Request
Attach Volume
/dev/rbd0
PLUGIN
DEVIEW 2016 : https://www.slideshare.net/deview/221-docker-orchestration
OpenStack 2017 : https://www.slideshare.net/JangseonRyu/on-demandblockstoragefordocker-77895159
OSD
Container in NAVER
What is Ceph Storage?
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Vender
NO Lock-in
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Up-to 16 Exa
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Reference : https://en.wikipedia.org/wiki/Ceph_(software)
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Low TCO
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
KVM integrated
with RBD
Supported	above	
kernel	2.6
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Client Metadata
query
Storage
Storage
Storage
Write
Traditional
Ceph
Client
Storage
CRUSH
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
Storage
Metadata
bottle
neck
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
• Very high durability
• 200 % overhead
• Quick recovery
• Cost-Effective
• 50 % overhead
• Expensive Recovery
Object
COPY
COPY COPY
Replicated Pool
Object
Erasure Coded Pool
1 2 3 4 X Y
What is Ceph Storage?
• Open Source Distributed Storage Solution
• Massive Scalable / Efficient Scale-Out
• Unified Storage
• Runs on commodity hardware
• Integrations into Linux Kernel
/ QEMU/KVM Driver / OpenStack
• Self-managing / Self-healing
• Peer-to-Peer Storage Nodes
• RESTful API
• No metadata bottleneck (no lookup)
• CRUSH algorithm determines data placement
• Replicated / Erasure Coding
• Architecture
Public Network
Cluster Network
OSD Node OSD Node OSD NodeOSD Node
MonitorMonitorMonitor
Client
P
S
T
Write Process in Ceph
◼ Strong Consistency
◼ CAP Theorem : CP system
(Consistency / network Partition)
Client
Primary OSD
Secondary OSD Tertiary OSD
①
② ②
③ ③
④
Write Flow
② Primary OSD sends data to replica OSDs,
write data to local disk.
④ Primary OSD signals completion to client.
③ Replica OSDs write data to local disk,
signal completion to primary.
① Client writes data to primary osd.
◼ All of Data Disks are SATA.
◼ 7.2k rpm SATA : ~ 75 ~ 100 IOPS
① Client writes data to primary osd.
② Primary OSD sends data to replica OSDs,
write data to local disk.
③’ Write data to local disk,
Send ack to primary.
③’’ Slow write data to local disk,
Send ack to primary.
④ Slow send ack to client.
Client
Primary OSD
Secondary OSD Tertiary OSD
①
② ②
③’ ③’’
④
SATA
High Disk Usage
Slow Requests
Client
Primary OSD
①
② ②
③ ③
④
Journal FileStore
Secondary OSD
Journal FileStore
Tertiary OSD
Journal FileStore
Primary OSD
Journal
SSD
O_DIRECT
O_DSYNC
FileStore
SATA
Buffered IOs
XFS
Journal on SSD
Performance Test
Network
Switch (10G)
Clients
Ceph
Public Networks Cluster Networks
KRBD
- /dev/rbd0
- mkfs / mount
- fio
Ceph
- FileStore
- Luminous (12.2.0)
Server
- Intel® Xeon CPU L5640 2.27 GHz
- 16 GB Memory
- 480GB SAS 10K x 5
- 480GB SAS SSD x 1
- Centos 7
Test Environment
Expect Result : 4K Rand Write
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS : 10,000 rpm / 600 IOPS
Per Node : SAS * 3 = 600 * 3 = 1,800 IOPS
Total : Node * 3 = 1,800 * 3 = 5,400 IOPS
Replicas 3 = 5,400 / 3 = 1,800 IOPS
600
1800
2094
34.36
0
500
1000
1500
2000
2500
IOPS Latency
No	SSD	for	Journal
2094 IOPS/s
Case #1. Only Use SAS DISK
Expect
SAS
SAS
SAS
SSD : 15,000 IOPS
Total : Node * 3 = 15,000* 3 = 45,000 IOPS
Replicas 3 = 45,000 / 3 = 15,000 IOPS
SSD
600
15k
SAS
SAS
SAS
SSD
SAS
SAS
SAS
SSD
2094 à 2273
2273
31.65
0
500
1000
1500
2000
2500
IOPS Latency
SSD	for	Journal
Case #2. Use SSD for Journal
Result : 4K Rand Write
# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep queue_max
…
"filestore_queue_max_bytes": " 104857600",
"filestore_queue_max_ops": ”50", ß default value
…
Client
FileStore Queue
SAS
# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf dump | grep -A 15 throttle-filestore_ops
"throttle-filestore_ops": {
"val": 50, ß limitation
"max": 50,
…
Block IO
Flush
50
Analysis
Tuning Result
filestore_queue_max_ops = 500000
filestore_queue_max_bytes = 42949672960
journal_max_write_bytes = 42949672960
journal_max_write_entries = 5000000
2094 à 2899 (38%)
2899
174.76
0
500
1000
1500
2000
2500
3000
3500
IOPS Latency
Tuning	QueueMax
Case #3. Tuning FileStore
# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok config show | grep wbthrottle_xfs
"filestore_wbthrottle_xfs_bytes_hard_limit": "419430400",
"filestore_wbthrottle_xfs_bytes_start_flusher": "41943040",
"filestore_wbthrottle_xfs_inodes_hard_limit": "5000",
"filestore_wbthrottle_xfs_inodes_start_flusher": "500",
”filestore_wbthrottle_xfs_ios_hard_limit": "5000",
"filestore_wbthrottle_xfs_ios_start_flusher": "500",
#	ceph --admin-daemon	/var/run/ceph/ceph-osd.2.asok	perf dump	|	grep -A6	WBThrottle
"WBThrottle":	{								
"bytes_dirtied":	21049344,
"bytes_wb":	197390336,
"ios_dirtied":	5017,				ß limitation
"ios_wb":	40920,
"inodes_dirtied":	1195,
"inodes_wb":	20602
Client
FileStore Queue
SAS
Block IO
Flush
500,000
Analysis
Tuning Result
"filestore_wbthrottle_enable": "false",
or
"filestore_wbthrottle_xfs_bytes_hard_limit": "4194304000",
"filestore_wbthrottle_xfs_bytes_start_flusher": "419430400",
"filestore_wbthrottle_xfs_inodes_hard_limit": "500000",
"filestore_wbthrottle_xfs_inodes_start_flusher": "5000",
"filestore_wbthrottle_xfs_ios_hard_limit": "500000",
"filestore_wbthrottle_xfs_ios_start_flusher": "5000",
2,094 à 14,264 (x7)
14264
40.65
0
2000
4000
6000
8000
10000
12000
14000
16000
IOPS Latency
Tuning	WBThrottle
Case #4. Tuning WBThrottle
After	35	secs,	Performance	(IPOS)	drops	below	100	~	200	IOPS…
Client
SAS
Block IO
Flush
OSD
Journal
FileStore
SSD
O_DIRECT
O_DSYNC
Buffered IOs
Page Cache
vm.dirty_background_ratio :	10%
vm.dirty_ratio :	20%
vm.dirty_ratio :	50%
Analysis
Client OSD
Enable
Throttle
2,000 IOPS
Client OSD
SAS
Disable
Throttle
SSD
15,000 IOPS
Bottleneck
SAS
SSD
• Slow performance
• No effect with SSD
• Fast performance
• Danger using
High Page Cache
• Crash Ceph Storage
Problems of Throttle
Client OSD
SAS
Dynamic
Throttle
SSD
• Burst (~ 60 secs)
• Throttle From 80%
15,000 IOPS
500 IOPS
filestore_queue_max_ops
filestore_queue_max_bytes
filestore_expected_throughput_ops
filestore_expected_throughput_bytes
filestore_queue_low_threshhold
filestore_queue_high_threshhold
filestore_queue_high_delay_multiple
filestore_queue_max_delay_multiple
Reference : http://blog.wjin.org/posts/ceph-dynamic-throttle.html
r	=	current_op /	max_ops
high_delay_per_count =	high_multiple /	expected_throughput_ops
max_delay_per_count =	max_multiple /	expected_throughput_ops
s0	=	high_delay_per_count /	(high_threshhold - low_threshhold)	
s1	=	(max_delay_per_count - high_delay_per_count)	/	(1	-
high_threshhold)
if	r	<	low_threshhold:	
delay	=	0	
elif r	<	high_threshhold:	
delay	=	(r	- low_threshhold)	*	s0	
else:	
delay	=	high_delay_per_count +	((r	- high_threshhold)	*	s1)	
Dynamic Throttle
◼ High performance improvements with SSD : 2,094 à 14,264 (x7)
◼ Must need Throttling for stable storage operation : Dynamic Throttle
◼ Need to tune OS(page cache, io scheduler), Ceph config
Conculsion
QnA

NAVER Ceph Storage on ssd for Container

  • 1.
    Ceph Storage onSSD for Container JANGSEON RYU NAVER
  • 2.
    • What isContainer? • Persistent Storage • What is Ceph Storage? • Write Process in Ceph • Performance Test Agenda
  • 3.
  • 4.
    What is Container? •VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy Docker Container Virtual Machine Reference : https://www.docker.com/what-container
  • 5.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy namespace cgroup isolation resource limiting CPU MEM DISK NWT Reference : https://www.slideshare.net/PhilEstes/docker-london-container-security
  • 6.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy Reference : https://www.theregister.co.uk/2014/08/18/docker_kicks_kvms_butt_in_ibm_tests/
  • 7.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy Reference : https://www.ibm.com/blogs/bluemix/2015/08/c-ports-docker-containers-across-multiple-clouds-datacenters/ PM / VM / Cloud
  • 8.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy App A Libs Original Image A’ Changed Image Deployment A’App A Libs change update
  • 9.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy Reference : http://www.devopsschool.com/slides/docker/docker-web-development/index.html#/17
  • 10.
    What is DockerContainer? • VM vs Container • OS Level Virtualization • Performance • Run Everywhere – Portable • Scalable – lightweight • Environment Consistency • Easy to manage Image • Easy to deploy Reference : https://www.slideshare.net/insideHPC/docker-for-hpc-in-a-nutshell
  • 11.
  • 12.
    Stateless vs StatefulContainer • Nothing to write on disk • Web (Front-end) • Easy to scale in/out • Container is ephemeral • If delete, will be lost data Stateless … Easy to scale out • Needs storage to write • Database • Logs • CI config / repo data • Secret Keys Stateful Hard to scale out …
  • 13.
    Ephemeral vs Persistent EphemeralStorage • Data Lost • Local Storage • Stateless Container Persistent Storage • Data Save • Network Storage • Stateful Container VS Reference : https://www.infoworld.com/article/3106416/cloud-computing/containerizing-stateful-applications.html
  • 14.
    Needs for PersistentStorage Our mission is to provide “Persistence Storage Service” while maintaining “Agility” & “Automation” of Docker Container
  • 15.
    macvlan IPVS IPVS vrrp Docker swarmcluster ECMP enabled BGP/ECMP internet consul consul consul Dynamic DNS AA MON MON MON OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD KEYSTONE CINDER Manage Volume PLUGIN Get Auth-Token Request Attach Volume /dev/rbd0 PLUGIN DEVIEW 2016 : https://www.slideshare.net/deview/221-docker-orchestration OpenStack 2017 : https://www.slideshare.net/JangseonRyu/on-demandblockstoragefordocker-77895159 OSD Container in NAVER
  • 16.
    What is CephStorage?
  • 17.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Vender NO Lock-in
  • 18.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Up-to 16 Exa
  • 19.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Reference : https://en.wikipedia.org/wiki/Ceph_(software)
  • 20.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Low TCO
  • 21.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture KVM integrated with RBD Supported above kernel 2.6
  • 22.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture
  • 23.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Client Metadata query Storage Storage Storage Write Traditional Ceph Client Storage CRUSH Metadata Storage Metadata Storage Metadata Storage Metadata Storage Metadata Storage Metadata bottle neck
  • 24.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture • Very high durability • 200 % overhead • Quick recovery • Cost-Effective • 50 % overhead • Expensive Recovery Object COPY COPY COPY Replicated Pool Object Erasure Coded Pool 1 2 3 4 X Y
  • 25.
    What is CephStorage? • Open Source Distributed Storage Solution • Massive Scalable / Efficient Scale-Out • Unified Storage • Runs on commodity hardware • Integrations into Linux Kernel / QEMU/KVM Driver / OpenStack • Self-managing / Self-healing • Peer-to-Peer Storage Nodes • RESTful API • No metadata bottleneck (no lookup) • CRUSH algorithm determines data placement • Replicated / Erasure Coding • Architecture Public Network Cluster Network OSD Node OSD Node OSD NodeOSD Node MonitorMonitorMonitor Client P S T
  • 26.
  • 27.
    ◼ Strong Consistency ◼CAP Theorem : CP system (Consistency / network Partition) Client Primary OSD Secondary OSD Tertiary OSD ① ② ② ③ ③ ④ Write Flow ② Primary OSD sends data to replica OSDs, write data to local disk. ④ Primary OSD signals completion to client. ③ Replica OSDs write data to local disk, signal completion to primary. ① Client writes data to primary osd.
  • 28.
    ◼ All ofData Disks are SATA. ◼ 7.2k rpm SATA : ~ 75 ~ 100 IOPS ① Client writes data to primary osd. ② Primary OSD sends data to replica OSDs, write data to local disk. ③’ Write data to local disk, Send ack to primary. ③’’ Slow write data to local disk, Send ack to primary. ④ Slow send ack to client. Client Primary OSD Secondary OSD Tertiary OSD ① ② ② ③’ ③’’ ④ SATA High Disk Usage Slow Requests
  • 29.
    Client Primary OSD ① ② ② ③③ ④ Journal FileStore Secondary OSD Journal FileStore Tertiary OSD Journal FileStore Primary OSD Journal SSD O_DIRECT O_DSYNC FileStore SATA Buffered IOs XFS Journal on SSD
  • 30.
  • 31.
    Network Switch (10G) Clients Ceph Public NetworksCluster Networks KRBD - /dev/rbd0 - mkfs / mount - fio Ceph - FileStore - Luminous (12.2.0) Server - Intel® Xeon CPU L5640 2.27 GHz - 16 GB Memory - 480GB SAS 10K x 5 - 480GB SAS SSD x 1 - Centos 7 Test Environment
  • 32.
    Expect Result :4K Rand Write SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS : 10,000 rpm / 600 IOPS Per Node : SAS * 3 = 600 * 3 = 1,800 IOPS Total : Node * 3 = 1,800 * 3 = 5,400 IOPS Replicas 3 = 5,400 / 3 = 1,800 IOPS 600 1800 2094 34.36 0 500 1000 1500 2000 2500 IOPS Latency No SSD for Journal 2094 IOPS/s Case #1. Only Use SAS DISK
  • 33.
    Expect SAS SAS SAS SSD : 15,000IOPS Total : Node * 3 = 15,000* 3 = 45,000 IOPS Replicas 3 = 45,000 / 3 = 15,000 IOPS SSD 600 15k SAS SAS SAS SSD SAS SAS SAS SSD 2094 à 2273 2273 31.65 0 500 1000 1500 2000 2500 IOPS Latency SSD for Journal Case #2. Use SSD for Journal Result : 4K Rand Write
  • 34.
    # ceph --admin-daemon/var/run/ceph/ceph-osd.2.asok config show | grep queue_max … "filestore_queue_max_bytes": " 104857600", "filestore_queue_max_ops": ”50", ß default value … Client FileStore Queue SAS # ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf dump | grep -A 15 throttle-filestore_ops "throttle-filestore_ops": { "val": 50, ß limitation "max": 50, … Block IO Flush 50 Analysis
  • 35.
    Tuning Result filestore_queue_max_ops =500000 filestore_queue_max_bytes = 42949672960 journal_max_write_bytes = 42949672960 journal_max_write_entries = 5000000 2094 à 2899 (38%) 2899 174.76 0 500 1000 1500 2000 2500 3000 3500 IOPS Latency Tuning QueueMax Case #3. Tuning FileStore
  • 36.
    # ceph --admin-daemon/var/run/ceph/ceph-osd.2.asok config show | grep wbthrottle_xfs "filestore_wbthrottle_xfs_bytes_hard_limit": "419430400", "filestore_wbthrottle_xfs_bytes_start_flusher": "41943040", "filestore_wbthrottle_xfs_inodes_hard_limit": "5000", "filestore_wbthrottle_xfs_inodes_start_flusher": "500", ”filestore_wbthrottle_xfs_ios_hard_limit": "5000", "filestore_wbthrottle_xfs_ios_start_flusher": "500", # ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf dump | grep -A6 WBThrottle "WBThrottle": { "bytes_dirtied": 21049344, "bytes_wb": 197390336, "ios_dirtied": 5017, ß limitation "ios_wb": 40920, "inodes_dirtied": 1195, "inodes_wb": 20602 Client FileStore Queue SAS Block IO Flush 500,000 Analysis
  • 37.
    Tuning Result "filestore_wbthrottle_enable": "false", or "filestore_wbthrottle_xfs_bytes_hard_limit":"4194304000", "filestore_wbthrottle_xfs_bytes_start_flusher": "419430400", "filestore_wbthrottle_xfs_inodes_hard_limit": "500000", "filestore_wbthrottle_xfs_inodes_start_flusher": "5000", "filestore_wbthrottle_xfs_ios_hard_limit": "500000", "filestore_wbthrottle_xfs_ios_start_flusher": "5000", 2,094 à 14,264 (x7) 14264 40.65 0 2000 4000 6000 8000 10000 12000 14000 16000 IOPS Latency Tuning WBThrottle Case #4. Tuning WBThrottle
  • 38.
  • 39.
    Client OSD Enable Throttle 2,000 IOPS ClientOSD SAS Disable Throttle SSD 15,000 IOPS Bottleneck SAS SSD • Slow performance • No effect with SSD • Fast performance • Danger using High Page Cache • Crash Ceph Storage Problems of Throttle
  • 40.
    Client OSD SAS Dynamic Throttle SSD • Burst(~ 60 secs) • Throttle From 80% 15,000 IOPS 500 IOPS filestore_queue_max_ops filestore_queue_max_bytes filestore_expected_throughput_ops filestore_expected_throughput_bytes filestore_queue_low_threshhold filestore_queue_high_threshhold filestore_queue_high_delay_multiple filestore_queue_max_delay_multiple Reference : http://blog.wjin.org/posts/ceph-dynamic-throttle.html r = current_op / max_ops high_delay_per_count = high_multiple / expected_throughput_ops max_delay_per_count = max_multiple / expected_throughput_ops s0 = high_delay_per_count / (high_threshhold - low_threshhold) s1 = (max_delay_per_count - high_delay_per_count) / (1 - high_threshhold) if r < low_threshhold: delay = 0 elif r < high_threshhold: delay = (r - low_threshhold) * s0 else: delay = high_delay_per_count + ((r - high_threshhold) * s1) Dynamic Throttle
  • 41.
    ◼ High performanceimprovements with SSD : 2,094 à 14,264 (x7) ◼ Must need Throttling for stable storage operation : Dynamic Throttle ◼ Need to tune OS(page cache, io scheduler), Ceph config Conculsion
  • 42.