SlideShare a Scribd company logo
1 of 99
Download to read offline
Build an High-Performance and High-Durable 
Block Storage Service Based on Ceph 
Rongze & Haomai 
[ rongze | haomai ]@unitedstack.com
CONTENTS 
Block Storage Service 
1 
3 
High Durability 
2 
High Performance 
4 
Operation Expericence
THE FIRST PART 
01 Block Storage 
Service
Block Storage Service Highlight 
• 6000 IOPS 170 MB/s 95% < 2ms SLA 
• 3 copys, strong consistency, 99.99999999% durability 
• All management ops in seconds 
• Real-time snapshot 
• Performance volume type and capacity volume type 
4
Software used 
5
6 
Now 
OpenStack Essex Folsom Havana Icehouse/ 
Juno Juno 
Ceph 0.42 0.67.2 base on 
0.67.5 
base on 
0.67.5 
base on 
0.80.7 
CentOS 6.4 6.5 6.5 6.6 
Qemu 0.12 0.12 base on 
1.2 
base on 
1.2 2.0 
Kernel 2.6.32 3.12.21 3.12.21 ?
Deployment Architecture 
7
minimum deployment 
12 osd nodes and 3 monitor nodes 
 
	
 
8 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 40 Gb Switch
Scale-out 
9
the minimum scale deployment 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
40 Gb Switch 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 
12 osd nodes
Compute/Storage Node 
 
	
 	
  
	
 
 
	
  
	
 
40 Gb Switch 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
  
	
 
Compute/Storage Node 
 
	
 	
  
	
 
 
	
  
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 
144 osd nodes 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node
OpenStack
nova 
VM VM 
1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins 
10 Gb Network: 20 GB / 1000 MB = 20 s 
LVM 
SAN 
Ceph 
20 GB Image 20 GB Image 
LocalFS 
Swift 
Ceph 
GlusterFS 
LocalFS 
NFS 
GlusterFS 
cinder glance 
HTTP HTTP 
Boot Storm
Nova, Glance, Cinder use the same ceph pool 
All action in seconds 
No boot storm 
cinder glance 
14 
nova 
VM VM 
Ceph
15 
QoS 
• Nova 
• Libvirt 
• Qemu(throttle) 
Two Volume Types 
• Cinder multi-backend 
• Ceph SSD Pool 
• Ceph SATA Pool 
Shared Volume 
• Read Only 
• Multi-attach
THE SECOND PART 
02 High 
Performance
OS configure
• CPU: 
• Get out of CPU out of power save mode: 
• echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null 
• Cgroup: 
• Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD) 
• Memory: 
• Turn off NUMA if support NUMA in /etc/grub.conf 
• Set vm.swappiness = 0 
• Block: 
• echo deadline  /sys/block/sd[x]/queue/scheduler 
• FileSystem 
• Mount with “noatime nobarrier
Qemu
• Throttle: Smooth IO limit algorithm(backport) 
• RBD enhance: Discard and flush enhance(backport) 
• Burst: 
• Virt-scsi: Multi-queue support
Ceph IO Stack
VCPU 
Thread 
Qemu Main 
Thread 
data flow 
Pipe::Writer Pipe::Reader DispatchThreader OSD::OpWQ FileJournal::Writer FileJournal-finisher 
Network 
Qemu OSD 
RadosClient-finisher DispatchThreader Pipe::Reader Pipe::Writer FileStore-ondisk_finisher FileStore-op_finisher FileStore::SyncThread FileStore::OpWQ 
Object Object Object Object 
Object Object Object Object 
RBD Image 
Rados 
Object File
Ceph Optimization
Rule 1: Keep FD 
• Facts: 
• FileStore Bottleneck: Remarkable performance degraded when FD cache missed 
• SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory 
• Action: 
• Increase FDCache/OMapHeader to very large to hold all objects 
• Increase object size to 16MB instead of 4MB(default) 
• Improve default OS fd limits 
• Configuration: 
• “filestore_omap_header_cache_size” 
• “filestore_fd_cache_size” 
• “rbd_chunk_size”(OpenStack Cinder)
Rule 2: Sparse Read/Write 
• Facts: 
• Only few KB exists in Object for RBD usage 
• Clone/Recovery will copy full object, harmful to performance and capacity 
• Action: 
• Use sparse read/write 
• Problem: 
• XFS or other local filesystems exists existing bugs for fiemap 
• Configuration: 
• “filestore_fiemap=true”
Rule 3: Drop default limits 
• Facts: 
• Default configuration value is suitable for HDD backend 
• Action: 
• Change all throttle-related configuration value 
• Configuration: 
• “filestore_wbthrottle_*” 
• “filestore_queue_*” 
• “journal_queue_*” 
• “…” More related configs(recovery, scrub)
Rule 4: Use RBD cache 
• Facts: 
• RBD cache has remarkable performance 
improvement for seq read/write 
• Action: 
• Enable RBD cache 
• Configuration: 
• “rbd_cache = true”
Rule 5: Speed Cache 
• Facts: 
• Default cache container implementation isn’t suitable for large cache 
capacity 
• Temporary Action: 
• Change cache container to “RandomCache” (Out of Master Branch) 
• FDCache, OMap header cache, ObjectCacher 
• Next: 
• RandomCache isn’t suitable for generic situations 
• Implementation Effective ARC replacing RandomCache
Rule 6: Keep Thread Running 
• Facts: 
• Ineffective thread wakeup 
• Action: 
• Make OSD queue running 
• Configuration: 
• Still in Pull Request(https://github.com/ceph/ceph/pull/2727) 
• “osd_op_worker_wake_time”
Rule 7: Async 
Messenger(experiment) 
• Facts: 
• Each client need two threads on OSD side 
• Painful context switch latency 
• Action: 
• Use Async Messenger 
• Configuration: 
• “ms_type = async” 
• “ms_async_op_threads = 5”
Result: IOPS 
Based on Ceph 0.67.5 Single OSD
Result: Latency 
• 4K random write for 1TB rbd image: 1.05 ms per IO 
• 4K random read for 1TB rbd image: 0.5 ms per IO 
• 1.5x latency performance improvement 
• Outstanding large dataset performance
THE THIRD PART 
03 High 
Durability
Ceph Reliability Model 
• https://wiki.ceph.com/Development/Reliability_model 
• 《CRUSH: Controlled, Scalable, Decentralized Placement of 
Replicated Data》 
• 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 
• Ceph CRUSH code 
Durability Formula 
• P = func(N, R, S, AFR) 
• P = Pr * M / C(R, N)
Where 
need to optimize ?
• DataPlacement decides Durability 
• CRUSH-MAP decides DataPlacement 
• CRUSH-MAP decides Durability
What 
need to optimize ?
• Durability depend on OSD recovery time 
• Durability depend on the number of Copy set in ceph 
pool 
the possible PG’s OSD set is Copy set, 
the data loss in ceph is loss of any PG, actually is loss of 
any Copy set. 
If the replication number is 3 and we lost 3 osds in ceph, 
the probability of data loss depend on the number of 
copy set, because the 3 osds may be not the Copy set.
• The shorter recovery time, the higher Durability 
• The less the number of Copy set, the higher Durability
How 
to optimize it?
• The CRUSH-MAP setting decides osd recovery time 
• The CRUSH-MAP setting decides the number of Copy set
Default CRUSH-MAP setting
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
24 OSDs 24 OSDs 24 OSDs 
if R = 3, M = 24 * 24 * 24 = 13824 
3 racks 
24 nodes 
72 osds
default crush setting
N = 72 
S = 3 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 2.1*10E-4 4.6*10E-8 
P 0.99 1.4*10E-4 5.4*10E-9 
Nines 3 8
Reduce recovery time
default CRUSH-MAP setting 
server-08 
if one OSD out, only two OSDs can do data recovery 
so 
recovery time is too longer 
we 
need more OSD to do data recovery to 
reduce recovery time
use osd-domain instead of host bucket 
reduce recovery time 
osd-domain
server-01 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
if R = 3, M = 24 * 24 * 24 = 13824
new crush map
N = 72 
S = 12 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 7.8*10E-5 6.7*10E-9 
P 0.99 5.4*10E-5 7.7*10E-10 
Nines 4 9
Reduce the number of copy 
set
use replica-domain instead 
of rack bucket
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
failure-domain 
replica-domain 
replica-domain 
if R = 3, M = (12*12*12) * 2 = 3456
new crush map
N = 72 
S = 12 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 864 3456 
Pr 0.99 7.8*10E-5 6.7*10E-9 
P 0.99 2.7*10E-5 1.9*10E-10 
Nines 0 4 ≈ 10
THE FOURTH PART 
04 Operation 
Expericence
deploy 
• eNovance: puppet-ceph 
• Stackforge: puppet-ceph 
• UnitedStack: puppet-ceph 
• shorter deploy time 
• support all ceph options 
• support multi disk type 
• wwn-id instead of disk label 
• hieradata
Operation goal: Availability 
• reduce unnecessary data migration 
• reduce slow requests
upgrade ceph 
1. noout: ceph osd set noout 
2. mark down: ceph osd down x 
3. restart: service ceph restart osd.x
reboot host 
1. migrate vm 
2. mark down osd 
3. reboot host
expand ceph capacity 
1. setting crushmap 
2. setting recovery options 
3. trigger data migration 
4. observe data recovery rate 
5. observe slow requests
replace disk 
• be careful 
• ensure replica-domain’s weight unchanged, 
otherwise data(pg) migrate to another replica-domain
monitoring 
• diamond: add new collector, ceph perf dump, ceph status 
• graphite: store data 
• grafana: display 
• alert: zabbix  ceph health command
throttle model
add new collector in diamond 
redefine metric name in graphite 
[process].[what].[component].[attr]
osd_client_messenger 
osd_dispatch_client 
osd_dispatch_cluster 
osd_pg 
osd_pg_client_w 
osd_pg_client_r 
osd_pg_client_rw 
osd_pg_cluster_w 
filestore_op_queue 
filestore_journal_queue 
filestore_journal 
filestore_wb 
filestore_leveldb 
filestore_commit 
	 
max_bytes 
max_ops 
ops 
bytes 
op/s 
in_b/s 
out_b/s 
lat
Accidents 
• SSD GC 
• network failure 
• Ceph bug 
• XFS bug 
• SSD corruption 
• PG inconsistent 
• recovery data filled network bandwidth
@ UnitedStack 
THANK YOU 
FOR WATCHING 
2014/11/05
https://www.ustack.com/jobs/
About UnitedStack
UnitedStack - The Leading OpenStack Cloud Service 
Solution Provider in China
VM deployment in seconds 
WYSIWYG network topology 
High performance cloud storage 
Billing by seconds
Public Cloud 
devel 
Beijing1 
Guangdong1 
Managed Cloud 
Cloud Storage 
Mobile Socail 
Financial 
U Center 
Unified Cloud Service Platform 
Unified Ops 
Unified SLA 
IDC 
test 
…… ……
The detail of durability formula
• DataPlacement decides Durability 
• CRUSH-MAP decides DataPlacement 
• CRUSH-MAP decides Durability
Default crush setting 
 
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24
3 racks 
24 nodes 
72 osds 
How to compute Durability?
Ceph Reliability Model
• https://wiki.ceph.com/Development/Reliability_model 
• 《CRUSH: Controlled, Scalable, Decentralized 
Placement of Replicated Data》 
• 《Copysets: Reducing the Frequency of Data Loss in 
Cloud Storage》 
• Ceph CRUSH code
Durability Formula
P = func(N, R, S, AFR) 
• P: the probability of losing all copy 
• N: the number of OSD in ceph pool 
• R: the number of copy 
• S: the number of OSD in bucket(it decide recovery 
time) 
• AFR: disk annualized failure rate
Failure events are 
considered to be Poisson 
• Failure rates are characterized in units of failures per 
billion hours(FITs), and so I have tried to represent all 
periodicities in FITs and all times in hours: 
fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/ 
(24*365) 
• Event Probabilities, λ is the failure rate, the 
probability of n failure events during time t: 
Pn(λ,t) = (λt)n e-λt / n!
The probability of data loss 
• OSD set: copy set, any PG reside in 
• data loss: any OSD set loss 
• ignore Non-Recoverable Errrors, NRE’s never happen 
which might be true on scrubbed osd
Non-Recoverable Errors 
NREs are read errors that cannot be 
corrected by retries or ECC. 
• media noise 
• high-fly 
• off-track writes
The probability of R OSDs loss 
1. The probability of an initial OSD loss incident. 
2. Having sufferred this loss, the probability of losing R- 
1 OSDs is based on the recovery time. 
3. Multiplied by the probability of the above. The result 
is Pr。
The probability of Copy sets 
loss 
1. M = Copy Sets Number in Ceph Pool 
2. any R OSDs is C(R, N) 
3. the probability of copy sets loss is Pr * M / C(R, N)
P = Pr * M / C(R, N) 
If R = 3, One PG - (osd.x, osd.y, osd.z) 
(osd.x, osd.y, osd.z) is a Copy Set 
All Copy Sets are in line with the rule of CRUSH MAP 
M = The numer of Copy Sets in Ceph Pool 
Pr = The probability of R OSDs loss 
C(R, N) = any R OSDs in N 
P = The probability of any Copy Sets loss(data loss)
24 OSDs 24 OSDs 24 OSDs 
default crush setting 
 
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
if R = 3, M = 24 * 24 * 24 = 13824
default crush setting
AFR = 0.04 
One Disk Recovery Rate = 100 MB/s 
Mark Out Time = 10 mins
N = 72 
S = 3 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 2.1*10E-4 4.6*10E-8 
P 0.99 1.4*10E-4 5.4*10E-9 
Nines 3 8
Trade-off
trade off between durability and availability 
new crush N R S Nines R 
Ceph 72 3 3 11 31 mins 
Ceph 72 3 6 10 13 mins 
Ceph 72 3 12 10 6 mins 
Ceph 72 3 24 9 3 mins
Shorter recovery time 
Minimize the impact of SLA

More Related Content

What's hot

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Community
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Ceph Community
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAli MasudianPour
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2Tommy Lee
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
 

What's hot (20)

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
An Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL databaseAn Introduction to REDIS NoSQL database
An Introduction to REDIS NoSQL database
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Ceph
CephCeph
Ceph
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
CephFS Update
CephFS UpdateCephFS Update
CephFS Update
 
Troubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issuesTroubleshooting common oslo.messaging and RabbitMQ issues
Troubleshooting common oslo.messaging and RabbitMQ issues
 

Similar to Build High-Performance Block Storage with Ceph

Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storageMarian Marinov
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storageMarian Marinov
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"OpenStack Korea Community
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Community
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Ceph Community
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterEttore Simone
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud StroageAlex Lau
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingMarian Marinov
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 

Similar to Build High-Performance Block Storage with Ceph (20)

ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storage
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud Stroage
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 

Recently uploaded

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 

Recently uploaded (20)

Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 

Build High-Performance Block Storage with Ceph

  • 1. Build an High-Performance and High-Durable Block Storage Service Based on Ceph Rongze & Haomai [ rongze | haomai ]@unitedstack.com
  • 2. CONTENTS Block Storage Service 1 3 High Durability 2 High Performance 4 Operation Expericence
  • 3. THE FIRST PART 01 Block Storage Service
  • 4. Block Storage Service Highlight • 6000 IOPS 170 MB/s 95% < 2ms SLA • 3 copys, strong consistency, 99.99999999% durability • All management ops in seconds • Real-time snapshot • Performance volume type and capacity volume type 4
  • 6. 6 Now OpenStack Essex Folsom Havana Icehouse/ Juno Juno Ceph 0.42 0.67.2 base on 0.67.5 base on 0.67.5 base on 0.80.7 CentOS 6.4 6.5 6.5 6.6 Qemu 0.12 0.12 base on 1.2 base on 1.2 2.0 Kernel 2.6.32 3.12.21 3.12.21 ?
  • 8. minimum deployment 12 osd nodes and 3 monitor nodes 8 Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 40 Gb Switch
  • 10. the minimum scale deployment Compute/Storage Node 40 Gb Switch Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 12 osd nodes
  • 11. Compute/Storage Node 40 Gb Switch Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 144 osd nodes Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node
  • 13. nova VM VM 1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins 10 Gb Network: 20 GB / 1000 MB = 20 s LVM SAN Ceph 20 GB Image 20 GB Image LocalFS Swift Ceph GlusterFS LocalFS NFS GlusterFS cinder glance HTTP HTTP Boot Storm
  • 14. Nova, Glance, Cinder use the same ceph pool All action in seconds No boot storm cinder glance 14 nova VM VM Ceph
  • 15. 15 QoS • Nova • Libvirt • Qemu(throttle) Two Volume Types • Cinder multi-backend • Ceph SSD Pool • Ceph SATA Pool Shared Volume • Read Only • Multi-attach
  • 16. THE SECOND PART 02 High Performance
  • 18. • CPU: • Get out of CPU out of power save mode: • echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null • Cgroup: • Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD) • Memory: • Turn off NUMA if support NUMA in /etc/grub.conf • Set vm.swappiness = 0 • Block: • echo deadline /sys/block/sd[x]/queue/scheduler • FileSystem • Mount with “noatime nobarrier
  • 19. Qemu
  • 20. • Throttle: Smooth IO limit algorithm(backport) • RBD enhance: Discard and flush enhance(backport) • Burst: • Virt-scsi: Multi-queue support
  • 22. VCPU Thread Qemu Main Thread data flow Pipe::Writer Pipe::Reader DispatchThreader OSD::OpWQ FileJournal::Writer FileJournal-finisher Network Qemu OSD RadosClient-finisher DispatchThreader Pipe::Reader Pipe::Writer FileStore-ondisk_finisher FileStore-op_finisher FileStore::SyncThread FileStore::OpWQ Object Object Object Object Object Object Object Object RBD Image Rados Object File
  • 24. Rule 1: Keep FD • Facts: • FileStore Bottleneck: Remarkable performance degraded when FD cache missed • SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory • Action: • Increase FDCache/OMapHeader to very large to hold all objects • Increase object size to 16MB instead of 4MB(default) • Improve default OS fd limits • Configuration: • “filestore_omap_header_cache_size” • “filestore_fd_cache_size” • “rbd_chunk_size”(OpenStack Cinder)
  • 25. Rule 2: Sparse Read/Write • Facts: • Only few KB exists in Object for RBD usage • Clone/Recovery will copy full object, harmful to performance and capacity • Action: • Use sparse read/write • Problem: • XFS or other local filesystems exists existing bugs for fiemap • Configuration: • “filestore_fiemap=true”
  • 26. Rule 3: Drop default limits • Facts: • Default configuration value is suitable for HDD backend • Action: • Change all throttle-related configuration value • Configuration: • “filestore_wbthrottle_*” • “filestore_queue_*” • “journal_queue_*” • “…” More related configs(recovery, scrub)
  • 27. Rule 4: Use RBD cache • Facts: • RBD cache has remarkable performance improvement for seq read/write • Action: • Enable RBD cache • Configuration: • “rbd_cache = true”
  • 28. Rule 5: Speed Cache • Facts: • Default cache container implementation isn’t suitable for large cache capacity • Temporary Action: • Change cache container to “RandomCache” (Out of Master Branch) • FDCache, OMap header cache, ObjectCacher • Next: • RandomCache isn’t suitable for generic situations • Implementation Effective ARC replacing RandomCache
  • 29. Rule 6: Keep Thread Running • Facts: • Ineffective thread wakeup • Action: • Make OSD queue running • Configuration: • Still in Pull Request(https://github.com/ceph/ceph/pull/2727) • “osd_op_worker_wake_time”
  • 30. Rule 7: Async Messenger(experiment) • Facts: • Each client need two threads on OSD side • Painful context switch latency • Action: • Use Async Messenger • Configuration: • “ms_type = async” • “ms_async_op_threads = 5”
  • 31. Result: IOPS Based on Ceph 0.67.5 Single OSD
  • 32. Result: Latency • 4K random write for 1TB rbd image: 1.05 ms per IO • 4K random read for 1TB rbd image: 0.5 ms per IO • 1.5x latency performance improvement • Outstanding large dataset performance
  • 33. THE THIRD PART 03 High Durability
  • 34. Ceph Reliability Model • https://wiki.ceph.com/Development/Reliability_model • 《CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data》 • 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 • Ceph CRUSH code Durability Formula • P = func(N, R, S, AFR) • P = Pr * M / C(R, N)
  • 35. Where need to optimize ?
  • 36. • DataPlacement decides Durability • CRUSH-MAP decides DataPlacement • CRUSH-MAP decides Durability
  • 37. What need to optimize ?
  • 38. • Durability depend on OSD recovery time • Durability depend on the number of Copy set in ceph pool the possible PG’s OSD set is Copy set, the data loss in ceph is loss of any PG, actually is loss of any Copy set. If the replication number is 3 and we lost 3 osds in ceph, the probability of data loss depend on the number of copy set, because the 3 osds may be not the Copy set.
  • 39. • The shorter recovery time, the higher Durability • The less the number of Copy set, the higher Durability
  • 41. • The CRUSH-MAP setting decides osd recovery time • The CRUSH-MAP setting decides the number of Copy set
  • 43. server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 24 OSDs 24 OSDs 24 OSDs if R = 3, M = 24 * 24 * 24 = 13824 3 racks 24 nodes 72 osds
  • 45. N = 72 S = 3 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 2.1*10E-4 4.6*10E-8 P 0.99 1.4*10E-4 5.4*10E-9 Nines 3 8
  • 47. default CRUSH-MAP setting server-08 if one OSD out, only two OSDs can do data recovery so recovery time is too longer we need more OSD to do data recovery to reduce recovery time
  • 48. use osd-domain instead of host bucket reduce recovery time osd-domain
  • 49. server-01 rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 osd-domain osd-domain osd-domain osd-domain osd-domain osd-domain if R = 3, M = 24 * 24 * 24 = 13824
  • 51. N = 72 S = 12 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 7.8*10E-5 6.7*10E-9 P 0.99 5.4*10E-5 7.7*10E-10 Nines 4 9
  • 52. Reduce the number of copy set
  • 53. use replica-domain instead of rack bucket
  • 54. server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 osd-domain osd-domain osd-domain osd-domain osd-domain osd-domain failure-domain replica-domain replica-domain if R = 3, M = (12*12*12) * 2 = 3456
  • 56. N = 72 S = 12 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 864 3456 Pr 0.99 7.8*10E-5 6.7*10E-9 P 0.99 2.7*10E-5 1.9*10E-10 Nines 0 4 ≈ 10
  • 57. THE FOURTH PART 04 Operation Expericence
  • 58. deploy • eNovance: puppet-ceph • Stackforge: puppet-ceph • UnitedStack: puppet-ceph • shorter deploy time • support all ceph options • support multi disk type • wwn-id instead of disk label • hieradata
  • 59. Operation goal: Availability • reduce unnecessary data migration • reduce slow requests
  • 60. upgrade ceph 1. noout: ceph osd set noout 2. mark down: ceph osd down x 3. restart: service ceph restart osd.x
  • 61. reboot host 1. migrate vm 2. mark down osd 3. reboot host
  • 62. expand ceph capacity 1. setting crushmap 2. setting recovery options 3. trigger data migration 4. observe data recovery rate 5. observe slow requests
  • 63. replace disk • be careful • ensure replica-domain’s weight unchanged, otherwise data(pg) migrate to another replica-domain
  • 64. monitoring • diamond: add new collector, ceph perf dump, ceph status • graphite: store data • grafana: display • alert: zabbix ceph health command
  • 66. add new collector in diamond redefine metric name in graphite [process].[what].[component].[attr]
  • 67. osd_client_messenger osd_dispatch_client osd_dispatch_cluster osd_pg osd_pg_client_w osd_pg_client_r osd_pg_client_rw osd_pg_cluster_w filestore_op_queue filestore_journal_queue filestore_journal filestore_wb filestore_leveldb filestore_commit max_bytes max_ops ops bytes op/s in_b/s out_b/s lat
  • 68.
  • 69.
  • 70.
  • 71. Accidents • SSD GC • network failure • Ceph bug • XFS bug • SSD corruption • PG inconsistent • recovery data filled network bandwidth
  • 72. @ UnitedStack THANK YOU FOR WATCHING 2014/11/05
  • 75. UnitedStack - The Leading OpenStack Cloud Service Solution Provider in China
  • 76. VM deployment in seconds WYSIWYG network topology High performance cloud storage Billing by seconds
  • 77.
  • 78. Public Cloud devel Beijing1 Guangdong1 Managed Cloud Cloud Storage Mobile Socail Financial U Center Unified Cloud Service Platform Unified Ops Unified SLA IDC test …… ……
  • 79. The detail of durability formula
  • 80. • DataPlacement decides Durability • CRUSH-MAP decides DataPlacement • CRUSH-MAP decides Durability
  • 81. Default crush setting server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24
  • 82. 3 racks 24 nodes 72 osds How to compute Durability?
  • 84. • https://wiki.ceph.com/Development/Reliability_model • 《CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data》 • 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 • Ceph CRUSH code
  • 86. P = func(N, R, S, AFR) • P: the probability of losing all copy • N: the number of OSD in ceph pool • R: the number of copy • S: the number of OSD in bucket(it decide recovery time) • AFR: disk annualized failure rate
  • 87. Failure events are considered to be Poisson • Failure rates are characterized in units of failures per billion hours(FITs), and so I have tried to represent all periodicities in FITs and all times in hours: fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/ (24*365) • Event Probabilities, λ is the failure rate, the probability of n failure events during time t: Pn(λ,t) = (λt)n e-λt / n!
  • 88. The probability of data loss • OSD set: copy set, any PG reside in • data loss: any OSD set loss • ignore Non-Recoverable Errrors, NRE’s never happen which might be true on scrubbed osd
  • 89. Non-Recoverable Errors NREs are read errors that cannot be corrected by retries or ECC. • media noise • high-fly • off-track writes
  • 90. The probability of R OSDs loss 1. The probability of an initial OSD loss incident. 2. Having sufferred this loss, the probability of losing R- 1 OSDs is based on the recovery time. 3. Multiplied by the probability of the above. The result is Pr。
  • 91. The probability of Copy sets loss 1. M = Copy Sets Number in Ceph Pool 2. any R OSDs is C(R, N) 3. the probability of copy sets loss is Pr * M / C(R, N)
  • 92. P = Pr * M / C(R, N) If R = 3, One PG - (osd.x, osd.y, osd.z) (osd.x, osd.y, osd.z) is a Copy Set All Copy Sets are in line with the rule of CRUSH MAP M = The numer of Copy Sets in Ceph Pool Pr = The probability of R OSDs loss C(R, N) = any R OSDs in N P = The probability of any Copy Sets loss(data loss)
  • 93. 24 OSDs 24 OSDs 24 OSDs default crush setting server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 if R = 3, M = 24 * 24 * 24 = 13824
  • 95. AFR = 0.04 One Disk Recovery Rate = 100 MB/s Mark Out Time = 10 mins
  • 96. N = 72 S = 3 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 2.1*10E-4 4.6*10E-8 P 0.99 1.4*10E-4 5.4*10E-9 Nines 3 8
  • 98. trade off between durability and availability new crush N R S Nines R Ceph 72 3 3 11 31 mins Ceph 72 3 6 10 13 mins Ceph 72 3 12 10 6 mins Ceph 72 3 24 9 3 mins
  • 99. Shorter recovery time Minimize the impact of SLA
  • 100. Final crush map old map: root rack host osd new map: failure-domain replica-domain osd-domain osd