SlideShare a Scribd company logo
1 of 25
Yuming Ma , Architect
Cisco Cloud Services
Ceph Day, Portland Oregon, May 25th, 2016
Stabilizing Petabyte Ceph
Cluster in OpenStack Cloud
Highlights
1. What are we doing with Ceph?
2. What did we start with?
3. We need a bigger boat
4. Getting better and sleeping through the night
5. Lessons learned
Cisco Cloud Services provides an Openstack platform to Cisco SaaS
applications and tenants through a worldwide deployment of
datacenters.
Background
SaaS Cases
• Collaboration
• IoT
• Security
• Analytics
• “Unknown
Projects”
Swift
• Database (Trove)
Backups
• Static Content
• Cold/Offline data for
Hadoop
Cinder
• Generic/Magnetic
Volumes
• Low Performance
• Boot Volumes for all VM flavors except those with
Ephemeral (local) storage
• Glance Image store
• Generic Cinder Volume
• Swift Object store
• In production since March 2014
• 13 clusters in production in two years
• Each cluster is 1800TB raw over 45 nodes and 450
OSDs.
How Do We Use Ceph?
Cisco UCS
Ceph
High-Perf
Platform
Generic
Volume
Prov
IOPS
Cinder API
Object
Swift API
• Nice consistent growth…
• Your users will not warn
you before:
• “going live”
• Migrating out of S3
• Backing up a Hadoop
HDFS
• Stability problems started
after 50% used
Growth: It will happen, just not sure when
CCS Ceph 1.0
RACK3RACK2
1 2 10
LSI 9271 HBA
Data
partition
HDD
Journal
partition
…..
…..
XFS
…..
…..
OSD2 OSD10OSD1
…..
1211
OS on
RAID1
MIRROR
2x10Gb PRIVATE NETWORK
KEYSTONE
API
SWIFT
API
CINDER
API
GLANCE
API
NOVA
API
OPENSTACK
RADOS GATE WAY CEPH BLOCK DEVICE (RBD)
Libvirt/kv
m
2x10Gb PUBLIC NETWORK
monitors monitors monitors
15xC240
CEPH libRADOS API
RACK1
15xC240 15xC240
OSD: 45 x UCS C240 M3
• 2xE5 2690 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 2x10Gbs for cluster
• 3X replication
• LSI 9271 HBA
• 10 x 4TB HDD, 7200 RPM
• 10GB journal partition from
HDD
• RHEL3.10.0-
229.1.2.el7.x86_64
NOVA: UCS C220
• Ceph 0.94.1
• RHEL3.10.0-
229.4.2.el7.x86_64
MON/RGW: UCS C220 M3
• 2xE5 2680 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 4x3TB HDD, 7200 RPM
• RHEL3.10.0-
229.4.2.el7.x86_64
Started with Cuttlefish/Dumpling
• Get to MVP and keep costs down.
• High capacity, hence C240 M3 LFF for 4TB HDDs
• Tradeoff was that C240 M3 LFF could not also accommodate SSD 
• So Journal was collocated on OSD
• Monitors were on HDD based systems as well
Initial Design Considerations
Major Stability Problems: Monitors
Problem Impact
MON election storm
impacting client IO
Monmap changes due to flaky NIC or chatty messaging between MON and
client. Caused unstable quorum and an election storm between MON hosts
Results: blocked and slowed client IO requests
LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from
serving OSD requests
Results: Blocked IO and slow request
DDOS due to chatty
client msg attack
Slow response from MON to client due to levelDB or election storm causing
message flood attack from client.
Results: failed client operation, e.g volume creation, RBD connection
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
Osdmap changes due to loss of disk, resulting in PG peering and backfilling
Results: Clients receive blocked and slow IO.
Unbalanced data
distribution
Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some
OSDs are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (sick, not dead) OSD can severely impact many clients until it’s
ejected from the cluster.
Results: Client have slow or blocked IO.
Stability Improvement Strategy
Strategy Improvement
Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume.
Backfill and recovery
throttling
Reduced IO consumption by backfill and recovery processes to yield to
client IO over
Retrofit with NVME (PCIe)
journals
Increased overall IOPS of the cluster
Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm
LevelDB on SSD
(replaced entire mon node)
Faster cluster map query
Re-weight by utilization Balance data distribution
*Client is the RBD client not the tenant
• Limit max/cap IO consumption at
qemu layer:
• iops ( IOPS read and write ) 250
• bps (Bits per second read and
write ) 100 MB/s
• Predictable and controlled IOPS
capacity
• NO min/guaranteed IOPS ->
future Ceph feature
• NO burst map -> qemu feature:
• iops_max 500
• bpx_max 120 MB/s
Client IO throttling
Swing ~ 100%
Swing ~ 12%
• Problem
• Blocked IO during peering
• Slow requests during backfill
• Both could cause client IO stall
and vCPU soft lockup
• Solution
• Throttling backfill and recovery
osd recovery max active = 3 (default : 15)
osd recovery op priority = 3 (default : 10)
osd max backfills = 1 (default : 10)
Backfill and Recovery Throttling
Rack-1
6 nodes
osd
1
osd
10…
Rack-2
5 nodes
osd
1
osd
10…
Rack-3
6 nodes
osd
1
osd
10…
nova1
vm
1
vm
20
…
nova10
vm
1
vm
20
…
nova2
vm
1
…
…..
NVMe Journaling: Performance Testing Setup
Partition starts at 4MB(s1024), 10GB each and
4MB offset in between
1 2 10
LSI 9271 HBA
1 2 10
RAID0
1DISK
1 2 10NVME
…..
…..
XFS
…..
…..
OSD
2
OSD10OSD
1
…..
300GB Free
1211
OS on RAID1
MIRROR OSD: C240 M3
• 2xE5 2690 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 2x10Gbs for cluster
• 3X replication
• Intel P3700 400GB NVMe
• LSI 9271 HBA
• 10x4TB, 7200 RPM
Nova C220
• 2xE5 2680 V2, 40 HT/core
• 380GB RAM
• 2x10Gbs for public
• 3.10.0-229.4.2.el7.x86_64
vm
20
NVMe Journaling: Performance Tuning
OSD host iostat:
• Both nvme and hdd disk %util and low most of the time, and spikes
every ~45s.
• Both nvme and hdd have very low queue size (iodepth) while frontend
VM pushes 16 qdepth to FIO.
• CPU %used is reasonable, converge at <%30. But the iowait is low
which corresponding to low disk activity
NVMe Journaling: Performance Tuning
Tuning Directions: increase disk %util:
• Disk thread: 4, 16, 32
• Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
• These two tunings showed no impact:
filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10
filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10
• Final Config:
osd_journal_size = 10240 (default :
journal_max_write_entries= 1000 (default : 100)
journal_max_write_bytes=1048576000 (default :10485760)
journal_queue_max_bytes=1048576000 (default :10485760)
filestore_queue_max_bytes=1048576000 ((default :10485760)
filestore_queue_committing_max_bytes=1048576000 ((default :10485760)
filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760)
NVMe Performance Tuning
Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher:
filestore_wbthrottle_xfs_inodes_start_flusher
filestore_wbthrottle_xfs_ios_start_flusher
NVMe Stability Improvement Analysis
One Disk (70% of
3TB) failure MTTR
One Host (70% of
30TB) Failure MTTR
Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs
NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs
Disk failure (70% of
3TB) impact to client
IOPS
Host failure (70% of
30TB) impact to client
IOPS
Colo 232.991 vs 210.08
(Drop: 9.83%)
NVME 231.66 vs 194.13
(Drop: 16.20%)
231.66 vs 211.36 (Drop:
8.76%)
Backfill and recovery config:
osd recovery max active = 3 (default : 15)
osd max backfills = 1 (default : 10)
osd recovery op priority = 3 (default : 10)
Server impact:
• Shorter recovery time
Client impact
• <10% impact (tested without IO
throttling, should be less with throttling)
LevelDB :
• Key-value store for cluster metadata, e.g.
osdmap, pgmap, monmap, clientID,
authID etc
• Not in data path
• Impactful to IO operation: IO blocked by
the DB query
• Larger size, longer query time, hence
longer IO wait -> slow requests
• Solution:
• Level DB on SSD in increase disk IO rate
• Upgrade to Hammer to reduce DB size
MON Level DB Issues
MON Level DB on SSD
New BOM:
• UCS C220 M4 with 120GB SSD
Write wait time
with levelDB on HDD
Write wait time
with levelDB on SSD
• Problem
• Election Storm & LevelDB inflation
• Solutions
• Upgrade to 1.2.3 to fix election storm
• Upgrade to 1.3.2 to fix levelDB inflation
• Configuration change
MON Cluster Hardening
[mon]
mon_lease = 20 (default = 5)
mon_lease_renew_interval = 12 (default 3)
mon_lease_ack_timeout = 40 (default 10)
mon_accept_timeout = 40 (default 10)
[client]
mon_client_hunt_interval = 40 (defaiult 3)
• Problem
• High skew of %used of disks that is
preventing data intake even cluster capacity
allows
• Impact:
• Unbalanced PG distribution impacts
performance
• Rebalancing is impactful as well
• Solution:
• Upgrade to Hammer 1.3.2+patch
• Re-weight by utilization: >10% delta
Data Distribution and Balance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22
43
64
85
106
127
148
169
190
211
232
253
274
295
316
337
358
379
400
421
442
us-internal-1 disk % used
% used
Cluster: 67.9% full
OSDs:
• Min: 47.2&
• Max: 83.5%
• Mean: %69.6
• Stddev: 6.5
• Problem
• RBD image data distributed to
all disk and single slow disk
can impact critical data IO
• Solution: proactively detect
slow disks
Proactive Detection of Slow Disks
• Set Clear Stability Goals
• You can plan for everything except how tenants will use it
• Monitor Everything, but not “everything”
• Turn down logging… It is possible to send 900k logs in 30minutes
• Look for issues in services that consume storage
• Had 50TB of “deleted volumes” that weren’t
Lessons Learned
• DevOps
• It’s not just technology, it’s how your team operates as a team
• Share knowledge
• Manage your backlog and manage your management
• Consistent performance and stability modeling
• Rigorous testing
• Determine Requires and architect to them
• Balance performance, cost and time
• Automate builds and rebuilds
• Shortcuts create Technical Debt
Last Lesson…
Yuming Ma: yumima@cisco.com
Thank You
Seth Mason: setmason@cisco.com

More Related Content

What's hot

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Community
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Community
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Community
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Community
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Developing a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsDeveloping a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsCeph Community
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Community
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Community
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on CephCeph Community
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephSage Weil
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Community
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Danielle Womboldt
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Community
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateDanielle Womboldt
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 

What's hot (20)

Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFS
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Developing a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure EnvironmentsDeveloping a Ceph Appliance for Secure Environments
Developing a Ceph Appliance for Secure Environments
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
librados
libradoslibrados
librados
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 

Viewers also liked

Ceph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Community
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Community
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Community
 
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Community
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture Ceph Community
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Community
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Community
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 

Viewers also liked (13)

librados
libradoslibrados
librados
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/O
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
 
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance Archiecture
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 

Similar to Stabilizing Petabyte Ceph Cluster in OpenStack Cloud

Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Community
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems Baruch Osoveskiy
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterAaron Joue
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
 
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use CasesVMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use CasesVMworld
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Howard Marks
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
 
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsGlobal Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsMarco Obinu
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld
 
Tsm7.1 seminar Stavanger
Tsm7.1 seminar StavangerTsm7.1 seminar Stavanger
Tsm7.1 seminar StavangerSolv AS
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Similar to Stabilizing Petabyte Ceph Cluster in OpenStack Cloud (20)

Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Oracle Performance On Linux X86 systems
Oracle  Performance On Linux  X86 systems Oracle  Performance On Linux  X86 systems
Oracle Performance On Linux X86 systems
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Exchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store ChangesExchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store Changes
 
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use CasesVMworld Europe 2014: Virtual SAN Best Practices and Use Cases
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Deploying ssd in the data center 2014
Deploying ssd in the data center 2014Deploying ssd in the data center 2014
Deploying ssd in the data center 2014
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMsGlobal Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
Global Azure Virtual 2020 What's new on Azure IaaS for SQL VMs
 
VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series VMworld 2014: Extreme Performance Series
VMworld 2014: Extreme Performance Series
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Tsm7.1 seminar Stavanger
Tsm7.1 seminar StavangerTsm7.1 seminar Stavanger
Tsm7.1 seminar Stavanger
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Stabilizing Petabyte Ceph Cluster in OpenStack Cloud

  • 1. Yuming Ma , Architect Cisco Cloud Services Ceph Day, Portland Oregon, May 25th, 2016 Stabilizing Petabyte Ceph Cluster in OpenStack Cloud
  • 2. Highlights 1. What are we doing with Ceph? 2. What did we start with? 3. We need a bigger boat 4. Getting better and sleeping through the night 5. Lessons learned
  • 3. Cisco Cloud Services provides an Openstack platform to Cisco SaaS applications and tenants through a worldwide deployment of datacenters. Background SaaS Cases • Collaboration • IoT • Security • Analytics • “Unknown Projects” Swift • Database (Trove) Backups • Static Content • Cold/Offline data for Hadoop Cinder • Generic/Magnetic Volumes • Low Performance
  • 4. • Boot Volumes for all VM flavors except those with Ephemeral (local) storage • Glance Image store • Generic Cinder Volume • Swift Object store • In production since March 2014 • 13 clusters in production in two years • Each cluster is 1800TB raw over 45 nodes and 450 OSDs. How Do We Use Ceph? Cisco UCS Ceph High-Perf Platform Generic Volume Prov IOPS Cinder API Object Swift API
  • 5. • Nice consistent growth… • Your users will not warn you before: • “going live” • Migrating out of S3 • Backing up a Hadoop HDFS • Stability problems started after 50% used Growth: It will happen, just not sure when
  • 6. CCS Ceph 1.0 RACK3RACK2 1 2 10 LSI 9271 HBA Data partition HDD Journal partition ….. ….. XFS ….. ….. OSD2 OSD10OSD1 ….. 1211 OS on RAID1 MIRROR 2x10Gb PRIVATE NETWORK KEYSTONE API SWIFT API CINDER API GLANCE API NOVA API OPENSTACK RADOS GATE WAY CEPH BLOCK DEVICE (RBD) Libvirt/kv m 2x10Gb PUBLIC NETWORK monitors monitors monitors 15xC240 CEPH libRADOS API RACK1 15xC240 15xC240 OSD: 45 x UCS C240 M3 • 2xE5 2690 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 2x10Gbs for cluster • 3X replication • LSI 9271 HBA • 10 x 4TB HDD, 7200 RPM • 10GB journal partition from HDD • RHEL3.10.0- 229.1.2.el7.x86_64 NOVA: UCS C220 • Ceph 0.94.1 • RHEL3.10.0- 229.4.2.el7.x86_64 MON/RGW: UCS C220 M3 • 2xE5 2680 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 4x3TB HDD, 7200 RPM • RHEL3.10.0- 229.4.2.el7.x86_64 Started with Cuttlefish/Dumpling
  • 7. • Get to MVP and keep costs down. • High capacity, hence C240 M3 LFF for 4TB HDDs • Tradeoff was that C240 M3 LFF could not also accommodate SSD  • So Journal was collocated on OSD • Monitors were on HDD based systems as well Initial Design Considerations
  • 8. Major Stability Problems: Monitors Problem Impact MON election storm impacting client IO Monmap changes due to flaky NIC or chatty messaging between MON and client. Caused unstable quorum and an election storm between MON hosts Results: blocked and slowed client IO requests LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from serving OSD requests Results: Blocked IO and slow request DDOS due to chatty client msg attack Slow response from MON to client due to levelDB or election storm causing message flood attack from client. Results: failed client operation, e.g volume creation, RBD connection
  • 9. Major Stability Problems: Cluster Problem Impact Backfill & Recovery impacting client IO Osdmap changes due to loss of disk, resulting in PG peering and backfilling Results: Clients receive blocked and slow IO. Unbalanced data distribution Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some OSDs are at 90% Results: Backfill isn’t always able to complete. Slow disk impacting client IO A single slow (sick, not dead) OSD can severely impact many clients until it’s ejected from the cluster. Results: Client have slow or blocked IO.
  • 10. Stability Improvement Strategy Strategy Improvement Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume. Backfill and recovery throttling Reduced IO consumption by backfill and recovery processes to yield to client IO over Retrofit with NVME (PCIe) journals Increased overall IOPS of the cluster Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm LevelDB on SSD (replaced entire mon node) Faster cluster map query Re-weight by utilization Balance data distribution *Client is the RBD client not the tenant
  • 11. • Limit max/cap IO consumption at qemu layer: • iops ( IOPS read and write ) 250 • bps (Bits per second read and write ) 100 MB/s • Predictable and controlled IOPS capacity • NO min/guaranteed IOPS -> future Ceph feature • NO burst map -> qemu feature: • iops_max 500 • bpx_max 120 MB/s Client IO throttling Swing ~ 100% Swing ~ 12%
  • 12. • Problem • Blocked IO during peering • Slow requests during backfill • Both could cause client IO stall and vCPU soft lockup • Solution • Throttling backfill and recovery osd recovery max active = 3 (default : 15) osd recovery op priority = 3 (default : 10) osd max backfills = 1 (default : 10) Backfill and Recovery Throttling
  • 13. Rack-1 6 nodes osd 1 osd 10… Rack-2 5 nodes osd 1 osd 10… Rack-3 6 nodes osd 1 osd 10… nova1 vm 1 vm 20 … nova10 vm 1 vm 20 … nova2 vm 1 … ….. NVMe Journaling: Performance Testing Setup Partition starts at 4MB(s1024), 10GB each and 4MB offset in between 1 2 10 LSI 9271 HBA 1 2 10 RAID0 1DISK 1 2 10NVME ….. ….. XFS ….. ….. OSD 2 OSD10OSD 1 ….. 300GB Free 1211 OS on RAID1 MIRROR OSD: C240 M3 • 2xE5 2690 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 2x10Gbs for cluster • 3X replication • Intel P3700 400GB NVMe • LSI 9271 HBA • 10x4TB, 7200 RPM Nova C220 • 2xE5 2680 V2, 40 HT/core • 380GB RAM • 2x10Gbs for public • 3.10.0-229.4.2.el7.x86_64 vm 20
  • 14. NVMe Journaling: Performance Tuning OSD host iostat: • Both nvme and hdd disk %util and low most of the time, and spikes every ~45s. • Both nvme and hdd have very low queue size (iodepth) while frontend VM pushes 16 qdepth to FIO. • CPU %used is reasonable, converge at <%30. But the iowait is low which corresponding to low disk activity
  • 15. NVMe Journaling: Performance Tuning Tuning Directions: increase disk %util: • Disk thread: 4, 16, 32 • Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
  • 16. • These two tunings showed no impact: filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10 filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10 • Final Config: osd_journal_size = 10240 (default : journal_max_write_entries= 1000 (default : 100) journal_max_write_bytes=1048576000 (default :10485760) journal_queue_max_bytes=1048576000 (default :10485760) filestore_queue_max_bytes=1048576000 ((default :10485760) filestore_queue_committing_max_bytes=1048576000 ((default :10485760) filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760) NVMe Performance Tuning Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher: filestore_wbthrottle_xfs_inodes_start_flusher filestore_wbthrottle_xfs_ios_start_flusher
  • 17. NVMe Stability Improvement Analysis One Disk (70% of 3TB) failure MTTR One Host (70% of 30TB) Failure MTTR Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs Disk failure (70% of 3TB) impact to client IOPS Host failure (70% of 30TB) impact to client IOPS Colo 232.991 vs 210.08 (Drop: 9.83%) NVME 231.66 vs 194.13 (Drop: 16.20%) 231.66 vs 211.36 (Drop: 8.76%) Backfill and recovery config: osd recovery max active = 3 (default : 15) osd max backfills = 1 (default : 10) osd recovery op priority = 3 (default : 10) Server impact: • Shorter recovery time Client impact • <10% impact (tested without IO throttling, should be less with throttling)
  • 18. LevelDB : • Key-value store for cluster metadata, e.g. osdmap, pgmap, monmap, clientID, authID etc • Not in data path • Impactful to IO operation: IO blocked by the DB query • Larger size, longer query time, hence longer IO wait -> slow requests • Solution: • Level DB on SSD in increase disk IO rate • Upgrade to Hammer to reduce DB size MON Level DB Issues
  • 19. MON Level DB on SSD New BOM: • UCS C220 M4 with 120GB SSD Write wait time with levelDB on HDD Write wait time with levelDB on SSD
  • 20. • Problem • Election Storm & LevelDB inflation • Solutions • Upgrade to 1.2.3 to fix election storm • Upgrade to 1.3.2 to fix levelDB inflation • Configuration change MON Cluster Hardening [mon] mon_lease = 20 (default = 5) mon_lease_renew_interval = 12 (default 3) mon_lease_ack_timeout = 40 (default 10) mon_accept_timeout = 40 (default 10) [client] mon_client_hunt_interval = 40 (defaiult 3)
  • 21. • Problem • High skew of %used of disks that is preventing data intake even cluster capacity allows • Impact: • Unbalanced PG distribution impacts performance • Rebalancing is impactful as well • Solution: • Upgrade to Hammer 1.3.2+patch • Re-weight by utilization: >10% delta Data Distribution and Balance 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442 us-internal-1 disk % used % used Cluster: 67.9% full OSDs: • Min: 47.2& • Max: 83.5% • Mean: %69.6 • Stddev: 6.5
  • 22. • Problem • RBD image data distributed to all disk and single slow disk can impact critical data IO • Solution: proactively detect slow disks Proactive Detection of Slow Disks
  • 23. • Set Clear Stability Goals • You can plan for everything except how tenants will use it • Monitor Everything, but not “everything” • Turn down logging… It is possible to send 900k logs in 30minutes • Look for issues in services that consume storage • Had 50TB of “deleted volumes” that weren’t Lessons Learned
  • 24. • DevOps • It’s not just technology, it’s how your team operates as a team • Share knowledge • Manage your backlog and manage your management • Consistent performance and stability modeling • Rigorous testing • Determine Requires and architect to them • Balance performance, cost and time • Automate builds and rebuilds • Shortcuts create Technical Debt Last Lesson…
  • 25. Yuming Ma: yumima@cisco.com Thank You Seth Mason: setmason@cisco.com

Editor's Notes

  1. Is the current config good.