SlideShare a Scribd company logo
Yuming Ma , Architect
StaaS, Cisco Cloud Foundation
Seattle WA, 10/18/2016
Stabilizing Petabyte Ceph
Cluster in OpenStack Cloud
Highlights
1. What are we doing with Ceph?
2. What did we start with?
3. We need a bigger boat
4. Getting better and sleeping through the night
5. Lessons learned
Cisco Cloud Services provides an Openstack platform to Cisco SaaS
applications and tenants through a worldwide deployment of
datacenters.
Background
SaaS Cases
• Collaboration
• IoT
• Security
• Analytics
• “Unknown
Projects”
Swift
• Database (Trove)
Backups
• Static Content
• Cold/Offline data for
Hadoop
Cinder
• Generic/Magnetic
Volumes
• Low Performance
• Boot Volumes for all VM flavors except those with
Ephemeral (local) storage
• Glance Image store
• Generic Cinder Volume
• RGW for Swift Object store
• In production since March 2014
• 13 clusters in production in two years
• Each cluster is 1800TB raw over 45 nodes and 450
OSDs.
How Do We Use Ceph?
Cisco UCS
Ceph
High-Perf
Platform
Generic
Volume
Prov
IOPS
Cinder API
Object
Swift API
• Get to MVP and keep costs down.
• High capacity, hence C240 M3 LFF for 4TB HDDs
• Tradeoff was that C240 M3 LFF could not also accommodate SSD 
• So Journal was collocated on OSD
• Monitors were on HDD based systems as well
Initial Design Considerations
CCS Ceph 1.0
RACK3RACK2
1 2 10
LSI 9271 HBA
Data
partition
HDD
Journal
partition
…..
…..
XFS
…..
…..
OSD2 OSD10OSD1
…..
1211
OS on
RAID1
MIRROR
2x10Gb PRIVATE NETWORK
KEYSTONE
API
SWIFT
API
CINDER
API
GLANCE
API
NOVA
API
OPENSTACK
RADOS GATE WAY CEPH BLOCK DEVICE (RBD)
Libvirt/kv
m
2x10Gb PUBLIC NETWORK
monitors monitors monitors
15xC240
CEPH libRADOS API
RACK1
15xC240 15xC240
OSD: 45 x UCS C240 M3
• 2xE5 2690 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 2x10Gbs for cluster
• 3X replication
• LSI 9271 HBA
• 10 x 4TB HDD, 7200 RPM
• 10GB journal partition from
HDD
• RHEL3.10.0-
229.1.2.el7.x86_64
NOVA: UCS C220
• Ceph 0.94.1
• RHEL3.10.0-
229.4.2.el7.x86_64
MON/RGW: UCS C220 M3
• 2xE5 2680 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 4x3TB HDD, 7200 RPM
• RHEL3.10.0-
229.4.2.el7.x86_64
Started with Cuttlefish/Dumpling
• Nice consistent growth…
• Your users will not warn
you before:
• “going live”
• Migrating out of S3
• Backing up a Hadoop
HDFS
• Stability problems
emerge after 50% used
Growth: It will happen, just not sure when
Major Stability Problems: Monitors
Problem Impact
MON election storm
impacting client IO
Monmap changes due to flaky NIC or chatty messaging between MON and
client. Caused unstable quorum and an election storm between MON hosts
Results: blocked and slowed client IO requests
LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from
serving OSD requests
Results: Blocked IO and slow request
DDOS due to chatty
client msg attack
Slow response from MON to client due to levelDB or election storm cause
message flood attack from client.
Results: failed client operation, e.g volume creation, RBD connection
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
Osdmap changes due to loss of disk, resulting in PG peering and backfilling
Results: Clients receive blocked and slow IO.
Unbalanced data
distribution
Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some
OSDs are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (sick, not dead) OSD can severely impact many clients until it’s
ejected from the cluster.
Results: Client have slow or blocked IO.
Stability Improvement Strategy
Strategy Improvement
Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume.
Backfill and recovery
throttling
Reduced IO consumption by backfill and recovery processes to yield to
client IO
Retrofit with NVME (PCIe)
journals
Increased overall IOPS of the cluster
Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm
LevelDB on SSD
(replaced entire mon node)
Faster cluster map query
Re-weight by utilization Balance data distribution
*Client is the RBD client not the tenant
• Limit max/cap IO consumption at
qemu layer:
• iops ( IOPS read and write ) 250
• bps (Bits per second read and
write ) 100 MB/s
• Predictable and controlled IOPS
capacity
• NO min/guaranteed IOPS ->
future Ceph feature
• NO burst map -> qemu feature:
• iops_max 500
• bpx_max 120 MB/s
Client IO throttling
Swing ~ 100%
Swing ~ 12%
• Problem
• Blocked IO during peering
• Slow requests during backfill
• Both could cause client IO stall
and vCPU soft lockup
• Solution
• Throttling backfill and recovery
osd recovery max active = 3 (default : 15)
osd recovery op priority = 3 (default : 10)
osd max backfills = 1 (default : 10)
Backfill and Recovery Throttling
• Goal: 2X IOPS capacity gain
• Tuning: filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760)
Retrofit Ceph Journal from HDD to NVME
1 2 10
LSI 9271 HBA
Data
partition
HDD
Journal
partition
…..
…..
XFS
…..
…..
OSD2 OSD10OSD1
…..
1211
OS on
RAID1
MIRROR
Partition starts at 4MB(s1024), 10GB each and
4MB offset in between
1 2 10
LSI 9271 HBA
1 2 10
RAID0
1DISK
1 2 10NVME
…..
…..
XFS
…..
…..
OSD
2
OSD10OSD
1
…..
300GB Free
1211
OS on RAID1
MIRROR
NVMe Stability Improvement Analysis
One Disk (70% of
3TB) failure MTTR
One Host (70% of
30TB) Failure MTTR
Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs
NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs
Disk failure (70% of
3TB) impact to client
IOPS
Host failure (70% of
30TB) impact to client
IOPS
Colo 232.991 vs 210.08
(Drop: 9.83%)
NVME 231.66 vs 194.13
(Drop: 16.20%)
231.66 vs 211.36 (Drop:
8.76%)
Backfill and recovery config:
osd recovery max active = 3
(default : 15)
osd max backfills = 1 (default :
10)
osd recovery op priority = 3
(default : 10)
Server impact:
• Shorter recovery time
Client impact
• <10% impact (tested without
IO throttling, impact should be
less with IO throttling)
LevelDB :
• Key-value store for cluster metadata, e.g.
osdmap, pgmap, monmap, clientID,
authID etc
• Not in data path
• Impactful to IO operation: IO blocked by
the DB query
• Larger size, longer query time, hence
longer IO wait -> slow requests
• Solution:
• Level DB on SSD in increase disk IO rate
• Upgrade to Hammer to reduce DB size
MON Level DB Issues
Retrofit MON Level DB from HDD to SSD
New BOM:
• UCS C220 M4 with 120GB SSD
Write wait time
with levelDB on HDD
Write wait time
with levelDB on SSD
• Problem
• Election Storm & LevelDB inflation
• Solutions
• Upgrade to 1.2.3 to fix election storm
• Upgrade to 1.3.2 to fix levelDB inflation
• Configuration change
Hardening MON Cluster with Hammer and
Tuning
[mon]
mon_lease = 20 (default = 5)
mon_lease_renew_interval = 12 (default 3)
mon_lease_ack_timeout = 40 (default 10)
mon_accept_timeout = 40 (default 10)
[client]
mon_client_hunt_interval = 40 (defaiult 3)
• Problem
• High skew of %used of disks that is
preventing data intake even cluster capacity
allows
• Impact:
• Unbalanced PG distribution impacts
performance
• Rebalancing is impactful as well
• Solution:
• Upgrade to Hammer 1.3.2+patch
• Re-weight by utilization: >10% delta
Data Distribution and Balance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
22
43
64
85
106
127
148
169
190
211
232
253
274
295
316
337
358
379
400
421
442
us-internal-1 disk % used
% used
Cluster: 67.9% full
OSDs:
• Min: 47.2&
• Max: 83.5%
• Mean: %69.6
• Stddev: 6.5
• Migrate OS from Ubuntu to RHEL
• Retrofit Journal from HDD to SSD
• Retrofit MON levelDB from HDD to SSD
• Expand cluster from 3 racks to 4/5 racks
• Continuously upgrade Ceph version
• Challenge is at client side: need to restart nova instances to reload librbd
and librados
Zero Down Time Ops
Storage Cluster Monitoring and Analytics
• Three types of data: events,
metrics, logs
• Data collected from each node
• Data pushed to monitoring portals
• In-flight analytics for run-time RCA
• Predictive analytics for proactive
alert, e.g Prophetstor disk failure
prediction
• Plugin to synthesize data for
cluster level metrics and status
• Problem
• RBD image data
distributed to all disk and
single disk failure can
impact critical data IO
• Solution:
• proactively detect future
disk failure
• DiskProphet Solution
• Disk near-failure likelihood
prediction
• Disk life-expectancy
prediction
• Actions to optimize Ceph
Proactive Detection of Disk Failure
Normal
workload
1 OSD failed,
Ceph’s rebalancing
1 OSD failure predicted,
No-Impact Recovery by
DiskProphet
IOPS
Time
Structured data
DB CSV Agent
Unstructured data
ETL
REST
APIDisk near-failure
likelihood alert
Artificial intelligence
core module
Fuzzy logic
Machine Learning
Predictive Analytics
Deep Learning
Disk failure prediction
module
Disk life-expectancy
prediction
Prescription for
conducting proactive
actions
Dash-
board
A core module with
AI software
Disk failure prediction
module is structured upon
AI core module
TXT
DiskProphet
• Set Clear Stability Goals: zero downtime operation
• You can plan for everything except how tenants will use it
• Look for issues in services that consume storage
• Had 50TB of “deleted volumes” that weren’t supposed to be left alone
• DevOps
• It’s not just technology, it’s how your team operates as a team
• Consistent performance and stability modeling
• Automate rigorous testing
• Automate builds and rebuilds
• Balance performance, cost and time
• Shortcuts create Technical Debt
Lessons Learned
Yuming Ma: yumima@cisco.com
Thank You
Rack-1
6 nodes
osd
1
osd
10…
Rack-2
5 nodes
osd
1
osd
10…
Rack-3
6 nodes
osd
1
osd
10…
nova1
vm
1
vm
20
…
nova10
vm
1
vm
20
…
nova2
vm
1
…
…..
NVMe Journaling: Performance Testing Setup
Partition starts at 4MB(s1024), 10GB each and
4MB offset in between
1 2 10
LSI 9271 HBA
1 2 10
RAID0
1DISK
1 2 10NVME
…..
…..
XFS
…..
…..
OSD
2
OSD10OSD
1
…..
300GB Free
1211
OS on RAID1
MIRROR OSD: C240 M3
• 2xE5 2690 V2, 40 HT/core
• 64GB RAM
• 2x10Gbs for public
• 2x10Gbs for cluster
• 3X replication
• Intel P3700 400GB NVMe
• LSI 9271 HBA
• 10x4TB, 7200 RPM
Nova C220
• 2xE5 2680 V2, 40 HT/core
• 380GB RAM
• 2x10Gbs for public
• 3.10.0-229.4.2.el7.x86_64
vm
20
NVMe Journaling: Performance Tuning
OSD host iostat:
• Both nvme and hdd disk %util and low most of the time, and spikes
every ~45s.
• Both nvme and hdd have very low queue size (iodepth) while frontend
VM pushes 16 qdepth to FIO.
• CPU %used is reasonable, converge at <%30. But the iowait is low
which corresponding to low disk activity
NVMe Journaling: Performance Tuning
Tuning Directions: increase disk %util:
• Disk thread: 4, 16, 32
• Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
• These two tunings showed no impact:
filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10
filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10
• Final Config:
osd_journal_size = 10240 (default :
journal_max_write_entries= 1000 (default : 100)
journal_max_write_bytes=1048576000 (default :10485760)
journal_queue_max_bytes=1048576000 (default :10485760)
filestore_queue_max_bytes=1048576000 ((default :10485760)
filestore_queue_committing_max_bytes=1048576000 ((default :10485760)
filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760)
NVMe Performance Tuning
Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher:
filestore_wbthrottle_xfs_inodes_start_flusher
filestore_wbthrottle_xfs_ios_start_flusher

More Related Content

What's hot

Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red_Hat_Storage
 
Red Hat Storage Day Boston - Persistent Storage for Containers
Red Hat Storage Day Boston - Persistent Storage for Containers Red Hat Storage Day Boston - Persistent Storage for Containers
Red Hat Storage Day Boston - Persistent Storage for Containers
Red_Hat_Storage
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red_Hat_Storage
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage Matters
Colleen Corrice
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red_Hat_Storage
 
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red_Hat_Storage
 
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
Red_Hat_Storage
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Ceph Community
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red_Hat_Storage
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red_Hat_Storage
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red_Hat_Storage
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red_Hat_Storage
 
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red_Hat_Storage
 
Red Hat Storage Day Atlanta - Why Software Defined Storage Matters
Red Hat Storage Day Atlanta - Why Software Defined Storage MattersRed Hat Storage Day Atlanta - Why Software Defined Storage Matters
Red Hat Storage Day Atlanta - Why Software Defined Storage Matters
Red_Hat_Storage
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red_Hat_Storage
 
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red_Hat_Storage
 
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red_Hat_Storage
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage
 
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
Red_Hat_Storage
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 

What's hot (20)

Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
Red Hat Storage Day Boston - Red Hat Gluster Storage vs. Traditional Storage ...
 
Red Hat Storage Day Boston - Persistent Storage for Containers
Red Hat Storage Day Boston - Persistent Storage for Containers Red Hat Storage Day Boston - Persistent Storage for Containers
Red Hat Storage Day Boston - Persistent Storage for Containers
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Why Software-Defined Storage Matters
Why Software-Defined Storage MattersWhy Software-Defined Storage Matters
Why Software-Defined Storage Matters
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
 
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
Red Hat Storage Day Atlanta - Persistent Storage for Linux Containers
 
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix BarbeiraBackup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
 
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
Red Hat Storage Day Dallas - Gluster Storage in Containerized Application
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
 
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage MattersRed Hat Storage Day Dallas - Why Software-defined Storage Matters
Red Hat Storage Day Dallas - Why Software-defined Storage Matters
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
Red Hat Storage Day New York - Penguin Computing Spotlight: Delivering Open S...
 
Red Hat Storage Day Atlanta - Why Software Defined Storage Matters
Red Hat Storage Day Atlanta - Why Software Defined Storage MattersRed Hat Storage Day Atlanta - Why Software Defined Storage Matters
Red Hat Storage Day Atlanta - Why Software Defined Storage Matters
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
 
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
Red Hat Storage Day LA - Performance and Sizing Software Defined Storage
 
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
Red Hat Storage Day New York -Performance Intensive Workloads with Samsung NV...
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
 
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
Red Hat Storage Day LA - Why Software-Defined Storage Matters and Web-Scale O...
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 

Viewers also liked

MUM Europe 2017 - Traffic Generator Case Study
MUM Europe 2017 - Traffic Generator Case StudyMUM Europe 2017 - Traffic Generator Case Study
MUM Europe 2017 - Traffic Generator Case Study
Fajar Nugroho
 
MUM Middle East 2016 - System Integration Analyst
MUM Middle East 2016 - System Integration AnalystMUM Middle East 2016 - System Integration Analyst
MUM Middle East 2016 - System Integration Analyst
Fajar Nugroho
 
MUM Madrid 2016 - Mikrotik y Suricata
MUM Madrid 2016 - Mikrotik y SuricataMUM Madrid 2016 - Mikrotik y Suricata
MUM Madrid 2016 - Mikrotik y Suricata
Fajar Nugroho
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
Rodrigo Missiaggia
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red_Hat_Storage
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
Mirantis
 

Viewers also liked (7)

MUM Europe 2017 - Traffic Generator Case Study
MUM Europe 2017 - Traffic Generator Case StudyMUM Europe 2017 - Traffic Generator Case Study
MUM Europe 2017 - Traffic Generator Case Study
 
MUM Middle East 2016 - System Integration Analyst
MUM Middle East 2016 - System Integration AnalystMUM Middle East 2016 - System Integration Analyst
MUM Middle East 2016 - System Integration Analyst
 
MUM Madrid 2016 - Mikrotik y Suricata
MUM Madrid 2016 - Mikrotik y SuricataMUM Madrid 2016 - Mikrotik y Suricata
MUM Madrid 2016 - Mikrotik y Suricata
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
 
Red Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage MattersRed Hat Storage Day Boston - Why Software-defined Storage Matters
Red Hat Storage Day Boston - Why Software-defined Storage Matters
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 

Similar to Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack Cloud

Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Ceph Community
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Patrick McGarry
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
MariaDB Corporation
 
Ceph
CephCeph
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Odinot Stanislas
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
Ceph Community
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Ceph Community
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
Nicolas Poggi
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)Amazon Web Services Korea
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
inwin stack
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Patrick McGarry
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5UniFabric
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
heraflux
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 

Similar to Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack Cloud (20)

Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 

More from Red_Hat_Storage

Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red_Hat_Storage
 
Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance
Red_Hat_Storage
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red_Hat_Storage
 
Red Hat Storage Day - When the Ceph Hits the Fan
Red Hat Storage Day -  When the Ceph Hits the FanRed Hat Storage Day -  When the Ceph Hits the Fan
Red Hat Storage Day - When the Ceph Hits the Fan
Red_Hat_Storage
 
Red Hat Storage Day New York - Welcome Remarks
Red Hat Storage Day New York - Welcome Remarks Red Hat Storage Day New York - Welcome Remarks
Red Hat Storage Day New York - Welcome Remarks
Red_Hat_Storage
 
Storage: Limitations, Frustrations, and Coping with Future Needs
Storage: Limitations, Frustrations, and Coping with Future NeedsStorage: Limitations, Frustrations, and Coping with Future Needs
Storage: Limitations, Frustrations, and Coping with Future Needs
Red_Hat_Storage
 

More from Red_Hat_Storage (6)

Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers Red Hat Storage Day Dallas - Storage for OpenShift Containers
Red Hat Storage Day Dallas - Storage for OpenShift Containers
 
Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance Red Hat Storage Day Dallas - Defiance of the Appliance
Red Hat Storage Day Dallas - Defiance of the Appliance
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
 
Red Hat Storage Day - When the Ceph Hits the Fan
Red Hat Storage Day -  When the Ceph Hits the FanRed Hat Storage Day -  When the Ceph Hits the Fan
Red Hat Storage Day - When the Ceph Hits the Fan
 
Red Hat Storage Day New York - Welcome Remarks
Red Hat Storage Day New York - Welcome Remarks Red Hat Storage Day New York - Welcome Remarks
Red Hat Storage Day New York - Welcome Remarks
 
Storage: Limitations, Frustrations, and Coping with Future Needs
Storage: Limitations, Frustrations, and Coping with Future NeedsStorage: Limitations, Frustrations, and Coping with Future Needs
Storage: Limitations, Frustrations, and Coping with Future Needs
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack Cloud

  • 1. Yuming Ma , Architect StaaS, Cisco Cloud Foundation Seattle WA, 10/18/2016 Stabilizing Petabyte Ceph Cluster in OpenStack Cloud
  • 2. Highlights 1. What are we doing with Ceph? 2. What did we start with? 3. We need a bigger boat 4. Getting better and sleeping through the night 5. Lessons learned
  • 3. Cisco Cloud Services provides an Openstack platform to Cisco SaaS applications and tenants through a worldwide deployment of datacenters. Background SaaS Cases • Collaboration • IoT • Security • Analytics • “Unknown Projects” Swift • Database (Trove) Backups • Static Content • Cold/Offline data for Hadoop Cinder • Generic/Magnetic Volumes • Low Performance
  • 4. • Boot Volumes for all VM flavors except those with Ephemeral (local) storage • Glance Image store • Generic Cinder Volume • RGW for Swift Object store • In production since March 2014 • 13 clusters in production in two years • Each cluster is 1800TB raw over 45 nodes and 450 OSDs. How Do We Use Ceph? Cisco UCS Ceph High-Perf Platform Generic Volume Prov IOPS Cinder API Object Swift API
  • 5. • Get to MVP and keep costs down. • High capacity, hence C240 M3 LFF for 4TB HDDs • Tradeoff was that C240 M3 LFF could not also accommodate SSD  • So Journal was collocated on OSD • Monitors were on HDD based systems as well Initial Design Considerations
  • 6. CCS Ceph 1.0 RACK3RACK2 1 2 10 LSI 9271 HBA Data partition HDD Journal partition ….. ….. XFS ….. ….. OSD2 OSD10OSD1 ….. 1211 OS on RAID1 MIRROR 2x10Gb PRIVATE NETWORK KEYSTONE API SWIFT API CINDER API GLANCE API NOVA API OPENSTACK RADOS GATE WAY CEPH BLOCK DEVICE (RBD) Libvirt/kv m 2x10Gb PUBLIC NETWORK monitors monitors monitors 15xC240 CEPH libRADOS API RACK1 15xC240 15xC240 OSD: 45 x UCS C240 M3 • 2xE5 2690 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 2x10Gbs for cluster • 3X replication • LSI 9271 HBA • 10 x 4TB HDD, 7200 RPM • 10GB journal partition from HDD • RHEL3.10.0- 229.1.2.el7.x86_64 NOVA: UCS C220 • Ceph 0.94.1 • RHEL3.10.0- 229.4.2.el7.x86_64 MON/RGW: UCS C220 M3 • 2xE5 2680 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 4x3TB HDD, 7200 RPM • RHEL3.10.0- 229.4.2.el7.x86_64 Started with Cuttlefish/Dumpling
  • 7. • Nice consistent growth… • Your users will not warn you before: • “going live” • Migrating out of S3 • Backing up a Hadoop HDFS • Stability problems emerge after 50% used Growth: It will happen, just not sure when
  • 8. Major Stability Problems: Monitors Problem Impact MON election storm impacting client IO Monmap changes due to flaky NIC or chatty messaging between MON and client. Caused unstable quorum and an election storm between MON hosts Results: blocked and slowed client IO requests LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from serving OSD requests Results: Blocked IO and slow request DDOS due to chatty client msg attack Slow response from MON to client due to levelDB or election storm cause message flood attack from client. Results: failed client operation, e.g volume creation, RBD connection
  • 9. Major Stability Problems: Cluster Problem Impact Backfill & Recovery impacting client IO Osdmap changes due to loss of disk, resulting in PG peering and backfilling Results: Clients receive blocked and slow IO. Unbalanced data distribution Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some OSDs are at 90% Results: Backfill isn’t always able to complete. Slow disk impacting client IO A single slow (sick, not dead) OSD can severely impact many clients until it’s ejected from the cluster. Results: Client have slow or blocked IO.
  • 10. Stability Improvement Strategy Strategy Improvement Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume. Backfill and recovery throttling Reduced IO consumption by backfill and recovery processes to yield to client IO Retrofit with NVME (PCIe) journals Increased overall IOPS of the cluster Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm LevelDB on SSD (replaced entire mon node) Faster cluster map query Re-weight by utilization Balance data distribution *Client is the RBD client not the tenant
  • 11. • Limit max/cap IO consumption at qemu layer: • iops ( IOPS read and write ) 250 • bps (Bits per second read and write ) 100 MB/s • Predictable and controlled IOPS capacity • NO min/guaranteed IOPS -> future Ceph feature • NO burst map -> qemu feature: • iops_max 500 • bpx_max 120 MB/s Client IO throttling Swing ~ 100% Swing ~ 12%
  • 12. • Problem • Blocked IO during peering • Slow requests during backfill • Both could cause client IO stall and vCPU soft lockup • Solution • Throttling backfill and recovery osd recovery max active = 3 (default : 15) osd recovery op priority = 3 (default : 10) osd max backfills = 1 (default : 10) Backfill and Recovery Throttling
  • 13. • Goal: 2X IOPS capacity gain • Tuning: filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760) Retrofit Ceph Journal from HDD to NVME 1 2 10 LSI 9271 HBA Data partition HDD Journal partition ….. ….. XFS ….. ….. OSD2 OSD10OSD1 ….. 1211 OS on RAID1 MIRROR Partition starts at 4MB(s1024), 10GB each and 4MB offset in between 1 2 10 LSI 9271 HBA 1 2 10 RAID0 1DISK 1 2 10NVME ….. ….. XFS ….. ….. OSD 2 OSD10OSD 1 ….. 300GB Free 1211 OS on RAID1 MIRROR
  • 14. NVMe Stability Improvement Analysis One Disk (70% of 3TB) failure MTTR One Host (70% of 30TB) Failure MTTR Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs Disk failure (70% of 3TB) impact to client IOPS Host failure (70% of 30TB) impact to client IOPS Colo 232.991 vs 210.08 (Drop: 9.83%) NVME 231.66 vs 194.13 (Drop: 16.20%) 231.66 vs 211.36 (Drop: 8.76%) Backfill and recovery config: osd recovery max active = 3 (default : 15) osd max backfills = 1 (default : 10) osd recovery op priority = 3 (default : 10) Server impact: • Shorter recovery time Client impact • <10% impact (tested without IO throttling, impact should be less with IO throttling)
  • 15. LevelDB : • Key-value store for cluster metadata, e.g. osdmap, pgmap, monmap, clientID, authID etc • Not in data path • Impactful to IO operation: IO blocked by the DB query • Larger size, longer query time, hence longer IO wait -> slow requests • Solution: • Level DB on SSD in increase disk IO rate • Upgrade to Hammer to reduce DB size MON Level DB Issues
  • 16. Retrofit MON Level DB from HDD to SSD New BOM: • UCS C220 M4 with 120GB SSD Write wait time with levelDB on HDD Write wait time with levelDB on SSD
  • 17. • Problem • Election Storm & LevelDB inflation • Solutions • Upgrade to 1.2.3 to fix election storm • Upgrade to 1.3.2 to fix levelDB inflation • Configuration change Hardening MON Cluster with Hammer and Tuning [mon] mon_lease = 20 (default = 5) mon_lease_renew_interval = 12 (default 3) mon_lease_ack_timeout = 40 (default 10) mon_accept_timeout = 40 (default 10) [client] mon_client_hunt_interval = 40 (defaiult 3)
  • 18. • Problem • High skew of %used of disks that is preventing data intake even cluster capacity allows • Impact: • Unbalanced PG distribution impacts performance • Rebalancing is impactful as well • Solution: • Upgrade to Hammer 1.3.2+patch • Re-weight by utilization: >10% delta Data Distribution and Balance 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 22 43 64 85 106 127 148 169 190 211 232 253 274 295 316 337 358 379 400 421 442 us-internal-1 disk % used % used Cluster: 67.9% full OSDs: • Min: 47.2& • Max: 83.5% • Mean: %69.6 • Stddev: 6.5
  • 19. • Migrate OS from Ubuntu to RHEL • Retrofit Journal from HDD to SSD • Retrofit MON levelDB from HDD to SSD • Expand cluster from 3 racks to 4/5 racks • Continuously upgrade Ceph version • Challenge is at client side: need to restart nova instances to reload librbd and librados Zero Down Time Ops
  • 20. Storage Cluster Monitoring and Analytics • Three types of data: events, metrics, logs • Data collected from each node • Data pushed to monitoring portals • In-flight analytics for run-time RCA • Predictive analytics for proactive alert, e.g Prophetstor disk failure prediction • Plugin to synthesize data for cluster level metrics and status
  • 21. • Problem • RBD image data distributed to all disk and single disk failure can impact critical data IO • Solution: • proactively detect future disk failure • DiskProphet Solution • Disk near-failure likelihood prediction • Disk life-expectancy prediction • Actions to optimize Ceph Proactive Detection of Disk Failure Normal workload 1 OSD failed, Ceph’s rebalancing 1 OSD failure predicted, No-Impact Recovery by DiskProphet IOPS Time Structured data DB CSV Agent Unstructured data ETL REST APIDisk near-failure likelihood alert Artificial intelligence core module Fuzzy logic Machine Learning Predictive Analytics Deep Learning Disk failure prediction module Disk life-expectancy prediction Prescription for conducting proactive actions Dash- board A core module with AI software Disk failure prediction module is structured upon AI core module TXT DiskProphet
  • 22. • Set Clear Stability Goals: zero downtime operation • You can plan for everything except how tenants will use it • Look for issues in services that consume storage • Had 50TB of “deleted volumes” that weren’t supposed to be left alone • DevOps • It’s not just technology, it’s how your team operates as a team • Consistent performance and stability modeling • Automate rigorous testing • Automate builds and rebuilds • Balance performance, cost and time • Shortcuts create Technical Debt Lessons Learned
  • 24. Rack-1 6 nodes osd 1 osd 10… Rack-2 5 nodes osd 1 osd 10… Rack-3 6 nodes osd 1 osd 10… nova1 vm 1 vm 20 … nova10 vm 1 vm 20 … nova2 vm 1 … ….. NVMe Journaling: Performance Testing Setup Partition starts at 4MB(s1024), 10GB each and 4MB offset in between 1 2 10 LSI 9271 HBA 1 2 10 RAID0 1DISK 1 2 10NVME ….. ….. XFS ….. ….. OSD 2 OSD10OSD 1 ….. 300GB Free 1211 OS on RAID1 MIRROR OSD: C240 M3 • 2xE5 2690 V2, 40 HT/core • 64GB RAM • 2x10Gbs for public • 2x10Gbs for cluster • 3X replication • Intel P3700 400GB NVMe • LSI 9271 HBA • 10x4TB, 7200 RPM Nova C220 • 2xE5 2680 V2, 40 HT/core • 380GB RAM • 2x10Gbs for public • 3.10.0-229.4.2.el7.x86_64 vm 20
  • 25. NVMe Journaling: Performance Tuning OSD host iostat: • Both nvme and hdd disk %util and low most of the time, and spikes every ~45s. • Both nvme and hdd have very low queue size (iodepth) while frontend VM pushes 16 qdepth to FIO. • CPU %used is reasonable, converge at <%30. But the iowait is low which corresponding to low disk activity
  • 26. NVMe Journaling: Performance Tuning Tuning Directions: increase disk %util: • Disk thread: 4, 16, 32 • Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
  • 27. • These two tunings showed no impact: filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10 filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10 • Final Config: osd_journal_size = 10240 (default : journal_max_write_entries= 1000 (default : 100) journal_max_write_bytes=1048576000 (default :10485760) journal_queue_max_bytes=1048576000 (default :10485760) filestore_queue_max_bytes=1048576000 ((default :10485760) filestore_queue_committing_max_bytes=1048576000 ((default :10485760) filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ((default :10485760) NVMe Performance Tuning Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher: filestore_wbthrottle_xfs_inodes_start_flusher filestore_wbthrottle_xfs_ios_start_flusher

Editor's Notes

  1. Is the current config good.