SlideShare a Scribd company logo
10 ways to break your Ceph cluster
10 ways to break your Ceph cluster - April 2018
Who am I?
• Wido den Hollander (1986)
• Owner and founder of 42on.com, Ceph Training and Consultancy Company
• Co-owner and CTO @ PCextreme B.V. (Dutch hosting company)
• Developed the Ceph (RBD) integration for libvirt storage drivers and Apache
CloudStack
• Wrote PHP and Java bindings for librados
10 ways to break your Ceph cluster - April 2018
42on.com:
My company focused on Ceph, providing:
• Consultancy
• Training
10 ways to break your Ceph cluster - April 2018
Breaking your Ceph cluster
Through the past years I've seen many Ceph cluster go down.
Some of these clusters lost data :-(, but all due to human error.
I'll talk you through 10 actual cases I've seen where people brought down their Ceph cluster and
some even lost data.
There is no chronological order for these cases, I've just picked 10.
10 ways to break your Ceph cluster - April 2018
1: Wrong CRUSH failure domain
This Ceph cluster uses 3x replication and is spread out over 4 racks. rack was/is the intented
failure domain.
On a weekend the power failed in one rack and the whole cluster stopped. Placement Groups
became inactive.
I was called and logged in. After searching for a while I found CRUSH to be configured
improperly.
Although the racks and hosts were properly mapped in the CRUSHMap this was not the case for
the ruleset.
Always do a test on your cluster to verify failures are handled as intended.
The downtime was over 2 hours as it took some time to get the power restored to the rack.
The fix in this case was to change the CRUSH ruleset and wait for recovery to finish.
10 ways to break your Ceph cluster - April 2018
1: Wrong CRUSH failure domain
root default
rack rack1
host server1
host server2
rack rack1
host server3
host server4
rule replicated_ruleset {
ruleset 0
type replicated
step take default
step chooseleaf firstn 0 host
step emit
}
10 ways to break your Ceph cluster - April 2018
2: Decommissioning a host
The cluster in this case was running with 2x (size = 2, min_size = 1) replication and some
hardware needed to be replaced.
The administrator decided that a node needed replacement and shut it down.
Recovery of Ceph kicked in after a few minutes and while I/O continued.
After a few hours a disk failed in one of the machines causing multiple PGs to go to the
incomplete state.
This disk was the only copy left for various Placement Groups and by loosing that disk the data
was lost.
After this happend I was called and ask to assist. We started the old machine and using PG
recovery we were able to get a part of the data back.
The cluster was running CephFS and all metadata and data was affected. After a few days of
debugging we were able to mount CephFS again in Read-Only state.
The result is that roughly 170TB of data on the CephFS cluster was (partially) affected.
The Ceph cluster had to be abandoned and rebuild from scratch.
10 ways to break your Ceph cluster - April 2018
3: Removing 'log' files in MON's data directory
If a cluster is in HEALTH_WARN state the data directory of the MONs started to grow as the
Monitors keep a long(er) history of OSDMaps.
This caused the Monitors, all three, to run out of disk space and to stop working.
This administrator did a quick search on sst files and thought they were binary logs like how they
work on a MySQL databases.
He removed the files and started the Monitors again, finding out they wouldn't start anymore due
to corruption in their LevelDB database.
The result is that this cluster was lost as at that time (beginning of Hammer) there was no way to
rebuild the Monitor database.
Always make sure Monitors have enough disk space and never manually remove files from their
data directory!
10 ways to break your Ceph cluster - April 2018
4: Removing the wrong pool
The administrator of this Ceph cluster was confidend that the rbd pool of the cluster was not
being used by anything.
He forgot to confirm of there was no data in the pool using ceph df, so he went ahead and
removed the pool.
After he removed the pool he started to see issues on his iSCSI gateway. It turned out that there
were active RBD images in that pool which were re-exported using iSCSI.
12TB of data was lost as there were no backups of these images.
Always set the nodelete flag on a pool and set the mon_allow_pool_delete setting to false!
(Default in Luminous).
Although these settings might not have helped in this case these additional safeguards might
prevent a admin from removing a pool by accident.
Double, no triple check before removing a pool! Always ask somebody else to take a look before
removing a pool.
10 ways to break your Ceph cluster - April 2018
5: Setting the noout flag for a long time
Due to performance problems with scrubbing turned on the noscrub flag was set causing this
cluster to be in HEALTH_WARN all the time.
During maintenance the nout flag was set and after completing the maintenance the flag was not
removed.
Over the course of a few weeks disk 1, 2 and finally number 3 failed. Replication (size) was set to
3 for all pools, but min_size was set to 1.
I was called when Placement Groups became inactive to find out that 3 disks had failed and data
was lost.
Eventually we were able to get back most of the data using some XFS filesystem recovery and
reverting some PG history, but it could be that there is some silent data corruption throughout the
cluster.
Always aim for a cluster running HEALTH_OK and take a look at the cluster if it's in
HEALTH_WANR for a longer period.
In addition, make sure that min_size is set to >1. It's a safety measure for your data.
10 ways to break your Ceph cluster - April 2018
6: Mounting XFS with nobarrier option
For performance reasons this SSD-only cluster was mounted with nobarrier.
/dev/sdh on /var/lib/ceph/osd/ceph-181 type xfs (rw,nobarrier)
Write barriers are there for a good reason:
A write barrier is a kernel mechanism used to ensure that file system metadata is correctly
written and ordered on persistent storage, even when storage devices with volatile write
caches lose power.
Although all servers were equipped with redundant power supplies a ground failure caused a
power outage on circuit A and B in the datacenter.
This power outage resulted in all OSD hosts to go down at the same time and that lead to many
corrupted XFS filesystems and OSD data stores.
We were not able to recover this Ceph cluster. Roughly 100TB of data was lost.
Never mount your XFS filesystem with nobarrier!
10 ways to break your Ceph cluster - April 2018
7: Enabling Writeback on HBA without BBU
This case is similar to the previous one. Instead of disabling write barriers in Linux the cache
mode of the HBA was set to Writeback without a Battery Backup Unit present.
A power outage caused some machines to go down resulting in corrupted XFS filesystems and
OSD data stores on those hosts.
Luckily this happend in one failure domain (rack) of the Ceph cluster and no data was lost.
However, never turn on Writeback caching in your HBA without a Battery Backup Unit present.
It's just dangerous!
10 ways to break your Ceph cluster - April 2018
8: Creating too many Placement Groups
I assisted this customer with building their Ceph cluster for running behind OpenStack.
The size of the cluster resulted in the volumes having 8192 Placement Groups.
As time progressed they created multiple pools on the cluster without consulting me. In total 10
additional pools, all with 8192 Placement Groups. (~70k extra PGs)
A few months later a power outage caused the whole cluster to restart.
The OSD hosts were lacking CPU and Memory to work their way through peering and recovery
of so many Placement Groups. Causing a flapping OSD situation.
I wasn't called until the day after it happend which resulted in over 24 hours of flapping OSDs and
thousands of new OSDMaps.
Eventually we recovered the cluster after babysitting it for 5 days and adding additional Memory
and CPUs to the cluster.
Be cautious when creating Placement Groups. It can hurt you when the cluster needs to re-peer
all Placement Groups!
10 ways to break your Ceph cluster - April 2018
9: Using 2x replication
Not tied to one specified situation, but I've just seen too many cases where data was either
corrupted or lost by clusters running with 2x replication.
A single disk failure in 2x replication can already lead to loss or corruption of data.
Imagine a host taken down for maintenance. A portion of the data now relies on one disk. If this
disk fails all the data is lost.
I've seen these cases just happen too many times! Do not consider using 2x replication if you
value your data!
10 ways to break your Ceph cluster - April 2018
10: Underestimating Monitors
Monitors are often underestimated badly by a lot of people. The word monitor might confuse them
and think that these daemons only serve a monitoring purpose like Zabbix or Nagios.
This results to running them on unreliable and cheap hardware causing all kinds of problems.
I've seen people run them on SD-Cards in Dell servers and then wearing through the SD-Card
quickly due to the Monitor writes to the LevelDB/RocksDB database.
Use reliable hardware for your Monitors! Yes, they are pretty lightweight daemons and usually
don't consume many resources. But they are a vital part of your Ceph cluster.
I always recommend dedicated hardware for Monitors and using datacenter grade / write
intensive SSDs for their data stores.
A 200GB SSD is vastly more then the Monitor will use, but you never want your Monitor to run
out of diskspace and potentially face data corruption.
10 ways to break your Ceph cluster - April 2018
11: Updating Cephx keys with the wrong
permissions
All good things go to eleven, right?
In this case a admin updated the cephx key for a OpenStack deployment and he made a typo in
the permissions.
By accident he revoked the w (write) permission for that user on the pool volumes.
This caused Ceph (librados) to start returning errors to librbd which issued these errors to the
Virtual Machines.
A single typo caused over 2.000 Instances to go down with filesystems in Read-Only mode.
caps osd = "allow rx pool=volumes, allow rwx pool=volumes-ssd"
10 ways to break your Ceph cluster - April 2018
Thank you!
Thanks for listening!
Questions?
Find me:
• E-Mail: wido@42on.com
• Company: https://42on.com/
• Blog: https://widodh.nl/
• Github: https://github.com/wido
• Twitter: @widodh
• Presentations: https://github.com/wido/presentations
10 ways to break your Ceph cluster - April 2018

More Related Content

What's hot

OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image Lifecycle
Mihai Criveti
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with Ceph
Sean Cohen
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Vietnam Open Infrastructure User Group
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
Ceph Community
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
Ceph Community
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Sean Cohen
 
How To Create The Ubuntu 20 VM Template For VMware Automation
How To Create The Ubuntu 20 VM Template For VMware AutomationHow To Create The Ubuntu 20 VM Template For VMware Automation
How To Create The Ubuntu 20 VM Template For VMware Automation
Real Estate
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
Italo Santos
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Vietnam Open Infrastructure User Group
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
ShapeBlue
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
ShapeBlue
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph Community
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
Sage Weil
 
Ceph on Windows
Ceph on WindowsCeph on Windows
Ceph on Windows
Ceph Community
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Vietnam Open Infrastructure User Group
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Colin Charles
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
Ceph Community
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
Tommy Lee
 
DevOps Meetup ansible
DevOps Meetup   ansibleDevOps Meetup   ansible
DevOps Meetup ansible
sriram_rajan
 

What's hot (20)

OpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image LifecycleOpenShift Virtualization - VM and OS Image Lifecycle
OpenShift Virtualization - VM and OS Image Lifecycle
 
How to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with CephHow to Survive an OpenStack Cloud Meltdown with Ceph
How to Survive an OpenStack Cloud Meltdown with Ceph
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and CephProtecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
 
How To Create The Ubuntu 20 VM Template For VMware Automation
How To Create The Ubuntu 20 VM Template For VMware AutomationHow To Create The Ubuntu 20 VM Template For VMware Automation
How To Create The Ubuntu 20 VM Template For VMware Automation
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
Room 3 - 7 - Nguyễn Như Phúc Huy - Vitastor: a fast and simple Ceph-like bloc...
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangLinux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph on Windows
Ceph on WindowsCeph on Windows
Ceph on Windows
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
 
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScaleThe Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
DevOps Meetup ansible
DevOps Meetup   ansibleDevOps Meetup   ansible
DevOps Meetup ansible
 

Similar to Wido den Hollander - 10 ways to break your Ceph cluster

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Fred de Villamil
 
PhpTek Ten Things to do to make your MySQL servers Happier and Healthier
PhpTek Ten Things to do to make your MySQL servers Happier and HealthierPhpTek Ten Things to do to make your MySQL servers Happier and Healthier
PhpTek Ten Things to do to make your MySQL servers Happier and Healthier
Dave Stokes
 
Caching and tuning fun for high scalability @ FrOSCon 2011
Caching and tuning fun for high scalability @ FrOSCon 2011Caching and tuning fun for high scalability @ FrOSCon 2011
Caching and tuning fun for high scalability @ FrOSCon 2011
Wim Godden
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
In-Memory Computing Summit
 
201657_Patra_pdf
201657_Patra_pdf201657_Patra_pdf
201657_Patra_pdflokinisj
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
Ceph Community
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
Wim Godden
 
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UKNetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
subtitle
 
Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011
Wim Godden
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
Franck Pachot
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
nadirpervez2
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
Patrick McGarry
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
 
Oracle Exadata Exam Dump
Oracle Exadata Exam DumpOracle Exadata Exam Dump
Oracle Exadata Exam Dump
Pooja C
 
os
osos
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
Ashok Modi
 
8 considerations for evaluating disk based backup solutions
8 considerations for evaluating disk based backup solutions8 considerations for evaluating disk based backup solutions
8 considerations for evaluating disk based backup solutionsServium
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
Aerospike, Inc.
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Stuart Pook
 

Similar to Wido den Hollander - 10 ways to break your Ceph cluster (20)

SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
PhpTek Ten Things to do to make your MySQL servers Happier and Healthier
PhpTek Ten Things to do to make your MySQL servers Happier and HealthierPhpTek Ten Things to do to make your MySQL servers Happier and Healthier
PhpTek Ten Things to do to make your MySQL servers Happier and Healthier
 
Caching and tuning fun for high scalability @ FrOSCon 2011
Caching and tuning fun for high scalability @ FrOSCon 2011Caching and tuning fun for high scalability @ FrOSCon 2011
Caching and tuning fun for high scalability @ FrOSCon 2011
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
201657_Patra_pdf
201657_Patra_pdf201657_Patra_pdf
201657_Patra_pdf
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Caching and tuning fun for high scalability
Caching and tuning fun for high scalabilityCaching and tuning fun for high scalability
Caching and tuning fun for high scalability
 
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UKNetApp Administration and Best Practice, Brendon Higgins, Proact UK
NetApp Administration and Best Practice, Brendon Higgins, Proact UK
 
Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011Caching and tuning fun for high scalability @ phpBenelux 2011
Caching and tuning fun for high scalability @ phpBenelux 2011
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 
Oracle Exadata Exam Dump
Oracle Exadata Exam DumpOracle Exadata Exam Dump
Oracle Exadata Exam Dump
 
os
osos
os
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
8 considerations for evaluating disk based backup solutions
8 considerations for evaluating disk based backup solutions8 considerations for evaluating disk based backup solutions
8 considerations for evaluating disk based backup solutions
 
Configuring Aerospike - Part 2
Configuring Aerospike - Part 2 Configuring Aerospike - Part 2
Configuring Aerospike - Part 2
 
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyPilot Hadoop Towards 2500 Nodes and Cluster Redundancy
Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy
 

More from ShapeBlue

CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
ShapeBlue
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
ShapeBlue
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
ShapeBlue
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
ShapeBlue
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
ShapeBlue
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
ShapeBlue
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
ShapeBlue
 
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIOHow We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
ShapeBlue
 
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue
 
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue
 
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue
 
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue
 
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
ShapeBlue
 
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue
 
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue
 
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue
 
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue
 
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue
 
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue
 
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue
 

More from ShapeBlue (20)

CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIOHow We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
 
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
 
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
 
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
 
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
 
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
 
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
 
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
 
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
 
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
 
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
 
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
 
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 

Wido den Hollander - 10 ways to break your Ceph cluster

  • 1. 10 ways to break your Ceph cluster 10 ways to break your Ceph cluster - April 2018
  • 2. Who am I? • Wido den Hollander (1986) • Owner and founder of 42on.com, Ceph Training and Consultancy Company • Co-owner and CTO @ PCextreme B.V. (Dutch hosting company) • Developed the Ceph (RBD) integration for libvirt storage drivers and Apache CloudStack • Wrote PHP and Java bindings for librados 10 ways to break your Ceph cluster - April 2018
  • 3. 42on.com: My company focused on Ceph, providing: • Consultancy • Training 10 ways to break your Ceph cluster - April 2018
  • 4. Breaking your Ceph cluster Through the past years I've seen many Ceph cluster go down. Some of these clusters lost data :-(, but all due to human error. I'll talk you through 10 actual cases I've seen where people brought down their Ceph cluster and some even lost data. There is no chronological order for these cases, I've just picked 10. 10 ways to break your Ceph cluster - April 2018
  • 5. 1: Wrong CRUSH failure domain This Ceph cluster uses 3x replication and is spread out over 4 racks. rack was/is the intented failure domain. On a weekend the power failed in one rack and the whole cluster stopped. Placement Groups became inactive. I was called and logged in. After searching for a while I found CRUSH to be configured improperly. Although the racks and hosts were properly mapped in the CRUSHMap this was not the case for the ruleset. Always do a test on your cluster to verify failures are handled as intended. The downtime was over 2 hours as it took some time to get the power restored to the rack. The fix in this case was to change the CRUSH ruleset and wait for recovery to finish. 10 ways to break your Ceph cluster - April 2018
  • 6. 1: Wrong CRUSH failure domain root default rack rack1 host server1 host server2 rack rack1 host server3 host server4 rule replicated_ruleset { ruleset 0 type replicated step take default step chooseleaf firstn 0 host step emit } 10 ways to break your Ceph cluster - April 2018
  • 7. 2: Decommissioning a host The cluster in this case was running with 2x (size = 2, min_size = 1) replication and some hardware needed to be replaced. The administrator decided that a node needed replacement and shut it down. Recovery of Ceph kicked in after a few minutes and while I/O continued. After a few hours a disk failed in one of the machines causing multiple PGs to go to the incomplete state. This disk was the only copy left for various Placement Groups and by loosing that disk the data was lost. After this happend I was called and ask to assist. We started the old machine and using PG recovery we were able to get a part of the data back. The cluster was running CephFS and all metadata and data was affected. After a few days of debugging we were able to mount CephFS again in Read-Only state. The result is that roughly 170TB of data on the CephFS cluster was (partially) affected. The Ceph cluster had to be abandoned and rebuild from scratch. 10 ways to break your Ceph cluster - April 2018
  • 8. 3: Removing 'log' files in MON's data directory If a cluster is in HEALTH_WARN state the data directory of the MONs started to grow as the Monitors keep a long(er) history of OSDMaps. This caused the Monitors, all three, to run out of disk space and to stop working. This administrator did a quick search on sst files and thought they were binary logs like how they work on a MySQL databases. He removed the files and started the Monitors again, finding out they wouldn't start anymore due to corruption in their LevelDB database. The result is that this cluster was lost as at that time (beginning of Hammer) there was no way to rebuild the Monitor database. Always make sure Monitors have enough disk space and never manually remove files from their data directory! 10 ways to break your Ceph cluster - April 2018
  • 9. 4: Removing the wrong pool The administrator of this Ceph cluster was confidend that the rbd pool of the cluster was not being used by anything. He forgot to confirm of there was no data in the pool using ceph df, so he went ahead and removed the pool. After he removed the pool he started to see issues on his iSCSI gateway. It turned out that there were active RBD images in that pool which were re-exported using iSCSI. 12TB of data was lost as there were no backups of these images. Always set the nodelete flag on a pool and set the mon_allow_pool_delete setting to false! (Default in Luminous). Although these settings might not have helped in this case these additional safeguards might prevent a admin from removing a pool by accident. Double, no triple check before removing a pool! Always ask somebody else to take a look before removing a pool. 10 ways to break your Ceph cluster - April 2018
  • 10. 5: Setting the noout flag for a long time Due to performance problems with scrubbing turned on the noscrub flag was set causing this cluster to be in HEALTH_WARN all the time. During maintenance the nout flag was set and after completing the maintenance the flag was not removed. Over the course of a few weeks disk 1, 2 and finally number 3 failed. Replication (size) was set to 3 for all pools, but min_size was set to 1. I was called when Placement Groups became inactive to find out that 3 disks had failed and data was lost. Eventually we were able to get back most of the data using some XFS filesystem recovery and reverting some PG history, but it could be that there is some silent data corruption throughout the cluster. Always aim for a cluster running HEALTH_OK and take a look at the cluster if it's in HEALTH_WANR for a longer period. In addition, make sure that min_size is set to >1. It's a safety measure for your data. 10 ways to break your Ceph cluster - April 2018
  • 11. 6: Mounting XFS with nobarrier option For performance reasons this SSD-only cluster was mounted with nobarrier. /dev/sdh on /var/lib/ceph/osd/ceph-181 type xfs (rw,nobarrier) Write barriers are there for a good reason: A write barrier is a kernel mechanism used to ensure that file system metadata is correctly written and ordered on persistent storage, even when storage devices with volatile write caches lose power. Although all servers were equipped with redundant power supplies a ground failure caused a power outage on circuit A and B in the datacenter. This power outage resulted in all OSD hosts to go down at the same time and that lead to many corrupted XFS filesystems and OSD data stores. We were not able to recover this Ceph cluster. Roughly 100TB of data was lost. Never mount your XFS filesystem with nobarrier! 10 ways to break your Ceph cluster - April 2018
  • 12. 7: Enabling Writeback on HBA without BBU This case is similar to the previous one. Instead of disabling write barriers in Linux the cache mode of the HBA was set to Writeback without a Battery Backup Unit present. A power outage caused some machines to go down resulting in corrupted XFS filesystems and OSD data stores on those hosts. Luckily this happend in one failure domain (rack) of the Ceph cluster and no data was lost. However, never turn on Writeback caching in your HBA without a Battery Backup Unit present. It's just dangerous! 10 ways to break your Ceph cluster - April 2018
  • 13. 8: Creating too many Placement Groups I assisted this customer with building their Ceph cluster for running behind OpenStack. The size of the cluster resulted in the volumes having 8192 Placement Groups. As time progressed they created multiple pools on the cluster without consulting me. In total 10 additional pools, all with 8192 Placement Groups. (~70k extra PGs) A few months later a power outage caused the whole cluster to restart. The OSD hosts were lacking CPU and Memory to work their way through peering and recovery of so many Placement Groups. Causing a flapping OSD situation. I wasn't called until the day after it happend which resulted in over 24 hours of flapping OSDs and thousands of new OSDMaps. Eventually we recovered the cluster after babysitting it for 5 days and adding additional Memory and CPUs to the cluster. Be cautious when creating Placement Groups. It can hurt you when the cluster needs to re-peer all Placement Groups! 10 ways to break your Ceph cluster - April 2018
  • 14. 9: Using 2x replication Not tied to one specified situation, but I've just seen too many cases where data was either corrupted or lost by clusters running with 2x replication. A single disk failure in 2x replication can already lead to loss or corruption of data. Imagine a host taken down for maintenance. A portion of the data now relies on one disk. If this disk fails all the data is lost. I've seen these cases just happen too many times! Do not consider using 2x replication if you value your data! 10 ways to break your Ceph cluster - April 2018
  • 15. 10: Underestimating Monitors Monitors are often underestimated badly by a lot of people. The word monitor might confuse them and think that these daemons only serve a monitoring purpose like Zabbix or Nagios. This results to running them on unreliable and cheap hardware causing all kinds of problems. I've seen people run them on SD-Cards in Dell servers and then wearing through the SD-Card quickly due to the Monitor writes to the LevelDB/RocksDB database. Use reliable hardware for your Monitors! Yes, they are pretty lightweight daemons and usually don't consume many resources. But they are a vital part of your Ceph cluster. I always recommend dedicated hardware for Monitors and using datacenter grade / write intensive SSDs for their data stores. A 200GB SSD is vastly more then the Monitor will use, but you never want your Monitor to run out of diskspace and potentially face data corruption. 10 ways to break your Ceph cluster - April 2018
  • 16. 11: Updating Cephx keys with the wrong permissions All good things go to eleven, right? In this case a admin updated the cephx key for a OpenStack deployment and he made a typo in the permissions. By accident he revoked the w (write) permission for that user on the pool volumes. This caused Ceph (librados) to start returning errors to librbd which issued these errors to the Virtual Machines. A single typo caused over 2.000 Instances to go down with filesystems in Read-Only mode. caps osd = "allow rx pool=volumes, allow rwx pool=volumes-ssd" 10 ways to break your Ceph cluster - April 2018
  • 17. Thank you! Thanks for listening! Questions? Find me: • E-Mail: wido@42on.com • Company: https://42on.com/ • Blog: https://widodh.nl/ • Github: https://github.com/wido • Twitter: @widodh • Presentations: https://github.com/wido/presentations 10 ways to break your Ceph cluster - April 2018