01
42on Ceph Month2021
A quick update on '10 ways to break
your Ceph cluster' originally by Wido
den Hollander.
https://youtu.be/-FOYXz3Bz3Q
https://www.slideshare.net/ShapeBlue/
widoden-hollander-10-ways-to-break-
your-ceph-cluster
Five more ways to break your Ceph cluster
Break your Ceph cluster in these five ways
02
Completing an
update too soon
Not completing an
update
Under- or over estimate
automation tool
Running with
min_size=1
Running multiple
rgws with sameid
Blindly trusting the
PG autoscaler
42on Ceph Month2021
03
Under- or over-estimating
your automation tool
o Due to a missing variablein a script that was part of
the automationtooling,the amount of monitorswas
set from 3 to 0. The very thorough tool nicely cleaned
up the mons. This includedthe directory structures of
all three monitors.
o A similar case was found while using cephadm. While
we did not find the root cause, it was clearly NOT
cephadms mistake. All monitors were scrapt due to
the ceph “mon means monitoring” mistake. New
clean monitors were deployed,but that doesn’t work.
1. Recreated the monitor db by scraping the OSDs.
2. Found the original mon directories hidingon the
filesystem.
Impact example case:
- availability: high.
- durability: low.
42on Ceph Month2021
04
Running with min_size=1
o We still recommend that you run with at least
'size=3' in all cases if you value your data.
o We revised our earlier views a littlebit though.
Never, ever in any case run production with
min_size=1.
o In a good case you’ll see recovery_unfound.
You will see ‘unknown’in a bad case.
o A better statement: Make sure that you can only
write data if a redundantobject can also be written.
Impact example case:
- availability: high.
- durability: high.
42on Ceph Month2021
05
Not completing an update
At least 4 cases:
o Customer upgraded to Nautilusand enabled msgr v2.
They didn’t updatethe required osd version.
o This is a common mistake with Nautilusupgrade.
o Sometimes the cluster survives into Octopus.
Impact example case:
- availability: high.
- durability: low.
42on Ceph Month2021
06
Not completing an update
cluster:
health: HEALTH_WARN
Reduced data availability: 512 pgs inactive, 143 pgs peering, 29 pgs stale
3 slow requests areblocked > 32 sec
3 slow ops, oldestone blocked for 864 sec, daemons [osd.24,osd.34] haveslow ops.
1/6 mons down, quorummon-02,mon-03,mon-05,mon-06,mon-08
data:
pools: 4 pools, 1664 pgs
objects: 2.68Mobjects, 10 TiB
usage: 20 TiB used, 150 TiB / 170 TiB avail
pgs: 22.055% pgs unknown
8.714% pgs notactive
1152 active+clean
367 unknown
58 peering
58 remapped+peering
27 stale+peering
2 stale
io:
client: 24 KiB/s rd, 24 MiB/s wr, 0 op/s rd, 797 op/s wr
42on Ceph Month2021
07
Completing an update
too soon
o Example 14.2.19 -> 14.2.20.
o Setting auth_allow_insecure_global_id_reclaim
false to before upgrading all clients and
daemons.
o This makes the clients unableto connect.
Impact example case:
- availability: medium.
- durability: low.
42on Ceph Month2021
08
Running multiple rgws
with the same id behind
a load balancer
o 9 Ceph Object Gateways(rgws) installed,reusing
3 names.
o Ceph only understands3 rgws based on the 3
names.
o The rgws keep switching over which 3 are active
in the service map.
o The load balancerin front of them kept trying to
do new uploads.
o Result: bad performance and millionsof failed
multipart uploads.Cluster getting full faster than
expected. New hardware ordered and installed.
Impact example case:
- availability: medium.
- durability: medium.
42on Ceph Month2021
9
Bonus:
Blindly trusting
the PG autoscaler
o Installeda reasonablylarge cluster for rgw with
only hdd no ssd bluefs_db_dev <- mistake nr. 1
o Only tested with very small dataset, Pools were
created with default amount of pgs (32).
o They then started to ingest a large amount of
data. The autoscaler kept splitting pgs and never
completes. Cluster staysat ~'5% misplaced' for a
very long time.
o Performance poor and customers unhappy.
Impact example case:
- availability: medium.
- durability: low.
42on Ceph Month2021
So, which of the original 10 no longer break your
Ceph cluster?
08
42on Ceph Month2021
Thank you!
011
42on Ceph Month2021

Five More Ways to Break Your Ceph Cluster

  • 1.
    01 42on Ceph Month2021 Aquick update on '10 ways to break your Ceph cluster' originally by Wido den Hollander. https://youtu.be/-FOYXz3Bz3Q https://www.slideshare.net/ShapeBlue/ widoden-hollander-10-ways-to-break- your-ceph-cluster Five more ways to break your Ceph cluster
  • 2.
    Break your Cephcluster in these five ways 02 Completing an update too soon Not completing an update Under- or over estimate automation tool Running with min_size=1 Running multiple rgws with sameid Blindly trusting the PG autoscaler 42on Ceph Month2021
  • 3.
    03 Under- or over-estimating yourautomation tool o Due to a missing variablein a script that was part of the automationtooling,the amount of monitorswas set from 3 to 0. The very thorough tool nicely cleaned up the mons. This includedthe directory structures of all three monitors. o A similar case was found while using cephadm. While we did not find the root cause, it was clearly NOT cephadms mistake. All monitors were scrapt due to the ceph “mon means monitoring” mistake. New clean monitors were deployed,but that doesn’t work. 1. Recreated the monitor db by scraping the OSDs. 2. Found the original mon directories hidingon the filesystem. Impact example case: - availability: high. - durability: low. 42on Ceph Month2021
  • 4.
    04 Running with min_size=1 oWe still recommend that you run with at least 'size=3' in all cases if you value your data. o We revised our earlier views a littlebit though. Never, ever in any case run production with min_size=1. o In a good case you’ll see recovery_unfound. You will see ‘unknown’in a bad case. o A better statement: Make sure that you can only write data if a redundantobject can also be written. Impact example case: - availability: high. - durability: high. 42on Ceph Month2021
  • 5.
    05 Not completing anupdate At least 4 cases: o Customer upgraded to Nautilusand enabled msgr v2. They didn’t updatethe required osd version. o This is a common mistake with Nautilusupgrade. o Sometimes the cluster survives into Octopus. Impact example case: - availability: high. - durability: low. 42on Ceph Month2021
  • 6.
    06 Not completing anupdate cluster: health: HEALTH_WARN Reduced data availability: 512 pgs inactive, 143 pgs peering, 29 pgs stale 3 slow requests areblocked > 32 sec 3 slow ops, oldestone blocked for 864 sec, daemons [osd.24,osd.34] haveslow ops. 1/6 mons down, quorummon-02,mon-03,mon-05,mon-06,mon-08 data: pools: 4 pools, 1664 pgs objects: 2.68Mobjects, 10 TiB usage: 20 TiB used, 150 TiB / 170 TiB avail pgs: 22.055% pgs unknown 8.714% pgs notactive 1152 active+clean 367 unknown 58 peering 58 remapped+peering 27 stale+peering 2 stale io: client: 24 KiB/s rd, 24 MiB/s wr, 0 op/s rd, 797 op/s wr 42on Ceph Month2021
  • 7.
    07 Completing an update toosoon o Example 14.2.19 -> 14.2.20. o Setting auth_allow_insecure_global_id_reclaim false to before upgrading all clients and daemons. o This makes the clients unableto connect. Impact example case: - availability: medium. - durability: low. 42on Ceph Month2021
  • 8.
    08 Running multiple rgws withthe same id behind a load balancer o 9 Ceph Object Gateways(rgws) installed,reusing 3 names. o Ceph only understands3 rgws based on the 3 names. o The rgws keep switching over which 3 are active in the service map. o The load balancerin front of them kept trying to do new uploads. o Result: bad performance and millionsof failed multipart uploads.Cluster getting full faster than expected. New hardware ordered and installed. Impact example case: - availability: medium. - durability: medium. 42on Ceph Month2021
  • 9.
    9 Bonus: Blindly trusting the PGautoscaler o Installeda reasonablylarge cluster for rgw with only hdd no ssd bluefs_db_dev <- mistake nr. 1 o Only tested with very small dataset, Pools were created with default amount of pgs (32). o They then started to ingest a large amount of data. The autoscaler kept splitting pgs and never completes. Cluster staysat ~'5% misplaced' for a very long time. o Performance poor and customers unhappy. Impact example case: - availability: medium. - durability: low. 42on Ceph Month2021
  • 10.
    So, which ofthe original 10 no longer break your Ceph cluster? 08 42on Ceph Month2021
  • 11.