CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Mastering Ceph Operations:
Upmap and the Mgr Balancer
Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018

CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB luminous
Satellite data centre (1000km away) 1.6PB luminous
Hyperconverged KVM+Ceph 16TB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Client Scale Testing 0.4PB luminous
Hyperconverged HPC+Ceph 0.4PB luminous
CASTOR/XRootD Production 4.4PB luminous
CERN Tape Archive 0.8TB luminous
S3+SWIFT Production (4+2 EC) 2.3PB luminous
6Stable growth in RBD, S3, CephFS

CephFS Scale Testing
7
• Two activities testing performance:
• HPC storage with CephFS, BoF at SuperComputing in
Dallas right now! (Pablo Llopis, CERN IT)
• Scale testing ceph-csi with k8s (10000 cephfs clients!)

8
RBD Tuning
8
Rackspace / CERN Openlab Collaboration
Performance assessment tools
• ceph-osd benchmarking suite
• rbd top for identifying active clients
Studied performance impacts:
• various hdd configurations
• Flash for block.db, wal, dm-*
• hyperconverged configurations
Target real use-cases at CERN
• database applications
• monitoring and data analysis

Background: 2-Step Placement
RANDOM
1. Map an object to a PG uniformly at random.

CRUSH
2. Map each PG to a set of OSDs using CRUSH

CRUSH
RANDOM

Do we really need PGs?
• Why don’t we map objects to OSDs directly with CRUSH?
• If we did that, all OSDs would be coupled (peered) with all
others.
• Any 3x failure would lead to data loss.
• Consider a 1000-osd cluster:
• 1000^3 possible combinations, but only #PGs of them are relevant for
data loss.
• …

Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• CRUSH provides a language for describing data placement
rules according to your infrastructure.
• The “failure-domain” part of CRUSH is always perfect: e.g. it will
never put two copies on the same host (unless you tell it to…)
• The uniformity part of CRUSH is imperfect: uneven osd utilizations
are a fact of life. Perfection requires an impractical number of PGs.

Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!

2.1-Step Placement
CRUSH
RANDOM
UPMAP

Data Imbalance
• User Question:
• ceph df says my cluster is 50% full overall but the
cephfs_data pool is 73% full. What’s wrong?
• A pool is full once its first OSD is full.
• ceph df reports the used space on the most full OSD.
• This difference shows that the OSD are imbalanced.
19

Data Balancing
• Luminous added a data balancer.
• The general idea is to slowly move PGs from the most full to the least full
OSDs.
• Two ways to accomplish this, two modes:
• crush-compat: internally tweak the OSD sizes, e.g. to make under full disks
look bigger
• Compatible with legacy clients.
• upmap: move PGs precisely where we want them (without breaking failure
domain rules)
• Compatible with luminous+ clients.
20

Turning on the balancer (luminous)
• My advice: Do whatever you can to use the upmap balancer:
ceph osd set-require-min-compat-client luminous
ceph mgr module ls
ceph mgr module enable balancer
ceph config-key set mgr/balancer/begin_time 0830
ceph config-key set mgr/balancer/end_time 1800
ceph config-key set mgr/balancer/max_misplaced 0.005
ceph config-key set mgr/balancer/upmap_max_iterations 2
ceph balancer mode upmap
ceph balancer on
21
Luminous limitation:
If the balancer config doesn’t take,
restart the active ceph-mgr.

Success!
• It really works
• On our largest clusters, we have recovered hundreds of
terabytes of capacity using the upmap balancer.
22

But wait, there’s more…
23

Ceph Operations at Scale
• Cluster is nearly full (need to add capacity)
• Cluster has old tunables (need to set to optimal)
• Cluster has legacy osd reweights (need to reweight all to 1.0)
• Cluster data placement changes (need to change crush ruleset)
• Operations like those above often involve:
• large amounts of data movement
• lasting several days or weeks
• unpredictable impact on users (and the cluster)
• and no easy rollback!
24

25
WE ARE
HERE LEAP OF FAITH
WE WANT TO
BE HERE

A brief interlude…
• “remapped” PGs are fully replicated (normally “clean”), but
CRUSH wants to move them to a new set of OSDs
• 4815 active+remapped+backfill_wait
• “norebalance” is a cluster state to tell Ceph *not* to make
progress on any remapped PGs
• ceph osd set norebalance
26

Adding capacity
• Step 1: set the norebalance flag
• Step 2: ceph-volume lvm create…
• Step 3: 4518 active+remapped+backfill_wait
• Step 4: ???
• Step 5: HEALTH_OK
27

Adding capacity
29
What if we could
“upmap” those
remapped PGs
back to where the
data is now?

Adding capacity with upmap
• But you may be wondering:
• We’ve added new disks to the cluster, but they have zero PGs, so
what’s the point?
• The good news is that the upmap balancer will automatically
notice the new OSDs are under full, and will gradually move
PGs to them.
• In fact, the balancer simply *removes* the upmap entries we created
to keep PGs off of the new OSDs.
32

Changing tunables
• Step 1: set the norebalance flag
• Step 2: ceph osd crush tunables optimal
• Step 3: 4815162/20935486 objects misplaced
• Step 4: ???
• Step 5: HEALTH_OK
33

Changing tunables
35
Such an intensive
backfilling would
not be transparent
to the users…

Changing tunables
38
And remember,
the balancer will
slowly move PGs
around to where
they need to be.

Other use-cases: Legacy OSD Weights
• Remove legacy OSD reweights:
• ceph osd reweight-by-utilization is a legacy feature for
balancing OSDs with a [0,1] reweight factor.
• When we set the reweights back to 1.0, many PGs will become
active+remapped
• We can upmap them back to where they are.
• Bonus: Since those reweights helped balance the OSDs, this
acts as a shortcut to find the right set of upmaps to balance a
cluster.
39

Other use-cases: Placement Changes
• We had a pool with the first 2 replicas in Room A and 3rd
replica in Room B.
• For capacity reasons, we needed to move all three
replicas into Room A.
40

• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
41

• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
42
We moved several hundred TBs of data
without users noticing, and could pause
with HEALTH_OK at any time.

No leap of faith required
43
WE
ARE
HERE
LEAP OF FAITH WE WANT
TO BE
HERE
UPMAP

What’s next for “upmap remapped”
• We find this capability to be super useful!
• Wrote external tools to manage the upmap entries.
• It can be tricky!
• After some iteration with upstream we could share…
• Possibly contribute as a core feature?
• Maybe you can think of other use-cases?
44

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER (20)

Recently uploaded

Recently uploaded (20)

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER