SlideShare a Scribd company logo
1 of 45
Download to read offline
Mastering Ceph Operations:
Upmap and the Mgr Balancer
Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
Already >250PB physics data
Ceph @ CERN Update
5
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB luminous
Satellite data centre (1000km away) 1.6PB luminous
Hyperconverged KVM+Ceph 16TB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Client Scale Testing 0.4PB luminous
Hyperconverged HPC+Ceph 0.4PB luminous
CASTOR/XRootD Production 4.4PB luminous
CERN Tape Archive 0.8TB luminous
S3+SWIFT Production (4+2 EC) 2.3PB luminous
6Stable growth in RBD, S3, CephFS
CephFS Scale Testing
7
• Two activities testing performance:
• HPC storage with CephFS, BoF at SuperComputing in
Dallas right now! (Pablo Llopis, CERN IT)
• Scale testing ceph-csi with k8s (10000 cephfs clients!)
8
RBD Tuning
8
Rackspace / CERN Openlab Collaboration
Performance assessment tools
• ceph-osd benchmarking suite
• rbd top for identifying active clients
Studied performance impacts:
• various hdd configurations
• Flash for block.db, wal, dm-*
• hyperconverged configurations
Target real use-cases at CERN
• database applications
• monitoring and data analysis
Upmap and the Mgr Balancer
9
Background: 2-Step Placement
Background: 2-Step Placement
RANDOM
1. Map an object to a PG uniformly at random.
Background: 2-Step Placement
CRUSH
2. Map each PG to a set of OSDs using CRUSH
Background: 2-Step Placement
CRUSH
RANDOM
Do we really need PGs?
• Why don’t we map objects to OSDs directly with CRUSH?
• If we did that, all OSDs would be coupled (peered) with all
others.
• Any 3x failure would lead to data loss.
• Consider a 1000-osd cluster:
• 1000^3 possible combinations, but only #PGs of them are relevant for
data loss.
• …
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• CRUSH provides a language for describing data placement
rules according to your infrastructure.
• The “failure-domain” part of CRUSH is always perfect: e.g. it will
never put two copies on the same host (unless you tell it to…)
• The uniformity part of CRUSH is imperfect: uneven osd utilizations
are a fact of life. Perfection requires an impractical number of PGs.
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
2.1-Step Placement
CRUSH
RANDOM
UPMAP
Data Imbalance
• User Question:
• ceph df says my cluster is 50% full overall but the
cephfs_data pool is 73% full. What’s wrong?
• A pool is full once its first OSD is full.
• ceph df reports the used space on the most full OSD.
• This difference shows that the OSD are imbalanced.
19
Data Balancing
• Luminous added a data balancer.
• The general idea is to slowly move PGs from the most full to the least full
OSDs.
• Two ways to accomplish this, two modes:
• crush-compat: internally tweak the OSD sizes, e.g. to make under full disks
look bigger
• Compatible with legacy clients.
• upmap: move PGs precisely where we want them (without breaking failure
domain rules)
• Compatible with luminous+ clients.
20
Turning on the balancer (luminous)
• My advice: Do whatever you can to use the upmap balancer:
ceph osd set-require-min-compat-client luminous
ceph mgr module ls
ceph mgr module enable balancer
ceph config-key set mgr/balancer/begin_time 0830
ceph config-key set mgr/balancer/end_time 1800
ceph config-key set mgr/balancer/max_misplaced 0.005
ceph config-key set mgr/balancer/upmap_max_iterations 2
ceph balancer mode upmap
ceph balancer on
21
Luminous limitation:
If the balancer config doesn’t take,
restart the active ceph-mgr.
Success!
• It really works
• On our largest clusters, we have recovered hundreds of
terabytes of capacity using the upmap balancer.
22
But wait, there’s more…
23
Ceph Operations at Scale
• Cluster is nearly full (need to add capacity)
• Cluster has old tunables (need to set to optimal)
• Cluster has legacy osd reweights (need to reweight all to 1.0)
• Cluster data placement changes (need to change crush ruleset)
• Operations like those above often involve:
• large amounts of data movement
• lasting several days or weeks
• unpredictable impact on users (and the cluster)
• and no easy rollback!
24
25
WE ARE
HERE LEAP OF FAITH
WE WANT TO
BE HERE
A brief interlude…
• “remapped” PGs are fully replicated (normally “clean”), but
CRUSH wants to move them to a new set of OSDs
• 4815 active+remapped+backfill_wait
• “norebalance” is a cluster state to tell Ceph *not* to make
progress on any remapped PGs
• ceph osd set norebalance
26
Adding capacity
• Step 1: set the norebalance flag
• Step 2: ceph-volume lvm create…
• Step 3: 4518 active+remapped+backfill_wait
• Step 4: ???
• Step 5: HEALTH_OK
27
Adding capacity
28
Adding capacity
29
What if we could
“upmap” those
remapped PGs
back to where the
data is now?
Adding capacity with upmap
30
Adding capacity with upmap
31
Adding capacity with upmap
• But you may be wondering:
• We’ve added new disks to the cluster, but they have zero PGs, so
what’s the point?
• The good news is that the upmap balancer will automatically
notice the new OSDs are under full, and will gradually move
PGs to them.
• In fact, the balancer simply *removes* the upmap entries we created
to keep PGs off of the new OSDs.
32
Changing tunables
• Step 1: set the norebalance flag
• Step 2: ceph osd crush tunables optimal
• Step 3: 4815162/20935486 objects misplaced
• Step 4: ???
• Step 5: HEALTH_OK
33
Changing tunables
34
Changing tunables
35
Such an intensive
backfilling would
not be transparent
to the users…
Changing tunables
36
Changing tunables
37
Changing tunables
38
And remember,
the balancer will
slowly move PGs
around to where
they need to be.
Other use-cases: Legacy OSD Weights
• Remove legacy OSD reweights:
• ceph osd reweight-by-utilization is a legacy feature for
balancing OSDs with a [0,1] reweight factor.
• When we set the reweights back to 1.0, many PGs will become
active+remapped
• We can upmap them back to where they are.
• Bonus: Since those reweights helped balance the OSDs, this
acts as a shortcut to find the right set of upmaps to balance a
cluster.
39
Other use-cases: Placement Changes
• We had a pool with the first 2 replicas in Room A and 3rd
replica in Room B.
• For capacity reasons, we needed to move all three
replicas into Room A.
40
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
41
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
42
We moved several hundred TBs of data
without users noticing, and could pause
with HEALTH_OK at any time.
No leap of faith required
43
WE
ARE
HERE
LEAP OF FAITH WE WANT
TO BE
HERE
UPMAP
What’s next for “upmap remapped”
• We find this capability to be super useful!
• Wrote external tools to manage the upmap entries.
• It can be tricky!
• After some iteration with upstream we could share…
• Possibly contribute as a core feature?
• Maybe you can think of other use-cases?
44
Thanks!

More Related Content

What's hot

Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
buildacloud
 

What's hot (20)

Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing GuideCeph Object Storage Reference Architecture Performance and Sizing Guide
Ceph Object Storage Reference Architecture Performance and Sizing Guide
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Tutorial ceph-2
Tutorial ceph-2Tutorial ceph-2
Tutorial ceph-2
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
Steve Traylen
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDB
Lixun Peng
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER (20)

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDB
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migration
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HA
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

  • 1.
  • 2. Mastering Ceph Operations: Upmap and the Mgr Balancer Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
  • 3.
  • 5. Ceph @ CERN Update 5
  • 6. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB luminous Satellite data centre (1000km away) 1.6PB luminous Hyperconverged KVM+Ceph 16TB luminous CephFS (HPC+Manila) Production 0.8PB luminous Client Scale Testing 0.4PB luminous Hyperconverged HPC+Ceph 0.4PB luminous CASTOR/XRootD Production 4.4PB luminous CERN Tape Archive 0.8TB luminous S3+SWIFT Production (4+2 EC) 2.3PB luminous 6Stable growth in RBD, S3, CephFS
  • 7. CephFS Scale Testing 7 • Two activities testing performance: • HPC storage with CephFS, BoF at SuperComputing in Dallas right now! (Pablo Llopis, CERN IT) • Scale testing ceph-csi with k8s (10000 cephfs clients!)
  • 8. 8 RBD Tuning 8 Rackspace / CERN Openlab Collaboration Performance assessment tools • ceph-osd benchmarking suite • rbd top for identifying active clients Studied performance impacts: • various hdd configurations • Flash for block.db, wal, dm-* • hyperconverged configurations Target real use-cases at CERN • database applications • monitoring and data analysis
  • 9. Upmap and the Mgr Balancer 9
  • 11. Background: 2-Step Placement RANDOM 1. Map an object to a PG uniformly at random.
  • 12. Background: 2-Step Placement CRUSH 2. Map each PG to a set of OSDs using CRUSH
  • 14. Do we really need PGs? • Why don’t we map objects to OSDs directly with CRUSH? • If we did that, all OSDs would be coupled (peered) with all others. • Any 3x failure would lead to data loss. • Consider a 1000-osd cluster: • 1000^3 possible combinations, but only #PGs of them are relevant for data loss. • …
  • 15. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • CRUSH provides a language for describing data placement rules according to your infrastructure. • The “failure-domain” part of CRUSH is always perfect: e.g. it will never put two copies on the same host (unless you tell it to…) • The uniformity part of CRUSH is imperfect: uneven osd utilizations are a fact of life. Perfection requires an impractical number of PGs.
  • 16. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 17. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 19. Data Imbalance • User Question: • ceph df says my cluster is 50% full overall but the cephfs_data pool is 73% full. What’s wrong? • A pool is full once its first OSD is full. • ceph df reports the used space on the most full OSD. • This difference shows that the OSD are imbalanced. 19
  • 20. Data Balancing • Luminous added a data balancer. • The general idea is to slowly move PGs from the most full to the least full OSDs. • Two ways to accomplish this, two modes: • crush-compat: internally tweak the OSD sizes, e.g. to make under full disks look bigger • Compatible with legacy clients. • upmap: move PGs precisely where we want them (without breaking failure domain rules) • Compatible with luminous+ clients. 20
  • 21. Turning on the balancer (luminous) • My advice: Do whatever you can to use the upmap balancer: ceph osd set-require-min-compat-client luminous ceph mgr module ls ceph mgr module enable balancer ceph config-key set mgr/balancer/begin_time 0830 ceph config-key set mgr/balancer/end_time 1800 ceph config-key set mgr/balancer/max_misplaced 0.005 ceph config-key set mgr/balancer/upmap_max_iterations 2 ceph balancer mode upmap ceph balancer on 21 Luminous limitation: If the balancer config doesn’t take, restart the active ceph-mgr.
  • 22. Success! • It really works • On our largest clusters, we have recovered hundreds of terabytes of capacity using the upmap balancer. 22
  • 23. But wait, there’s more… 23
  • 24. Ceph Operations at Scale • Cluster is nearly full (need to add capacity) • Cluster has old tunables (need to set to optimal) • Cluster has legacy osd reweights (need to reweight all to 1.0) • Cluster data placement changes (need to change crush ruleset) • Operations like those above often involve: • large amounts of data movement • lasting several days or weeks • unpredictable impact on users (and the cluster) • and no easy rollback! 24
  • 25. 25 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE
  • 26. A brief interlude… • “remapped” PGs are fully replicated (normally “clean”), but CRUSH wants to move them to a new set of OSDs • 4815 active+remapped+backfill_wait • “norebalance” is a cluster state to tell Ceph *not* to make progress on any remapped PGs • ceph osd set norebalance 26
  • 27. Adding capacity • Step 1: set the norebalance flag • Step 2: ceph-volume lvm create… • Step 3: 4518 active+remapped+backfill_wait • Step 4: ??? • Step 5: HEALTH_OK 27
  • 29. Adding capacity 29 What if we could “upmap” those remapped PGs back to where the data is now?
  • 32. Adding capacity with upmap • But you may be wondering: • We’ve added new disks to the cluster, but they have zero PGs, so what’s the point? • The good news is that the upmap balancer will automatically notice the new OSDs are under full, and will gradually move PGs to them. • In fact, the balancer simply *removes* the upmap entries we created to keep PGs off of the new OSDs. 32
  • 33. Changing tunables • Step 1: set the norebalance flag • Step 2: ceph osd crush tunables optimal • Step 3: 4815162/20935486 objects misplaced • Step 4: ??? • Step 5: HEALTH_OK 33
  • 35. Changing tunables 35 Such an intensive backfilling would not be transparent to the users…
  • 38. Changing tunables 38 And remember, the balancer will slowly move PGs around to where they need to be.
  • 39. Other use-cases: Legacy OSD Weights • Remove legacy OSD reweights: • ceph osd reweight-by-utilization is a legacy feature for balancing OSDs with a [0,1] reweight factor. • When we set the reweights back to 1.0, many PGs will become active+remapped • We can upmap them back to where they are. • Bonus: Since those reweights helped balance the OSDs, this acts as a shortcut to find the right set of upmaps to balance a cluster. 39
  • 40. Other use-cases: Placement Changes • We had a pool with the first 2 replicas in Room A and 3rd replica in Room B. • For capacity reasons, we needed to move all three replicas into Room A. 40
  • 41. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 41
  • 42. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 42 We moved several hundred TBs of data without users noticing, and could pause with HEALTH_OK at any time.
  • 43. No leap of faith required 43 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE UPMAP
  • 44. What’s next for “upmap remapped” • We find this capability to be super useful! • Wrote external tools to manage the upmap entries. • It can be tricky! • After some iteration with upstream we could share… • Possibly contribute as a core feature? • Maybe you can think of other use-cases? 44