SlideShare a Scribd company logo
1 of 45
Download to read offline
Mastering Ceph Operations:
Upmap and the Mgr Balancer
Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
Already >250PB physics data
Ceph @ CERN Update
5
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB luminous
Satellite data centre (1000km away) 1.6PB luminous
Hyperconverged KVM+Ceph 16TB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Client Scale Testing 0.4PB luminous
Hyperconverged HPC+Ceph 0.4PB luminous
CASTOR/XRootD Production 4.4PB luminous
CERN Tape Archive 0.8TB luminous
S3+SWIFT Production (4+2 EC) 2.3PB luminous
6Stable growth in RBD, S3, CephFS
CephFS Scale Testing
7
• Two activities testing performance:
• HPC storage with CephFS, BoF at SuperComputing in
Dallas right now! (Pablo Llopis, CERN IT)
• Scale testing ceph-csi with k8s (10000 cephfs clients!)
8
RBD Tuning
8
Rackspace / CERN Openlab Collaboration
Performance assessment tools
• ceph-osd benchmarking suite
• rbd top for identifying active clients
Studied performance impacts:
• various hdd configurations
• Flash for block.db, wal, dm-*
• hyperconverged configurations
Target real use-cases at CERN
• database applications
• monitoring and data analysis
Upmap and the Mgr Balancer
9
Background: 2-Step Placement
Background: 2-Step Placement
RANDOM
1. Map an object to a PG uniformly at random.
Background: 2-Step Placement
CRUSH
2. Map each PG to a set of OSDs using CRUSH
Background: 2-Step Placement
CRUSH
RANDOM
Do we really need PGs?
• Why don’t we map objects to OSDs directly with CRUSH?
• If we did that, all OSDs would be coupled (peered) with all
others.
• Any 3x failure would lead to data loss.
• Consider a 1000-osd cluster:
• 1000^3 possible combinations, but only #PGs of them are relevant for
data loss.
• …
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• CRUSH provides a language for describing data placement
rules according to your infrastructure.
• The “failure-domain” part of CRUSH is always perfect: e.g. it will
never put two copies on the same host (unless you tell it to…)
• The uniformity part of CRUSH is imperfect: uneven osd utilizations
are a fact of life. Perfection requires an impractical number of PGs.
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
Do we really need CRUSH?
• Why don’t we just distribute the PG mappings directly?
• We can! Now in luminous with upmap!
2.1-Step Placement
CRUSH
RANDOM
UPMAP
Data Imbalance
• User Question:
• ceph df says my cluster is 50% full overall but the
cephfs_data pool is 73% full. What’s wrong?
• A pool is full once its first OSD is full.
• ceph df reports the used space on the most full OSD.
• This difference shows that the OSD are imbalanced.
19
Data Balancing
• Luminous added a data balancer.
• The general idea is to slowly move PGs from the most full to the least full
OSDs.
• Two ways to accomplish this, two modes:
• crush-compat: internally tweak the OSD sizes, e.g. to make under full disks
look bigger
• Compatible with legacy clients.
• upmap: move PGs precisely where we want them (without breaking failure
domain rules)
• Compatible with luminous+ clients.
20
Turning on the balancer (luminous)
• My advice: Do whatever you can to use the upmap balancer:
ceph osd set-require-min-compat-client luminous
ceph mgr module ls
ceph mgr module enable balancer
ceph config-key set mgr/balancer/begin_time 0830
ceph config-key set mgr/balancer/end_time 1800
ceph config-key set mgr/balancer/max_misplaced 0.005
ceph config-key set mgr/balancer/upmap_max_iterations 2
ceph balancer mode upmap
ceph balancer on
21
Luminous limitation:
If the balancer config doesn’t take,
restart the active ceph-mgr.
Success!
• It really works
• On our largest clusters, we have recovered hundreds of
terabytes of capacity using the upmap balancer.
22
But wait, there’s more…
23
Ceph Operations at Scale
• Cluster is nearly full (need to add capacity)
• Cluster has old tunables (need to set to optimal)
• Cluster has legacy osd reweights (need to reweight all to 1.0)
• Cluster data placement changes (need to change crush ruleset)
• Operations like those above often involve:
• large amounts of data movement
• lasting several days or weeks
• unpredictable impact on users (and the cluster)
• and no easy rollback!
24
25
WE ARE
HERE LEAP OF FAITH
WE WANT TO
BE HERE
A brief interlude…
• “remapped” PGs are fully replicated (normally “clean”), but
CRUSH wants to move them to a new set of OSDs
• 4815 active+remapped+backfill_wait
• “norebalance” is a cluster state to tell Ceph *not* to make
progress on any remapped PGs
• ceph osd set norebalance
26
Adding capacity
• Step 1: set the norebalance flag
• Step 2: ceph-volume lvm create…
• Step 3: 4518 active+remapped+backfill_wait
• Step 4: ???
• Step 5: HEALTH_OK
27
Adding capacity
28
Adding capacity
29
What if we could
“upmap” those
remapped PGs
back to where the
data is now?
Adding capacity with upmap
30
Adding capacity with upmap
31
Adding capacity with upmap
• But you may be wondering:
• We’ve added new disks to the cluster, but they have zero PGs, so
what’s the point?
• The good news is that the upmap balancer will automatically
notice the new OSDs are under full, and will gradually move
PGs to them.
• In fact, the balancer simply *removes* the upmap entries we created
to keep PGs off of the new OSDs.
32
Changing tunables
• Step 1: set the norebalance flag
• Step 2: ceph osd crush tunables optimal
• Step 3: 4815162/20935486 objects misplaced
• Step 4: ???
• Step 5: HEALTH_OK
33
Changing tunables
34
Changing tunables
35
Such an intensive
backfilling would
not be transparent
to the users…
Changing tunables
36
Changing tunables
37
Changing tunables
38
And remember,
the balancer will
slowly move PGs
around to where
they need to be.
Other use-cases: Legacy OSD Weights
• Remove legacy OSD reweights:
• ceph osd reweight-by-utilization is a legacy feature for
balancing OSDs with a [0,1] reweight factor.
• When we set the reweights back to 1.0, many PGs will become
active+remapped
• We can upmap them back to where they are.
• Bonus: Since those reweights helped balance the OSDs, this
acts as a shortcut to find the right set of upmaps to balance a
cluster.
39
Other use-cases: Placement Changes
• We had a pool with the first 2 replicas in Room A and 3rd
replica in Room B.
• For capacity reasons, we needed to move all three
replicas into Room A.
40
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
41
Other use-cases: Placement Changes
• What did we do?
1. Create the new crush ruleset
2. Set the norebalance flag
3. Set the pool’s crush rule to be the new one…
• This puts *every* PG in “remapped” state.
4. Use upmap to map those PGs back to where they are (3rd replica in
the wrong room)
5. Gradually remove those upmap entries to slowly move all PGs fully
into Room A.
6. HEALTH_OK
42
We moved several hundred TBs of data
without users noticing, and could pause
with HEALTH_OK at any time.
No leap of faith required
43
WE
ARE
HERE
LEAP OF FAITH WE WANT
TO BE
HERE
UPMAP
What’s next for “upmap remapped”
• We find this capability to be super useful!
• Wrote external tools to manage the upmap entries.
• It can be tricky!
• After some iteration with upstream we could share…
• Possibly contribute as a core feature?
• Maybe you can think of other use-cases?
44
Thanks!

More Related Content

What's hot

Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for CephCeph Community
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Novell
 
Automation with ansible
Automation with ansibleAutomation with ansible
Automation with ansibleKhizer Naeem
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Vietnam Open Infrastructure User Group
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Introduction to OpenStack Cinder
Introduction to OpenStack CinderIntroduction to OpenStack Cinder
Introduction to OpenStack CinderSean McGinnis
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 

What's hot (20)

Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Disk health prediction for Ceph
Disk health prediction for CephDisk health prediction for Ceph
Disk health prediction for Ceph
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)Virtualization with KVM (Kernel-based Virtual Machine)
Virtualization with KVM (Kernel-based Virtual Machine)
 
Deploying IPv6 on OpenStack
Deploying IPv6 on OpenStackDeploying IPv6 on OpenStack
Deploying IPv6 on OpenStack
 
Automation with ansible
Automation with ansibleAutomation with ansible
Automation with ansible
 
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Introduction to OpenStack Cinder
Introduction to OpenStack CinderIntroduction to OpenStack Cinder
Introduction to OpenStack Cinder
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN GenevaSteve Traylen
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDBLixun Peng
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Unit 5.1 Memory Management_Device Driver_IPC.ppt
Unit 5.1 Memory Management_Device Driver_IPC.pptUnit 5.1 Memory Management_Device Driver_IPC.ppt
Unit 5.1 Memory Management_Device Driver_IPC.pptAnilkumarBrahmane2
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationOpenNebula Project
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationJonathan Beard
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAAdam Spiers
 

Similar to CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER (20)

Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Deeplearning
Deeplearning Deeplearning
Deeplearning
 
Puppet Camp CERN Geneva
Puppet Camp CERN GenevaPuppet Camp CERN Geneva
Puppet Camp CERN Geneva
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Alibaba patches in MariaDB
Alibaba patches in MariaDBAlibaba patches in MariaDB
Alibaba patches in MariaDB
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Unit 5.1 Memory Management_Device Driver_IPC.ppt
Unit 5.1 Memory Management_Device Driver_IPC.pptUnit 5.1 Memory Management_Device Driver_IPC.ppt
Unit 5.1 Memory Management_Device Driver_IPC.ppt
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Inoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migrationInoreader OpenNebula + StorPool migration
Inoreader OpenNebula + StorPool migration
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Chef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HAChef cookbooks for OpenStack HA
Chef cookbooks for OpenStack HA
 

Recently uploaded

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 

Recently uploaded (20)

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 

CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER

  • 1.
  • 2. Mastering Ceph Operations: Upmap and the Mgr Balancer Dan van der Ster | CERN IT | Ceph Day Berlin | 12 November 2018
  • 3.
  • 5. Ceph @ CERN Update 5
  • 6. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB luminous Satellite data centre (1000km away) 1.6PB luminous Hyperconverged KVM+Ceph 16TB luminous CephFS (HPC+Manila) Production 0.8PB luminous Client Scale Testing 0.4PB luminous Hyperconverged HPC+Ceph 0.4PB luminous CASTOR/XRootD Production 4.4PB luminous CERN Tape Archive 0.8TB luminous S3+SWIFT Production (4+2 EC) 2.3PB luminous 6Stable growth in RBD, S3, CephFS
  • 7. CephFS Scale Testing 7 • Two activities testing performance: • HPC storage with CephFS, BoF at SuperComputing in Dallas right now! (Pablo Llopis, CERN IT) • Scale testing ceph-csi with k8s (10000 cephfs clients!)
  • 8. 8 RBD Tuning 8 Rackspace / CERN Openlab Collaboration Performance assessment tools • ceph-osd benchmarking suite • rbd top for identifying active clients Studied performance impacts: • various hdd configurations • Flash for block.db, wal, dm-* • hyperconverged configurations Target real use-cases at CERN • database applications • monitoring and data analysis
  • 9. Upmap and the Mgr Balancer 9
  • 11. Background: 2-Step Placement RANDOM 1. Map an object to a PG uniformly at random.
  • 12. Background: 2-Step Placement CRUSH 2. Map each PG to a set of OSDs using CRUSH
  • 14. Do we really need PGs? • Why don’t we map objects to OSDs directly with CRUSH? • If we did that, all OSDs would be coupled (peered) with all others. • Any 3x failure would lead to data loss. • Consider a 1000-osd cluster: • 1000^3 possible combinations, but only #PGs of them are relevant for data loss. • …
  • 15. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • CRUSH provides a language for describing data placement rules according to your infrastructure. • The “failure-domain” part of CRUSH is always perfect: e.g. it will never put two copies on the same host (unless you tell it to…) • The uniformity part of CRUSH is imperfect: uneven osd utilizations are a fact of life. Perfection requires an impractical number of PGs.
  • 16. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 17. Do we really need CRUSH? • Why don’t we just distribute the PG mappings directly? • We can! Now in luminous with upmap!
  • 19. Data Imbalance • User Question: • ceph df says my cluster is 50% full overall but the cephfs_data pool is 73% full. What’s wrong? • A pool is full once its first OSD is full. • ceph df reports the used space on the most full OSD. • This difference shows that the OSD are imbalanced. 19
  • 20. Data Balancing • Luminous added a data balancer. • The general idea is to slowly move PGs from the most full to the least full OSDs. • Two ways to accomplish this, two modes: • crush-compat: internally tweak the OSD sizes, e.g. to make under full disks look bigger • Compatible with legacy clients. • upmap: move PGs precisely where we want them (without breaking failure domain rules) • Compatible with luminous+ clients. 20
  • 21. Turning on the balancer (luminous) • My advice: Do whatever you can to use the upmap balancer: ceph osd set-require-min-compat-client luminous ceph mgr module ls ceph mgr module enable balancer ceph config-key set mgr/balancer/begin_time 0830 ceph config-key set mgr/balancer/end_time 1800 ceph config-key set mgr/balancer/max_misplaced 0.005 ceph config-key set mgr/balancer/upmap_max_iterations 2 ceph balancer mode upmap ceph balancer on 21 Luminous limitation: If the balancer config doesn’t take, restart the active ceph-mgr.
  • 22. Success! • It really works • On our largest clusters, we have recovered hundreds of terabytes of capacity using the upmap balancer. 22
  • 23. But wait, there’s more… 23
  • 24. Ceph Operations at Scale • Cluster is nearly full (need to add capacity) • Cluster has old tunables (need to set to optimal) • Cluster has legacy osd reweights (need to reweight all to 1.0) • Cluster data placement changes (need to change crush ruleset) • Operations like those above often involve: • large amounts of data movement • lasting several days or weeks • unpredictable impact on users (and the cluster) • and no easy rollback! 24
  • 25. 25 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE
  • 26. A brief interlude… • “remapped” PGs are fully replicated (normally “clean”), but CRUSH wants to move them to a new set of OSDs • 4815 active+remapped+backfill_wait • “norebalance” is a cluster state to tell Ceph *not* to make progress on any remapped PGs • ceph osd set norebalance 26
  • 27. Adding capacity • Step 1: set the norebalance flag • Step 2: ceph-volume lvm create… • Step 3: 4518 active+remapped+backfill_wait • Step 4: ??? • Step 5: HEALTH_OK 27
  • 29. Adding capacity 29 What if we could “upmap” those remapped PGs back to where the data is now?
  • 32. Adding capacity with upmap • But you may be wondering: • We’ve added new disks to the cluster, but they have zero PGs, so what’s the point? • The good news is that the upmap balancer will automatically notice the new OSDs are under full, and will gradually move PGs to them. • In fact, the balancer simply *removes* the upmap entries we created to keep PGs off of the new OSDs. 32
  • 33. Changing tunables • Step 1: set the norebalance flag • Step 2: ceph osd crush tunables optimal • Step 3: 4815162/20935486 objects misplaced • Step 4: ??? • Step 5: HEALTH_OK 33
  • 35. Changing tunables 35 Such an intensive backfilling would not be transparent to the users…
  • 38. Changing tunables 38 And remember, the balancer will slowly move PGs around to where they need to be.
  • 39. Other use-cases: Legacy OSD Weights • Remove legacy OSD reweights: • ceph osd reweight-by-utilization is a legacy feature for balancing OSDs with a [0,1] reweight factor. • When we set the reweights back to 1.0, many PGs will become active+remapped • We can upmap them back to where they are. • Bonus: Since those reweights helped balance the OSDs, this acts as a shortcut to find the right set of upmaps to balance a cluster. 39
  • 40. Other use-cases: Placement Changes • We had a pool with the first 2 replicas in Room A and 3rd replica in Room B. • For capacity reasons, we needed to move all three replicas into Room A. 40
  • 41. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 41
  • 42. Other use-cases: Placement Changes • What did we do? 1. Create the new crush ruleset 2. Set the norebalance flag 3. Set the pool’s crush rule to be the new one… • This puts *every* PG in “remapped” state. 4. Use upmap to map those PGs back to where they are (3rd replica in the wrong room) 5. Gradually remove those upmap entries to slowly move all PGs fully into Room A. 6. HEALTH_OK 42 We moved several hundred TBs of data without users noticing, and could pause with HEALTH_OK at any time.
  • 43. No leap of faith required 43 WE ARE HERE LEAP OF FAITH WE WANT TO BE HERE UPMAP
  • 44. What’s next for “upmap remapped” • We find this capability to be super useful! • Wrote external tools to manage the upmap entries. • It can be tricky! • After some iteration with upstream we could share… • Possibly contribute as a core feature? • Maybe you can think of other use-cases? 44