SlideShare a Scribd company logo
1 of 12
Download to read offline
OCS Recovery PoC
Senior Technical Account Manager
Jooho Lee
This document explains how to recover OCS when you lost one of OCS nodes.
Absolutely, this doc is for PoC so it is not supported by Red Hat.
Test Environment: 4
Test Scenario: 4
Step by step: 5
Remove one of OCS nodes(worker-0) 5
Stop the node (worker-0) 5
Detach/Remove ocs volumes 5
Delete the node: 5
Create a new server(worker-5) 6
Find ocs pod on worker-0 then scale down mon-0/crashcollector 6
Create a node (worker-5) 6
Add volume to the new node via OpenStack console 6
Apply infra label 6
Recover Local Volume 7
Delete pv/pvc that was attached to the old node 7
Update localvolume 7
Recover OCS 8
Apply storage label 8
Create a pvc for rook-ceph-mon-c 8
Deploy mon-c 8
Deploy rook toolbox to remove the old osd 8
Delete deployment OSD-0/rook-ceph-crashcollector 9
Verify OSD status 10
Appendix A. Why does a new server use a different hostname? 11
Appendix B. rook-toolbox.yaml 11
Reference 12
Test Environment:
- OpenStack 14
- OpenShift 4.3.28
- 3 Master nodes
- 3 Infra nodes
- 2 Worker nodes
- OpenShift Container Storage 4.3.0
- Local Volume 4.5.0
- 4 filesystem(worker-0/1/2/3)
- /dev/vdb
- 4 block(worker-0/1/2/3)
- /dev/vdc
Test Scenario:
- Remove one of OCS nodes(worker-0)
- Shutdown the OCS node (worker-0)
- remove worker-0 vm(instance)
- remove volumes for OCS node(worker-0) ​ ⇐ different thing
- remove worker-0 from openshift
- Create a new server(worker-5)
- Use other server name and hostname because of ​this known issue
- Apply infra MCP
- Recover Local Volume
- Remove the deleted node from LocalVolume object
- Add a new node to LocalVolume object
- Recover OCS
- Remove PV/PVC/OSD/Crashcollector that are related with the deleted node.
- Wait for the operator to add new objects for the new OCS node.
Step by step:
Remove one of OCS nodes(worker-0)
This step explains how I remove one of OCS nodes permanently.
Stop the node (worker-0)
Detach/Remove ocs volumes
Delete the node:
1. from load balancer(haproxy)
a. for ingress endpoint of openshift, worker-0 should be removed
b. actually, worker-5 has to be added after it is created but for testing purposes, I
just add it now.
2. from dns
a. for upstream DNS, worker-0 record is not need anymore so like load balancer, I
remove worker-0 but add worker-5
3. from openstack
a. delete the instance
4. from openshift
oc delete node worker-0.telus.tamlab.brq.redhat.com
Create a new server(worker-5)
Before you create a new vm(instance), you have to do the following first. If not, you hit this
error[1]
I0715 03:56:11.808818 450992 update.go:92] error when evicting pod
"rook-ceph-mon-a-bcfc499c5-bm4lz" (will retry after 5s): Cannot evict pod as it would violate
the pod's disruption budget.
Find ocs pod on worker-0 then scale down mon-0/crashcollector
# Check which mon and osd pods were running on the deleted node
oc get pod -o wide|grep worker-0
#Scale down the mon/osd pod that found above
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
oc scale deployment
--selector=app=rook-ceph-crashcollector,node_name=worker-0.telus.tamlab.brq.redhat.com
--replicas=0 -n openshift-storage
Create a node (worker-5)
- update haproxy/dns
- This is already done in the ​Delete the node step
- create a new node
- approve csr
oc get csr -o json | jq -r '.items[] | select(.status == {}) | .metadata.name' | xargs oc
adm certificate approve
Add volume to the new node via OpenStack console
Apply infra label
oc label node worker-5.telus.tamlab.brq.redhat.com node-role.kubernetes.io/infra=""
oc label node worker-5.telus.tamlab.brq.redhat.com node-role.kubernetes.io/worker-
(Tip)If a new worker node is not up with infra mcp, check machine-config-daemon
oc get pod -o wide -n openshift-machine-config-operator |grep worker-0
oc logs ​machine-config-daemon-XXX​ -c machine-config-daemon -n
openshift-machine-config-operator
Recover Local Volume
The localvolume object needed to be updated because worker-1 was deleted and worker-5
added for localvolume.
Delete pv/pvc that was attached to the old node
Before you update localvolume, you need to delete pv/pvc that was related with worker-1
# Backup and delete
oc get pvc rook-ceph-mon-c -o yaml -n openshift-storage > mon-c.yaml
oc get pvc ocs-deviceset-0-0-494jh -o yaml -n openshift-storage > ocs-deviceset-0.yaml
oc delete pvc rook-ceph-mon-d rook-ceph-mon-c ocs-deviceset-0-0-494jh
oc delete pv local-pv-12cc2ec4 local-pv-74c2a064 local-pv-85537348 local-pv-addebda5
Update localvolume
Remove worker-1 node and add worker-5 node
oc edit localvolume local-file -n local-storage
oc edit localvolume local-block -n local-storage
...
nodeSelector:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-1.telus.tamlab.brq.redhat.com
- worker-2.telus.tamlab.brq.redhat.com
- worker-3.telus.tamlab.brq.redhat.com
- worker-5.telus.tamlab.brq.redhat.com
Recover OCS
Apply storage label
oc label nodes worker-5.telus.tamlab.brq.redhat.com
cluster.ocs.openshift.io/openshift-storage=''
Create a pvc for rook-ceph-mon-c
New localvolume pv are created so you can create pvc for mon-c.
oc create -f mon-c.yaml
Deploy mon-c
oc scale deployment rook-ceph-mon-c --replicas=1 -n openshift-storage
(Tip) If rook-ceph-mon-d-canary deployment is created, you can delete it because we don't lost
rook-ceph-mon-c.
oc delete deploy rook-ceph-mon-d-canary
oc delete pv rook-ceph-mon-d
Deploy rook toolbox to remove the old osd
Manual script to remove problematic osd from ​here
#Deploy rook toolbox to use ceph command
oc create -f rook-toolbox.yaml ​(check ​here​)
oc rsh rook-toolbox-XXXX
...
ceph status
...
osd.0 down
…
# Manual script to remove problematic osd.
cat osd_clean_job.sh
~~~
FAILED_OSD_ID=0 ​# This id should be updated depending on situation
HOST_TO_REMOVE=$(ceph osd find osd.${FAILED_OSD_ID} | grep "host" | tail -n 1 | awk
'{print $2}' | cut -d'"' -f 2)
osd_status=$(ceph osd tree | grep "osd.${FAILED_OSD_ID} " | awk '{print $5}')
if [[ "$osd_status" == "up" ]]; then
echo "OSD ${FAILED_OSD_ID} is up and running."
echo "Please check if you entered correct ID of failed osd!"
else
echo "OSD ${FAILED_OSD_ID} is down. Proceeding to mark out and purge"
ceph osd out osd.${FAILED_OSD_ID}
ceph osd purge osd.${FAILED_OSD_ID} --force --yes-i-really-mean-it
echo "Attempting to remove the parent host. Errors can be ignored if there are other OSDs
on the same host"
ceph osd crush rm $HOST_TO_REMOVE
fi
./osd_clean_job.sh
Delete deployment OSD-0/rook-ceph-crashcollector
With the above step, you delete OSD from the cluster but the operator will create
pvc/deployment for a new OSD automatically.
oc delete deployment rook-ceph-osd-0 -n openshift-storage
oc delete deployment
--selector=app=rook-ceph-crashcollector,node_name=worker-0.telus.tamlab.brq.redhat.com
-n openshift-storage
oc get -n openshift-storage pod -l app=rook-ceph-operator
oc delete -n openshift-storage pod rook-ceph-operator-XXXX
All steps are done so now, what you should do is waiting.
Verify OSD status
# Inside rook-toolbox
oc rsh rook-toolbox-XXXX
sh-4.2$ ceph status
cluster:
id: d579609e-7440-4432-b6b4-b79173bf7a93
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 47m)
mgr: a(active, since 17m)
mds: ocs-storagecluster-cephfilesystem:1
{0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 13m), 3 in (since 25m)
rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
task status:
data:
pools: 10 pools, 192 pgs
objects: 437 objects, 332 MiB
usage: 3.7 GiB used, 83 GiB / 87 GiB avail
pgs: 192 active+clean
io:
client: 5.9 KiB/s rd, 6.0 KiB/s wr, 7 op/s rd, 3 op/s wr
Appendix A. Why does a new server use a different
hostname?
8.3. OpenShift Container Storage deployed using local storage devices
IMPORTANT
While replacing a node, the hostname of the new Openshift Container Storage node should not be the
same as the hostname of any decommissioned Openshift Container Storage node due to a known issue.
As a workaround, we recommend to use a new hostname for adding the replaced node back into the
cluster.
Appendix B. rook-toolbox.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rook-ceph-tools
labels:
app: rook-ceph-tools
spec:
replicas: 1
selector:
matchLabels:
app: rook-ceph-tools
template:
metadata:
labels:
app: rook-ceph-tools
spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: rook-ceph-tools
image: rook/ceph:v1.3.7
command: ["/tini"]
args: ["-g", "--", "/usr/local/bin/toolbox.sh"]
imagePullPolicy: IfNotPresent
env:
- name: ROOK_ADMIN_SECRET
valueFrom:
secretKeyRef:
name: rook-ceph-mon
key: admin-secret
volumeMounts:
- mountPath: /etc/ceph
name: ceph-config
- name: mon-endpoint-volume
mountPath: /etc/rook
volumes:
- name: mon-endpoint-volume
configMap:
name: rook-ceph-mon-endpoints
items:
- key: data
path: mon-endpoints
- name: ceph-config
emptyDir: {}
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 5
Reference
- https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/ht
ml-single/deploying_openshift_container_storage/index
- https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/ht
ml-single/managing_openshift_container_storage/index#replacing-storage-nodes-for-openshift-
container-storage_rhocs
- https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.4/ht
ml-single/managing_openshift_container_storage/index#openshift_container_storage_deploye
d_using_local_storage_devices

More Related Content

Recently uploaded

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)Max Lee
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...naitiksharma1124
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 

Recently uploaded (20)

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

OpenShift Container Storage on OCP4 Recovery PoC

  • 1. OCS Recovery PoC Senior Technical Account Manager Jooho Lee
  • 2. This document explains how to recover OCS when you lost one of OCS nodes. Absolutely, this doc is for PoC so it is not supported by Red Hat.
  • 3. Test Environment: 4 Test Scenario: 4 Step by step: 5 Remove one of OCS nodes(worker-0) 5 Stop the node (worker-0) 5 Detach/Remove ocs volumes 5 Delete the node: 5 Create a new server(worker-5) 6 Find ocs pod on worker-0 then scale down mon-0/crashcollector 6 Create a node (worker-5) 6 Add volume to the new node via OpenStack console 6 Apply infra label 6 Recover Local Volume 7 Delete pv/pvc that was attached to the old node 7 Update localvolume 7 Recover OCS 8 Apply storage label 8 Create a pvc for rook-ceph-mon-c 8 Deploy mon-c 8 Deploy rook toolbox to remove the old osd 8 Delete deployment OSD-0/rook-ceph-crashcollector 9 Verify OSD status 10 Appendix A. Why does a new server use a different hostname? 11 Appendix B. rook-toolbox.yaml 11 Reference 12
  • 4. Test Environment: - OpenStack 14 - OpenShift 4.3.28 - 3 Master nodes - 3 Infra nodes - 2 Worker nodes - OpenShift Container Storage 4.3.0 - Local Volume 4.5.0 - 4 filesystem(worker-0/1/2/3) - /dev/vdb - 4 block(worker-0/1/2/3) - /dev/vdc Test Scenario: - Remove one of OCS nodes(worker-0) - Shutdown the OCS node (worker-0) - remove worker-0 vm(instance) - remove volumes for OCS node(worker-0) ​ ⇐ different thing - remove worker-0 from openshift - Create a new server(worker-5) - Use other server name and hostname because of ​this known issue - Apply infra MCP - Recover Local Volume - Remove the deleted node from LocalVolume object - Add a new node to LocalVolume object - Recover OCS - Remove PV/PVC/OSD/Crashcollector that are related with the deleted node. - Wait for the operator to add new objects for the new OCS node.
  • 5. Step by step: Remove one of OCS nodes(worker-0) This step explains how I remove one of OCS nodes permanently. Stop the node (worker-0) Detach/Remove ocs volumes Delete the node: 1. from load balancer(haproxy) a. for ingress endpoint of openshift, worker-0 should be removed b. actually, worker-5 has to be added after it is created but for testing purposes, I just add it now. 2. from dns a. for upstream DNS, worker-0 record is not need anymore so like load balancer, I remove worker-0 but add worker-5 3. from openstack a. delete the instance 4. from openshift oc delete node worker-0.telus.tamlab.brq.redhat.com
  • 6. Create a new server(worker-5) Before you create a new vm(instance), you have to do the following first. If not, you hit this error[1] I0715 03:56:11.808818 450992 update.go:92] error when evicting pod "rook-ceph-mon-a-bcfc499c5-bm4lz" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. Find ocs pod on worker-0 then scale down mon-0/crashcollector # Check which mon and osd pods were running on the deleted node oc get pod -o wide|grep worker-0 #Scale down the mon/osd pod that found above oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=worker-0.telus.tamlab.brq.redhat.com --replicas=0 -n openshift-storage Create a node (worker-5) - update haproxy/dns - This is already done in the ​Delete the node step - create a new node - approve csr oc get csr -o json | jq -r '.items[] | select(.status == {}) | .metadata.name' | xargs oc adm certificate approve Add volume to the new node via OpenStack console Apply infra label oc label node worker-5.telus.tamlab.brq.redhat.com node-role.kubernetes.io/infra="" oc label node worker-5.telus.tamlab.brq.redhat.com node-role.kubernetes.io/worker-
  • 7. (Tip)If a new worker node is not up with infra mcp, check machine-config-daemon oc get pod -o wide -n openshift-machine-config-operator |grep worker-0 oc logs ​machine-config-daemon-XXX​ -c machine-config-daemon -n openshift-machine-config-operator Recover Local Volume The localvolume object needed to be updated because worker-1 was deleted and worker-5 added for localvolume. Delete pv/pvc that was attached to the old node Before you update localvolume, you need to delete pv/pvc that was related with worker-1 # Backup and delete oc get pvc rook-ceph-mon-c -o yaml -n openshift-storage > mon-c.yaml oc get pvc ocs-deviceset-0-0-494jh -o yaml -n openshift-storage > ocs-deviceset-0.yaml oc delete pvc rook-ceph-mon-d rook-ceph-mon-c ocs-deviceset-0-0-494jh oc delete pv local-pv-12cc2ec4 local-pv-74c2a064 local-pv-85537348 local-pv-addebda5 Update localvolume Remove worker-1 node and add worker-5 node oc edit localvolume local-file -n local-storage oc edit localvolume local-block -n local-storage ... nodeSelector: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker-1.telus.tamlab.brq.redhat.com - worker-2.telus.tamlab.brq.redhat.com - worker-3.telus.tamlab.brq.redhat.com - worker-5.telus.tamlab.brq.redhat.com
  • 8. Recover OCS Apply storage label oc label nodes worker-5.telus.tamlab.brq.redhat.com cluster.ocs.openshift.io/openshift-storage='' Create a pvc for rook-ceph-mon-c New localvolume pv are created so you can create pvc for mon-c. oc create -f mon-c.yaml Deploy mon-c oc scale deployment rook-ceph-mon-c --replicas=1 -n openshift-storage (Tip) If rook-ceph-mon-d-canary deployment is created, you can delete it because we don't lost rook-ceph-mon-c. oc delete deploy rook-ceph-mon-d-canary oc delete pv rook-ceph-mon-d Deploy rook toolbox to remove the old osd Manual script to remove problematic osd from ​here #Deploy rook toolbox to use ceph command oc create -f rook-toolbox.yaml ​(check ​here​) oc rsh rook-toolbox-XXXX ... ceph status ... osd.0 down …
  • 9. # Manual script to remove problematic osd. cat osd_clean_job.sh ~~~ FAILED_OSD_ID=0 ​# This id should be updated depending on situation HOST_TO_REMOVE=$(ceph osd find osd.${FAILED_OSD_ID} | grep "host" | tail -n 1 | awk '{print $2}' | cut -d'"' -f 2) osd_status=$(ceph osd tree | grep "osd.${FAILED_OSD_ID} " | awk '{print $5}') if [[ "$osd_status" == "up" ]]; then echo "OSD ${FAILED_OSD_ID} is up and running." echo "Please check if you entered correct ID of failed osd!" else echo "OSD ${FAILED_OSD_ID} is down. Proceeding to mark out and purge" ceph osd out osd.${FAILED_OSD_ID} ceph osd purge osd.${FAILED_OSD_ID} --force --yes-i-really-mean-it echo "Attempting to remove the parent host. Errors can be ignored if there are other OSDs on the same host" ceph osd crush rm $HOST_TO_REMOVE fi ./osd_clean_job.sh Delete deployment OSD-0/rook-ceph-crashcollector With the above step, you delete OSD from the cluster but the operator will create pvc/deployment for a new OSD automatically. oc delete deployment rook-ceph-osd-0 -n openshift-storage oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=worker-0.telus.tamlab.brq.redhat.com -n openshift-storage oc get -n openshift-storage pod -l app=rook-ceph-operator oc delete -n openshift-storage pod rook-ceph-operator-XXXX All steps are done so now, what you should do is waiting.
  • 10. Verify OSD status # Inside rook-toolbox oc rsh rook-toolbox-XXXX sh-4.2$ ceph status cluster: id: d579609e-7440-4432-b6b4-b79173bf7a93 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 47m) mgr: a(active, since 17m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 13m), 3 in (since 25m) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) task status: data: pools: 10 pools, 192 pgs objects: 437 objects, 332 MiB usage: 3.7 GiB used, 83 GiB / 87 GiB avail pgs: 192 active+clean io: client: 5.9 KiB/s rd, 6.0 KiB/s wr, 7 op/s rd, 3 op/s wr
  • 11. Appendix A. Why does a new server use a different hostname? 8.3. OpenShift Container Storage deployed using local storage devices IMPORTANT While replacing a node, the hostname of the new Openshift Container Storage node should not be the same as the hostname of any decommissioned Openshift Container Storage node due to a known issue. As a workaround, we recommend to use a new hostname for adding the replaced node back into the cluster. Appendix B. rook-toolbox.yaml apiVersion: apps/v1 kind: Deployment metadata: name: rook-ceph-tools labels: app: rook-ceph-tools spec: replicas: 1 selector: matchLabels: app: rook-ceph-tools template: metadata: labels: app: rook-ceph-tools spec: dnsPolicy: ClusterFirstWithHostNet containers: - name: rook-ceph-tools image: rook/ceph:v1.3.7 command: ["/tini"] args: ["-g", "--", "/usr/local/bin/toolbox.sh"] imagePullPolicy: IfNotPresent env: - name: ROOK_ADMIN_SECRET valueFrom: secretKeyRef: name: rook-ceph-mon
  • 12. key: admin-secret volumeMounts: - mountPath: /etc/ceph name: ceph-config - name: mon-endpoint-volume mountPath: /etc/rook volumes: - name: mon-endpoint-volume configMap: name: rook-ceph-mon-endpoints items: - key: data path: mon-endpoints - name: ceph-config emptyDir: {} tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 5 Reference - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/ht ml-single/deploying_openshift_container_storage/index - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.3/ht ml-single/managing_openshift_container_storage/index#replacing-storage-nodes-for-openshift- container-storage_rhocs - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.4/ht ml-single/managing_openshift_container_storage/index#openshift_container_storage_deploye d_using_local_storage_devices