SlideShare a Scribd company logo
How to debug OCFS2 hang problem
- L3 bug handling experience sharing
Gang He <ghe@suse.com>
Apr 26th, 2019
Understand the problem
3
Problem description
The customer has setup a new SLES11sp4 2 node
cluster and is running some application tests on it,
they see the file system periodically hangs up and
processes get into a "D" state.
All processes stuck in "D" state were in the ocfs2_cluster_lock code. for example,
[<ffffffffa066f800>] __ocfs2_cluster_lock+0x3b0/0xa60 [ocfs2]
[<ffffffffa0677528>] ocfs2_inode_lock_full_nested+0x178/0x510 [ocfs2]
[<ffffffffa06ec791>] ocfs2_get_acl+0x61/0x120 [ocfs2]
[<ffffffffa06ec95a>] ocfs2_acl_chmod+0x6a/0xe0 [ocfs2]
[<ffffffffa0681121>] ocfs2_setattr+0x671/0xab0 [ocfs2]
[<ffffffff8117de8e>] notify_change+0x17e/0x2d0
[<ffffffff8116136c>] sys_fchmodat+0xdc/0x150
[<ffffffff8147c187>] sysenter_dispatch+0x7/0x32
[<ffffffffffffffff>] 0xffffffffffffffff
4
Interact with the customer
• Mail communication
Make sure the ocfs2 cluster setup is correct.
Understand the customer application scenarios.
Provide tentative suggestions/patches.
• Remote session with the customer
Reproduce bug.
Find ocfs2 related hung processes.
Collect the related data.
5
Collect data from the customer site
• supportconfig/hb_report
SLES HA cluster related data.
• dlm_tool
DLM lock related dump.
• o2image
OCFS2 file system meta-data image.
• echo "c" > /proc/sysrq-trigger
Linux core dump file.
6
Generate core dump in HA cluster
• Why is no Linux core dump left after trigger panic?
Since the fence mechanism resets the machine when
it is doing the Kdump.
• Solutions
1) use stonith:fence_kdump resource agent
please refer to SLE-HA-guide document for more
details.
2) disable hardware watchdog and use soft watchdog
see the detailed steps on the next page.
7
Use soft watchdog temporarily
• Disable hardware watchdog
edit /etc/modprobe.conf file, to add two lines to disable
loading the related kernel modules. (Note: this step
depends on your machine's hardware watchdog
configuration)
blacklist iTCO_wdt
blacklist iTCO_vendor_support
• Enable soft watchdog
edit /etc/init.d/boot.local file, to add one line to load
soft watchdog kernel module at boot.
modprobe softdog
• Reboot the machine to take effect
Analyze the problem
9
Prepare crash analysis environment
• Crash-setup
This tools can help you set up a crash analysis environment quickly in L3 server according
to the vmcore file, but the access speed is very slow from Beijing site, and HA related
KMP debuginfo/debugsource rpms are missed.
• By yourself
Install the related debuginfo/debugsource rpms
kernel-default-3.0.101-108.68.1
kernel-default-devel-3.0.101-108.68.1
kernel-default-base-3.0.101-108.68.1
kernel-default-debugsource-3.0.101-108.68.1
kernel-default-debuginfo-3.0.101-108.68.1
ocfs2-kmp-default-1.6_3.0.101_63-0.23.40
ocfs2-debugsource-1.6-3.0.101_63-0.23.40
ocfs2-debuginfo-1.6-3.0.101_63-0.23.40
10
Basic crash analysis skills
11
Verify the problematic directories/files
12
Analyze the hung processes - I
13
Analyze the hung processes - II
14
Check DLM lock dump
From DLM lock dumps of two nodes, we can find
node04(this DLM lock resource master) has given a
PR Meta lock grant of inode 14797221(0xe1c9a5) to
one process.
15
Analyze the hung processes - III
16
Analyze the hung processes - IV
17
Analyze the hung processes - V
18
Analyze the hung processes - VI
19
Root cause
The root cause is the process 31017, which had got
the inode(14797222) DLM EX lock at ocfs2_setattr(),
then the process tried to get the inode DLM PR lock at
ocfs2_get_acl() again, the recursive lock recursive led
to a dead-lock. Then, the related processes among
the cluster were blocked.
The fix patches are as below,
commit 439a36b8ef38657f765b80b775e2885338d72451
Author: Eric Ren <zren@suse.com>
Date: Wed Feb 22 15:40:41 2017 -0800
ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
commit b891fa5024a95c77e0d6fd6655cb74af6fb77f46
Author: Eric Ren <zren@suse.com>
Date: Wed Feb 22 15:40:44 2017 -0800
ocfs2: fix deadlock issue when taking inode lock at vfs entry points
commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442
Author: Eric Ren <zren@suse.com>
Date: Fri Jun 23 15:08:55 2017 -0700
ocfs2: fix deadlock caused by recursive locking in xattr
Solve the problem
21
The fix process
• Find kernel patches (from the upstream/yourself).
• Test the patches based on the customer version.
Pass ocfs2 test suits.
• Create the fix branch.
e.g. origin/users/ghe/SLE12-SP4/bsc1128902
• L3 creates the corresponding PTF rpm.
• The customer verifies the PTF rpm.
• Submit the patches to the upstream if they are new.
• Add the patches to SUSE kernel-source.
• Close the bug from SUSE bugzilla.
22
SUSE kernel source maintenance
• Kernel-source
url: user@kerncvs.suse.de:/home/git/kernel-source.git
Linux tarball plus lots of patches
• Kernel
url: git://kerncvs.suse.de/kernel.git
SUSE Linux kernel source (patches applied)
• Code branches for various SLES versions.
origin/SLE12-SP4
origin/SLE15-SP1
origin/SLE15-SP1-UPDATE
...
• Automatically propagate among branches.
http://kerncvs.suse.de/
23
Automatically propagate among
branches
24
Add patch to SUSE kernel-source
• Format patch from the Linus git
cd /torvalds
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git format-patch commit-id -1
• Add three keywords to the patch, e.g.
Patch-mainline: v4.11-rc1
Git-commit: b891fa5024a95c77e0d6fd6655cb74af6fb77f46
References: bsc#1086695
Note: the patch must include at least one SUSE related e-mail address.
• Set LINUX_GIT environment variable
This variable points to your local Linus git directory, e.g. LINUX_GIT=/torvalds/linux
• Push the patch to SUSE kernel-source, e.g.
git checkout -b users/ghe/SLE12-SP2/for-next origin/SLE12-SP2
./scripts/git_sort/series_insert.py patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch
git add patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch
./scripts/log
git push -v ssh://ghe@kerncvs.suse.de/srv/git/kernel-source.git users/ghe/SLE12-SP2/for-next
• Reference
https://pes.suse.de/L3/Kernel_git_repositories/
How to debug ocfs2 hang problem

More Related Content

What's hot

Systemd cheatsheet
Systemd cheatsheetSystemd cheatsheet
Systemd cheatsheet
Susant Sahani
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
Anne Nicolas
 
CLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init systemCLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init system
PaulWay
 
1.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v31.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v3
Acácio Oliveira
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
All Things Open
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Anne Nicolas
 
101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2
Acácio Oliveira
 
101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot
Acácio Oliveira
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Rac introduction
Rac introductionRac introduction
Rac introduction
Riyaj Shamsudeen
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
LinuxCon ContainerCon CloudOpen China
 
On-Demand Image Resizing
On-Demand Image ResizingOn-Demand Image Resizing
On-Demand Image Resizing
Jonathan Lee
 
How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3 How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3
Saroj Sahu
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Anne Nicolas
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
Marc Cortinas Val
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Anne Nicolas
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysis
Tamas K Lengyel
 
Whitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxWhitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxRoger Eisentrager
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
Linaro
 

What's hot (20)

Systemd cheatsheet
Systemd cheatsheetSystemd cheatsheet
Systemd cheatsheet
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
CLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init systemCLUG 2010 09 - systemd - the new init system
CLUG 2010 09 - systemd - the new init system
 
1.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v31.3 runlevels, shutdown, and reboot v3
1.3 runlevels, shutdown, and reboot v3
 
SystemV vs systemd
SystemV vs systemdSystemV vs systemd
SystemV vs systemd
 
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
 
101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2101 1.3 runlevels, shutdown, and reboot v2
101 1.3 runlevels, shutdown, and reboot v2
 
101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot101 1.3 runlevels , shutdown, and reboot
101 1.3 runlevels , shutdown, and reboot
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
 
On-Demand Image Resizing
On-Demand Image ResizingOn-Demand Image Resizing
On-Demand Image Resizing
 
How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3 How to assign unowned disk in the netapp cluster 8.3
How to assign unowned disk in the netapp cluster 8.3
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
BSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysisBSides Denver: Stealthy, hypervisor-based malware analysis
BSides Denver: Stealthy, hypervisor-based malware analysis
 
Whitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on LinuxWhitepaper MS SQL Server on Linux
Whitepaper MS SQL Server on Linux
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 

Similar to How to debug ocfs2 hang problem

Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
Riyaj Shamsudeen
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawn
Gábor Nyers
 
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
ginniapps
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenLex Yu
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
Ivan Babrou
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
Hisaki Ohara
 
LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201
Linaro
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
virtualizacionTV
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
Eric Sloof
 
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-BaljevicHow to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
Circling Cycle
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
Cyber Security Alliance
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
tomasbart
 
My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...
Luigi Auriemma
 
Armboot process zeelogic
Armboot process zeelogicArmboot process zeelogic
Armboot process zeelogic
Aleem Shariff
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Yury Velikanov
 
Network Automation Tools
Network Automation ToolsNetwork Automation Tools
Network Automation Tools
Edwin Beekman
 
Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0
Gobinath Panchavarnam
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
Cumulus Networks
 
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库maclean liu
 
hacking-embedded-devices.pptx
hacking-embedded-devices.pptxhacking-embedded-devices.pptx
hacking-embedded-devices.pptx
ssuserfcf43f
 

Similar to How to debug ocfs2 hang problem (20)

Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawn
 
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_Tizen
 
Debugging linux issues with eBPF
Debugging linux issues with eBPFDebugging linux issues with eBPF
Debugging linux issues with eBPF
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201LCU14 114- Upstreaming 201
LCU14 114- Upstreaming 201
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-BaljevicHow to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
How to-mount-3 par-san-virtual-copy-onto-rhel-servers-by-Dusan-Baljevic
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Docker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in PragueDocker and friends at Linux Days 2014 in Prague
Docker and friends at Linux Days 2014 in Prague
 
My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...My old security advisories on HMI/SCADA and industrial software released betw...
My old security advisories on HMI/SCADA and industrial software released betw...
 
Armboot process zeelogic
Armboot process zeelogicArmboot process zeelogic
Armboot process zeelogic
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
 
Network Automation Tools
Network Automation ToolsNetwork Automation Tools
Network Automation Tools
 
Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0Powervc upgrade from_1.3.0.2_to_1.3.2.0
Powervc upgrade from_1.3.0.2_to_1.3.2.0
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
 
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
 
hacking-embedded-devices.pptx
hacking-embedded-devices.pptxhacking-embedded-devices.pptx
hacking-embedded-devices.pptx
 

Recently uploaded

RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 

Recently uploaded (20)

RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 

How to debug ocfs2 hang problem

  • 1. How to debug OCFS2 hang problem - L3 bug handling experience sharing Gang He <ghe@suse.com> Apr 26th, 2019
  • 3. 3 Problem description The customer has setup a new SLES11sp4 2 node cluster and is running some application tests on it, they see the file system periodically hangs up and processes get into a "D" state. All processes stuck in "D" state were in the ocfs2_cluster_lock code. for example, [<ffffffffa066f800>] __ocfs2_cluster_lock+0x3b0/0xa60 [ocfs2] [<ffffffffa0677528>] ocfs2_inode_lock_full_nested+0x178/0x510 [ocfs2] [<ffffffffa06ec791>] ocfs2_get_acl+0x61/0x120 [ocfs2] [<ffffffffa06ec95a>] ocfs2_acl_chmod+0x6a/0xe0 [ocfs2] [<ffffffffa0681121>] ocfs2_setattr+0x671/0xab0 [ocfs2] [<ffffffff8117de8e>] notify_change+0x17e/0x2d0 [<ffffffff8116136c>] sys_fchmodat+0xdc/0x150 [<ffffffff8147c187>] sysenter_dispatch+0x7/0x32 [<ffffffffffffffff>] 0xffffffffffffffff
  • 4. 4 Interact with the customer • Mail communication Make sure the ocfs2 cluster setup is correct. Understand the customer application scenarios. Provide tentative suggestions/patches. • Remote session with the customer Reproduce bug. Find ocfs2 related hung processes. Collect the related data.
  • 5. 5 Collect data from the customer site • supportconfig/hb_report SLES HA cluster related data. • dlm_tool DLM lock related dump. • o2image OCFS2 file system meta-data image. • echo "c" > /proc/sysrq-trigger Linux core dump file.
  • 6. 6 Generate core dump in HA cluster • Why is no Linux core dump left after trigger panic? Since the fence mechanism resets the machine when it is doing the Kdump. • Solutions 1) use stonith:fence_kdump resource agent please refer to SLE-HA-guide document for more details. 2) disable hardware watchdog and use soft watchdog see the detailed steps on the next page.
  • 7. 7 Use soft watchdog temporarily • Disable hardware watchdog edit /etc/modprobe.conf file, to add two lines to disable loading the related kernel modules. (Note: this step depends on your machine's hardware watchdog configuration) blacklist iTCO_wdt blacklist iTCO_vendor_support • Enable soft watchdog edit /etc/init.d/boot.local file, to add one line to load soft watchdog kernel module at boot. modprobe softdog • Reboot the machine to take effect
  • 9. 9 Prepare crash analysis environment • Crash-setup This tools can help you set up a crash analysis environment quickly in L3 server according to the vmcore file, but the access speed is very slow from Beijing site, and HA related KMP debuginfo/debugsource rpms are missed. • By yourself Install the related debuginfo/debugsource rpms kernel-default-3.0.101-108.68.1 kernel-default-devel-3.0.101-108.68.1 kernel-default-base-3.0.101-108.68.1 kernel-default-debugsource-3.0.101-108.68.1 kernel-default-debuginfo-3.0.101-108.68.1 ocfs2-kmp-default-1.6_3.0.101_63-0.23.40 ocfs2-debugsource-1.6-3.0.101_63-0.23.40 ocfs2-debuginfo-1.6-3.0.101_63-0.23.40
  • 11. 11 Verify the problematic directories/files
  • 12. 12 Analyze the hung processes - I
  • 13. 13 Analyze the hung processes - II
  • 14. 14 Check DLM lock dump From DLM lock dumps of two nodes, we can find node04(this DLM lock resource master) has given a PR Meta lock grant of inode 14797221(0xe1c9a5) to one process.
  • 15. 15 Analyze the hung processes - III
  • 16. 16 Analyze the hung processes - IV
  • 17. 17 Analyze the hung processes - V
  • 18. 18 Analyze the hung processes - VI
  • 19. 19 Root cause The root cause is the process 31017, which had got the inode(14797222) DLM EX lock at ocfs2_setattr(), then the process tried to get the inode DLM PR lock at ocfs2_get_acl() again, the recursive lock recursive led to a dead-lock. Then, the related processes among the cluster were blocked. The fix patches are as below, commit 439a36b8ef38657f765b80b775e2885338d72451 Author: Eric Ren <zren@suse.com> Date: Wed Feb 22 15:40:41 2017 -0800 ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock commit b891fa5024a95c77e0d6fd6655cb74af6fb77f46 Author: Eric Ren <zren@suse.com> Date: Wed Feb 22 15:40:44 2017 -0800 ocfs2: fix deadlock issue when taking inode lock at vfs entry points commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442 Author: Eric Ren <zren@suse.com> Date: Fri Jun 23 15:08:55 2017 -0700 ocfs2: fix deadlock caused by recursive locking in xattr
  • 21. 21 The fix process • Find kernel patches (from the upstream/yourself). • Test the patches based on the customer version. Pass ocfs2 test suits. • Create the fix branch. e.g. origin/users/ghe/SLE12-SP4/bsc1128902 • L3 creates the corresponding PTF rpm. • The customer verifies the PTF rpm. • Submit the patches to the upstream if they are new. • Add the patches to SUSE kernel-source. • Close the bug from SUSE bugzilla.
  • 22. 22 SUSE kernel source maintenance • Kernel-source url: user@kerncvs.suse.de:/home/git/kernel-source.git Linux tarball plus lots of patches • Kernel url: git://kerncvs.suse.de/kernel.git SUSE Linux kernel source (patches applied) • Code branches for various SLES versions. origin/SLE12-SP4 origin/SLE15-SP1 origin/SLE15-SP1-UPDATE ... • Automatically propagate among branches. http://kerncvs.suse.de/
  • 24. 24 Add patch to SUSE kernel-source • Format patch from the Linus git cd /torvalds git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git format-patch commit-id -1 • Add three keywords to the patch, e.g. Patch-mainline: v4.11-rc1 Git-commit: b891fa5024a95c77e0d6fd6655cb74af6fb77f46 References: bsc#1086695 Note: the patch must include at least one SUSE related e-mail address. • Set LINUX_GIT environment variable This variable points to your local Linus git directory, e.g. LINUX_GIT=/torvalds/linux • Push the patch to SUSE kernel-source, e.g. git checkout -b users/ghe/SLE12-SP2/for-next origin/SLE12-SP2 ./scripts/git_sort/series_insert.py patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch git add patches.fixes/ocfs2-try-to-reuse-extent-block-in-dealloc-without-m.patch ./scripts/log git push -v ssh://ghe@kerncvs.suse.de/srv/git/kernel-source.git users/ghe/SLE12-SP2/for-next • Reference https://pes.suse.de/L3/Kernel_git_repositories/