SlideShare a Scribd company logo
Practicing Linux Crash/Panic
Issue on Production and Cloud
Server
2019 Shanghai Open Source Summit
China
Gavin Guo
Technical Lead - Sustaining Engineering
gavin.guo@canonical.com
Wechat
Migrating KSM page causes the
VM lock up as the KSM page
merging list is too large
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513
Enterprise Case Study
Case Description
After numad is enabled and there are several
VMs running on the same host machine, the
softlockup messages can be observed inside the
VMs' dmesg.
CPU: 3 PID: 22468 Comm: kworker/u32:2 Not tainted 4.4.0-47-generic
#68-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-252:0)
[<ffffffff81104388>] smp_call_function_many+0x1f8/0x260
[<ffffffff810727d5>] native_flush_tlb_others+0x65/0x150
[<ffffffff81072b35>] flush_tlb_page+0x55/0x90
This one seems a known issue. The bug is proactively handled by Linus when Dave Jones[3]
issued the bug which happened on the bare metal machine. Tinoco[2] also found the bug in
the nested KVM environment which happened when the IPI is sent out in the VCPU and it
seems the problem coming from the LAPIC simulation of VMX. Chris Arges also involved in
the debugging process and the debugging patch was given out by the Ingo Molnar, then Chris
added some hacks to print out the debugging information. Unfortunately, after a long
investigation, the root cause is still unknown.
Investigation on the VM side
[1]. smp/call: Detect stuck CSD locks https://patchwork.kernel.org/patch/6153801/
[2]. smp_call_function_single lockups https://lkml.org/lkml/2015/2/11/247
[3]. frequent lockups in 3.18rc4 https://lkml.org/lkml/2014/11/14/656
I've prepared a hotfix kernel which would resend the IPI and print out
the information when the softlockup happens. Unfortunately, the
hotfix kernel doesn't print out the error message. That means my
original thoughts are incorrect!
The hotfix kernel source:
http://kernel.ubuntu.com/git/gavinguo/ubuntu-xenial.git/log/?h=sf000
103690-csd-lock-debug
Investigation on the VM side
As I cannot find the clue inside the VMs, then
try to investigate the host side.
Host Machine - Hung task Backtrace
# ksmd
crash> bt 615
PID: 615 TASK: ffff881fa174a940 CPU: 15 COMMAND: "ksmd"
#0 [ffff881fa1087cc0] __schedule at ffffffff818207ee
#1 [ffff881fa1087d10] schedule at ffffffff81820ee5
#2 [ffff881fa1087d28] rwsem_down_read_failed at ffffffff81823d60
#3 [ffff881fa1087d98] call_rwsem_down_read_failed at ffffffff813f8324
#4 [ffff881fa1087df8] ksm_scan_thread at ffffffff811e613d
#5 [ffff881fa1087ec8] kthread at ffffffff810a0528
#6 [ffff881fa1087f50] ret_from_fork at ffffffff8182538f
Host Machine - Hung task Backtrace
# khugepaged
crash> bt 616
PID: 616 TASK: ffff881fa1749b80 CPU: 11 COMMAND: "khugepaged"
#0 [ffff881fa108bc60] __schedule at ffffffff818207ee
#1 [ffff881fa108bcb0] schedule at ffffffff81820ee5
#2 [ffff881fa108bcc8] rwsem_down_write_failed at ffffffff81823b32
#3 [ffff881fa108bd50] call_rwsem_down_write_failed at ffffffff813f8353
#4 [ffff881fa108bda8] khugepaged at ffffffff811f58ef
#5 [ffff881fa108bec8] kthread at ffffffff810a0528
#6 [ffff881fa108bf50] ret_from_fork at ffffffff8182538f
Host Machine - Hung task Backtrace
# qemu-system-x86
crash> bt 12555
PID: 12555 TASK: ffff885fa1af6040 CPU: 55 COMMAND: "qemu-system-x86"
#0 [ffff885f9a043a50] __schedule at ffffffff818207ee
#1 [ffff885f9a043aa0] schedule at ffffffff81820ee5
#2 [ffff885f9a043ab8] rwsem_down_read_failed at ffffffff81823d60
#3 [ffff885f9a043b28] call_rwsem_down_read_failed at ffffffff813f8324
#4 [ffff885f9a043b88] kvm_host_page_size at ffffffffc02cfbae [kvm]
#5 [ffff885f9a043ba8] mapping_level at ffffffffc02ead1f [kvm]
#6 [ffff885f9a043bd8] tdp_page_fault at ffffffffc02f0b8a [kvm]
#7 [ffff885f9a043c50] kvm_mmu_page_fault at ffffffffc02ea794 [kvm]
#8 [ffff885f9a043c80] handle_ept_violation at ffffffffc01acda3 [kvm_intel]
#9 [ffff885f9a043cb8] vmx_handle_exit at ffffffffc01afdab [kvm_intel]
#10 [ffff885f9a043d48] vcpu_enter_guest at ffffffffc02e026d [kvm]
#11 [ffff885f9a043dc0] kvm_arch_vcpu_ioctl_run at ffffffffc02e698f [kvm]
#12 [ffff885f9a043e08] kvm_vcpu_ioctl at ffffffffc02ce09d [kvm]
#13 [ffff885f9a043ea0] do_vfs_ioctl at ffffffff81220bef
Host Machine - Hung task Backtrace
We can see that the previous three tasks are waiting on the
mmap_sem. The most interesting part is the backtrace of numad:
crash> bt 2950 The disassembly analysis of numad call stack
#1 [ffff885f8fb4fb78] smp_call_function_many
#2 [ffff885f8fb4fbc0] native_flush_tlb_others
#3 [ffff885f8fb4fc08] flush_tlb_page
#4 [ffff885f8fb4fc30] ptep_clear_flush
#5 [ffff885f8fb4fc60] try_to_unmap_one
#6 [ffff885f8fb4fcd0] rmap_walk_ksm
#7 [ffff885f8fb4fd28] rmap_walk
#8 [ffff885f8fb4fd80] try_to_unmap
#9 [ffff885f8fb4fdc8] migrate_pages
#10 [ffff885f8fb4fe80] do_migrate_pages
Host Machine - Hung task Backtrace
I've tried to disassemble the code and finally find the stable_node->hlist is as long as
2306920 entries(Around 9.2GB memory merged into one page).
rmap_item list(stable_node->hlist):
stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0
struct hlist_head {
[0] struct hlist_node *first;
}
struct hlist_node {
[0] struct hlist_node *next;
[8] struct hlist_node **pprev;
}
crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst
$ wc -l rmap_item.lst
2306920 rmap_item.lst
KSM merge list extraction
Introduction to the KSM Stable Tree
(Stable/Unstable tree)
stable_node
stable_node stable_node
stable_nodestable_node
stable_node
stable_node stable_node stable_node
0 1 2 3
/sys/kernel/mm/ksm/merge_across_node=0
rmap_item rmap_item rmap_item
rmap_item
rmap_item rmap_item
rmap_itemrmap_item
rmap_item
rmap_item rmap_item rmap_item
0 1 2 3
root_unstable_tree[nr_node_ids]root_stable_tree[nr_node_ids]
The merge list is as long as
2306920 entries.
Automatic NUMA balancing
Local/Remote access
Cpu 0 Cpu 1 Cpu 2 Cpu 3 Cpu 4 Cpu 5 Cpu 6 Cpu 7
Process A
page pagepage
page page page
page page page
page page page
Process B Process C Process D Process E Process F Process G Process H
Local access
Remote access Node 0 Node 1
According to the memory access latency, it
would be better to migrate Process D to
node 1 and Process E to node 0. The
remote access page by Process A can be
migrated to node 0. However, it would also
need to consider the CPU loading before
migrating the processes.
https://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg
FlameGraph of the performance problem
When migrating the ksm pages, numad needs to call the IPI to flush
the related TLB entries in CPUs which ever used the PTE.
Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page
deduplication limit
https://www.spinics.net/lists/linux-mm/msg125880.html
80b18dfa53bb ksm: optimize refile of stable_node_dup at the head of the
chain
8dc5ffcd5a74 ksm: swap the two output parameters of chain/chain_prune
0ba1d0f7c41c ksm: cleanup stable_node chain collapse case
b4fecc67cc56 ksm: fix use after free with merge_across_nodes = 0
2c653d0ee2ae ksm: introduce ksm_max_page_sharing per page
deduplication limit
Solution

More Related Content

What's hot

I/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
I/O, You Own: Regaining Control of Your Disk in the Presence of BootkitsI/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
I/O, You Own: Regaining Control of Your Disk in the Presence of BootkitsCrowdStrike
 
Linux Kernel Crashdump
Linux Kernel CrashdumpLinux Kernel Crashdump
Linux Kernel Crashdump
Marian Marinov
 
Proxy arp
Proxy arpProxy arp
Proxy arp
Marian Marinov
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Anne Nicolas
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Anne Nicolas
 
grsecurity and PaX
grsecurity and PaXgrsecurity and PaX
grsecurity and PaX
Kernel TLV
 
System Calls
System CallsSystem Calls
System Calls
David Evans
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and Scheduling
David Evans
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
J.J. Ciarlante
 
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege EscalationSemtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Kernel TLV
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
Anne Nicolas
 
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architectureKernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Anne Nicolas
 
Vm ware fuzzing - defcon russia 20
Vm ware fuzzing  - defcon russia 20Vm ware fuzzing  - defcon russia 20
Vm ware fuzzing - defcon russia 20
DefconRussia
 
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
Valeriy Kravchuk
 
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
MariaDB Server on macOS -  FOSDEM 2022 MariaDB DevroomMariaDB Server on macOS -  FOSDEM 2022 MariaDB Devroom
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
Valeriy Kravchuk
 
Node day 2014
Node day 2014Node day 2014
Node day 2014
Trevor Norris
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Valeriy Kravchuk
 
MINCS - containers in the shell script (Eng. ver.)
MINCS - containers in the shell script (Eng. ver.)MINCS - containers in the shell script (Eng. ver.)
MINCS - containers in the shell script (Eng. ver.)
Masami Hiramatsu
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
CanSecWest
 
Operating System Engineering Quiz
Operating System Engineering QuizOperating System Engineering Quiz
Operating System Engineering Quiz
Programming Homework Help
 

What's hot (20)

I/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
I/O, You Own: Regaining Control of Your Disk in the Presence of BootkitsI/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
I/O, You Own: Regaining Control of Your Disk in the Presence of Bootkits
 
Linux Kernel Crashdump
Linux Kernel CrashdumpLinux Kernel Crashdump
Linux Kernel Crashdump
 
Proxy arp
Proxy arpProxy arp
Proxy arp
 
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel FernandesKernel Recipes 2019 - RCU in 2019 - Joel Fernandes
Kernel Recipes 2019 - RCU in 2019 - Joel Fernandes
 
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelKernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel
 
grsecurity and PaX
grsecurity and PaXgrsecurity and PaX
grsecurity and PaX
 
System Calls
System CallsSystem Calls
System Calls
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and Scheduling
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
 
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege EscalationSemtex.c [CVE-2013-2094] - A Linux Privelege Escalation
Semtex.c [CVE-2013-2094] - A Linux Privelege Escalation
 
Kernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power ManagementKernel Recipes 2015: Introduction to Kernel Power Management
Kernel Recipes 2015: Introduction to Kernel Power Management
 
Kernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architectureKernel Recipes 2015 - Porting Linux to a new processor architecture
Kernel Recipes 2015 - Porting Linux to a new processor architecture
 
Vm ware fuzzing - defcon russia 20
Vm ware fuzzing  - defcon russia 20Vm ware fuzzing  - defcon russia 20
Vm ware fuzzing - defcon russia 20
 
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
 
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
MariaDB Server on macOS -  FOSDEM 2022 MariaDB DevroomMariaDB Server on macOS -  FOSDEM 2022 MariaDB Devroom
MariaDB Server on macOS - FOSDEM 2022 MariaDB Devroom
 
Node day 2014
Node day 2014Node day 2014
Node day 2014
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 
MINCS - containers in the shell script (Eng. ver.)
MINCS - containers in the shell script (Eng. ver.)MINCS - containers in the shell script (Eng. ver.)
MINCS - containers in the shell script (Eng. ver.)
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
Operating System Engineering Quiz
Operating System Engineering QuizOperating System Engineering Quiz
Operating System Engineering Quiz
 

Similar to Migrating KSM page causes the VM lock up as the KSM page merging list is too large - 2019 OSS China - Gavin Guo

Openstack 101
Openstack 101Openstack 101
Openstack 101
POSSCON
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
Cyber Security Alliance
 
PFIセミナー資料 H27.10.22
PFIセミナー資料 H27.10.22PFIセミナー資料 H27.10.22
PFIセミナー資料 H27.10.22
Yuya Takei
 
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Ontico
 
How to debug ocfs2 hang problem
How to debug ocfs2 hang problemHow to debug ocfs2 hang problem
How to debug ocfs2 hang problem
Gang He
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
Amazon Web Services
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
Paul V. Novarese
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
Emanuel Calvo
 
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Frederic Descamps
 
Metasploitable
MetasploitableMetasploitable
Trouble shooting apachecloudstack
Trouble shooting apachecloudstackTrouble shooting apachecloudstack
Trouble shooting apachecloudstackSailaja Sunil
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
Brendan Gregg
 
A little systemtap
A little systemtapA little systemtap
A little systemtap
yang bingwu
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Anne Nicolas
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenLex Yu
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
virtualizacionTV
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
Eric Sloof
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
NETWAYS
 
Vagrant, Ansible, and OpenStack on your laptop
Vagrant, Ansible, and OpenStack on your laptopVagrant, Ansible, and OpenStack on your laptop
Vagrant, Ansible, and OpenStack on your laptop
Lorin Hochstein
 
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
Remi Bergsma
 

Similar to Migrating KSM page causes the VM lock up as the KSM page merging list is too large - 2019 OSS China - Gavin Guo (20)

Openstack 101
Openstack 101Openstack 101
Openstack 101
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
PFIセミナー資料 H27.10.22
PFIセミナー資料 H27.10.22PFIセミナー資料 H27.10.22
PFIセミナー資料 H27.10.22
 
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
 
How to debug ocfs2 hang problem
How to debug ocfs2 hang problemHow to debug ocfs2 hang problem
How to debug ocfs2 hang problem
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
 
Linux Crash Dump Capture and Analysis
Linux Crash Dump Capture and AnalysisLinux Crash Dump Capture and Analysis
Linux Crash Dump Capture and Analysis
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
 
Metasploitable
MetasploitableMetasploitable
Metasploitable
 
Trouble shooting apachecloudstack
Trouble shooting apachecloudstackTrouble shooting apachecloudstack
Trouble shooting apachecloudstack
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
A little systemtap
A little systemtapA little systemtap
A little systemtap
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Crash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_TizenCrash_Report_Mechanism_In_Tizen
Crash_Report_Mechanism_In_Tizen
 
Analisis_avanzado_vmware
Analisis_avanzado_vmwareAnalisis_avanzado_vmware
Analisis_avanzado_vmware
 
Advanced Root Cause Analysis
Advanced Root Cause AnalysisAdvanced Root Cause Analysis
Advanced Root Cause Analysis
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
 
Vagrant, Ansible, and OpenStack on your laptop
Vagrant, Ansible, and OpenStack on your laptopVagrant, Ansible, and OpenStack on your laptop
Vagrant, Ansible, and OpenStack on your laptop
 
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
Operating CloudStack: Sharing My Tool Box @ApacheCon NA'15
 

Recently uploaded

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 

Recently uploaded (20)

WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 

Migrating KSM page causes the VM lock up as the KSM page merging list is too large - 2019 OSS China - Gavin Guo

  • 1. Practicing Linux Crash/Panic Issue on Production and Cloud Server 2019 Shanghai Open Source Summit China Gavin Guo Technical Lead - Sustaining Engineering gavin.guo@canonical.com Wechat
  • 2. Migrating KSM page causes the VM lock up as the KSM page merging list is too large https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680513 Enterprise Case Study
  • 3. Case Description After numad is enabled and there are several VMs running on the same host machine, the softlockup messages can be observed inside the VMs' dmesg. CPU: 3 PID: 22468 Comm: kworker/u32:2 Not tainted 4.4.0-47-generic #68-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Workqueue: writeback wb_workfn (flush-252:0) [<ffffffff81104388>] smp_call_function_many+0x1f8/0x260 [<ffffffff810727d5>] native_flush_tlb_others+0x65/0x150 [<ffffffff81072b35>] flush_tlb_page+0x55/0x90
  • 4. This one seems a known issue. The bug is proactively handled by Linus when Dave Jones[3] issued the bug which happened on the bare metal machine. Tinoco[2] also found the bug in the nested KVM environment which happened when the IPI is sent out in the VCPU and it seems the problem coming from the LAPIC simulation of VMX. Chris Arges also involved in the debugging process and the debugging patch was given out by the Ingo Molnar, then Chris added some hacks to print out the debugging information. Unfortunately, after a long investigation, the root cause is still unknown. Investigation on the VM side [1]. smp/call: Detect stuck CSD locks https://patchwork.kernel.org/patch/6153801/ [2]. smp_call_function_single lockups https://lkml.org/lkml/2015/2/11/247 [3]. frequent lockups in 3.18rc4 https://lkml.org/lkml/2014/11/14/656
  • 5. I've prepared a hotfix kernel which would resend the IPI and print out the information when the softlockup happens. Unfortunately, the hotfix kernel doesn't print out the error message. That means my original thoughts are incorrect! The hotfix kernel source: http://kernel.ubuntu.com/git/gavinguo/ubuntu-xenial.git/log/?h=sf000 103690-csd-lock-debug Investigation on the VM side
  • 6. As I cannot find the clue inside the VMs, then try to investigate the host side. Host Machine - Hung task Backtrace
  • 7. # ksmd crash> bt 615 PID: 615 TASK: ffff881fa174a940 CPU: 15 COMMAND: "ksmd" #0 [ffff881fa1087cc0] __schedule at ffffffff818207ee #1 [ffff881fa1087d10] schedule at ffffffff81820ee5 #2 [ffff881fa1087d28] rwsem_down_read_failed at ffffffff81823d60 #3 [ffff881fa1087d98] call_rwsem_down_read_failed at ffffffff813f8324 #4 [ffff881fa1087df8] ksm_scan_thread at ffffffff811e613d #5 [ffff881fa1087ec8] kthread at ffffffff810a0528 #6 [ffff881fa1087f50] ret_from_fork at ffffffff8182538f Host Machine - Hung task Backtrace
  • 8. # khugepaged crash> bt 616 PID: 616 TASK: ffff881fa1749b80 CPU: 11 COMMAND: "khugepaged" #0 [ffff881fa108bc60] __schedule at ffffffff818207ee #1 [ffff881fa108bcb0] schedule at ffffffff81820ee5 #2 [ffff881fa108bcc8] rwsem_down_write_failed at ffffffff81823b32 #3 [ffff881fa108bd50] call_rwsem_down_write_failed at ffffffff813f8353 #4 [ffff881fa108bda8] khugepaged at ffffffff811f58ef #5 [ffff881fa108bec8] kthread at ffffffff810a0528 #6 [ffff881fa108bf50] ret_from_fork at ffffffff8182538f Host Machine - Hung task Backtrace
  • 9. # qemu-system-x86 crash> bt 12555 PID: 12555 TASK: ffff885fa1af6040 CPU: 55 COMMAND: "qemu-system-x86" #0 [ffff885f9a043a50] __schedule at ffffffff818207ee #1 [ffff885f9a043aa0] schedule at ffffffff81820ee5 #2 [ffff885f9a043ab8] rwsem_down_read_failed at ffffffff81823d60 #3 [ffff885f9a043b28] call_rwsem_down_read_failed at ffffffff813f8324 #4 [ffff885f9a043b88] kvm_host_page_size at ffffffffc02cfbae [kvm] #5 [ffff885f9a043ba8] mapping_level at ffffffffc02ead1f [kvm] #6 [ffff885f9a043bd8] tdp_page_fault at ffffffffc02f0b8a [kvm] #7 [ffff885f9a043c50] kvm_mmu_page_fault at ffffffffc02ea794 [kvm] #8 [ffff885f9a043c80] handle_ept_violation at ffffffffc01acda3 [kvm_intel] #9 [ffff885f9a043cb8] vmx_handle_exit at ffffffffc01afdab [kvm_intel] #10 [ffff885f9a043d48] vcpu_enter_guest at ffffffffc02e026d [kvm] #11 [ffff885f9a043dc0] kvm_arch_vcpu_ioctl_run at ffffffffc02e698f [kvm] #12 [ffff885f9a043e08] kvm_vcpu_ioctl at ffffffffc02ce09d [kvm] #13 [ffff885f9a043ea0] do_vfs_ioctl at ffffffff81220bef Host Machine - Hung task Backtrace
  • 10. We can see that the previous three tasks are waiting on the mmap_sem. The most interesting part is the backtrace of numad: crash> bt 2950 The disassembly analysis of numad call stack #1 [ffff885f8fb4fb78] smp_call_function_many #2 [ffff885f8fb4fbc0] native_flush_tlb_others #3 [ffff885f8fb4fc08] flush_tlb_page #4 [ffff885f8fb4fc30] ptep_clear_flush #5 [ffff885f8fb4fc60] try_to_unmap_one #6 [ffff885f8fb4fcd0] rmap_walk_ksm #7 [ffff885f8fb4fd28] rmap_walk #8 [ffff885f8fb4fd80] try_to_unmap #9 [ffff885f8fb4fdc8] migrate_pages #10 [ffff885f8fb4fe80] do_migrate_pages Host Machine - Hung task Backtrace
  • 11. I've tried to disassemble the code and finally find the stable_node->hlist is as long as 2306920 entries(Around 9.2GB memory merged into one page). rmap_item list(stable_node->hlist): stable_node: 0xffff881f836ba000 stable_node->hlist->first = 0xffff883f3e5746b0 struct hlist_head { [0] struct hlist_node *first; } struct hlist_node { [0] struct hlist_node *next; [8] struct hlist_node **pprev; } crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst $ wc -l rmap_item.lst 2306920 rmap_item.lst KSM merge list extraction
  • 12. Introduction to the KSM Stable Tree (Stable/Unstable tree) stable_node stable_node stable_node stable_nodestable_node stable_node stable_node stable_node stable_node 0 1 2 3 /sys/kernel/mm/ksm/merge_across_node=0 rmap_item rmap_item rmap_item rmap_item rmap_item rmap_item rmap_itemrmap_item rmap_item rmap_item rmap_item rmap_item 0 1 2 3 root_unstable_tree[nr_node_ids]root_stable_tree[nr_node_ids] The merge list is as long as 2306920 entries.
  • 13. Automatic NUMA balancing Local/Remote access Cpu 0 Cpu 1 Cpu 2 Cpu 3 Cpu 4 Cpu 5 Cpu 6 Cpu 7 Process A page pagepage page page page page page page page page page Process B Process C Process D Process E Process F Process G Process H Local access Remote access Node 0 Node 1 According to the memory access latency, it would be better to migrate Process D to node 1 and Process E to node 0. The remote access page by Process A can be migrated to node 0. However, it would also need to consider the CPU loading before migrating the processes.
  • 14. https://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg FlameGraph of the performance problem When migrating the ksm pages, numad needs to call the IPI to flush the related TLB entries in CPUs which ever used the PTE.
  • 15. Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit https://www.spinics.net/lists/linux-mm/msg125880.html 80b18dfa53bb ksm: optimize refile of stable_node_dup at the head of the chain 8dc5ffcd5a74 ksm: swap the two output parameters of chain/chain_prune 0ba1d0f7c41c ksm: cleanup stable_node chain collapse case b4fecc67cc56 ksm: fix use after free with merge_across_nodes = 0 2c653d0ee2ae ksm: introduce ksm_max_page_sharing per page deduplication limit Solution