Linux Memory Management
Kamal Maiti
Sr. Linux System Engineer
Amdocs DVCI, Pune, India
AGENDA
 Basic concept of computer
 Hardware, firmware, driver, software, application
 CPU, RAM, How RAM used
 Moving Information within Computer
 Primary & Other Memory,
 Segment of RAM
 Memory Mapping, Process Address Space
 Page, Frame, Hugepage, MMU etc.
 Virtual Memory, PageCache
 Memory nodes, zones, lowmem
 NUMA
 Kernel Memory allocator
 Pagefault Handling, Tools, Memory leak, Memory related issues
 Hands-on Troubleshooting : sysrq, backtrace analysis, OOM messages investigation etc
BASIC CONCEPTS OF COMPUTER HARDWARE
 This model of the typical digital computer is often called the von
Neumann computer.
 Programs and data are stored in the same memory: primary memory
CPU
(Central Processing Unit)
Input
Units
Output
Units
Primary Memory
HARDWARE, FIRMWARE, DRIVER, SOFTWARE, APPLICATION
Hardware : All computer devices like - Input, Output
devices, Motherboard, mouse, keyboard
Firmware : Vendor provided low level codes that
interacts with hardware to get the output of instructions
passed to device.
Driver : On top of firmware, driver is used to interacts with
firmware or hardware directly.
Software/Application: which interacts with system calls
to call kernel and kernel interacts with driver to get the
output.
CPU
 The three major components of the CPU are:
1. Arithmetic Unit (Computations performed)
Accumulator (Results of computations kept here)
2. Control Unit (Has two locations where numbers are kept)
Instruction Register (Instruction placed here for
analysis)
Program Counter (Which instruction will be
performed next?)
3. Instruction Decoding Unit (Decodes the instruction)
 Motherboard: The place where most of the electronics including
the CPU are mounted.
RAM
 Commonly known as random access memory, or just
RAM
 Holds instructions and data needed for programs that
are currently running
 RAM is usually a volatile type of memory
 Contents of RAM are lost when power is turned off
HOW RAM USED ?
Memory is used to store:
 i) instructions - > to execute a program
 ii) data -> When the computer is doing any job, the data that
have to be processed are stored in the primary memory. This
data may come from an input device like keyboard or from a
secondary storage device like a floppy disk.
MOVING INFORMATION WITHIN THE COMPUTER
 How do binary numerals move into, out of, and within the computer?
 Information is moved about in bytes, or multiple bytes called
words.
 Words are the fundamental units of information.
 The number of bits per word may vary per computer.
 A word length for most large IBM computers is 32 bits:
MOVING INFORMATION WITHIN THE COMPUTER …
 Bits that compose a word are passed in parallel from place to
place.
 Ribbon cables:
 Consist of several wires, molded together.
 One wire for each bit of the word or byte.
 Additional wires coordinate the activity of moving
information.
 Each wire sends information in the form of a voltage
pulse.
MOVING INFORMATION WITHIN THE COMPUTER …
 Example of sending the word WOW over the ribbon cable
 Voltage pulses corresponding to the ASCII codes would pass
through the cable.
PRIMARY MEMORY
 Primary storage or memory: Where the data & program that are
currently in operation or being accessed are stored during use.
 Consists of electronic circuits: Extremely fast and expensive.
 Two types:
 RAM (non-permanent)
 Programs and data can be stored here for the
computer’s use.
 Volatile: All information will be lost once the computer
shuts down.
 ROM (permanent)
 Contents do not change.
 ROM : a transistor [storing video game software, electronic musical
instruments]. ROM is mostly used for firmware updates.
 EROM : Erasable programmable read-only memory
 EEPROM :Electrically Erasable Programmable Read-Only Memory
 Cache : Location in RAM where data is stored for a certain amount of time of
that it can be reused.
 Registers : various flip flop register[RS, D, JK, shift etc] holds information
 Swap : External disk is used to accommodate the demand of more RAM.
OTHER MEMORY
SEGMENT OF RAM
 Low mem, high mem, Normal mem, DMA, DMA32
 On a 32-bit architecture[DMA, Normal & HighMem] : the
address space range for addressing RAM is:
0x00000000 - 0xffffffff or 4'294'967'295 (4 GB).
The user space range: 0x00000000 - 0xbfffffff or 3 GB
The kernel space range: 0xc0000000 - 0xffffffff or 1 GB
Linux splits the 1GB kernel space into 2 pieces: LOWMEM and HIGHMEM.
 On 64 bit machine[DMA, DMA32 & Normal] : Normal
memory available beyond 4 GB
MEMORY MAPPING
 Linux uses only 4 segments in 32 bit arch:
 2 segments (code and data/stack) for KERNEL SPACE from [0xC000 0000] (3 GB) to [0xFFFF FFFF] (4 GB)
 2 segments (code and data/stack) for USER SPACE from [0] (0 GB) to [0xBFFF FFFF] (3 GB)
See virtual Map : $ pmap <PID> , see stack : $pstack <PID>
 Segmentation, Paging [To overcome flaw in segmentation] –
 allocating virtual small pages to each process so that they will be fit in RAM with out wasting it.
PROCESS ADDRESS SPACE – 31 BIT ARCH
Kernel
0xC0000000
File name, Environment
Arguments
Stack
Bss[Block started by Symbol]
_end
_bss_start
Data
_edata
_etext Text/code
Header
0x84000000
Shared Libs
Text/Code Segment: contains the actual
code
Data: contains global variables
BSS: contains uninitialized global variables
Heap: dynamic memory
Stack: collection of frames/functions
Heap
Unused Memory
4 GB -->
3 GB -->
0 GB -->
Kernel Space
User Space
PAGE & FRAME
 Paging, Demand Paging, Swapping
 Page Tables [64 bit 4, 32 bit 2]: Page Global Directory, Page Upper Directory,
Page Middle Directory, Page
 Min page size : getconf -a|grep -i page
 Life cycle of page: active----> inactive list --> dirty --> clean
SWAP, HUGE PAGE, MMU,TLB
 SWAP : All pages can’t be fit in RAM, need to call/send data from and to storage
disk
 Hugepage : default page is 4MB but large program uses chunks of memory area.
Hence, allow large page. [sysctl -a|grep -i huge]
 MMU/TLB : Responsible for translating logical address to physical address. TLB is buffer
that is used by MMU.
 Active/Inactive regions [cat /proc/meminfo]
 Shmem : shared memory area[ipcs -m]
 Buddyinfo : view memory fragmentation/ allocation[cat /proc/buddyinfo]
 Cache : For speeding up, sync to flush out and forcefully write on disk, bdflush does
at background [flush-253:0 in rhel 6]
buffer's policy is first-in, first-out
cache's policy is Least Recently Used[LRU] [$ vmstat -S M 1]
VIRTUAL MEMORY, HOW PROGRAM MAPS?
 Executable text
 Executable data
 Heap space
 Stack
 Get exact required memory by process :
 $ pmap -x <pid>,
 $cat /proc/<pid>/status
PAGE CACHE MEMORY CONTROL
 vm.dirty_expire_centisecs=2000
 vm.dirty_writeback_centisecs=400 //how long they’ll wait
 vm.dirty_background_ratio=5 // when percentage of total RAM filled, pdflush/flush daemon will
start write dirty data on disk
 vm.dirty_ratio=20 //when percentage of total RAM filled, process will start write data on disk
 vfs_cache_pressure [100] : controls the tendency of the kernel to reclaim the memory which is
used for caching of directory and inode objects
 Swappiness[60] : controls how kernel will use swap space.
 To free pagecache:
To free pagecache: echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes : echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches
 cache writes done by : kernel thread pdflush/bdflush, now in rhel 6 it is flush.
 Life cycle of pages :
active---->inactive list -->dirty > clean
Link : https://www.kernel.org/doc/Documentation/sysctl/vm.txt
PHYSICAL MEMORY ALLOCATION LIMIT
 CommitLimit : total mem to be allocated based on ovcercommit_ratio
 Committed_AS : currently allocated
 overcommit_memory : from 0 to 2 << Start from here
0 = allow available memory on the system to be overloaded //default
1 = no memory over commit handling
2 = allocate best on overcommit_ratio // allocate best on condition
 Overcommit_ratio: % of RAM when overcommit_memory is set 2, default value 50
Example : 4 GB RAM, 2 GB Swap, overcommit_memory=2, Overcommit_ratio=50 , so
commitLimit = 2+ (4*50/100)=2+2= 4 GB
Issue : Application failed to start due to shortage of memory, Needed to disable
WHY MEMORY CACHE IS REALLY REQUIRED
Speed up processing :
 $ cat > XYZ
 $ echo 3 > /proc/sys/vm/drop_caches
 $ time cat XYZ //much time
 $ time cat XYZ //less time
MEMORY NODES, ZONES IN 32 BIT & 64 BIT
 Below zones are in 32 bits :
 Zone_DMA (0-16MB)
 Zone_Normal (16MB-896MB)
 ZONE_HIGH_MEM (896MB-above)
HIGHMEM's lower zone is NORMAL+DMA , NORMAL's lower zone is DMA.
 Below zones are in 64 bits :
 Normal : Beyond 4 GB
 DMA : till 16 MB
 DMA32 : till 4GB
 $ cat /proc/zoneinfo
 $ cat /proc/pagetypeinfo
 $cat /proc/<pid>/numa_maps
 $ cat /proc/buddyinfo
LOW MEMORY, ZONE_RECLAIM
 "lowmem" often means NORMAL+DMA
 “lowmem” is not present in RHEL 6, 64bit
 Reservation is controlled by : lowmem_reserve_ratio [DMA NORMAL HIGMEM]
 cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32 // (1/256)*100 % = 0.39% of nearset zone is reserved
 zone_reclaim_mode: How more or less aggressive approaches to reclaim
memory when a zone runs out of memory
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
NON-UNIFORM MEMORY ACCESS(NUMA)
 Numa concept :
Numa Placement – placement of processor & Memory, manual – application,
MPI(Message Passing Interface)
 Place application in correct node
 Two memory policy – Node Local[after linux boot], Interleave [during kernel boot]
 cat /proc/<pid>/numa_maps
 numactl -s //show policy
 numactl –hardware
 numactl [ --interleave nodes ] [ --preferred node ] [ --membind nodes ] [ --cpunodebind nodes ] [ --physcpubind cpus ] [ --
localalloc ] [--] command {arguments ...}
Ref : http://www.redhat.com/summit/2012/pdf/2012-DevDay-Lab-NUMA-Hacker.pdf
NUMA MANAGEMENT
 numactl --physcpubind=0,1,2,3 example_process
 numactl --physcpubind=0-3 example_process
 numactl --cpunodebind=2 example_process //run on this cpu
 numactl --physcpubind=0 --localalloc example_process
 numactl --membind=4 example_process
 numactl --cpunodebind=0 example_process //Only execute command on the CPUs of 0
 numactl --cpubind=0 --membind=0,1 process // Run process on node 0 with memory allocated on
node 0 and 1
 numactl –hardware
 cat /sys/devices/system/node/node*/numastat
 Allocation : $watch -n1 numastat
KERNEL MEMORY ALLOCATORS
 Low-level page allocator :
 Buddy system for contiguous multi-page allocations
 Provides pages for
 in-kernel allocations (slab cache)
 vmalloc areas (kernel modules, multi-page data areas)
 page cache, anonymous user pages
 misc. other users
 Slab cache :
 Manages allocations of objects of the same type
 Large-scale users: inodes, dentries, block I/O, network ...
 kmalloc (generic allocator) implemented on top
 Tool : slabtop
PAGE FAULT HANDLING
 Hardware support :
 Accessing invalid pages causes 'page translation' check
 Writing to protected pages causes 'protection exception'
 Translation-exception identification provides address
 'Suppression on protection' facility essential!
 Linux kernel page fault handler :
 Determine address/access validity according to VMA
 Invalid accesses cause SIGSEGV delivery
 Valid accesses trigger: page-in, swap-in, copy-on-write
 Extra support for stack VMA: grows automatically
 Out-of-memory if overcommitted causes SIGBUS
TOOLS TO CHECK MEMORY USAGE
 Report paging statistics : sar -B
 Report memory utilization statistics : sar –r
 Report memory statistics : sar –R
 Report swap space utilization statistics: sar –S
 Current memory usage :
 free –m|k|g
 Cat /proc/meminfo
 Memory allocation :
 cat /proc/buddyinfo
 VM memory allocation:
 pmap -x <PID>
 Cat /proc/<pid>/status
 Display kernel slab cache & memory information in real time:
 slabtop
 vmstat
 ps
 top
 cat /proc/meminfo
 strace, gcore
MEMORY LEAK CHECK
 Usage check : historical sar report
 mtrace : builtin c function.
 Valgrind :
 valgrind --tool=memcheck --leak-check=full --show-reachable=yes snmpd -f –Lo
ISSUES RELATED TO MEMORY
 TCP/IP communication delay – RH cluster broken
 High cache usage : slowdown application / system
 Memory pressure : Memory leak, App is not tuned properly
 Memory fragmentation : hugepage not used
 OOM killer kills application: Memory pressure, OOM is enabled
by default, kills based on badness value.
 Segmentation fault : Kernel reclaims in normal/low memory
region, hence no room for kernel, encounters segmentation
fault.
 Faulty Memory : Hardware failure or circuit failure in chip, need
a diagnosis and replace chip
TROUBLESHOOTING MEMORY ISSUE
 Memory & swap usage test :
swap_tendency = mapped_ratio/2 + distress + vm_swappiness
mapped_ratio= % of physical memory in use
distress = how much trouble kernel in freeing memory
vm_swappiness= default 60
swap_tendency >= 100, eligible for swap
swap_tendency < 100, reclaim from page cache
 Sysrq :
echo 1 > /proc/sys/kernel/sysrq
echo m > /proc/sysrq-trigger
 backtrace analysis
TROUBLESHOOTING
 OOM messages investigation :
Messages :
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461588] [] oom_kill_process+0x5c/0x80
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461591] [] out_of_memory+0xc5/0x1c0
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461595] [] __alloc_pages_nodemask+0x72c/0x740
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461599] [] __get_free_pages+0x1c/0x30
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461602] [] get_zeroed_page+0x12/0x20
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461606] [] fill_read_buffer.isra.8+0xaa/0xd0
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461609] [] sysfs_read_file+0x7d/0x90
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461613] [] vfs_read+0x8c/0x160
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461616] [] ? fill_read_buffer.isra.8+0xd0/0xd0
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461619] [] sys_read+0x3d/0x70
 Oct 25 07:28:34 nldedip4k031 kernel: [87976.461624] [] sysenter_do_call+0x12/0x28
Q/A
Ref :
https://www.kernel.org/
https://www.redhat.com/en
http://www.tldp.org/LDP/tlk/mm/memory.html
https://en.wikipedia.org/wiki/Virtual_memory
https://lwn.net/

Linux memory-management-kamal

  • 1.
    Linux Memory Management KamalMaiti Sr. Linux System Engineer Amdocs DVCI, Pune, India
  • 2.
    AGENDA  Basic conceptof computer  Hardware, firmware, driver, software, application  CPU, RAM, How RAM used  Moving Information within Computer  Primary & Other Memory,  Segment of RAM  Memory Mapping, Process Address Space  Page, Frame, Hugepage, MMU etc.  Virtual Memory, PageCache  Memory nodes, zones, lowmem  NUMA  Kernel Memory allocator  Pagefault Handling, Tools, Memory leak, Memory related issues  Hands-on Troubleshooting : sysrq, backtrace analysis, OOM messages investigation etc
  • 3.
    BASIC CONCEPTS OFCOMPUTER HARDWARE  This model of the typical digital computer is often called the von Neumann computer.  Programs and data are stored in the same memory: primary memory CPU (Central Processing Unit) Input Units Output Units Primary Memory
  • 4.
    HARDWARE, FIRMWARE, DRIVER,SOFTWARE, APPLICATION Hardware : All computer devices like - Input, Output devices, Motherboard, mouse, keyboard Firmware : Vendor provided low level codes that interacts with hardware to get the output of instructions passed to device. Driver : On top of firmware, driver is used to interacts with firmware or hardware directly. Software/Application: which interacts with system calls to call kernel and kernel interacts with driver to get the output.
  • 5.
    CPU  The threemajor components of the CPU are: 1. Arithmetic Unit (Computations performed) Accumulator (Results of computations kept here) 2. Control Unit (Has two locations where numbers are kept) Instruction Register (Instruction placed here for analysis) Program Counter (Which instruction will be performed next?) 3. Instruction Decoding Unit (Decodes the instruction)  Motherboard: The place where most of the electronics including the CPU are mounted.
  • 6.
    RAM  Commonly knownas random access memory, or just RAM  Holds instructions and data needed for programs that are currently running  RAM is usually a volatile type of memory  Contents of RAM are lost when power is turned off
  • 7.
    HOW RAM USED? Memory is used to store:  i) instructions - > to execute a program  ii) data -> When the computer is doing any job, the data that have to be processed are stored in the primary memory. This data may come from an input device like keyboard or from a secondary storage device like a floppy disk.
  • 8.
    MOVING INFORMATION WITHINTHE COMPUTER  How do binary numerals move into, out of, and within the computer?  Information is moved about in bytes, or multiple bytes called words.  Words are the fundamental units of information.  The number of bits per word may vary per computer.  A word length for most large IBM computers is 32 bits:
  • 9.
    MOVING INFORMATION WITHINTHE COMPUTER …  Bits that compose a word are passed in parallel from place to place.  Ribbon cables:  Consist of several wires, molded together.  One wire for each bit of the word or byte.  Additional wires coordinate the activity of moving information.  Each wire sends information in the form of a voltage pulse.
  • 10.
    MOVING INFORMATION WITHINTHE COMPUTER …  Example of sending the word WOW over the ribbon cable  Voltage pulses corresponding to the ASCII codes would pass through the cable.
  • 11.
    PRIMARY MEMORY  Primarystorage or memory: Where the data & program that are currently in operation or being accessed are stored during use.  Consists of electronic circuits: Extremely fast and expensive.  Two types:  RAM (non-permanent)  Programs and data can be stored here for the computer’s use.  Volatile: All information will be lost once the computer shuts down.  ROM (permanent)  Contents do not change.
  • 12.
     ROM :a transistor [storing video game software, electronic musical instruments]. ROM is mostly used for firmware updates.  EROM : Erasable programmable read-only memory  EEPROM :Electrically Erasable Programmable Read-Only Memory  Cache : Location in RAM where data is stored for a certain amount of time of that it can be reused.  Registers : various flip flop register[RS, D, JK, shift etc] holds information  Swap : External disk is used to accommodate the demand of more RAM. OTHER MEMORY
  • 13.
    SEGMENT OF RAM Low mem, high mem, Normal mem, DMA, DMA32  On a 32-bit architecture[DMA, Normal & HighMem] : the address space range for addressing RAM is: 0x00000000 - 0xffffffff or 4'294'967'295 (4 GB). The user space range: 0x00000000 - 0xbfffffff or 3 GB The kernel space range: 0xc0000000 - 0xffffffff or 1 GB Linux splits the 1GB kernel space into 2 pieces: LOWMEM and HIGHMEM.  On 64 bit machine[DMA, DMA32 & Normal] : Normal memory available beyond 4 GB
  • 14.
    MEMORY MAPPING  Linuxuses only 4 segments in 32 bit arch:  2 segments (code and data/stack) for KERNEL SPACE from [0xC000 0000] (3 GB) to [0xFFFF FFFF] (4 GB)  2 segments (code and data/stack) for USER SPACE from [0] (0 GB) to [0xBFFF FFFF] (3 GB) See virtual Map : $ pmap <PID> , see stack : $pstack <PID>  Segmentation, Paging [To overcome flaw in segmentation] –  allocating virtual small pages to each process so that they will be fit in RAM with out wasting it.
  • 15.
    PROCESS ADDRESS SPACE– 31 BIT ARCH Kernel 0xC0000000 File name, Environment Arguments Stack Bss[Block started by Symbol] _end _bss_start Data _edata _etext Text/code Header 0x84000000 Shared Libs Text/Code Segment: contains the actual code Data: contains global variables BSS: contains uninitialized global variables Heap: dynamic memory Stack: collection of frames/functions Heap Unused Memory 4 GB --> 3 GB --> 0 GB --> Kernel Space User Space
  • 16.
    PAGE & FRAME Paging, Demand Paging, Swapping  Page Tables [64 bit 4, 32 bit 2]: Page Global Directory, Page Upper Directory, Page Middle Directory, Page  Min page size : getconf -a|grep -i page  Life cycle of page: active----> inactive list --> dirty --> clean
  • 17.
    SWAP, HUGE PAGE,MMU,TLB  SWAP : All pages can’t be fit in RAM, need to call/send data from and to storage disk  Hugepage : default page is 4MB but large program uses chunks of memory area. Hence, allow large page. [sysctl -a|grep -i huge]  MMU/TLB : Responsible for translating logical address to physical address. TLB is buffer that is used by MMU.  Active/Inactive regions [cat /proc/meminfo]  Shmem : shared memory area[ipcs -m]  Buddyinfo : view memory fragmentation/ allocation[cat /proc/buddyinfo]  Cache : For speeding up, sync to flush out and forcefully write on disk, bdflush does at background [flush-253:0 in rhel 6] buffer's policy is first-in, first-out cache's policy is Least Recently Used[LRU] [$ vmstat -S M 1]
  • 18.
    VIRTUAL MEMORY, HOWPROGRAM MAPS?  Executable text  Executable data  Heap space  Stack  Get exact required memory by process :  $ pmap -x <pid>,  $cat /proc/<pid>/status
  • 19.
    PAGE CACHE MEMORYCONTROL  vm.dirty_expire_centisecs=2000  vm.dirty_writeback_centisecs=400 //how long they’ll wait  vm.dirty_background_ratio=5 // when percentage of total RAM filled, pdflush/flush daemon will start write dirty data on disk  vm.dirty_ratio=20 //when percentage of total RAM filled, process will start write data on disk  vfs_cache_pressure [100] : controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects  Swappiness[60] : controls how kernel will use swap space.  To free pagecache: To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes : echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches  cache writes done by : kernel thread pdflush/bdflush, now in rhel 6 it is flush.  Life cycle of pages : active---->inactive list -->dirty > clean Link : https://www.kernel.org/doc/Documentation/sysctl/vm.txt
  • 20.
    PHYSICAL MEMORY ALLOCATIONLIMIT  CommitLimit : total mem to be allocated based on ovcercommit_ratio  Committed_AS : currently allocated  overcommit_memory : from 0 to 2 << Start from here 0 = allow available memory on the system to be overloaded //default 1 = no memory over commit handling 2 = allocate best on overcommit_ratio // allocate best on condition  Overcommit_ratio: % of RAM when overcommit_memory is set 2, default value 50 Example : 4 GB RAM, 2 GB Swap, overcommit_memory=2, Overcommit_ratio=50 , so commitLimit = 2+ (4*50/100)=2+2= 4 GB Issue : Application failed to start due to shortage of memory, Needed to disable
  • 21.
    WHY MEMORY CACHEIS REALLY REQUIRED Speed up processing :  $ cat > XYZ  $ echo 3 > /proc/sys/vm/drop_caches  $ time cat XYZ //much time  $ time cat XYZ //less time
  • 22.
    MEMORY NODES, ZONESIN 32 BIT & 64 BIT  Below zones are in 32 bits :  Zone_DMA (0-16MB)  Zone_Normal (16MB-896MB)  ZONE_HIGH_MEM (896MB-above) HIGHMEM's lower zone is NORMAL+DMA , NORMAL's lower zone is DMA.  Below zones are in 64 bits :  Normal : Beyond 4 GB  DMA : till 16 MB  DMA32 : till 4GB  $ cat /proc/zoneinfo  $ cat /proc/pagetypeinfo  $cat /proc/<pid>/numa_maps  $ cat /proc/buddyinfo
  • 23.
    LOW MEMORY, ZONE_RECLAIM "lowmem" often means NORMAL+DMA  “lowmem” is not present in RHEL 6, 64bit  Reservation is controlled by : lowmem_reserve_ratio [DMA NORMAL HIGMEM]  cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 // (1/256)*100 % = 0.39% of nearset zone is reserved  zone_reclaim_mode: How more or less aggressive approaches to reclaim memory when a zone runs out of memory 1 = Zone reclaim on 2 = Zone reclaim writes dirty pages out 4 = Zone reclaim swaps pages
  • 24.
    NON-UNIFORM MEMORY ACCESS(NUMA) Numa concept : Numa Placement – placement of processor & Memory, manual – application, MPI(Message Passing Interface)  Place application in correct node  Two memory policy – Node Local[after linux boot], Interleave [during kernel boot]  cat /proc/<pid>/numa_maps  numactl -s //show policy  numactl –hardware  numactl [ --interleave nodes ] [ --preferred node ] [ --membind nodes ] [ --cpunodebind nodes ] [ --physcpubind cpus ] [ -- localalloc ] [--] command {arguments ...} Ref : http://www.redhat.com/summit/2012/pdf/2012-DevDay-Lab-NUMA-Hacker.pdf
  • 25.
    NUMA MANAGEMENT  numactl--physcpubind=0,1,2,3 example_process  numactl --physcpubind=0-3 example_process  numactl --cpunodebind=2 example_process //run on this cpu  numactl --physcpubind=0 --localalloc example_process  numactl --membind=4 example_process  numactl --cpunodebind=0 example_process //Only execute command on the CPUs of 0  numactl --cpubind=0 --membind=0,1 process // Run process on node 0 with memory allocated on node 0 and 1  numactl –hardware  cat /sys/devices/system/node/node*/numastat  Allocation : $watch -n1 numastat
  • 26.
    KERNEL MEMORY ALLOCATORS Low-level page allocator :  Buddy system for contiguous multi-page allocations  Provides pages for  in-kernel allocations (slab cache)  vmalloc areas (kernel modules, multi-page data areas)  page cache, anonymous user pages  misc. other users  Slab cache :  Manages allocations of objects of the same type  Large-scale users: inodes, dentries, block I/O, network ...  kmalloc (generic allocator) implemented on top  Tool : slabtop
  • 27.
    PAGE FAULT HANDLING Hardware support :  Accessing invalid pages causes 'page translation' check  Writing to protected pages causes 'protection exception'  Translation-exception identification provides address  'Suppression on protection' facility essential!  Linux kernel page fault handler :  Determine address/access validity according to VMA  Invalid accesses cause SIGSEGV delivery  Valid accesses trigger: page-in, swap-in, copy-on-write  Extra support for stack VMA: grows automatically  Out-of-memory if overcommitted causes SIGBUS
  • 28.
    TOOLS TO CHECKMEMORY USAGE  Report paging statistics : sar -B  Report memory utilization statistics : sar –r  Report memory statistics : sar –R  Report swap space utilization statistics: sar –S  Current memory usage :  free –m|k|g  Cat /proc/meminfo  Memory allocation :  cat /proc/buddyinfo  VM memory allocation:  pmap -x <PID>  Cat /proc/<pid>/status  Display kernel slab cache & memory information in real time:  slabtop  vmstat  ps  top  cat /proc/meminfo  strace, gcore
  • 29.
    MEMORY LEAK CHECK Usage check : historical sar report  mtrace : builtin c function.  Valgrind :  valgrind --tool=memcheck --leak-check=full --show-reachable=yes snmpd -f –Lo
  • 30.
    ISSUES RELATED TOMEMORY  TCP/IP communication delay – RH cluster broken  High cache usage : slowdown application / system  Memory pressure : Memory leak, App is not tuned properly  Memory fragmentation : hugepage not used  OOM killer kills application: Memory pressure, OOM is enabled by default, kills based on badness value.  Segmentation fault : Kernel reclaims in normal/low memory region, hence no room for kernel, encounters segmentation fault.  Faulty Memory : Hardware failure or circuit failure in chip, need a diagnosis and replace chip
  • 31.
    TROUBLESHOOTING MEMORY ISSUE Memory & swap usage test : swap_tendency = mapped_ratio/2 + distress + vm_swappiness mapped_ratio= % of physical memory in use distress = how much trouble kernel in freeing memory vm_swappiness= default 60 swap_tendency >= 100, eligible for swap swap_tendency < 100, reclaim from page cache  Sysrq : echo 1 > /proc/sys/kernel/sysrq echo m > /proc/sysrq-trigger  backtrace analysis
  • 32.
    TROUBLESHOOTING  OOM messagesinvestigation : Messages :  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461588] [] oom_kill_process+0x5c/0x80  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461591] [] out_of_memory+0xc5/0x1c0  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461595] [] __alloc_pages_nodemask+0x72c/0x740  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461599] [] __get_free_pages+0x1c/0x30  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461602] [] get_zeroed_page+0x12/0x20  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461606] [] fill_read_buffer.isra.8+0xaa/0xd0  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461609] [] sysfs_read_file+0x7d/0x90  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461613] [] vfs_read+0x8c/0x160  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461616] [] ? fill_read_buffer.isra.8+0xd0/0xd0  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461619] [] sys_read+0x3d/0x70  Oct 25 07:28:34 nldedip4k031 kernel: [87976.461624] [] sysenter_do_call+0x12/0x28
  • 33.