1
2
Memory management in
Linux kernel
3
Memory management
tasks
• Physical memory allocator
• Physical memory management
• Virtual memory allocator
• PTE management
• Memory allocator for kernel
needs
4
Memory management
subsystem
• >100K lines
• Buddy allocator
• Page replacement (“LRU” reclaim model)
• PTE management
• Slab/slob/slub kernel allocator
• Pagecache/writeback/readahead/swap
• Cgroup memory controller
• Compaction
5
Hardware
• X86_64
• Paging (MMU, TLB, ...)
• 4KB, 2MB and 1GB pages
• NUMA
• 4-level PTE's
• Hardware referenced bit
6
Physical memory
description
• Node (pg_data_t)
• Zone (struct zone)
• Page (struct page)
$ cat /proc/zoneinfo | grep Node
Node 0, zone DMA
Node 0, zone DMA32
Node 0, zone Normal
Node 1, zone Normal
7
Virtual memory
description• Address space (struct mm_struct)
• VM area (struct vm_area_struct)
$ cat /proc/self/maps
00400000-0040c000 r-xp 00000000 08:03 2359718 /usr/bin/cat
0060b000-0060c000 r--p 0000b000 08:03 2359718 /usr/bin/cat
0060c000-0060d000 rw-p 0000c000 08:03 2359718 /usr/bin/cat
011a7000-011c8000 rw-p 00000000 00:00 0 [heap]
7f4d072e5000-7f4d0d80e000 r--p 00000000 08:03 2369473
/usr/lib/locale/locale-archive
7f4d0d80e000-7f4d0d9c2000 r-xp 00000000 08:03 2366682 /usr/lib64/libc-2.18.so
7f4d0d9c2000-7f4d0dbc2000 ---p 001b4000 08:03 2366682 /usr/lib64/libc-2.18.so
7f4d0dbc2000-7f4d0dbc6000 r--p 001b4000 08:03 2366682 /usr/lib64/libc-2.18.so
...
8
File mappings
• File mappings (struct
address_space)
• Radix tree with all resident pages
• Pagecache
• Major/minor pagefault
9
Kernel API
• __get_free_page()
• kmalloc()/kfree()
• vmalloc()
• ...
10
Userspace API
• pagefault
• mmap()/munmap()
• brk()
• mlock()/munlock()
• fadvise(), madvise()
• ...
11
Memory reclaim
• Normal/direct reclaim (free pool)
• Per-node kswapd
• Working set
• Memory pressure
• File memory vs anonymous memory
• Swap
• OOM
12
“LRU” model
• 5 double linked lists: inactive file,
active file, inactive anon, active
anon, unevictable
• Referenced flag in struct
page_struct flag
13
List transition rules
• mark_page_accessed():
– unreferenced -> referenced
– inactive && referenced -> active
• shrink_inactive_list():
– if (ptes referenced)
• anonymous -> active
• referenced -> active
• (ptes referenced > 1) -> active (3.2)
• (vm_flags & VM_EXEC) -> active (3.2)
• set referenced
• rotate
– else
• reclaim
• shrink_active_list():
– If referenced
• file & VM_EXEC -> rotate
– -> inactive
14
Memory pressure
balancing
• nr_pages_to_scan =
nr_pages/2^priority
• priority = [12..0]
1/4096, 1/2048, 1/1024, ...
• swappiness
• active > inactive
15
Yasearch-specific
problems & solutions
• Working set > 1/2 available
memory
• Memory thrashing
• promote_mapped_pages
• file_inactive_ratio
16
Monitoring & tools
• top
• vmtouch
• /proc/vmstat
• /proc/buddyinfo
• /proc/slabinfo
• perf top
• oom-message in dmesg
17
Demonstration
18
Cgroups
• Each cgroup has own LRU lists.
• No common LRU (since 3.3)!
• Common free pool(s)
• Common kswapd thread(s)
• Global reclaim vs target reclaim
19
Memory controller
• memory.limit_in_bytes
• memory.soft_limit_in_bytes (will
be deprecated)
• memory.use_hierarchy
• ...
20
Monitoring
• memory.usage_in_bytes
• memory.max_usage_in_bytes
• memory.stat
21
Accounting
• Each page belongs to one cgroup
• First accessed - owner
• memory.move_charge_at_immigr
ate
22
Yasearch-specific
problems & solutions
• memory.low_limit_in_bytes
• First accessed – owner? mlock()?
low_limit?
• memory.recharge_on_pgfault
23
Compaction
• Physical pages migration to zone's
top
• https://lwn.net/Articles/368869
• Broken in 3.3-3.7
• Replacement for lumpy reclaim
• Use perf top for problem diagnostics
24
Спасибо за
внимание!

Memory management in Linux kernel