Linuxカーネル
ページ回収
吉田雅徳@siburu!
2014/7/27(Sun)
1. 前回のあらすじ
What’s Page Frame
❖ page frame = A page-sized/aligned piece of RAM!
❖ struct page = An one-on-one structure in kernel for ...
What’s NUMA
❖ NUMA(Non-Uniform Memory Architecture)!
❖ System is comprised of nodes.!
❖ Each node is defined by a set of CP...
What’s Memory Zone
❖ Physical memory is separated by address range:!
❖ ZONE_DMA: <16MB!
❖ ZONE_DMA32: <4GB!
❖ ZONE_NORMAL:...
struct pglist_data {!
struct zone node_zone[MAX_NR_ZONES];!
};
Memory node, zone
物理アドレス Range1 Range2
CPU1 CPU2 CPU3 CPU4
...
Memory Allocation
1. At first, checks threshold for each zone

(threshold = watermark and dirty-ratio).!
❖ If all zones are...
Memory Deallocation
❖ Page is returned to buddy system.!
❖ 0-order page is returned to per-cpu cache via
free_hot_cold_pag...
Buddy System
4k 4k 4k
8k 8k 8k
4m 4m 4m
・・・
Per-cpu cache
4k 4k 4k
Per-zone buddy system
order0

(de)alloc
HOT COLD
order1...
2. ページの回収
2.1 Direct reclaim!
2.2 Daemon reclaim
ページ割当フローの復習
❖ __alloc_pages_nodemask(ページ割当基本関数)!
❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_fre...
2.1 Direct Reclaim
(ページ割当要求者本人による回収)
__alloc_pages_direct_reclaim()
❖ __perform_reclaim!
❖ current->flags |= PF_MEMALLOC!
❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるよ...
pfmemalloc_watermark_ok()
❖ ARGS!
❖ pgdat(type: struct pglist_data)!
❖ RETURN!
❖ type: bool!
❖ node’s free_pages > 0.5 * n...
do_try_to_free_pages()
❖ Core function for page reclaim, which is called at 3 different scenes!
❖ try_to_free_pages() → Gl...
struct scan_control
struct scan_control {!
! unsigned long nr_scanned;!
! unsigned long nr_reclaimed;!
! unsigned long nr_...
do_try_to_free_pagesの処理
❖ 以下二つのループ!
❖ shrink_zones()!
❖ 後述!
❖ wakeup_flusher_threads()!
❖ shrink_zonesが、回収目標(scan_context::...
shrink_zones()
1. for_each_zone_zonelist_nodemask:!
1. mem_cgroup_soft_limit_reclaim!
❖ while mem_cgroup_largest_soft_limi...
shrink_lruvec()
❖ per-zone page freer!
1. get_scan_count!
❖ 回収目標ページ数決定!
2. while 目標未達:!
❖ shrink_list(LRU_INACTIVE_ANON)!
...
shrink_list()
❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを
shrinkするのは、対となるinactiveリストより大きい場合のみ!
1. if ACTIVEなリストを指定...
shrink_{active,inactive}_list
❖ shrink_active_list()!
1. Traverse pages in an active list!
2. Find inactive pages in the l...
inactiveなページとは
❖ !laptop_modeの場合!
❖ active LRU listの末尾から、単純に指定数分のページ
をinactiveなページとして取得!
❖ laptop_modeの場合!
❖ active LRU li...
try_to_unmap()
❖ Unmap a specified page from all corresponding mappings!
1. Set up struct rmap_walk_control.!
2. rmap_walk_...
A. rmap_walk_file
page
address_space(inode)
i_mmap(type: rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
B. rmap_walk_anon
page
anon_vma
rb_root(type:rb_root)
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
unmap
C. rmap_walk_ksm
page
stable_node
hlist
anon!
vma
anon

vma
anon!
vma
vma vma vma vma
pgtbl pgtbl pgtbl pgtbl
anon!
vma
2.2 Daemon Reclaim
(KSwapDによる代行回収)
kswapd
❖ Processing overview!
1. Wake up!
2. balance_pgdat()!
3. Sleep!
❖ balance_pgdat()!
❖ Work until all zones of pgdat...
Upcoming SlideShare
Loading in...5
×

Page reclaim

462

Published on

Investigation on (basic of) Linux's page reclaim function.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
462
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Page reclaim

  1. 1. Linuxカーネル ページ回収 吉田雅徳@siburu! 2014/7/27(Sun)
  2. 2. 1. 前回のあらすじ
  3. 3. What’s Page Frame ❖ page frame = A page-sized/aligned piece of RAM! ❖ struct page = An one-on-one structure in kernel for each page frame! ❖ mem_map! ❖ Unique array of struct page's which covers all RAM that a kernel manages.! ❖ but in CONFIG_SPARSEMEM environment! ❖ There's no unique mem_map.! ❖ Instead, there's a list of 2MB-sized arrays of struct page's.! ❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.
  4. 4. What’s NUMA ❖ NUMA(Non-Uniform Memory Architecture)! ❖ System is comprised of nodes.! ❖ Each node is defined by a set of CPUs and one physical memory range.! ❖ Memory access latency differs depending on source and destination nodes.! ❖ NUMA configuration! ❖ ACPI provides NUMA configuration:! ❖ SRAT(Static Resource Affinity Table)! ❖ To know which CPUs and memory range are contained in which NUMA node?! ❖ SLIT(System Locality Information Table)! ❖ To know how far a NUMA node is from another node?
  5. 5. What’s Memory Zone ❖ Physical memory is separated by address range:! ❖ ZONE_DMA: <16MB! ❖ ZONE_DMA32: <4GB! ❖ ZONE_NORMAL: the rest! ❖ ZONE_MOVABLE: none by default.! ❖ This is used to define a hot-removable physical memory range.
  6. 6. struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! }; Memory node, zone 物理アドレス Range1 Range2 CPU1 CPU2 CPU3 CPU4 struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! …! }; NUMA node1 NUMA node2 ❖ どのpglist_dataにも各ZONE(DMA∼MOVABLE)に対応する zone構造体が用意される(但し一部の中身は空かもしれない)
  7. 7. Memory Allocation 1. At first, checks threshold for each zone
 (threshold = watermark and dirty-ratio).! ❖ If all zones are failed, the kernel goes into page reclaim path (=today’s topic).! 2. If some zone is ok, allocates a page from the zone’s buddy system.! ❖ 0-order page is allocated from per-cpu cache.! ❖ higher order page is obtained from per-order lists of pages
  8. 8. Memory Deallocation ❖ Page is returned to buddy system.! ❖ 0-order page is returned to per-cpu cache via free_hot_cold_page().! ❖ Cold page: A page estimated not to be on CPU cache! ❖ This is linked to the tail of LRU list of the per-cpu cache.! ❖ Hot page: A page estimated to be on CPU cache! ❖ This is linked to the head of LRU list of the per-cpu cache.! ❖ higher order page is directly returned to per-order lists of pages.
  9. 9. Buddy System 4k 4k 4k 8k 8k 8k 4m 4m 4m ・・・ Per-cpu cache 4k 4k 4k Per-zone buddy system order0
 (de)alloc HOT COLD order1 order10 ・・・
  10. 10. 2. ページの回収 2.1 Direct reclaim! 2.2 Daemon reclaim
  11. 11. ページ割当フローの復習 ❖ __alloc_pages_nodemask(ページ割当基本関数)! ❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_freelist(2nd: all zones)! ❖ __alloc_pages_slowpath! 1. wake_all_kswapds(kswapd達の起床)! 2. get_page_from_freelist(3rd: all zones, min wmark)! 3. if {__GFP,PF}_MEMALLOC → __alloc_pages_high_priority! 4. __alloc_pages_direct_compact(非同期的)! 5. __alloc_pages_direct_reclaim(本コンテキストで直接ページ回収)! 6. if not did_some_progress → __alloc_pages_may_oom! 7. リトライ(2.へ) 又は __alloc_pages_direct_compact(同期的)
  12. 12. 2.1 Direct Reclaim (ページ割当要求者本人による回収)
  13. 13. __alloc_pages_direct_reclaim() ❖ __perform_reclaim! ❖ current->flags |= PF_MEMALLOC! ❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるように! ❖ try_to_free_pages! ❖ throttle_direct_reclaim! ❖ if !pfmemalloc_watermark_ok →  kswapdによりokになるのを待機! ❖ do_try_to_free_pages! ❖ current->flags &= ~PF_MEMALLOC! ❖ get_page_from_freelist! ❖ drain_all_pages! ❖ get_page_from_freelist
  14. 14. pfmemalloc_watermark_ok() ❖ ARGS! ❖ pgdat(type: struct pglist_data)! ❖ RETURN! ❖ type: bool! ❖ node’s free_pages > 0.5 * node’s min_wmark! ❖ DESC! ❖ node単位で(zone単位でなく)、フリーページ量を min watermarkの半分と比較し、超え ていればOK! ❖ 下回っていればfalseを返すとともに、 当該nodeのkswapdを起床! ❖ メモリ 迫したnodeではdirect reclaimはやめて kswapdに任せる、その閾値を決める関数。
  15. 15. do_try_to_free_pages() ❖ Core function for page reclaim, which is called at 3 different scenes! ❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()! ❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path! ❖ Right before per-memcg slab allocation! ❖ Right before per-memcg file page allocation! ❖ Right before per-memcg anon page allocation! ❖ Right before per-memcg swapin allocation! ❖ shrink_all_memory() → Hibernation path! ❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc
  16. 16. struct scan_control struct scan_control {! ! unsigned long nr_scanned;! ! unsigned long nr_reclaimed;! ! unsigned long nr_to_reclaim;! ! …! ! int swappiness; // 0..100! ! …! ! struct mem_cgroup *target_mem_cgroup;! ! …! ! nodemask_t! *nodemask;! };!
  17. 17. do_try_to_free_pagesの処理 ❖ 以下二つのループ! ❖ shrink_zones()! ❖ 後述! ❖ wakeup_flusher_threads()! ❖ shrink_zonesが、回収目標(scan_context::nr_to_reclaim)の1.5 倍以上のページをスキャンするたび、呼び出し。! ❖ 最大で、スキャンした分のページをライトバックするよう、 全ブロックデバイス(bdi)に要求。
  18. 18. shrink_zones() 1. for_each_zone_zonelist_nodemask:! 1. mem_cgroup_soft_limit_reclaim! ❖ while mem_cgroup_largest_soft_limit_node:! ❖ mem_cgroup_soft_reclaim! ❖ shrink_zoneに進む前に、当該zoneを使ってる memcgでlimitを超えてるものについて、 ページ 回収を済ませる処理! 2. shrink_zone! ❖ foreach mem_cgroup_iter:! ❖ shrink_lruvec! ❖ ここでのiterationはGlobal reclaimの場合は root memcgから回収! 2. shrink_slab! ❖ スラブについては次回以降で・・・
  19. 19. shrink_lruvec() ❖ per-zone page freer! 1. get_scan_count! ❖ 回収目標ページ数決定! 2. while 目標未達:! ❖ shrink_list(LRU_INACTIVE_ANON)! ❖ shrink_list(LRU_ACTIVE_ANON)! ❖ shrink_list(LRU_INACTIVE_FILE)! ❖ shrink_list(LRU_ACTIVE_FILE)! 3. if INACTIVEな無名メモリだけでは不足:! ❖ shrink_active_list
  20. 20. shrink_list() ❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを shrinkするのは、対となるinactiveリストより大きい場合のみ! 1. if ACTIVEなリストを指定:! ❖ if size of lru(ACTIVE) > size of lru(INACTIVE):! ❖ shrink_active_list! 2. else:! ❖ shrink_inactive_list
  21. 21. shrink_{active,inactive}_list ❖ shrink_active_list()! 1. Traverse pages in an active list! 2. Find inactive pages in the list and move them to an inactive list! ❖ shrink_inactive_list()! ❖ foreach page:! 1. page_mapped(page) => try_to_unmap(page)! 2. if PageDirty(page) => pageout(page)
  22. 22. inactiveなページとは ❖ !laptop_modeの場合! ❖ active LRU listの末尾から、単純に指定数分のページ をinactiveなページとして取得! ❖ laptop_modeの場合! ❖ active LRU listの末尾から、cleanな指定数分のページ をinactiveなページとして取得
  23. 23. try_to_unmap() ❖ Unmap a specified page from all corresponding mappings! 1. Set up struct rmap_walk_control.! 2. rmap_walk_{file, anon, or ksm}! ❖ rmap walk is iterating VMAs and unmapping from it! A. file: traverse address_space::i_mmap tree! B. anon: traverse anon_vma tree! C. ksm: traverse all merged anon_vma trees! ❖ each operation is similar to that for anon
  24. 24. A. rmap_walk_file page address_space(inode) i_mmap(type: rb_root) vma vma vma vma pgtbl pgtbl pgtbl pgtbl unmap
  25. 25. B. rmap_walk_anon page anon_vma rb_root(type:rb_root) vma vma vma vma pgtbl pgtbl pgtbl pgtbl unmap
  26. 26. C. rmap_walk_ksm page stable_node hlist anon! vma anon
 vma anon! vma vma vma vma vma pgtbl pgtbl pgtbl pgtbl anon! vma
  27. 27. 2.2 Daemon Reclaim (KSwapDによる代行回収)
  28. 28. kswapd ❖ Processing overview! 1. Wake up! 2. balance_pgdat()! 3. Sleep! ❖ balance_pgdat()! ❖ Work until all zones of pgdat are at or over hi-wmark.! ❖ reclaim function: kswapd_shrink_zone()
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×