The e820 trap of Linux kernel hibernation

979 views

Published on

Presentation of COSCUP 2015

Coscup 2015-s4-e820-trap-20150815

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
979
On SlideShare
0
From Embeds
0
Number of Embeds
139
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Theory
    Mathematics
  • The e820 trap of Linux kernel hibernation

    1. 1. The e820 trap of Linux kernelThe e820 trap of Linux kernel hibernationhibernation AugAug, 2015, COSCUP 2015, Taipei, 2015, COSCUP 2015, Taipei Joey Lee, SUSE Labs Taipei
    2. 2. 2 Agenda • Fundamental • Hibernation (suspen to disk) • e820, EFI memmap • e820 shift • Platform vs. Shutdown • memory size changing • EFI memmap shift • setup_data and nosave regions • EFI runtime services broken after S4 • Challenges • Q&A
    3. 3. FundamentalFundamental
    4. 4. 4 Memory (physical) pfn = 0 pfn = max_pfn
    5. 5. 5 Memory (runtime) 0 max_pfn
    6. 6. 6 Hibernation (suspend to disk) • Create snapshot image of runtime memory. • Store snapshot image to swap partition or file. • Restore snapshot image to memory.
    7. 7. 7 Hibernation (restore) 0 max_pfn 0 max_pfn Memory restored
    8. 8. 8 Memory (physical) pfn = 0 pfn = max_pfn
    9. 9. 9 Memory (BIOS memory map) 0 max_pfn 0 max_pfn Boot Boot
    10. 10. 10 e820 • Wikipedia: • e820 is shorthand to refer to the facility by which the BIOS of x86-based computer systems reports the memory map to the operating system or boot loader. • It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. It reports which memory address ranges are usable and which are reserved for use by the BIOS.
    11. 11. 11
    12. 12. 12 e820 entry type Type Kernel Define String in dmesg Description Type 1 E820_RAM usable, System RAM Usable (normal) RAM Type 2 E820_RESERVED reserved, reserved Reserved - unusable Type 3 E820_ACPI ACPI data, ACPI Tables ACPI reclaimable memory Type 4 E820_NVS* ACPI NVS, ACPI Non-volatile Storage ACPI NVS memory, ACPI Non-Volatile-Sleeping Memory (NVS) Type 5 E820_UNUSABLE Unusable, Unusable memory Area containing bad memory * drivers/acpi/nvs.c::suspend_nvs_*() handle ACPI NVS for S4
    13. 13. 13 Memory (BIOS memory map) 0 max_pfn 0 max_pfn Boot Boot
    14. 14. 14 Memory (runtime) 0 max_pfn 0 max_pfn Boot ACPI NVS reserved ACPI data reserved Boot useable useable useable useable useable useable 0 max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable OS
    15. 15. 15 EFI memory map • EFI spec v2.5 • EFI_BOOT_SERVICES.GetMemoryMap() • Returns the current memory map. • 6.2 Memory Allocation Services • Table 25. Memory Type Usage before ExitBootServices() • Table 26. Memory Type Usage after ExitBootServices()
    16. 16. 16
    17. 17. 17 e820 entry type vs. EFI memory region type E820 Type E820 entry type EFI memory region type Type 1 E820_RAM EFI_LOADER_CODE (type 1) EFI_LOADER_DATA (type 2) EFI_BOOT_SERVICES_CODE (type 3) EFI_BOOT_SERVICES_DATA (type 4) EFI_CONVENTIONAL_MEMORY (type 7) Type 2 E820_RESERVED EFI_RESERVED_TYPE (type 0) EFI_RUNTIME_SERVICES_CODE (type 5) EFI_RUNTIME_SERVICES_DATA (type 6) EFI_MEMORY_MAPPED_IO (type 11) EFI_MEMORY_MAPPED_IO_PORT_SPACE (type 12) EFI_PAL_CODE (type 13) Type 3 E820_ACPI EFI_ACPI_RECLAIM_MEMORY (type 9) Type 4 E820_NVS EFI_ACPI_MEMORY_NVS (type 10) Type 5 E820_UNUSABLE EFI_UNUSABLE_MEMORY (type 8) New* E820_PMEM EFI_PERSISTENT_MEMORY (type 14) * v4.2-rc4 arch/x86/boot/compressed/eboot.c::setup_e820()
    18. 18. e820 shifte820 shift
    19. 19. 19
    20. 20. 20
    21. 21. 21 e820 shift (1) Boot 1: Boot 2:
    22. 22. 22 e820 shift (2) • Boot: • [ 0.000000] BIOS-e820: [mem 0x0000000068f45000-0x0000000069d4ffff] usable • Resume Boot: • [ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved • [ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff] • [ 17.410733] PM: Image loading progress: 0% • [ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000 • [ 17.933469] IP: [<ffffffff810a1cf0>] load_image_lzo+0x810/0xe40 • Page fault address is in usable memory entry when boot, but in reserved memory entry when resume boot.
    23. 23. 23 e820 shift (3) 0 max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable max_pfn Boot ACPI NVS reserved ACPI data reserved useable useable useable useable useable useable 0 Boot Resume Boot Useable address in reserved region
    24. 24. 24 Checking e820 shift: • Lee, Chun-Yi [PATCH] PM / hibernate: avoid unsafe pages in e820 reserved regions: • 84c91b7ae commit in v3.17-rc1 • https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=84c91b7 • Reverted by f82daee49 commit in v4.0 • Waiting “Yinghai Lu<> [PATCH]x86: Kill E820_RESERVED_KERN” • Lee, Chun-Yi [PATCH] Hibernate: save e820 table to snapshot header for comparison • https://lkml.org/lkml/2014/8/11/166
    25. 25. 25 Platform vs. Shutdown (1) • Different modes of hibernation: • cat /sys/power/disk [platform] shutdown reboot suspend • Platform mode depends on _S4 support by BIOS: [ 1.080004] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [_S4_] (20130725/hwxface-571) • ACPI spec 6.0: • Table 7-234 BIOS-Supplied Control Methods for System-Level Functions • _S4: Package that defines system _S4 state mode. • 16.3.2 BIOS Initialization of Memory (since ACPI v1.0): • Note: The memory information returned from the system address map reporting interfaces should be the same before and after an S4 sleep. OSPM will invoke E820 interfaces on IA-PC-based legacy systems or the GetMemoryMap() interface on UEFI-enabled systems
    26. 26. 26 Platform vs. Shutdown (2) • Documentation/power/swsusp.txt in kernel • Q: What is the difference between "platform" and "shutdown"? • A: "platform" is actually right thing to do where supported, but "shutdown" is most reliable (except on ACPI systems). • Linux Kernel bug #77571: • https://bugzilla.kernel.org/show_bug.cgi?id=77571 • The same page fault when writing snapshot image to page buffer. • Bug reporter uses “shutdown” but not “platform”. After using “platform”, bug reporter can not reproduce issue. • That's better using platform when BIOS support _S4. User should aware that has risk when using “shutdown”.
    27. 27. 27 Memory size mismatch (1) • PM: Loading and decompressing image data (495448 pages)... [ 3.834831] PM: Image mismatch: memory size [ 3.834851] PM: Read 1981792 kbytes in 0.01 seconds (198179.20 MB/s) [ 3.836147] PM: Error -1 resuming [ 3.836162] PM: Failed to load hibernation image, recovering. • Normally: On node 0 totalpages: 4177255 When issue happened: On node 0 totalpages: 4177256 <== mismatch • for_each_online_node(nid) phys_pages += node_present_pages(nid); • kernel/power/snapshot.c::check_header() if (!reason && info->num_physpages != get_num_physpages()) reason = "memory size"; if (reason) { printk(KERN_ERR "PM: Image mismatch: %sn", reason); return -EPERM; }
    28. 28. 28 Memory size mismatch (2) • Boot Memory map of Boot
    29. 29. 29 Memory size mismatch (3) • Resume Boot Memory map of Resume Boot
    30. 30. EFI memmap shiftEFI memmap shift
    31. 31. 31 Misidentification of nosave region (1) 1 page In usable Not align EFI_LOADER_DATA
    32. 32. 32 setup_data and E820_RESERVED_KERN • setup_data: a linked list for carrying data with boot_params to later boot stage. • Allocated in EFI stub, reserved via memblock and e820. • Yinghai Lu<> [PATCH] x86, boot: clean up setup_data handling • https://lkml.org/lkml/2015/2/28/272 • SETUP_E820_EXT, SETUP_EFI SETUP_DTB, SETUP_PCI SETUP_KASLR • Those setup_data chunks are not page align when allocating. That causes hole between e820 entries, then kernel register it as 1 page nosave regions. <== random address per boot!
    33. 33. 33 Misidentification of nosave region (2) • arch/x86/kernel/e820.c Register hole between two e820 region to nosave as 1 page region
    34. 34. 34 Kill E820_RESERVED_KERN • Yinghai Lu [PATCH] x86: Kill E820_RESERVED_KERN • https://lkml.org/lkml/2015/2/28/274 • Cleaning setup_data handler, remove E820_RESERVED_KERN from e820 regions because setup_data are already protected by memblock. • Avoid wasting memory, fix page align problem in e820. • Linux Kernel bug #96111 Unreliable hibernation on Lenovo X230 • https://bugzilla.kernel.org/show_bug.cgi?id=96111 • 84c91b7ae commit in v3.17-rc1 Reverted by f82daee49 commit in v4.0 • Chen, Yu C [RFC PATCH] PM / hibernate: make sure each resuming page is in current memory zones • Waiting Yinghai Lu's patch for kill E820_RESERVED_KERN
    35. 35. 35 EFI runtime services broken after S4 (1) On some machines
    36. 36. 36 EFI runtime services broken after S4 (2) • Resume Boot: VA 0xffffffefd244e60 is in Runtime Data region after hibernate resumed: [ 0.125865] efi: mem26: [Runtime Data |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd1e5000- 0xfffffffefd245000) (0MB) • Boot: VA 0xffffffefd244e60 didn't mapped to any PA in hibernating kernel (image kernel): [ 0.111002] efi: mem24: [Runtime Code |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb385000-0x00000000bb3e5000) va=[0xfffffffefd585000- 0xfffffffefd5e5000) (0MB) [ 0.125883] efi: mem25: [Runtime Data |RUN| | | | |WB|WT|WC|UC] pa=[0x00000000bb3e5000-0x00000000bb445000) va=[0xfffffffefd3e5000- 0xfffffffefd445000) (0MB) [ 0.140764] efi: mem29: [Boot Data | | | | | |WB|WT|WC|UC] pa=[0x00000000bb7ff000-0x00000000bb800000) va=[0xfffffffefd1ff000- 0xfffffffefd200000) (0MB)
    37. 37. 37 Memory mapping of EFI runtime services (1) • Borislav Petkov [PATCH] EFI: Runtime services virtual mapping • d2f7cbe7 merged since v3.14 kernel • We map the EFI regions needed for runtime services non- contiguously, with preserved alignment on virtual addresses starting from -4G down for a total max space of 64G. • Documentation/x86/x86_64/mm.txt ->trampoline_pgd: We map EFI runtime services in the aforementioned PGD in the virtual range of 64Gb (arbitrarily set, can be raised if needed) 0xffffffef00000000 - 0xffffffff00000000
    38. 38. 38 Memory mapping of EFI runtime services (2) • Virtual memory map x86_64 of runtime service – trampoline_pgd Runtime Code Runtime Data 0xffffffffffffffff 0x0000000000000000 0x00000000bb385000 0xffffffff00000000 4 G 64 G 0x00000000bb3e5000 0xffffffef00000000 Boot Data Boot Code1:1 mapping workaround 1:1 mapping workaround 1:1 mapping workaround 1:1 mapping workaround Boot Data Boot Data arch/x86/platform/efi/efi_64.c::efi_map_region()
    39. 39. 39 Memory mapping of EFI runtime services (3) • In -4G area: Runtime Code Runtime Data 0xffffffff00000000 0xffffffef00000000 Boot Data Boot Code 64 G Boot Data Boot Data 2M-aligned arch/x86/platform/efi/efi_64.c::efi_map_region()
    40. 40. 40 Should fix runtime services address after S4 • Lee, Chun-Yi [PATCH] x86_64/efi: Mapping Boot and Runtime EFI memory regions to different starting virtual address • VA of EFI runtime services should may changed between hibernation, but that's fine when PA doesn't change. • Should checking more detail about EFI page table when hibernation recovery.
    41. 41. ChallengesChallenges
    42. 42. 42 Hibernation's Challenge • KASLR (Kernel address space layout randomization) • Exclusive with hibernation • Intel Rapid Start • A replacement of kernel hibernation • May also conflict with KASLR • NVDIMM • Do not need hibernation anymore
    43. 43. Q&AQ&A
    44. 44. SUSE is HiringSUSE is Hiring Please search “SUSE Careers”Please search “SUSE Careers” andand http://www.104.com.tw/http://www.104.com.tw/
    45. 45. SUMMIT 2015 OPENSUSE ASIA Taipei,R.O.C(Taiwan) Bring you to the free world
    46. 46. 46
    47. 47. 47
    48. 48. 48 Join us on: www.opensuse.org

    ×