Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Las16 200 - firmware summit - ras what is it- why do we need it

1,871 views

Published on

Title: RAS What is it? Why do we need it?
A 101 style introduction to RAS, its purpose and how we use it on ARM64. Covering current status of implementation in ASWG specs and Linux kernel. Plans for future features that are essential for ARM64. Followed by a discussion period.
Speaker: Yazen Ghannam, Fu Wei

Published in: Technology
  • Be the first to comment

Las16 200 - firmware summit - ras what is it- why do we need it

  1. 1. RAS: What is it? Why do we need it? Harb Abdulhamid (Qualcomm) Fu Wei (Red Hat) Yazen Ghannam (AMD)
  2. 2. ENGINEERS AND DEVICES WORKING TOGETHER What is it? ● Reliability ○ Computation needs be correct and reliable. ○ Failures and errors need be detected and reported. ○ Computation needs to fail when an error is not handled. ● Availability ○ System needs to remain available as long as possible. ○ Errors should be corrected and failures handled so that operation can continue. ● Serviceability ○ System should provide information to administrator to aid in system servicing. ○ Service time needs to be minimized to maximize uptime.
  3. 3. ENGINEERS AND DEVICES WORKING TOGETHER Why do we need it? ● Increase in system uptime (productivity) ● Less time spent debugging bad or failing hardware (productivity/cost) ● Fewer hardware replacement calls (cost/mindshare)
  4. 4. ENGINEERS AND DEVICES WORKING TOGETHER Hardware Architecture (How do we do it?) ● x86: Machine Check Exceptions (MCE) & Machine Check Architecture (MCA) ○ Architectural features/extensions. ○ Defines a register set that can be used for multiple devices (IMPORTANT!). ○ Poll for correctable errors. ○ APIC LVT or SMI interrupts for correctable thresholding and deferred errors. ○ MCE for uncorrectable errors. ● PCI-E: Advanced Error Reporting (AER) ○ Similar concepts to MCE/MCA. ● Implementation-specific features ○ ECC in memory controllers ○ ECC in I/O RAMs ○ Poison/bad data markers ○ Flooding I/O links (e.g. Sync Flood)
  5. 5. ENGINEERS AND DEVICES WORKING TOGETHER Platform Firmware (How do we do it?) ● Platform Firmware has intimate knowledge of the system and can handle RAS features not available through standardized mechanisms. ● Privileged code runs on the main cores or a separate microcontroller. ● Can mask registers from OS view and handle interrupts. ● Handling can be done without OS’s knowledge and information can be exposed to OS if desired. ● Preferably, will use a standard mechanism, like ACPI, to inform the OS of errors. ● Can directly inform sysadmin of errors using sideband communications like a baseboard management controller (BMC). ● Can pinpoint bad hardware for easy replacement.
  6. 6. ENGINEERS AND DEVICES WORKING TOGETHER Kernel (How do we do it?) ● Error Detect and Correct (EDAC) for system-specific handling and decoding. ● ISA-specific handling in /arch. ● Drivers for PCI-E AER and ACPI. ● Ideally, most RAS code in the Kernel would be obsoleted by Platform Firmware handling of errors. ● Kernel could then be only responsible for reporting errors received through standard mechanisms (e.g. ACPI). ● Kernel could also perform error handling relevant at the kernel-level (e.g. killing processes or retiring bad/poisoned pages).
  7. 7. ENGINEERS AND DEVICES WORKING TOGETHER User-space (How do we do it?) ● Mcelog ○ Generally considered obsolete. ○ X86 only. ○ Reads data from /dev/mcelog. ● Rasdaemon ○ More active. ○ Can be updated to handle various platforms. ○ Reads data from Kernel tracepoints. ○ Can effectively obsolete EDAC modules for error decoding.
  8. 8. ENGINEERS AND DEVICES WORKING TOGETHER ACPI (How do we do it?) ● We’ll get into this next...
  9. 9. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI BERT ● Scenarios : Record errors in emergency (OS crash/reset) ● BERT:Boot Error Record Table ● Mechanism : report unhandled errors that occurred in a previous boot. ○ WHERE are the error records
  10. 10. ENGINEERS AND DEVICES WORKING TOGETHER UEFI spec CPER
  11. 11. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI BERT
  12. 12. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI HEST ● Scenarios : Record errors in runtime (OS still can work) ● HEST:Hardware Error Source Table ● Mechanism : describes a standardized mechanism platforms may use to describe their error sources by Error Source Structure: ○ HOW to inform ○ WHERE are the error records ○ WHEN records can be free
  13. 13. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI HEST ● Error Source Structure : ○ For IA-32 : MCE/CMC/NMI ○ For PCI: AER Root Port/Endpoint/Bridge ○ Generic Hardware : GHES V1/V2 ● For ARM64 : GHES v2 ○ HOW to inform : Notification Structure ○ WHERE are the error records: Error Status Address (GAS : Generic Address Structure) ○ WHEN records can be free:Read Ack Register
  14. 14. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI HEST
  15. 15. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI ERST ● Scenarios : Record and Retrieve errors in persistent storage ● ERST:Error Record Serialization Table ● Mechanism : Operation abstract, provides details necessary to communicate with on-board persistent storage ● Plan B: use the UEFI runtime variable services to carry out error record persistence operations
  16. 16. ENGINEERS AND DEVICES WORKING TOGETHER ACPI APEI EINJ ● Scenarios : Test OSPM error handling stack ● EINJ:Error Injection Table ● Mechanism : Operation abstract, provides a generic interface which OSPM can inject hardware errors to the platform without requiring platform specific software.
  17. 17. ENGINEERS AND DEVICES WORKING TOGETHER RAS on ARM64 ● Architectural support for RAS is not available but not needed. ● In other words, no need to follow the same historical path as other architectures. ● Focus should be on Platform Firmware handling of errors. ● Reporting should be through standard methods like ACPI. ● Will possibly need to implement kernel-relevant error handling based on information received from Platform Firmware.
  18. 18. ENGINEERS AND DEVICES WORKING TOGETHER Current Work ● Add support for ACPI RAS features. ● Testing Platform Firmware to OS interface. ● No platform-specific RAS feature testing. ● Using modified QEMU for testing.
  19. 19. ENGINEERS AND DEVICES WORKING TOGETHER Future Work ● Finish ACPI implementation. ● Investigate kernel handling of poisoned pages and processes. ● Investigate I/O-related error handling in the Kernel.
  20. 20. ENGINEERS AND DEVICES WORKING TOGETHER Demo
  21. 21. Thank You #LAS16 For further information: www.linaro.org LAS16 keynotes and videos on: connect.linaro.org

×