Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203

417 views

Published on

Session ID: SFO17-203
Session Name: Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Speaker: Fu Wei
Track: LEG


★ Session Summary ★
This presentation gives an updated RAS architecture on ARM64 base on RAS extension (in ARMv8.2), SDEI (Software Delegated Exception Interface), APEI, UEFI PI-SMM. Will talk about all the components of the new RAS architecture on ARM64, gives audience the current status and the next step of development.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-203/
Presentation:
Video: https://www.youtube.com/watch?v=NReFBzbeWi0
---------------------------------------------------

★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport

---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203

  1. 1. Reliability, Availability, and Serviceability (RAS) on AArch64 Fu Wei (Linaro LEG) Supreeth Venkatesh (ARM)
  2. 2. ENGINEERS AND DEVICES WORKING TOGETHER AGENDA 1. Brief introduction of RAS ○ Definition, Importance, History 2. RAS on AArch64 ○ Overview ■ Hardware support ● RAS Extension ■ Software Architecture ● ARM-Trusted-Firmware, UEFI, APEI tables ● SDEI ○ Prototype Solution for Firmware First Error Handling ■ MM Secure Partition, Secure Partition Manager ■ Uncorrected error -- HEST & MM ■ Demo time 3. Status and Future Plans
  3. 3. Brief introduction of RAS ● What is RAS? ● Why do we need RAS? ● History of RAS
  4. 4. ENGINEERS AND DEVICES WORKING TOGETHER What is RAS? -- Definition Reliability Continuity, Computation needs be correct and reliable. Availability Readiness, System needs to remain available as long as possible. Serviceability Ability to undergo modifications and repairs,System should provide information to administrator to aid in system servicing. The RAS architecture primarily cares about ERRORs produced from HARDWARE .
  5. 5. ENGINEERS AND DEVICES WORKING TOGETHER Why do we need RAS? -- Importance Impacts Continuity, Computation needs be correct and reliable. Inevitability Although faults are rare, enterprise systems can be very large. So failures are inevitable. So we have to maintain system very well, and Operating Expense (OPEX) for maintenance is inevitable. OPEX for maintenance is reduced by 1. replacing only failed parts 2. scheduled maintenance (is cheaper than unscheduled service outages)
  6. 6. ENGINEERS AND DEVICES WORKING TOGETHER Why do we need RAS? -- Importance Inevitability <DRAM Errors in the Wild: A Large-Scale Field Study> by Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber Important Conclusion: ● The incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. ● Memory errors are strongly correlated. ● The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to re-placements). ● Error rates are unlikely to be dominated by soft errors. Benefit from ECC in DIMM ● Single-bit error --> CE ● Avoid (multi-bit errors)UEs from beginning (Single-bit error, CEs) ● The statistical data of CEs/UEs could be a reference for maintenance to reduce the cost of unscheduled service outage.
  7. 7. ENGINEERS AND DEVICES WORKING TOGETHER Server without RAS How to avoid "Inevitability" ? To Be Successful in Business, You Need a Little Luck. --Richard Branson /* _ooOoo_ o8888888o 88" . "88 (| -_- |) O = /O ____/`---'____ .' | |// `. / ||| : |||// / _||||| -:- |||||- | | - /// | | | _| ''---/'' | | .-__ `-` ___/-. / ___`. .' /--.-- `. . __ ."" '< `.____<|>_/___.' >'"". | | : `- `.;` _ /`;.`/ - ` : | | `-. _ __ /__ _/ .-` / / ======`-.____`-.________/___.-`____.-'===== = `=---=' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ Buddha blessed No Error Forever */ Emperor Yongzheng Defeat Ba A Ge (Bug)
  8. 8. ENGINEERS AND DEVICES WORKING TOGETHER History ECC in memory controllers and I/O RAMs Machine Check Architecture (MCA) ● A mechanism in which the CPU reports hardware errors to the OS ○ model-specific registers (MSRs) ■ set up machine checking ■ record detected errors ■ the info they contain is CPU specific ○ Machine Check Exception (MCE) ■ signals the detection of an uncorrected machine-check error ■ handler collect information about error from MSRs ○ Utility: mcelog PCI-E: Advanced Error Reporting (AER) Linux kernel ● EDAC (Error Detection and Correction) ○ designed to report and possibly act on hardware errors ○ inspect the hardware directly (system-specific handling and decoding.) ○ only support memory controller and PCI/AGP errors Firmware (first) FF ● APEI ● UEFI
  9. 9. RAS on AArch64 ● Overview of Hardware & Software ● Prototype Solution for Firmware First Error Handling
  10. 10. ENGINEERS AND DEVICES WORKING TOGETHER Hardware support for RAS ● CPU ○ ARMv8-A architecture (a mandatory extension to ARMv8.2) ○ EL2, EL3, or both ○ Virtualization extension or Security extensions or both ● GICv3 ○ Interrupt routing modes ○ Private and shared interrupts (PPI/SPI) ○ Ability to set an interrupt pending event signaling and delegation Interrupt groups/priority RAS Extension ● ESB (Error Synchronization Barrier) instructions ● RAS Extension registers ● Corrupted data poisoning
  11. 11. ENGINEERS AND DEVICES WORKING TOGETHER RAS Extension ESB instruction ● ESB (Error Synchronization Barrier) can be used to isolate Unrecoverable errors. ● Software can determine that: ○ The error was reported as Unrecoverable. ○ The preferred return address of the SEI is an ESB instruction. ○ The software between that ESB and the previous ESB can be isolated. ○ ESB might update DISR_EL1 / DISR (Deferred Interrupt Status Register) and VDISR_EL2 / VDISR (Virtual Deferred Interrupt Status Register) RAS Extension registers: ● Feature Register/Component ID Register ● Error Record Register ○ Feature ○ Control ○ Record Primary Syndrome ○ Record Address Register ○ Record Miscellaneous Registers ● Hypervisor Configuration Register ● Virtual SError Exception Syndrome Register ● Secure Configuration Register Or ● Interrupt Register for Fault-Handling and Recovery ● Device Affinity and Architecture Register
  12. 12. ENGINEERS AND DEVICES WORKING TOGETHER RAS Extension -- gather HW error info for FW ESB instruction Help to locate Error RAS Extension registers ● Provide the error info to FW ● Control the Interrupt by FW ARMv8-A RAS extensions standardize the interface between HW and FW
  13. 13. ENGINEERS AND DEVICES WORKING TOGETHER Software Architecture Firmware First error handling requires standard interfaces between multiple SW components.
  14. 14. ENGINEERS AND DEVICES WORKING TOGETHER SoftwareArchitecture
  15. 15. ENGINEERS AND DEVICES WORKING TOGETHER Firmware ARM Trusted Firmware ● Reference EL3 Runtime (BL31) ○ Standard power control (PSCI) ○ Optional Trusted OS integration ● Trusted boot firmware ○ Optional ○ Compatible with other firmware (like EDK2) ● Applicable to all segments ● Open Source at GitHub with BSD-3-clause license UEFI Unified Extensible Firmware Interface. Firmware interface between the platform and the operating system. Predominate interfaces are in the boot services (BS) or pre-OS. Few runtime (RT) services. On AArch64, it (tianocore EDK2) works with ARM TF as BL33 in EL2
  16. 16. ENGINEERS AND DEVICES WORKING TOGETHER APEI (ACPI Platform Error Interfaces) APEI EINJERST BERT HEST For runtimeFor last crash For TestingFor Storage Provides a standard way to convey error info from Firmware to OS
  17. 17. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables HEST (Hardware Error Source Table) Key info: HOW to get trigger WHERE are the error records HOW to release records’ mem For ARM64 : GHES v2 HOW to get trigger: Notification Structure WHERE are the error records: Error Status Address (GAS : Generic Address Structure) HOW to release records’ mem: Read Ack Register For IA-32 : MCE/CMC/NMI For PCI: AER Root Port/Endpoint/Br idge For generic hardware: GHES (Generic Hardware Error Source) V1/V2
  18. 18. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables BERT: Boot Error Record Table Record fatal errors, then report it in the second boot CPER (in the Appendix of UEFI spec) Common Platform Error Record, with this help, OS can get all kinds of error we could think of.
  19. 19. ENGINEERS AND DEVICES WORKING TOGETHER APEI tables -- ERST & EINJ ● ERST: Error Record Serialization Table ○ Operation abstract, provides details necessary to communicate with on-board persistent storage for error recording ● EINJ: Error Injection Table ○ Operation abstract, provides a generic interface which OSPM can inject hardware errors to the platform without requiring platform specific software.
  20. 20. ENGINEERS AND DEVICES WORKING TOGETHER SDEI usage in RAS Software Delegated Exception Interface:An interface between FW & OS, for registering, notifying and servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)
  21. 21. ENGINEERS AND DEVICES WORKING TOGETHER Prototype Solution for Error Handling MM Secure Partition, Secure Partition Manager Uncorrected error -- HEST & MM
  22. 22. ENGINEERS AND DEVICES WORKING TOGETHER What Are We Doing ● Define standard interfaces to enable FF handling of AP RAS errors ● Demonstrate use of RAS extensions ● Demonstrate interfaces with reference software and platforms ○ uncorrected DIMM & CPU errors
  23. 23. ENGINEERS AND DEVICES WORKING TOGETHER MM Secure Partition ● MM Secure Partition implements management functions , runs in S-EL0 to achieve isolation from S-EL1 & EL3 ○ Leverages existing firmware code based on EDK2: Standalone MM ● Partition communicates with ARM TF through a standard interface: MM_COMMUNICATE SMC ● Partition is managed by ARM TF ● ARM TF BL31 stage owns EL3 and S-EL1 ● Secure partition resources are described in BL31 platform port ● Minimise code in EL3 and delegate RAS error handling
  24. 24. ENGINEERS AND DEVICES WORKING TOGETHER Secure Partition Manager (SPM) ● Secure Partition Manager in BL31 exports standard ABI to ○ Initialize the partition ○ Delegate SMC requests to the partition
  25. 25. ENGINEERS AND DEVICES WORKING TOGETHER UncorrectedError --HEST&MM
  26. 26. ENGINEERS AND DEVICES WORKING TOGETHER Uncorrected Error -- HEST & MM 1. System boot: BootROM-->BL2-->BL3x a. BL31 initializes SPM (includes MM dispatcher) and SDEI dispatcher. b. UEFI (BL33), DXE, UEFI Platform Driver: i. query SPI (Secure Partition Image, BL32) for error source info ii. SPI return error source info back to UEFI iii. UEFI map in and mark error record region as Runtime Services Data Region iv. Update/add error source info in HEST 2. OS starts running: HEST driver scan HEST table and register error handlers by SDEI 3. UE occurred, the event will be routed to EL3 (SPM) 4. SPM routes the event to RAS error handler in S-EL0 (MM Foundation) 5. MM Foundation creates the CPER blobs by the info from RAS Extension 6. SPM notifies SDEI to call the corresponding OS registered handler 7. OS gets the CPER blobs by Error Status Address block, process the error, try to recovery. 8. report the error event by RAS event 9. rasdaemon log error info from RAS event to recorder
  27. 27. ENGINEERS AND DEVICES WORKING TOGETHER Demo Time ● Prototype RAS solution on FVP ○ arm-trusted-firmware (bl1, bl2, bl31) ○ tianocore edk2 (bl32, bl33) ○ Linux kernel, Shell command
  28. 28. Status and Future Plans ● Current development status ● Ongoing development ● TODO list for Reference Solution
  29. 29. ENGINEERS AND DEVICES WORKING TOGETHER Current development status ● Hardware ○ ARM engineers are working FVP, LEG team is developing on QEMU ○ RAS spec has released (ARM DDI 0587A) ● Firmware ○ SDEI ■ SDEI Specification released (ARM DEN0054A) ■ SDEI added as hardware error notification type in ACPI 6.2 ■ Linux SDEI client implementation v3 patchset has been posted on kvmarm and devicetree mailing list. ■ ACPICA support for SDEI up-streamed ■ SDEI DT bindings acked ■ ARM TF support posted to github and includes ● SDEI Dispatcher ● Framework for managing interrupts handled in EL3 ● OS (Linux): ○ APEI on ARM64 can be enabled in kernel. ○ Memory failure support merged
  30. 30. ENGINEERS AND DEVICES WORKING TOGETHER Ongoing development ● ARM TF ○ Simplify error interrupt handling for platform ports ○ Framework for handling External aborts (EA) in design ○ RAS Extensions support in design ■ ESB ■ RAS Error Record driver ● EDK2 ○ Driver for creating APEI HEST under development ○ Library for creating APEI CPERs under development ○ Prototyping use of Standalone MM partition to create error records on QEMU ● OS(Linux) ○ KVM changes for virtualizing SDEI under development
  31. 31. ENGINEERS AND DEVICES WORKING TOGETHER TODO list ● Hardware ○ Test on a real hardware (ARMv8.2, including RAS extension) ● Firmware ○ ARM-TF ■ Support for Double fault handling ■ Support for v8.4 RAS Extensions ○ EDK2: ■ Support for BERT ■ ERST and EINJ implementation
  32. 32. ENGINEERS AND DEVICES WORKING TOGETHER Acknowledgments ● Achin Gupta (ARM) ● John Feeney (Red Hat) ● Leif Lindholm (ARM) ● Supreeth Venkatesh (ARM)
  33. 33. Thank You #SFO17 BUD17 keynotes and videos on: connect.linaro.org For further information: www.linaro.org

×