SlideShare a Scribd company logo
SAN19-118:
RAS on ARM64
Reliability, Availability, and Serviceability (RAS) on
ARM64 status
Fu Wei <wefu@redhat.com>
AGENDA
1.Brief introduction
○ Architecture, RAS Extension, APEI, SDEI
2.Prototype overview
○ Firmware First Error Handling
■ Overview
■ APEI protocol (boot time)
■ CperLib (run time)
3.Status & Plans
○ SBSA QEMU & SPM
○ EINJ
Brief introduction
Architecture, RAS Extension, APEI, SDEI
RAS Architecture Overview
Hardware support for RAS
CPU
● ARMv8-A architecture (a mandatory
extension to ARMv8.2)
● EL2, EL3, or both
● Virtualization extension or Security
extensions or both
GICv3
● Interrupt routing modes
● Private and shared interrupts
(PPI/SPI)
● Ability to set an interrupt pending
event signaling and delegation
Interrupt groups/priority
RAS Extension
● ESB (Error Synchronization Barrier)
instructions
● RAS Extension registers
● Corrupted data poisoning
RAS Extension: Gather HW error info for FW
ESB instruction
Help to locate Error
RAS Extension registers
●Provide the error info to FW
●Control the Interrupt by FW
ARMv8-A RAS extensions standardize
the interface between HW and
FW
● a mandatory extension to ARMv8.2, an optional extension to the Armv8.0 and
Firmware
RAS
Extension
registers
Arm ® Reliability, Availability, and Serviceability
(RAS) Specification Documentations
The latest version for now is :
DDI 0587C.b
The latest version for now is :
DDI 0487E.a
● Secure Partition Manager in BL31
exports standard ABI to
− Initialize the partition
− Delegate SMC requests to the
partition
− For RAS, delegate RAS error
handling to a MM Secure
Partition of RAS
● MM Secure Partition implements
management functions , runs in S-
EL0
− Leverages existing firmware
code based on EDK2:
StandaloneMm
− Minimise code in EL3
Firmware support for RAS: SPM & MM Secure
Partition
Arm SPA Documentations
The latest version for now is :
DEN 0063 1.0.0-2 https://developer.arm.com/architectures/sec
urity-architectures/platform-security-
architecture
APEI
EINJERST
BERT HEST For runtimeFor last
crash
(critical error)
For Testing
For Storage Provides a standard way
to convey error info
from Firmware to OS
CPER
UEFI
For Error
info format
Firmware support for RAS: APEI (ACPI Platform Error Interfaces)
ACPI and UEFI Documentations
The latest version for now is :
Version 6.3
The latest version for now is :
Version 2.8
ACPI for the Armv8 RAS Extensions 1.0(Provisional)
This document describes ACPI extensions
that enable kernel first RAS handling for
systems that employ the Arm RAS
extensions.
For PEs, this specification covers:
1. Armv8 & 8.4 RAS extension
2. RAS system architecture v1.0 & v1.1.
AEST: Arm
Error Source
Table
Integrating
AEST into
HEST
Armv8 RAS
extension
section
ACPI CPER
Code First
From firmware to OS: SDEI usage in RAS
Software Delegated Exception Interface:An interface between FW & OS, for registering,
notifying and servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)
Prototype overview
● Firmware First Error Handling
Firmware First Error Handling Overview
APEI protocol: Updating APEI table in boot time
CperLib : Generating the CPER blob in runtime
Status & Plans
SBSA QEMU & Secure Partition
TODO
SBSA QEMU & Secure Partition
❖ SBSA QEMU
➢ SBSA QEMU patchset has
been upstreamed
Great thanks to Hongbo and Radosław
❖ ARM-TF
➢ SBSA QEMU support has
been partly upstreamed,
and the rest of patches are
posted.
➢ StandaloneMm support(on
SBSA QEMU) patches are
debugged
SBSA QEMU & Secure Partition
❖ EDK2 and edk2-platform
➢ StandaloneMm support patchset has been upstreamed
➢ StandaloneMm support for SBSA QEMU patches are under
debugging.
➢ APEI protocol and CperLib patches are under debugging.
TODO
❖ Go on upstreaming ARM-TF
patchset (target: v2.2)
❖ Hardware
➢ Test on a real hardware
(ARMv8.2+)
❖Firmware
➢ EDK2:
■ Improve and upstream
APEI protocol and
CperLib code (target:
201912)
■ EINJ prototype(target:
before next Connect)
■ prototype solution of
BERT
EINJ: Error Injection with Common Fault
Injection Model Extension
Linux
EINJ
driver
APEI
Test
application
BL32 in SEL-0
MM
Foundation/Core
Non-secure mem for
firmware
RAS Node
registers
Common Fault
Injection Model
Extension
Trigger Error
Action Tables
EINJ
Shared mem
EDK2(BL33)
In EL-2
Acknowledgments
Achin Gupta
Arvind Chauhan
Charles Garcia-Tobin
Daniil Egranov
Dong Wei
Supreeth Venkatesh
Thomas Abraham
Leif Lindholm
Radosław Biernacki
Tanmay Jagdale
Zhang HongBo
Al Stone
John Feeney
Jon Masters
Thank you
Join Linaro to accelerate deployment of your Arm-based
solutions through collaboration
contactus@linaro.org
FYI
More detail for RAS on AArch64
What is RAS? -- Definition
Reliability
Continuity, Computation
needs be correct and reliable.
Availability
Readiness, System needs to
remain available as long as
possible.
Serviceability
Ability to undergo modifications
and repairs, system should provide
information to administrator to aid
in system servicing.
The RAS architecture primarily cares about
the ERRORs produced from HARDWARE.
Why do we need RAS? -- Reality
Impacts
Enterprise systems often provide
mission-critical services. Any
system failure or data corruption
impacts the customer’s business
and reputation
Inevitability
Although faults are rare,
enterprise systems can be
very large. So failures are
inevitable.
Why do we need RAS? -- Importance
So we inevitably have to maintain system very well.
For reducing the expense for maintenance, we should :
Replace
only failed
parts
Scheduled
maintenance
cheaper than
unscheduled service outages
History
ECC in memory controllers and I/O RAMs
Machine Check Architecture (MCA)
● A mechanism in which the CPU reports
hardware errors to the OS
• model-specific registers (MSRs)
■ set up machine checking
■ record detected errors
■ the info they contain is CPU
specific
• Machine Check Exception (MCE)
■ signals the detection of an
uncorrected machine-check
error
■ handler collect information
about error from MSRs
• Utility: mcelog
PCI-E: Advanced Error Reporting (AER)
Linux kernel
● EDAC (Error Detection and Correction)
• designed to report and possibly
act on hardware errors
• inspect the hardware directly
(system-specific handling and
decoding.)
• only support memory controller
and PCI/AGP errors
Firmware (first) FF
● APEI
● UEFI
Taxonomy of Error in Diagram
RAS Architecture
Firmware First error handling requires standard interfaces between multiple SW
components.
RAS Extension: ESB instruction
● ESB (Error Synchronization Barrier) can be used to synchronize Unrecoverable
errors(containable errors).
● Software can determine that:
○ The error was reported as Unrecoverable.
○ The preferred return address of the SEI is an ESB instruction.
○ The software between that ESB and the previous ESB can be isolated.
○ ESB might update :
■ DISR_EL1 / DISR (Deferred Interrupt Status Register)
■ VDISR_EL2 / VDISR (Virtual Deferred Interrupt Status Register)
RAS Extension: Registers
System register views
(in Arm ® Architecture Reference Manual)
● Processor Feature Register
● Memory Model Feature Register
● Error Record Register
○ Error Record ID Register
○ Error Record Select Register
○ [Selected Error Record] Feature Register
○ [...] Control Register
○ [...] Primary Status Register
○ [...] Address Register
○ [...] Miscellaneous Register [0-3]
● Deferred Interrupt Status Register
● Hypervisor Configuration Register
● Virtual SError Exception Syndrome
Register
● Virtual Deferred Interrupt Status Register
● Secure Configuration Register
Memory-mapped view
● Error Record Feature Register
● Error Record Control Register
● Error Record Primary Status Register
● Error Record Address Register
● Error Record Miscellaneous Register[0-3]
● Pseudo-fault Generation Feature register
● Pseudo-fault Generation Control register
● Pseudo-fault Generation Countdown register
● Error Group Status Register
● Implementation Identification Register
● Error Interrupt Configuration Register[FHI, ERI,
CRI][0-2]
● Error Interrupt Status Register
● Device Affinity Register
● Device Architecture Register
● Device Configuration Register
● Peripheral Identification Register[0-4]
● Component Identification Register[0-3]
APEI tables
BERT: Boot Error Record Table
Record fatal errors, then report it in the
second boot
CPER(in the Appendix of UEFI spec)
Common Platform Error Record
APEI tables
HEST: Hardware Error Source Table
● Key info:
HOW to get trigger
WHERE are the error records
HOW to release records’ mem
● For ARM64 : GHES v2
HOW to get trigger: Notification
Structure
WHERE are the error records:
Error Status Address
(GAS : Generic Address Structure)
HOW to release records’ mem:
Read Ack Register
For IA-32 :
MCE
CMC
NMI
For PCI: AER
Root Port
Endpoint
Bridge
For generic
hardware:
GHES V1/V2
(Generic
Hardware
Error Source)
APEI tables
ERST: Error Record Serialization Table
Operation abstract, provides details
necessary to communicate with on-board
persistent storage for error recording.
EINJ: Error Injection Table
Operation abstract, provides a generic
interface which OSPM can inject
hardware errors to the platform
without requiring platform specific
software.
ACPI for the Armv8 RAS Extensions 1.0(proposal-1)
AEST: Arm Error Source Table
AEST node are composed of the
following parts:
● A header
● A set of common fields for all error
nodes(Node generic data)
● A component specific section
● A section that describes the interfaces
associated with an error node.
● A section describing associated
interrupts with the error node.
ACPI for the Armv8 RAS Extensions 1.0(proposal-2)
Integrating AEST into HEST
Q: Do we have to add a brand new table for
the extension?
An alternative way is to integrate the “AEST”
nodes into the HEST directly by creating new
error source structure, Type: 12-ARMv8 RAS
Extension error node
ACPI for the Armv8 RAS Extensions 1.0(proposal-3)
CPER Armv8 RAS extension section
The section gathers the content of a RAS
extension node’s registers without any
specific information that can associate it
with components in the SoC, because it is
used alongside other error sections
already defined for CPER, such as
processor sections, generic and Arm, or
Memory sections.
Uncorrected Error -- HEST & MM
1. System boot: BootROM-->BL2-->BL3x
a. BL31 initializes SPM, SDEI
dispatcher and BL32(MM dispatcher)
b. UEFI (BL33), DXE, UEFI Platform
Driver:
i. query MM partitions by APEI
protocol for error source info
ii. MM partitions return error
source info back to UEFI
iii. UEFI map in and mark error
record region as Runtime
Services Data Region
iv. Update/add error source info
in HEST
2. OS starts running: HEST driver scan HEST
table and register error handlers by SDEI
3. UE occurred, the event will be routed to
EL3 (SPM)
4. SPM routes the event to RAS error
handler in S-EL0 (MM partitions)
5. MM Foundation (CperLib) creates the
CPER blobs by the info from RAS
Extension
6. SPM notifies SDEI to call the
corresponding OS registered handler
7. OS gets the CPER blobs by Error Status
Address block, process the error, try to
recovery.
8. report the error event by RAS event
9. rasdaemon log error info from RAS event
to recorder
Boot Time RunTime
APEI protocol: API
Main
● Apei.c
○ UpdateApei :
updates error source information to
the specified APEI(HEST/BERT) Table
typedef
EFI_STATUS
(EFIAPI *EFI_APEI_UPDATE) (
IN CONST EFI_APEI_PROTOCOL *This,
IN UINT32
Signature
);
● ApeiCommon.c
○ GetApeiTable
○ AcpiTableChecksum
For HEST table:
● ApeiHest.c
− UpdateHest
■ GetErrorSources
■ UpdateGhes
For BERT table(TBD):
● ApeiBert.c:
○ UpdateBert
■ GetErrorInfo
■ FillBert
CperLib: API
● CperInit:
parse the error source info, and return
the error record address info
EFI_STATUS
EFIAPI
CperInit (
IN EFI_APEI_ERROR_SOURCE *ErrorSource,
IN OUT UINTN *ErrorRecordAddress
);
The Dynamic Tables module passes the
ErrorSource to help CperLib get the
memory region info for other APIs, like
CperWrite
● CperWrite:
creates a CPER blob at the defined
memory region from Section info data
EFI_STATUS
EFIAPI
CperWrite (
IN SECTIONS_INFO *SectionInfo,
IN UINTN ErrorRecordAddress
);
Wrap the given section data into CPER
blob and put it into the specific
memory region
BERT : When catastrophic errors occur(TODO)
● if whole system has
crashed down, BMC or
other system controller
should generate the error
blob, and save it.
BERT: After emergency reboot(to be improved)

More Related Content

What's hot

SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
Linaro
 
LCU14 500 ARM Trusted Firmware
LCU14 500 ARM Trusted FirmwareLCU14 500 ARM Trusted Firmware
LCU14 500 ARM Trusted Firmware
Linaro
 
Arm Processors Architectures
Arm Processors ArchitecturesArm Processors Architectures
Arm Processors Architectures
Mohammed Hilal
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted Firmware
Linaro
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
AMD
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
Venkatesh Malla
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V International
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
Rebekah Rodriguez
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
Yi-Hsiu Hsu
 
Interrupts.ppt
Interrupts.pptInterrupts.ppt
Interrupts.ppt
SasiBhushan22
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
The Linux Foundation
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
Ryo Jin
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
Deepak Shankar
 
IP PCIe
IP PCIeIP PCIe
IP PCIe
SILKAN
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
AMD
 
Project ACRN expose and pass through platform hidden PCIe devices to SOS
Project ACRN expose and pass through platform hidden PCIe devices to SOSProject ACRN expose and pass through platform hidden PCIe devices to SOS
Project ACRN expose and pass through platform hidden PCIe devices to SOS
Project ACRN
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
Smruti Sarangi
 
Dd sdram
Dd sdramDd sdram
Dd sdram
Ratnakar Varun
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
shimosawa
 

What's hot (20)

SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
LCU14 500 ARM Trusted Firmware
LCU14 500 ARM Trusted FirmwareLCU14 500 ARM Trusted Firmware
LCU14 500 ARM Trusted Firmware
 
Arm Processors Architectures
Arm Processors ArchitecturesArm Processors Architectures
Arm Processors Architectures
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
 
LCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted FirmwareLCU13: An Introduction to ARM Trusted Firmware
LCU13: An Introduction to ARM Trusted Firmware
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
Pcie drivers basics
Pcie drivers basicsPcie drivers basics
Pcie drivers basics
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Accelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to CloudAccelerating Innovation from Edge to Cloud
Accelerating Innovation from Edge to Cloud
 
Introduction to armv8 aarch64
Introduction to armv8 aarch64Introduction to armv8 aarch64
Introduction to armv8 aarch64
 
Interrupts.ppt
Interrupts.pptInterrupts.ppt
Interrupts.ppt
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
 
Linux on ARM 64-bit Architecture
Linux on ARM 64-bit ArchitectureLinux on ARM 64-bit Architecture
Linux on ARM 64-bit Architecture
 
Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power Evaluating UCIe based multi-die SoC to meet timing and power
Evaluating UCIe based multi-die SoC to meet timing and power
 
IP PCIe
IP PCIeIP PCIe
IP PCIe
 
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor CoreZen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64 Microprocessor Core
 
Project ACRN expose and pass through platform hidden PCIe devices to SOS
Project ACRN expose and pass through platform hidden PCIe devices to SOSProject ACRN expose and pass through platform hidden PCIe devices to SOS
Project ACRN expose and pass through platform hidden PCIe devices to SOS
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
Dd sdram
Dd sdramDd sdram
Dd sdram
 
Linux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKBLinux Kernel Booting Process (1) - For NLKB
Linux Kernel Booting Process (1) - For NLKB
 

Similar to Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
parth bera
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
Linaro
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
Aananth C N
 
Slimline Open Firmware
Slimline Open FirmwareSlimline Open Firmware
Slimline Open Firmware
Heiko Joerg Schick
 
ACPI In Windows Vista
ACPI In Windows VistaACPI In Windows Vista
ACPI In Windows Vista
spot2
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Paul Yang
 
Power8 sales exam prep
 Power8 sales exam prep Power8 sales exam prep
Power8 sales exam prep
Jason Wong
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
Alan Su
 
Q4.11: ARM Technology Update Plenary
Q4.11: ARM Technology Update PlenaryQ4.11: ARM Technology Update Plenary
Q4.11: ARM Technology Update Plenary
Linaro
 
Arm architecture
Arm architectureArm architecture
Arm architecture
MinYeop Na
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
Heiko Joerg Schick
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
KundanSingh887495
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
Heiko Joerg Schick
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
Linaro
 
Assignment
AssignmentAssignment
Assignment
Abu Md Choudhury
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
Premier Farnell
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
Talal Khaliq
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced Features
Anh Dung NGUYEN
 
Unit vi (1)
Unit vi (1)Unit vi (1)
Unit vi (1)
Siva Nageswararao
 
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
ssuser65bfce
 

Similar to Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118 (20)

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
 
HKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 ServersHKG18-116 - RAS Solutions for Arm64 Servers
HKG18-116 - RAS Solutions for Arm64 Servers
 
Virtualization Support in ARMv8+
Virtualization Support in ARMv8+Virtualization Support in ARMv8+
Virtualization Support in ARMv8+
 
Slimline Open Firmware
Slimline Open FirmwareSlimline Open Firmware
Slimline Open Firmware
 
ACPI In Windows Vista
ACPI In Windows VistaACPI In Windows Vista
ACPI In Windows Vista
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
 
Power8 sales exam prep
 Power8 sales exam prep Power8 sales exam prep
Power8 sales exam prep
 
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_FinalAMulti-coreSoftwareHardwareCo-DebugPlatform_Final
AMulti-coreSoftwareHardwareCo-DebugPlatform_Final
 
Q4.11: ARM Technology Update Plenary
Q4.11: ARM Technology Update PlenaryQ4.11: ARM Technology Update Plenary
Q4.11: ARM Technology Update Plenary
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Developing a Windows CE OAL.ppt
Developing a Windows CE OAL.pptDeveloping a Windows CE OAL.ppt
Developing a Windows CE OAL.ppt
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
Assignment
AssignmentAssignment
Assignment
 
Overview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit MicrocontrollersOverview of ST7 8-bit Microcontrollers
Overview of ST7 8-bit Microcontrollers
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
 
AAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced FeaturesAAME ARM Techcon2013 002v02 Advanced Features
AAME ARM Techcon2013 002v02 Advanced Features
 
Unit vi (1)
Unit vi (1)Unit vi (1)
Unit vi (1)
 
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
Assembly Language for x86 Processors 7th Edition Chapter 2 : x86 Processor Ar...
 

Recently uploaded

How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.
AnkitaPandya11
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
Rakesh Kumar R
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 

Recently uploaded (20)

How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 

Reliability, Availability, and Serviceability (RAS) on ARM64 status - SAN19-118

  • 1. SAN19-118: RAS on ARM64 Reliability, Availability, and Serviceability (RAS) on ARM64 status Fu Wei <wefu@redhat.com>
  • 2. AGENDA 1.Brief introduction ○ Architecture, RAS Extension, APEI, SDEI 2.Prototype overview ○ Firmware First Error Handling ■ Overview ■ APEI protocol (boot time) ■ CperLib (run time) 3.Status & Plans ○ SBSA QEMU & SPM ○ EINJ
  • 3. Brief introduction Architecture, RAS Extension, APEI, SDEI
  • 5. Hardware support for RAS CPU ● ARMv8-A architecture (a mandatory extension to ARMv8.2) ● EL2, EL3, or both ● Virtualization extension or Security extensions or both GICv3 ● Interrupt routing modes ● Private and shared interrupts (PPI/SPI) ● Ability to set an interrupt pending event signaling and delegation Interrupt groups/priority RAS Extension ● ESB (Error Synchronization Barrier) instructions ● RAS Extension registers ● Corrupted data poisoning
  • 6. RAS Extension: Gather HW error info for FW ESB instruction Help to locate Error RAS Extension registers ●Provide the error info to FW ●Control the Interrupt by FW ARMv8-A RAS extensions standardize the interface between HW and FW ● a mandatory extension to ARMv8.2, an optional extension to the Armv8.0 and Firmware RAS Extension registers
  • 7. Arm ® Reliability, Availability, and Serviceability (RAS) Specification Documentations The latest version for now is : DDI 0587C.b The latest version for now is : DDI 0487E.a
  • 8. ● Secure Partition Manager in BL31 exports standard ABI to − Initialize the partition − Delegate SMC requests to the partition − For RAS, delegate RAS error handling to a MM Secure Partition of RAS ● MM Secure Partition implements management functions , runs in S- EL0 − Leverages existing firmware code based on EDK2: StandaloneMm − Minimise code in EL3 Firmware support for RAS: SPM & MM Secure Partition
  • 9. Arm SPA Documentations The latest version for now is : DEN 0063 1.0.0-2 https://developer.arm.com/architectures/sec urity-architectures/platform-security- architecture
  • 10. APEI EINJERST BERT HEST For runtimeFor last crash (critical error) For Testing For Storage Provides a standard way to convey error info from Firmware to OS CPER UEFI For Error info format Firmware support for RAS: APEI (ACPI Platform Error Interfaces)
  • 11. ACPI and UEFI Documentations The latest version for now is : Version 6.3 The latest version for now is : Version 2.8
  • 12. ACPI for the Armv8 RAS Extensions 1.0(Provisional) This document describes ACPI extensions that enable kernel first RAS handling for systems that employ the Arm RAS extensions. For PEs, this specification covers: 1. Armv8 & 8.4 RAS extension 2. RAS system architecture v1.0 & v1.1. AEST: Arm Error Source Table Integrating AEST into HEST Armv8 RAS extension section ACPI CPER Code First
  • 13. From firmware to OS: SDEI usage in RAS Software Delegated Exception Interface:An interface between FW & OS, for registering, notifying and servicing system events using SMC/HVC. SDEI Specification (ARM DEN0054A)
  • 14. Prototype overview ● Firmware First Error Handling
  • 15. Firmware First Error Handling Overview
  • 16. APEI protocol: Updating APEI table in boot time
  • 17. CperLib : Generating the CPER blob in runtime
  • 18. Status & Plans SBSA QEMU & Secure Partition TODO
  • 19. SBSA QEMU & Secure Partition ❖ SBSA QEMU ➢ SBSA QEMU patchset has been upstreamed Great thanks to Hongbo and Radosław ❖ ARM-TF ➢ SBSA QEMU support has been partly upstreamed, and the rest of patches are posted. ➢ StandaloneMm support(on SBSA QEMU) patches are debugged
  • 20. SBSA QEMU & Secure Partition ❖ EDK2 and edk2-platform ➢ StandaloneMm support patchset has been upstreamed ➢ StandaloneMm support for SBSA QEMU patches are under debugging. ➢ APEI protocol and CperLib patches are under debugging.
  • 21. TODO ❖ Go on upstreaming ARM-TF patchset (target: v2.2) ❖ Hardware ➢ Test on a real hardware (ARMv8.2+) ❖Firmware ➢ EDK2: ■ Improve and upstream APEI protocol and CperLib code (target: 201912) ■ EINJ prototype(target: before next Connect) ■ prototype solution of BERT
  • 22. EINJ: Error Injection with Common Fault Injection Model Extension Linux EINJ driver APEI Test application BL32 in SEL-0 MM Foundation/Core Non-secure mem for firmware RAS Node registers Common Fault Injection Model Extension Trigger Error Action Tables EINJ Shared mem EDK2(BL33) In EL-2
  • 23. Acknowledgments Achin Gupta Arvind Chauhan Charles Garcia-Tobin Daniil Egranov Dong Wei Supreeth Venkatesh Thomas Abraham Leif Lindholm Radosław Biernacki Tanmay Jagdale Zhang HongBo Al Stone John Feeney Jon Masters
  • 24. Thank you Join Linaro to accelerate deployment of your Arm-based solutions through collaboration contactus@linaro.org
  • 25. FYI More detail for RAS on AArch64
  • 26. What is RAS? -- Definition Reliability Continuity, Computation needs be correct and reliable. Availability Readiness, System needs to remain available as long as possible. Serviceability Ability to undergo modifications and repairs, system should provide information to administrator to aid in system servicing. The RAS architecture primarily cares about the ERRORs produced from HARDWARE.
  • 27. Why do we need RAS? -- Reality Impacts Enterprise systems often provide mission-critical services. Any system failure or data corruption impacts the customer’s business and reputation Inevitability Although faults are rare, enterprise systems can be very large. So failures are inevitable.
  • 28. Why do we need RAS? -- Importance So we inevitably have to maintain system very well. For reducing the expense for maintenance, we should : Replace only failed parts Scheduled maintenance cheaper than unscheduled service outages
  • 29. History ECC in memory controllers and I/O RAMs Machine Check Architecture (MCA) ● A mechanism in which the CPU reports hardware errors to the OS • model-specific registers (MSRs) ■ set up machine checking ■ record detected errors ■ the info they contain is CPU specific • Machine Check Exception (MCE) ■ signals the detection of an uncorrected machine-check error ■ handler collect information about error from MSRs • Utility: mcelog PCI-E: Advanced Error Reporting (AER) Linux kernel ● EDAC (Error Detection and Correction) • designed to report and possibly act on hardware errors • inspect the hardware directly (system-specific handling and decoding.) • only support memory controller and PCI/AGP errors Firmware (first) FF ● APEI ● UEFI
  • 30. Taxonomy of Error in Diagram
  • 31. RAS Architecture Firmware First error handling requires standard interfaces between multiple SW components.
  • 32. RAS Extension: ESB instruction ● ESB (Error Synchronization Barrier) can be used to synchronize Unrecoverable errors(containable errors). ● Software can determine that: ○ The error was reported as Unrecoverable. ○ The preferred return address of the SEI is an ESB instruction. ○ The software between that ESB and the previous ESB can be isolated. ○ ESB might update : ■ DISR_EL1 / DISR (Deferred Interrupt Status Register) ■ VDISR_EL2 / VDISR (Virtual Deferred Interrupt Status Register)
  • 33. RAS Extension: Registers System register views (in Arm ® Architecture Reference Manual) ● Processor Feature Register ● Memory Model Feature Register ● Error Record Register ○ Error Record ID Register ○ Error Record Select Register ○ [Selected Error Record] Feature Register ○ [...] Control Register ○ [...] Primary Status Register ○ [...] Address Register ○ [...] Miscellaneous Register [0-3] ● Deferred Interrupt Status Register ● Hypervisor Configuration Register ● Virtual SError Exception Syndrome Register ● Virtual Deferred Interrupt Status Register ● Secure Configuration Register Memory-mapped view ● Error Record Feature Register ● Error Record Control Register ● Error Record Primary Status Register ● Error Record Address Register ● Error Record Miscellaneous Register[0-3] ● Pseudo-fault Generation Feature register ● Pseudo-fault Generation Control register ● Pseudo-fault Generation Countdown register ● Error Group Status Register ● Implementation Identification Register ● Error Interrupt Configuration Register[FHI, ERI, CRI][0-2] ● Error Interrupt Status Register ● Device Affinity Register ● Device Architecture Register ● Device Configuration Register ● Peripheral Identification Register[0-4] ● Component Identification Register[0-3]
  • 34. APEI tables BERT: Boot Error Record Table Record fatal errors, then report it in the second boot CPER(in the Appendix of UEFI spec) Common Platform Error Record
  • 35. APEI tables HEST: Hardware Error Source Table ● Key info: HOW to get trigger WHERE are the error records HOW to release records’ mem ● For ARM64 : GHES v2 HOW to get trigger: Notification Structure WHERE are the error records: Error Status Address (GAS : Generic Address Structure) HOW to release records’ mem: Read Ack Register For IA-32 : MCE CMC NMI For PCI: AER Root Port Endpoint Bridge For generic hardware: GHES V1/V2 (Generic Hardware Error Source)
  • 36. APEI tables ERST: Error Record Serialization Table Operation abstract, provides details necessary to communicate with on-board persistent storage for error recording. EINJ: Error Injection Table Operation abstract, provides a generic interface which OSPM can inject hardware errors to the platform without requiring platform specific software.
  • 37. ACPI for the Armv8 RAS Extensions 1.0(proposal-1) AEST: Arm Error Source Table AEST node are composed of the following parts: ● A header ● A set of common fields for all error nodes(Node generic data) ● A component specific section ● A section that describes the interfaces associated with an error node. ● A section describing associated interrupts with the error node.
  • 38. ACPI for the Armv8 RAS Extensions 1.0(proposal-2) Integrating AEST into HEST Q: Do we have to add a brand new table for the extension? An alternative way is to integrate the “AEST” nodes into the HEST directly by creating new error source structure, Type: 12-ARMv8 RAS Extension error node
  • 39. ACPI for the Armv8 RAS Extensions 1.0(proposal-3) CPER Armv8 RAS extension section The section gathers the content of a RAS extension node’s registers without any specific information that can associate it with components in the SoC, because it is used alongside other error sections already defined for CPER, such as processor sections, generic and Arm, or Memory sections.
  • 40. Uncorrected Error -- HEST & MM 1. System boot: BootROM-->BL2-->BL3x a. BL31 initializes SPM, SDEI dispatcher and BL32(MM dispatcher) b. UEFI (BL33), DXE, UEFI Platform Driver: i. query MM partitions by APEI protocol for error source info ii. MM partitions return error source info back to UEFI iii. UEFI map in and mark error record region as Runtime Services Data Region iv. Update/add error source info in HEST 2. OS starts running: HEST driver scan HEST table and register error handlers by SDEI 3. UE occurred, the event will be routed to EL3 (SPM) 4. SPM routes the event to RAS error handler in S-EL0 (MM partitions) 5. MM Foundation (CperLib) creates the CPER blobs by the info from RAS Extension 6. SPM notifies SDEI to call the corresponding OS registered handler 7. OS gets the CPER blobs by Error Status Address block, process the error, try to recovery. 8. report the error event by RAS event 9. rasdaemon log error info from RAS event to recorder Boot Time RunTime
  • 41. APEI protocol: API Main ● Apei.c ○ UpdateApei : updates error source information to the specified APEI(HEST/BERT) Table typedef EFI_STATUS (EFIAPI *EFI_APEI_UPDATE) ( IN CONST EFI_APEI_PROTOCOL *This, IN UINT32 Signature ); ● ApeiCommon.c ○ GetApeiTable ○ AcpiTableChecksum For HEST table: ● ApeiHest.c − UpdateHest ■ GetErrorSources ■ UpdateGhes For BERT table(TBD): ● ApeiBert.c: ○ UpdateBert ■ GetErrorInfo ■ FillBert
  • 42. CperLib: API ● CperInit: parse the error source info, and return the error record address info EFI_STATUS EFIAPI CperInit ( IN EFI_APEI_ERROR_SOURCE *ErrorSource, IN OUT UINTN *ErrorRecordAddress ); The Dynamic Tables module passes the ErrorSource to help CperLib get the memory region info for other APIs, like CperWrite ● CperWrite: creates a CPER blob at the defined memory region from Section info data EFI_STATUS EFIAPI CperWrite ( IN SECTIONS_INFO *SectionInfo, IN UINTN ErrorRecordAddress ); Wrap the given section data into CPER blob and put it into the specific memory region
  • 43. BERT : When catastrophic errors occur(TODO) ● if whole system has crashed down, BMC or other system controller should generate the error blob, and save it.
  • 44. BERT: After emergency reboot(to be improved)