SlideShare a Scribd company logo
1 of 38
Download to read offline
ARM64 Server RAS Solutions
Jonathan (Zhixiong) Zhang
Cavium Inc.
Agenda
● Overview
● Solutions
● Building blocks
● Reflections
Overview
Reliability, Availability, Serviceability
RAS is one of the most important aspects of ARM64 servers’ success
Progresses
● RAS on ARM64
● SFO17
● Fu Wei
● Focus on OS and
ACPI
Progresses
● RAS on AARCH64
● SFO17
● Fu Wei, Supreeth
Venkatesh
● Focus on prototype
solution
Discussion Focus
● RAS is a 3
dimensional
problem!
● 2 Dimensions: from
CPU, to board, to
rack
● 1 dimension: from
hardware, to
firmware to
software
● Product ready
solutions
Who needs what – end user, apps
developer● Accurate error reporting
● Right level error reporting
● Low cycle stealing
● Quarantined uncorrectable error
Who needs what – data centre admin
● Accurate error reporting
● Reliable error reporting
● Uniformed error handling policy
● Actionable error reporting
Who needs what – Platform Vendor
Validation through production software stack
● Example: Memory stability has to be validated through running all
hardware threads, exercising all DIMMs through OS memory test tool.
● Example: PCIe tuning issue can be spotted through thousands of
reboots.
Who needs what – SOC Vendor
Validation through production software stack
● Validation/diag tool does not exercise the SOC enough.
● Subsystem design/programming issue needs to be unmasked earlier
in the engineering cycle.
Cost-effective support process
● Expedited problem root cause.
● Reduced reliant on hardware debugger.
Solutions
Early boot time error notification
● Post code, to diagnose halted boot.
● SEL record, to diagnose successful boot with component failure
Run time error notification
Catastrophic Error
1. GPIO interrupt. Out Of Band.
2. SEL record. Out Of Band.
3. BERT record. In Band
4. Crash dump. Out Of Band and
In Band; Catastrophic error
only.
Non-catastrophic Error
1. GPIO interrupt. Out Of Band.
2. SEL record. Out Of Band.
3. HEST record. In Band
4. GHESv2 notification. In Band
a. Polled
b. Interrupt
c. SDEI
d. SEA
e. SEI
DRAM error handling
Boot time issue
● Reasons:
● DIMM training failure.
● SPD access failure.
● Reponses:
○ Halt the boot.
○ Continue booting with SEL record
sent.
Run time issue
● Correctable error
● Uncorrectable error in non-
secure memory
● Recoverable.
● Unrecovable, in non OS critical
region
● Unrecovable, in OS critical region.
● Uncorrectable error in secure
memory
DRAM error reporting
Error location types concerned:
a) Apps developer: Physical address
b) Data center admin: FRU (Field Replaceable Unit)
c) Platform vendor: DRAM address
Channel interleaving complicates address translation:
DRAM address
Transaction
address
Physical address
Memory scrubbing
DRAM scrubbing prevents future error from happening.
a. Firmware initiated on-demand scrubbing, upon memory error.
b. Firmware initiated interval based scrubbing, as defined by user.
c. OS initiated on-demand scrubbing.
1. Platform exposes scrubbing support capability to OS through ACPI RASF table.
2. OS setup, start, stop scrubbing through RASF PCC channel.
PCIe error handling -- workflow
Firmware First Handling being asked.
1. PCIe device (such as end point device) detects error.
2. The error is reported up through AER (Advanced Error Reporting)
mechanism.
3. PCIe Root Complex on the SoC issues error interrupt.
4. Firmware secure interrupt handler analyzes and reports the error.
PCIe error handling -- ACPI
a) OSC: OS conveys APEI support to platform.
b) OSC: When OS requests control over PCIe AER, platform does not give
the control.
c) APEI error source: GHESv2 error source, no PCIe RP AER structure, no
PCIe device AER structure.
d) CPER: PCIe error section.
PCIe error handling – SEL sample
a) Multiple SEL records generated per error.
b) Some are generated from error producer, others from error
consumer.
c) SEL records hold info such as error type (bus error, PERR, SERR,
etc.), BDF numbers, vendor/device IDs, error ID (bad TLP, bad DLLP,
etc.).
Power/Thermal management
Sensors
a. Platform sensors
b. DIMM sensors
c. On-chip sensors
Defenses
d. DIMM temperature based memory throttling
e. Other sensor temperature based Close Loop Thermal Throttling
f. Last defense: Software and hardware based thermal trip
Power optimizations
g. Autonomous DVFS
h. Turbo mode and CPPC
Catastrophic Error -- Detection
a) Hardware error interrupt handler in Trusted Firmware – A (AKA. ARM
Trusted Firmware).
b) Secure watchdog bite in on-chip management controller
a) Watchdog signal 0 routed as SPI to GIC.
b) Watchdog signal 1 routed to the platform, such as on-chip management controller
c) BMC (Board Management Controller) detected heart beat stop from
on-chip management controller
d) OS initiated through ACPI PDTT mechanism
Catastrophic Error – crash dump
a) Contains data needed for failure analysis.
b) Vendor proprietary tool provided to analyze the failure.
c) Modularized to provide fault tolerance.
d) Saved in both non-volatile storage and BMC.
Building Blocks
Early boot time considerations
Report issues happened before interrupt can be enabled:
a. Check error status register at certain check points.
b. Check error status register right before interrupt is enabled.
c. Enable hardware error interrupt as soon as possible.
Prevent false positives:
d. Filter out noises happened during training/initialization process.
e. Clear status registers before enabling hardware error interrupt.
SEL record
I2C bus access conflict resolution
a. Dedicated I2C connection between Trusted Firmware – A and BMC.
b. Dedicated I2C connection between non-secure world (UEFI/OS) and
BMC.
BMC to handle simultaneous SEL record access
ARM TF
UEFI/OS
BMC
I2C
I2C
Configuration
Early boot loader configuration
a. Some customers want boot to be halted upon single DIMM training
failure.
b. Other customers want boot to continue unless all DIMMs failed.
ARM TF configuration (For example, memory CE threshold needs)
c. Length of sliding window.
d. The threshold number.
Enablement of display/update of configurations in OS
e. Stored in a partition in BootRom, separated from UEFI variable
partition.
f. Exposed through UEFI variable run time service.
g. Protected from firmware update.
h. Able to be set to factory default.
Error injection
Memory Error
● Inject into a specific physical
memory
● Inject into a specific DIMM
location
PCIe Error
● Inject through Root Port
● Inject through End Point device
● Inject through PCIe analyzer
Reflections
Hardware design
Reduce cycle stealing
a. Route hardware error interrupt to on-chip management controller
b. Manage Correctable Error threshold in hardware
Standardized design
c. Support RAS extension
Firmware RAS Features
● Exception Handling Framework in Trusted Firmware – A.
○ A framework for triaging RAS errors in EL3 prior to handling or delegation to lower EL
○ Up-stream support available for triaging RAS error interrupts and delegation through
SDEI
○ Support for v8.2 RAS extension under review
■ Error synchronization barrier
■ Standard Error records
○ Support for v8.4 RAS extensions planned next
● Software Delegated Exception Interface dispatcher in Trusted Firmware
– A.
○ A mechanism to deliver high priority non-maskable events to Normal world
○ Hooks into EHF as a way of delegating RAS errors to Normal world
○ Up-stream support available for physical SDEI dispatcher
● RAS error handling in a Secure partition
○ Support for CPER creation in a EDK2 Standalone MM partition in S-EL0 under
development
○ Enables isolation of platform specific code in a EL3 managed sandbox
● DRAM error handling prototyped on SGI-575 using above components
Kernel RAS Features
● Memory failure handling merged
● All Arm specific APEI notifications (SEA, SEI, SDEI) work
● Patches to make APEI support aware of multiple NMI sources under
review
● Support for SDEI Client driver merged
● Support in APEI for registering SDEI notification handler under review
● Support for IESB in v8.2 RAS Extension merged
○ SErrors normally unmasked when in kernel
● Errors in KVM-Guest memory forwarded to Qemu/Kvmtool
● Support for notifying KVM-Guest under development
User space daemon
BMC
ARM Server Base Manageability Guide
Firmware First vs. Kernel First
● In general, firmware first is asked by server industry.
● Kernel first is desired for some products.
● ACPI table available for review that describes the RAS error node
topology to the kernel.
● https://connect.arm.com/dropzone/systemarch/DEN0061A-RAS_ACPI-1.0
alp1.pdf
Challenges
New technologies
a. NVDIMM: ACPI NVDIMM Sub Team working on a number of proposals.
b. New DIMM technologies: DDR5.
c. New bus technologies: PCIe 5.0; CCIX; Gen-Z.
SoC complexities
d. SoC instead of chip set.
e. SoC interconnect.
f. Large number of cores, big size of caches, fast speed buses
g. On-Chip controllers (PCIe/SATA/USB, etc.)
Acknowledgement
● Achin Gupta (ARM)
● Charles Garcia-Tobin (ARM)
Thank You
#HKG18
HKG18 keynotes and videos on: connect.linaro.org
For further information: jon.zhixiong.zhang@gmail.comwww.linaro.org

More Related Content

What's hot

AAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System StartupAAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System StartupAnh Dung NGUYEN
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory SystemsAnh Dung NGUYEN
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureRaahul Raghavan
 
AAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation DiversityAAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation DiversityAnh Dung NGUYEN
 
ARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMAnh Dung NGUYEN
 
Interrupt handling
Interrupt handlingInterrupt handling
Interrupt handlingmaverick2203
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAnh Dung NGUYEN
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3Raahul Raghavan
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3Raahul Raghavan
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecturerajkciitr
 
Introduction to Avr Microcontrollers
Introduction to Avr MicrocontrollersIntroduction to Avr Microcontrollers
Introduction to Avr MicrocontrollersMohamed Tarek
 
Ch1 2 kb_pcs7_v70_en
Ch1 2 kb_pcs7_v70_enCh1 2 kb_pcs7_v70_en
Ch1 2 kb_pcs7_v70_enconfidencial
 
Basics Of Embedded Systems
Basics Of Embedded SystemsBasics Of Embedded Systems
Basics Of Embedded Systemsarlabstech
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheRaahul Raghavan
 
BeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSBeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSDeveler S.r.l.
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerNETWAYS
 

What's hot (20)

AAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System StartupAAME ARM Techcon2013 005v02 System Startup
AAME ARM Techcon2013 005v02 System Startup
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
 
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_ArchitectureARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
ARM® Cortex™ M Bootup_CMSIS_Part_3_3_Debug_Architecture
 
AAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation DiversityAAME ARM Techcon2013 006v02 Implementation Diversity
AAME ARM Techcon2013 006v02 Implementation Diversity
 
ARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARMARM AAE - Developing Code for ARM
ARM AAE - Developing Code for ARM
 
Interrupt handling
Interrupt handlingInterrupt handling
Interrupt handling
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
 
ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3ARM® Cortex M Boot & CMSIS Part 1-3
ARM® Cortex M Boot & CMSIS Part 1-3
 
ARM AAE - Architecture
ARM AAE - ArchitectureARM AAE - Architecture
ARM AAE - Architecture
 
Day1
Day1Day1
Day1
 
Ch7 v70 scl_en
Ch7 v70 scl_enCh7 v70 scl_en
Ch7 v70 scl_en
 
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3ARM® Cortex™ M Bootup_CMSIS_Part_2_3
ARM® Cortex™ M Bootup_CMSIS_Part_2_3
 
ARM Processor architecture
ARM Processor  architectureARM Processor  architecture
ARM Processor architecture
 
Introduction to Avr Microcontrollers
Introduction to Avr MicrocontrollersIntroduction to Avr Microcontrollers
Introduction to Avr Microcontrollers
 
Ch1 2 kb_pcs7_v70_en
Ch1 2 kb_pcs7_v70_enCh1 2 kb_pcs7_v70_en
Ch1 2 kb_pcs7_v70_en
 
Basics Of Embedded Systems
Basics Of Embedded SystemsBasics Of Embedded Systems
Basics Of Embedded Systems
 
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction CacheARM® Cortex™ M Energy Optimization - Using Instruction Cache
ARM® Cortex™ M Energy Optimization - Using Instruction Cache
 
Introduction to stm32-part1
Introduction to stm32-part1Introduction to stm32-part1
Introduction to stm32-part1
 
BeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOSBeRTOS: Free Embedded RTOS
BeRTOS: Free Embedded RTOS
 
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner FischerOSMC 2014: Server Hardware Monitoring done right | Werner Fischer
OSMC 2014: Server Hardware Monitoring done right | Werner Fischer
 

Similar to HKG18-116 - RAS Solutions for Arm64 Servers

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper finalparth bera
 
Understanding and Improving Device Access Complexity
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexityasimkadav
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfPaul Yang
 
Hardware Management Module
Hardware Management ModuleHardware Management Module
Hardware Management ModuleAero Plane
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...The Linux Foundation
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linuxmountpoint.io
 
Las16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need itLas16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need itLinaro
 
laptop repairing course in delhi
laptop repairing course in delhilaptop repairing course in delhi
laptop repairing course in delhiAmit Gupta
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewLinaro
 
1 study of motherboard
1 study of motherboard1 study of motherboard
1 study of motherboardAnkit Dubey
 
Computer specifications
Computer specificationsComputer specifications
Computer specificationsLeonel Rivas
 
Crash dump analysis - experience sharing
Crash dump analysis - experience sharingCrash dump analysis - experience sharing
Crash dump analysis - experience sharingJames Hsieh
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Community
 

Similar to HKG18-116 - RAS Solutions for Arm64 Servers (20)

Spike yuan server ras and uefi cper final
Spike yuan  server ras and uefi cper finalSpike yuan  server ras and uefi cper final
Spike yuan server ras and uefi cper final
 
UNIT-III ES.ppt
UNIT-III ES.pptUNIT-III ES.ppt
UNIT-III ES.ppt
 
Understanding and Improving Device Access Complexity
Understanding and Improving Device Access ComplexityUnderstanding and Improving Device Access Complexity
Understanding and Improving Device Access Complexity
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
 
Hardware Management Module
Hardware Management ModuleHardware Management Module
Hardware Management Module
 
Introduction to Embedded Systems
Introduction to Embedded SystemsIntroduction to Embedded Systems
Introduction to Embedded Systems
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
 
eMMC 5.0 Total IP Solution
eMMC 5.0 Total IP SolutioneMMC 5.0 Total IP Solution
eMMC 5.0 Total IP Solution
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linux
 
Las16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need itLas16 200 - firmware summit - ras what is it- why do we need it
Las16 200 - firmware summit - ras what is it- why do we need it
 
laptop repairing course in delhi
laptop repairing course in delhilaptop repairing course in delhi
laptop repairing course in delhi
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting Review
 
Phytium 64 core cpu preview
Phytium 64 core cpu previewPhytium 64 core cpu preview
Phytium 64 core cpu preview
 
1 study of motherboard
1 study of motherboard1 study of motherboard
1 study of motherboard
 
Exam hp2 t17 aquestion 1
Exam hp2 t17 aquestion 1Exam hp2 t17 aquestion 1
Exam hp2 t17 aquestion 1
 
Computer specifications
Computer specificationsComputer specifications
Computer specifications
 
Crash dump analysis - experience sharing
Crash dump analysis - experience sharingCrash dump analysis - experience sharing
Crash dump analysis - experience sharing
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
5120224.ppt
5120224.ppt5120224.ppt
5120224.ppt
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

HKG18-116 - RAS Solutions for Arm64 Servers

  • 1. ARM64 Server RAS Solutions Jonathan (Zhixiong) Zhang Cavium Inc.
  • 2. Agenda ● Overview ● Solutions ● Building blocks ● Reflections
  • 3. Overview Reliability, Availability, Serviceability RAS is one of the most important aspects of ARM64 servers’ success
  • 4. Progresses ● RAS on ARM64 ● SFO17 ● Fu Wei ● Focus on OS and ACPI
  • 5. Progresses ● RAS on AARCH64 ● SFO17 ● Fu Wei, Supreeth Venkatesh ● Focus on prototype solution
  • 6. Discussion Focus ● RAS is a 3 dimensional problem! ● 2 Dimensions: from CPU, to board, to rack ● 1 dimension: from hardware, to firmware to software ● Product ready solutions
  • 7. Who needs what – end user, apps developer● Accurate error reporting ● Right level error reporting ● Low cycle stealing ● Quarantined uncorrectable error
  • 8. Who needs what – data centre admin ● Accurate error reporting ● Reliable error reporting ● Uniformed error handling policy ● Actionable error reporting
  • 9. Who needs what – Platform Vendor Validation through production software stack ● Example: Memory stability has to be validated through running all hardware threads, exercising all DIMMs through OS memory test tool. ● Example: PCIe tuning issue can be spotted through thousands of reboots.
  • 10. Who needs what – SOC Vendor Validation through production software stack ● Validation/diag tool does not exercise the SOC enough. ● Subsystem design/programming issue needs to be unmasked earlier in the engineering cycle. Cost-effective support process ● Expedited problem root cause. ● Reduced reliant on hardware debugger.
  • 12. Early boot time error notification ● Post code, to diagnose halted boot. ● SEL record, to diagnose successful boot with component failure
  • 13. Run time error notification Catastrophic Error 1. GPIO interrupt. Out Of Band. 2. SEL record. Out Of Band. 3. BERT record. In Band 4. Crash dump. Out Of Band and In Band; Catastrophic error only. Non-catastrophic Error 1. GPIO interrupt. Out Of Band. 2. SEL record. Out Of Band. 3. HEST record. In Band 4. GHESv2 notification. In Band a. Polled b. Interrupt c. SDEI d. SEA e. SEI
  • 14. DRAM error handling Boot time issue ● Reasons: ● DIMM training failure. ● SPD access failure. ● Reponses: ○ Halt the boot. ○ Continue booting with SEL record sent. Run time issue ● Correctable error ● Uncorrectable error in non- secure memory ● Recoverable. ● Unrecovable, in non OS critical region ● Unrecovable, in OS critical region. ● Uncorrectable error in secure memory
  • 15. DRAM error reporting Error location types concerned: a) Apps developer: Physical address b) Data center admin: FRU (Field Replaceable Unit) c) Platform vendor: DRAM address Channel interleaving complicates address translation: DRAM address Transaction address Physical address
  • 16. Memory scrubbing DRAM scrubbing prevents future error from happening. a. Firmware initiated on-demand scrubbing, upon memory error. b. Firmware initiated interval based scrubbing, as defined by user. c. OS initiated on-demand scrubbing. 1. Platform exposes scrubbing support capability to OS through ACPI RASF table. 2. OS setup, start, stop scrubbing through RASF PCC channel.
  • 17. PCIe error handling -- workflow Firmware First Handling being asked. 1. PCIe device (such as end point device) detects error. 2. The error is reported up through AER (Advanced Error Reporting) mechanism. 3. PCIe Root Complex on the SoC issues error interrupt. 4. Firmware secure interrupt handler analyzes and reports the error.
  • 18. PCIe error handling -- ACPI a) OSC: OS conveys APEI support to platform. b) OSC: When OS requests control over PCIe AER, platform does not give the control. c) APEI error source: GHESv2 error source, no PCIe RP AER structure, no PCIe device AER structure. d) CPER: PCIe error section.
  • 19. PCIe error handling – SEL sample a) Multiple SEL records generated per error. b) Some are generated from error producer, others from error consumer. c) SEL records hold info such as error type (bus error, PERR, SERR, etc.), BDF numbers, vendor/device IDs, error ID (bad TLP, bad DLLP, etc.).
  • 20. Power/Thermal management Sensors a. Platform sensors b. DIMM sensors c. On-chip sensors Defenses d. DIMM temperature based memory throttling e. Other sensor temperature based Close Loop Thermal Throttling f. Last defense: Software and hardware based thermal trip Power optimizations g. Autonomous DVFS h. Turbo mode and CPPC
  • 21. Catastrophic Error -- Detection a) Hardware error interrupt handler in Trusted Firmware – A (AKA. ARM Trusted Firmware). b) Secure watchdog bite in on-chip management controller a) Watchdog signal 0 routed as SPI to GIC. b) Watchdog signal 1 routed to the platform, such as on-chip management controller c) BMC (Board Management Controller) detected heart beat stop from on-chip management controller d) OS initiated through ACPI PDTT mechanism
  • 22. Catastrophic Error – crash dump a) Contains data needed for failure analysis. b) Vendor proprietary tool provided to analyze the failure. c) Modularized to provide fault tolerance. d) Saved in both non-volatile storage and BMC.
  • 24. Early boot time considerations Report issues happened before interrupt can be enabled: a. Check error status register at certain check points. b. Check error status register right before interrupt is enabled. c. Enable hardware error interrupt as soon as possible. Prevent false positives: d. Filter out noises happened during training/initialization process. e. Clear status registers before enabling hardware error interrupt.
  • 25. SEL record I2C bus access conflict resolution a. Dedicated I2C connection between Trusted Firmware – A and BMC. b. Dedicated I2C connection between non-secure world (UEFI/OS) and BMC. BMC to handle simultaneous SEL record access ARM TF UEFI/OS BMC I2C I2C
  • 26. Configuration Early boot loader configuration a. Some customers want boot to be halted upon single DIMM training failure. b. Other customers want boot to continue unless all DIMMs failed. ARM TF configuration (For example, memory CE threshold needs) c. Length of sliding window. d. The threshold number. Enablement of display/update of configurations in OS e. Stored in a partition in BootRom, separated from UEFI variable partition. f. Exposed through UEFI variable run time service. g. Protected from firmware update. h. Able to be set to factory default.
  • 27. Error injection Memory Error ● Inject into a specific physical memory ● Inject into a specific DIMM location PCIe Error ● Inject through Root Port ● Inject through End Point device ● Inject through PCIe analyzer
  • 29. Hardware design Reduce cycle stealing a. Route hardware error interrupt to on-chip management controller b. Manage Correctable Error threshold in hardware Standardized design c. Support RAS extension
  • 30. Firmware RAS Features ● Exception Handling Framework in Trusted Firmware – A. ○ A framework for triaging RAS errors in EL3 prior to handling or delegation to lower EL ○ Up-stream support available for triaging RAS error interrupts and delegation through SDEI ○ Support for v8.2 RAS extension under review ■ Error synchronization barrier ■ Standard Error records ○ Support for v8.4 RAS extensions planned next ● Software Delegated Exception Interface dispatcher in Trusted Firmware – A. ○ A mechanism to deliver high priority non-maskable events to Normal world ○ Hooks into EHF as a way of delegating RAS errors to Normal world ○ Up-stream support available for physical SDEI dispatcher ● RAS error handling in a Secure partition ○ Support for CPER creation in a EDK2 Standalone MM partition in S-EL0 under development ○ Enables isolation of platform specific code in a EL3 managed sandbox ● DRAM error handling prototyped on SGI-575 using above components
  • 31. Kernel RAS Features ● Memory failure handling merged ● All Arm specific APEI notifications (SEA, SEI, SDEI) work ● Patches to make APEI support aware of multiple NMI sources under review ● Support for SDEI Client driver merged ● Support in APEI for registering SDEI notification handler under review ● Support for IESB in v8.2 RAS Extension merged ○ SErrors normally unmasked when in kernel ● Errors in KVM-Guest memory forwarded to Qemu/Kvmtool ● Support for notifying KVM-Guest under development
  • 33. BMC
  • 34. ARM Server Base Manageability Guide
  • 35. Firmware First vs. Kernel First ● In general, firmware first is asked by server industry. ● Kernel first is desired for some products. ● ACPI table available for review that describes the RAS error node topology to the kernel. ● https://connect.arm.com/dropzone/systemarch/DEN0061A-RAS_ACPI-1.0 alp1.pdf
  • 36. Challenges New technologies a. NVDIMM: ACPI NVDIMM Sub Team working on a number of proposals. b. New DIMM technologies: DDR5. c. New bus technologies: PCIe 5.0; CCIX; Gen-Z. SoC complexities d. SoC instead of chip set. e. SoC interconnect. f. Large number of cores, big size of caches, fast speed buses g. On-Chip controllers (PCIe/SATA/USB, etc.)
  • 37. Acknowledgement ● Achin Gupta (ARM) ● Charles Garcia-Tobin (ARM)
  • 38. Thank You #HKG18 HKG18 keynotes and videos on: connect.linaro.org For further information: jon.zhixiong.zhang@gmail.comwww.linaro.org