SlideShare a Scribd company logo
1 of 13
Download to read offline
Xen RAS Status and Progress


                  Dugger, Donald D
                       Liu, Jinsong
                    Jiang, Yunhong
Agenda
• Xen RAS overview
• Xen RAS latest progress
   – Core error recovery
   – APEI support
   – Robust enhancement
• Call for co-work




                       Intel Confidential
                                            2
Xen RAS overview
• Xen RAS motivation
   – Error affects many VMs
   – Xen RAS: error contained and handled accordingly


• Error Handling
   – CPU/Memory error: MCA (Machine Check Architecture)
   – I/O error: AER (Advanced Error Reporting)
   – ACPI Platform Error Interfaces




                            Intel Confidential
                                                          3
MCA: Machine Check Architecture
dom0       User space tools (FMA/ Mcelog)                         domU




                  vIRQ handler       vMCE handler           vMCE handler



                    vIRQ                  vMCA
                                                                 vMCA


                                                                           XEN
 Recover action

  page offline                                                          system panic
                                      Xen MCA handler                      & reset
  cpu offline


                                     Polling          MCE/CMCI


                                                CPU                      HW




                           Intel Confidential
                                                                                       4
Xen RAS status
Item                    Status                     Comments
MCA infrastructure      supported                  Move from dom0 to hypervisor

CE and UCNA             supported                  Userspace tools logging and analysis

Uncore error recovery   supported                  Memory scrubbing error
                                                   L3 explicit write-back error
Core error recovery     WIP                        Data load error
                                                   Instruction fetch error
APEI BERT               WAIT                       Dom0 own, wait kernel ready

APEI ERST               supported                  Dom0 and hypervisor co-work

APEI EINJ               supported                  Dom0 own

APEI HEST/GHES          WIP                        Dom0 and hypervisor co-work



                              Intel Confidential
                                                                                          5
Agenda
• Xen RAS overview
• Xen RAS latest progress
   – Core error recovery
   – APEI support
   – Robust enhancement
• Call for co-work




                       Intel Confidential
                                            6
Xen RAS latest progress
• Core error recovery
   – A new MCA error type, error in current processor execution context
   – CPU tag it as action required, must deal with before execution resume
   – Currently 2 type of architecturally defined core error:
          •   Data Load Error
          •   Instruction Fetch Error


• APEI support
   – ACPI Platform Error Interfaces
   – Bring existing h/w error mechanism together as a coherent infrastructure
   – Consists of 4 separate tables
          •   Boot Error Record Table
          •   Error Record Serialization Table
          •   Error Injection Table
          •   Hardware Error Source Table
    –   Linux3.0 as dom0 save us much effort
          •   Many dom0 APEI reuse
          •   Little maintain effort, benefit from kernel improvement


                                        Intel Confidential
                                                                                7
Core Error Recovery
• Xen core error recovery
   – Basically same MCA infrastructure as uncore error recovery
   – MCE exception ISR
         •   MCE broadcast to all logical processors
         •   Error in range of hypervisor/guest
   –   If in hypervisor
         •   Reset system
              –   Worst case, cannot resume execution

   –   If in guest
         •   Trigger vMCE to affected guest
         •   Trigger vIRQ to dom0 for logging
         •   Error contained in guest
              –   Medium case, error in guest kernel, kill the guest
              –   Best case, error in guest app, kill the app

   –   Code done, need kernel core recovery to do fine-grain test



                                     Intel Confidential
                                                                       8
APEI support
• Xen APEI support
  –   BERT
       •   BOOT Error Record Table
             –   For unhandled fatal error occurred in a previous boot
       •   Xen BERT
             –   Dom0 own, wait kernel BERT ready


  –   ERST
       •   Error Record Serialization Table
             –   Save/retrieve fatal error to/from persistent storage
       •   Hypervisor ERST:
             –   Save error
       •   Dom0 ERST:
             –   Retrieve/clear error




                                        Intel Confidential
                                                                         9
APEI support
• Xen APEI support
  –   EINJ
       •     Error Injection table
              –   Mechanism through which OSPM can inject h/w errors
       •     Xen EINJ
              –   Dom0 own
              –   Test done based on current bios available error types


  –   HEST
       •     Hardware Error Source Table
              –   Platform level description of error sources and error notifications
       •     Xen HEST
              –   Dom0 own SCI logic because of acpica
              –   Hypervisor own NMI logic, Xen APEI NMI handler currently not ready
              –   Need bios ready for more error sources and notifications




                                       Intel Confidential
                                                                                        10
Robust enhancement
• Xen RAS robust enhancement
  –   Xen RAS robust challenge
        •   Buggy bios
        •   Some error types not h/w supported yet
        •   Hard to trigger errors and do auto test
  –   Our work to enhance Xen RAS robust
        •   Do some code cleanup & enhancement
        •   Current supported errors were triggered and tested
        •   QA add error-simulator tools and auto test script
        •   EINJ enabling help debug & test greatly
        •   Robust enhancement will continue w/ new platform support more error types




                                   Intel Confidential
                                                                                        11
Agenda
• Xen RAS overview
• Xen RAS latest progress
   – Core error recovery
   – APEI support
   – Robust enhancement
• Call for co-work




                       Intel Confidential
                                            12
Call for co-work
• I/O error handling
   – PCIe AER, Advanced Error Reporting
   – For device assign to dom0/pv domU
         •   Basically reuse dom0/domU AER logic
   –   For device assign to hvm
         •   Need PCIe AER support at qemu
              –   Some VALinux work on standard qemu
              –   Porting to Xen qemu with AER support




                                     Intel Confidential
                                                          13

More Related Content

What's hot

What's hot (20)

Ian Pratt Nsdi Keynote Apr2008
Ian Pratt Nsdi Keynote Apr2008Ian Pratt Nsdi Keynote Apr2008
Ian Pratt Nsdi Keynote Apr2008
 
Rootlinux17: An introduction to Xen Project Virtualisation
Rootlinux17:  An introduction to Xen Project VirtualisationRootlinux17:  An introduction to Xen Project Virtualisation
Rootlinux17: An introduction to Xen Project Virtualisation
 
Bare-Metal Hypervisor as a Platform for Innovation
Bare-Metal Hypervisor as a Platform for InnovationBare-Metal Hypervisor as a Platform for Innovation
Bare-Metal Hypervisor as a Platform for Innovation
 
Xen Hypervisor
Xen HypervisorXen Hypervisor
Xen Hypervisor
 
Xen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization OpportunitiesXen PV Performance Status and Optimization Opportunities
Xen PV Performance Status and Optimization Opportunities
 
PVH : PV Guest in HVM container
PVH : PV Guest in HVM containerPVH : PV Guest in HVM container
PVH : PV Guest in HVM container
 
XS Boston 2008 Quantitative
XS Boston 2008 QuantitativeXS Boston 2008 Quantitative
XS Boston 2008 Quantitative
 
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM SystemsXPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
 
Citrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 TroubleshootingCitrix XenServer 5.5 Troubleshooting
Citrix XenServer 5.5 Troubleshooting
 
Linaro connect : Introduction to Xen on ARM
Linaro connect : Introduction to Xen on ARMLinaro connect : Introduction to Xen on ARM
Linaro connect : Introduction to Xen on ARM
 
Xen & virtualization
Xen & virtualizationXen & virtualization
Xen & virtualization
 
Why xen slides
Why xen slidesWhy xen slides
Why xen slides
 
XPDDS18: Windows PV Drivers Project: Status and Updates - Paul Durrant, Citri...
XPDDS18: Windows PV Drivers Project: Status and Updates - Paul Durrant, Citri...XPDDS18: Windows PV Drivers Project: Status and Updates - Paul Durrant, Citri...
XPDDS18: Windows PV Drivers Project: Status and Updates - Paul Durrant, Citri...
 
Xen Project Hypervisor for the Cloud
Xen Project Hypervisor for the CloudXen Project Hypervisor for the Cloud
Xen Project Hypervisor for the Cloud
 
Xen & the Art of Virtualization
Xen & the Art of VirtualizationXen & the Art of Virtualization
Xen & the Art of Virtualization
 
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, ArmXPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
 
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
 
Xen Project 15 Years down the Line
Xen Project 15 Years down the LineXen Project 15 Years down the Line
Xen Project 15 Years down the Line
 
LFCOLLAB15: Xen 4.5 and Beyond
LFCOLLAB15: Xen 4.5 and BeyondLFCOLLAB15: Xen 4.5 and Beyond
LFCOLLAB15: Xen 4.5 and Beyond
 
Dealing with Hardware Heterogeneity Using EmbeddedXEN, a Virtualization Frame...
Dealing with Hardware Heterogeneity Using EmbeddedXEN, a Virtualization Frame...Dealing with Hardware Heterogeneity Using EmbeddedXEN, a Virtualization Frame...
Dealing with Hardware Heterogeneity Using EmbeddedXEN, a Virtualization Frame...
 

Similar to Xen RAS Status and Progress

20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
Digicomp Academy AG
 
Xen Project Update LinuxCon Brazil
Xen Project Update LinuxCon BrazilXen Project Update LinuxCon Brazil
Xen Project Update LinuxCon Brazil
The Linux Foundation
 
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
Peter Ocasek
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriver
chiportal
 
Xen Euro Par07
Xen Euro Par07Xen Euro Par07
Xen Euro Par07
congvc
 
Xen and the Art of Virtualization
Xen and the Art of VirtualizationXen and the Art of Virtualization
Xen and the Art of Virtualization
Susheel Thakur
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQL
Denish Patel
 

Similar to Xen RAS Status and Progress (20)

Xen Summit 2009 Shanghai Ras
Xen Summit 2009 Shanghai RasXen Summit 2009 Shanghai Ras
Xen Summit 2009 Shanghai Ras
 
Chen Haibo
Chen HaiboChen Haibo
Chen Haibo
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
 
XS Japan 2008 Citrix English
XS Japan 2008 Citrix EnglishXS Japan 2008 Citrix English
XS Japan 2008 Citrix English
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
 
Xen Project Update LinuxCon Brazil
Xen Project Update LinuxCon BrazilXen Project Update LinuxCon Brazil
Xen Project Update LinuxCon Brazil
 
XS Boston 2008 OVF
XS Boston 2008 OVFXS Boston 2008 OVF
XS Boston 2008 OVF
 
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
XenServer 5.5 - Czy można zaoszczędzić na wirtualizacji serwerów? Darmowy Xen...
 
XS Boston 2008 VT-D PCI
XS Boston 2008 VT-D PCIXS Boston 2008 VT-D PCI
XS Boston 2008 VT-D PCI
 
Nakajima hvm-be final
Nakajima hvm-be finalNakajima hvm-be final
Nakajima hvm-be final
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriver
 
COSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and ClustersCOSMIC: Middleware for Xeon Phi Servers and Clusters
COSMIC: Middleware for Xeon Phi Servers and Clusters
 
Xen Euro Par07
Xen Euro Par07Xen Euro Par07
Xen Euro Par07
 
XS Oracle 2009 PVOps
XS Oracle 2009 PVOpsXS Oracle 2009 PVOps
XS Oracle 2009 PVOps
 
Xen and the Art of Virtualization
Xen and the Art of VirtualizationXen and the Art of Virtualization
Xen and the Art of Virtualization
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, HuaweiXPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQL
 
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
Advancedperformancetroubleshootingusingesxtop 101110131727-phpapp02
 
XS Boston 2008 Self IO Emulation
XS Boston 2008 Self IO EmulationXS Boston 2008 Self IO Emulation
XS Boston 2008 Self IO Emulation
 

More from The Linux Foundation

More from The Linux Foundation (20)

ELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made SimpleELC2019: Static Partitioning Made Simple
ELC2019: Static Partitioning Made Simple
 
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
 
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
 
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, XilinxXPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
 
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
 
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, BitdefenderXPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
 
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...OSSJP/ALS19:  The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
 
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making... OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
 
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, CitrixXPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
 
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltdXPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
 
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
 
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&DXPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
 
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM SystemsXPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
 
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
 
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
 
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
 
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSEXPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
 
XPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information Security
XPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information SecurityXPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information Security
XPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information Security
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Xen RAS Status and Progress

  • 1. Xen RAS Status and Progress Dugger, Donald D Liu, Jinsong Jiang, Yunhong
  • 2. Agenda • Xen RAS overview • Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement • Call for co-work Intel Confidential 2
  • 3. Xen RAS overview • Xen RAS motivation – Error affects many VMs – Xen RAS: error contained and handled accordingly • Error Handling – CPU/Memory error: MCA (Machine Check Architecture) – I/O error: AER (Advanced Error Reporting) – ACPI Platform Error Interfaces Intel Confidential 3
  • 4. MCA: Machine Check Architecture dom0 User space tools (FMA/ Mcelog) domU vIRQ handler vMCE handler vMCE handler vIRQ vMCA vMCA XEN Recover action page offline system panic Xen MCA handler & reset cpu offline Polling MCE/CMCI CPU HW Intel Confidential 4
  • 5. Xen RAS status Item Status Comments MCA infrastructure supported Move from dom0 to hypervisor CE and UCNA supported Userspace tools logging and analysis Uncore error recovery supported Memory scrubbing error L3 explicit write-back error Core error recovery WIP Data load error Instruction fetch error APEI BERT WAIT Dom0 own, wait kernel ready APEI ERST supported Dom0 and hypervisor co-work APEI EINJ supported Dom0 own APEI HEST/GHES WIP Dom0 and hypervisor co-work Intel Confidential 5
  • 6. Agenda • Xen RAS overview • Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement • Call for co-work Intel Confidential 6
  • 7. Xen RAS latest progress • Core error recovery – A new MCA error type, error in current processor execution context – CPU tag it as action required, must deal with before execution resume – Currently 2 type of architecturally defined core error: • Data Load Error • Instruction Fetch Error • APEI support – ACPI Platform Error Interfaces – Bring existing h/w error mechanism together as a coherent infrastructure – Consists of 4 separate tables • Boot Error Record Table • Error Record Serialization Table • Error Injection Table • Hardware Error Source Table – Linux3.0 as dom0 save us much effort • Many dom0 APEI reuse • Little maintain effort, benefit from kernel improvement Intel Confidential 7
  • 8. Core Error Recovery • Xen core error recovery – Basically same MCA infrastructure as uncore error recovery – MCE exception ISR • MCE broadcast to all logical processors • Error in range of hypervisor/guest – If in hypervisor • Reset system – Worst case, cannot resume execution – If in guest • Trigger vMCE to affected guest • Trigger vIRQ to dom0 for logging • Error contained in guest – Medium case, error in guest kernel, kill the guest – Best case, error in guest app, kill the app – Code done, need kernel core recovery to do fine-grain test Intel Confidential 8
  • 9. APEI support • Xen APEI support – BERT • BOOT Error Record Table – For unhandled fatal error occurred in a previous boot • Xen BERT – Dom0 own, wait kernel BERT ready – ERST • Error Record Serialization Table – Save/retrieve fatal error to/from persistent storage • Hypervisor ERST: – Save error • Dom0 ERST: – Retrieve/clear error Intel Confidential 9
  • 10. APEI support • Xen APEI support – EINJ • Error Injection table – Mechanism through which OSPM can inject h/w errors • Xen EINJ – Dom0 own – Test done based on current bios available error types – HEST • Hardware Error Source Table – Platform level description of error sources and error notifications • Xen HEST – Dom0 own SCI logic because of acpica – Hypervisor own NMI logic, Xen APEI NMI handler currently not ready – Need bios ready for more error sources and notifications Intel Confidential 10
  • 11. Robust enhancement • Xen RAS robust enhancement – Xen RAS robust challenge • Buggy bios • Some error types not h/w supported yet • Hard to trigger errors and do auto test – Our work to enhance Xen RAS robust • Do some code cleanup & enhancement • Current supported errors were triggered and tested • QA add error-simulator tools and auto test script • EINJ enabling help debug & test greatly • Robust enhancement will continue w/ new platform support more error types Intel Confidential 11
  • 12. Agenda • Xen RAS overview • Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement • Call for co-work Intel Confidential 12
  • 13. Call for co-work • I/O error handling – PCIe AER, Advanced Error Reporting – For device assign to dom0/pv domU • Basically reuse dom0/domU AER logic – For device assign to hvm • Need PCIe AER support at qemu – Some VALinux work on standard qemu – Porting to Xen qemu with AER support Intel Confidential 13