Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Xen RAS Status and Progress                  Dugger, Donald D                       Liu, Jinsong                    Jiang,...
Agenda• Xen RAS overview• Xen RAS latest progress   – Core error recovery   – APEI support   – Robust enhancement• Call fo...
Xen RAS overview• Xen RAS motivation   – Error affects many VMs   – Xen RAS: error contained and handled accordingly• Erro...
MCA: Machine Check Architecturedom0       User space tools (FMA/ Mcelog)                         domU                  vIR...
Xen RAS statusItem                    Status                     CommentsMCA infrastructure      supported                ...
Agenda• Xen RAS overview• Xen RAS latest progress   – Core error recovery   – APEI support   – Robust enhancement• Call fo...
Xen RAS latest progress• Core error recovery   – A new MCA error type, error in current processor execution context   – CP...
Core Error Recovery• Xen core error recovery   – Basically same MCA infrastructure as uncore error recovery   – MCE except...
APEI support• Xen APEI support  –   BERT       •   BOOT Error Record Table             –   For unhandled fatal error occur...
APEI support• Xen APEI support  –   EINJ       •     Error Injection table              –   Mechanism through which OSPM c...
Robust enhancement• Xen RAS robust enhancement  –   Xen RAS robust challenge        •   Buggy bios        •   Some error t...
Agenda• Xen RAS overview• Xen RAS latest progress   – Core error recovery   – APEI support   – Robust enhancement• Call fo...
Call for co-work• I/O error handling   – PCIe AER, Advanced Error Reporting   – For device assign to dom0/pv domU         ...
Upcoming SlideShare
Loading in …5
×

Xen RAS Status and Progress

3,959 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Xen RAS Status and Progress

  1. 1. Xen RAS Status and Progress Dugger, Donald D Liu, Jinsong Jiang, Yunhong
  2. 2. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 2
  3. 3. Xen RAS overview• Xen RAS motivation – Error affects many VMs – Xen RAS: error contained and handled accordingly• Error Handling – CPU/Memory error: MCA (Machine Check Architecture) – I/O error: AER (Advanced Error Reporting) – ACPI Platform Error Interfaces Intel Confidential 3
  4. 4. MCA: Machine Check Architecturedom0 User space tools (FMA/ Mcelog) domU vIRQ handler vMCE handler vMCE handler vIRQ vMCA vMCA XEN Recover action page offline system panic Xen MCA handler & reset cpu offline Polling MCE/CMCI CPU HW Intel Confidential 4
  5. 5. Xen RAS statusItem Status CommentsMCA infrastructure supported Move from dom0 to hypervisorCE and UCNA supported Userspace tools logging and analysisUncore error recovery supported Memory scrubbing error L3 explicit write-back errorCore error recovery WIP Data load error Instruction fetch errorAPEI BERT WAIT Dom0 own, wait kernel readyAPEI ERST supported Dom0 and hypervisor co-workAPEI EINJ supported Dom0 ownAPEI HEST/GHES WIP Dom0 and hypervisor co-work Intel Confidential 5
  6. 6. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 6
  7. 7. Xen RAS latest progress• Core error recovery – A new MCA error type, error in current processor execution context – CPU tag it as action required, must deal with before execution resume – Currently 2 type of architecturally defined core error: • Data Load Error • Instruction Fetch Error• APEI support – ACPI Platform Error Interfaces – Bring existing h/w error mechanism together as a coherent infrastructure – Consists of 4 separate tables • Boot Error Record Table • Error Record Serialization Table • Error Injection Table • Hardware Error Source Table – Linux3.0 as dom0 save us much effort • Many dom0 APEI reuse • Little maintain effort, benefit from kernel improvement Intel Confidential 7
  8. 8. Core Error Recovery• Xen core error recovery – Basically same MCA infrastructure as uncore error recovery – MCE exception ISR • MCE broadcast to all logical processors • Error in range of hypervisor/guest – If in hypervisor • Reset system – Worst case, cannot resume execution – If in guest • Trigger vMCE to affected guest • Trigger vIRQ to dom0 for logging • Error contained in guest – Medium case, error in guest kernel, kill the guest – Best case, error in guest app, kill the app – Code done, need kernel core recovery to do fine-grain test Intel Confidential 8
  9. 9. APEI support• Xen APEI support – BERT • BOOT Error Record Table – For unhandled fatal error occurred in a previous boot • Xen BERT – Dom0 own, wait kernel BERT ready – ERST • Error Record Serialization Table – Save/retrieve fatal error to/from persistent storage • Hypervisor ERST: – Save error • Dom0 ERST: – Retrieve/clear error Intel Confidential 9
  10. 10. APEI support• Xen APEI support – EINJ • Error Injection table – Mechanism through which OSPM can inject h/w errors • Xen EINJ – Dom0 own – Test done based on current bios available error types – HEST • Hardware Error Source Table – Platform level description of error sources and error notifications • Xen HEST – Dom0 own SCI logic because of acpica – Hypervisor own NMI logic, Xen APEI NMI handler currently not ready – Need bios ready for more error sources and notifications Intel Confidential 10
  11. 11. Robust enhancement• Xen RAS robust enhancement – Xen RAS robust challenge • Buggy bios • Some error types not h/w supported yet • Hard to trigger errors and do auto test – Our work to enhance Xen RAS robust • Do some code cleanup & enhancement • Current supported errors were triggered and tested • QA add error-simulator tools and auto test script • EINJ enabling help debug & test greatly • Robust enhancement will continue w/ new platform support more error types Intel Confidential 11
  12. 12. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 12
  13. 13. Call for co-work• I/O error handling – PCIe AER, Advanced Error Reporting – For device assign to dom0/pv domU • Basically reuse dom0/domU AER logic – For device assign to hvm • Need PCIe AER support at qemu – Some VALinux work on standard qemu – Porting to Xen qemu with AER support Intel Confidential 13

×