Your SlideShare is downloading. ×
Xen RAS Status and Progress
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Xen RAS Status and Progress


Published on

Published in: Technology, Business

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Xen RAS Status and Progress Dugger, Donald D Liu, Jinsong Jiang, Yunhong
  • 2. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 2
  • 3. Xen RAS overview• Xen RAS motivation – Error affects many VMs – Xen RAS: error contained and handled accordingly• Error Handling – CPU/Memory error: MCA (Machine Check Architecture) – I/O error: AER (Advanced Error Reporting) – ACPI Platform Error Interfaces Intel Confidential 3
  • 4. MCA: Machine Check Architecturedom0 User space tools (FMA/ Mcelog) domU vIRQ handler vMCE handler vMCE handler vIRQ vMCA vMCA XEN Recover action page offline system panic Xen MCA handler & reset cpu offline Polling MCE/CMCI CPU HW Intel Confidential 4
  • 5. Xen RAS statusItem Status CommentsMCA infrastructure supported Move from dom0 to hypervisorCE and UCNA supported Userspace tools logging and analysisUncore error recovery supported Memory scrubbing error L3 explicit write-back errorCore error recovery WIP Data load error Instruction fetch errorAPEI BERT WAIT Dom0 own, wait kernel readyAPEI ERST supported Dom0 and hypervisor co-workAPEI EINJ supported Dom0 ownAPEI HEST/GHES WIP Dom0 and hypervisor co-work Intel Confidential 5
  • 6. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 6
  • 7. Xen RAS latest progress• Core error recovery – A new MCA error type, error in current processor execution context – CPU tag it as action required, must deal with before execution resume – Currently 2 type of architecturally defined core error: • Data Load Error • Instruction Fetch Error• APEI support – ACPI Platform Error Interfaces – Bring existing h/w error mechanism together as a coherent infrastructure – Consists of 4 separate tables • Boot Error Record Table • Error Record Serialization Table • Error Injection Table • Hardware Error Source Table – Linux3.0 as dom0 save us much effort • Many dom0 APEI reuse • Little maintain effort, benefit from kernel improvement Intel Confidential 7
  • 8. Core Error Recovery• Xen core error recovery – Basically same MCA infrastructure as uncore error recovery – MCE exception ISR • MCE broadcast to all logical processors • Error in range of hypervisor/guest – If in hypervisor • Reset system – Worst case, cannot resume execution – If in guest • Trigger vMCE to affected guest • Trigger vIRQ to dom0 for logging • Error contained in guest – Medium case, error in guest kernel, kill the guest – Best case, error in guest app, kill the app – Code done, need kernel core recovery to do fine-grain test Intel Confidential 8
  • 9. APEI support• Xen APEI support – BERT • BOOT Error Record Table – For unhandled fatal error occurred in a previous boot • Xen BERT – Dom0 own, wait kernel BERT ready – ERST • Error Record Serialization Table – Save/retrieve fatal error to/from persistent storage • Hypervisor ERST: – Save error • Dom0 ERST: – Retrieve/clear error Intel Confidential 9
  • 10. APEI support• Xen APEI support – EINJ • Error Injection table – Mechanism through which OSPM can inject h/w errors • Xen EINJ – Dom0 own – Test done based on current bios available error types – HEST • Hardware Error Source Table – Platform level description of error sources and error notifications • Xen HEST – Dom0 own SCI logic because of acpica – Hypervisor own NMI logic, Xen APEI NMI handler currently not ready – Need bios ready for more error sources and notifications Intel Confidential 10
  • 11. Robust enhancement• Xen RAS robust enhancement – Xen RAS robust challenge • Buggy bios • Some error types not h/w supported yet • Hard to trigger errors and do auto test – Our work to enhance Xen RAS robust • Do some code cleanup & enhancement • Current supported errors were triggered and tested • QA add error-simulator tools and auto test script • EINJ enabling help debug & test greatly • Robust enhancement will continue w/ new platform support more error types Intel Confidential 11
  • 12. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 12
  • 13. Call for co-work• I/O error handling – PCIe AER, Advanced Error Reporting – For device assign to dom0/pv domU • Basically reuse dom0/domU AER logic – For device assign to hvm • Need PCIe AER support at qemu – Some VALinux work on standard qemu – Porting to Xen qemu with AER support Intel Confidential 13