Your SlideShare is downloading. ×
Xen RAS Status and Progress
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Xen RAS Status and Progress

3,347
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,347
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Xen RAS Status and Progress Dugger, Donald D Liu, Jinsong Jiang, Yunhong
  • 2. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 2
  • 3. Xen RAS overview• Xen RAS motivation – Error affects many VMs – Xen RAS: error contained and handled accordingly• Error Handling – CPU/Memory error: MCA (Machine Check Architecture) – I/O error: AER (Advanced Error Reporting) – ACPI Platform Error Interfaces Intel Confidential 3
  • 4. MCA: Machine Check Architecturedom0 User space tools (FMA/ Mcelog) domU vIRQ handler vMCE handler vMCE handler vIRQ vMCA vMCA XEN Recover action page offline system panic Xen MCA handler & reset cpu offline Polling MCE/CMCI CPU HW Intel Confidential 4
  • 5. Xen RAS statusItem Status CommentsMCA infrastructure supported Move from dom0 to hypervisorCE and UCNA supported Userspace tools logging and analysisUncore error recovery supported Memory scrubbing error L3 explicit write-back errorCore error recovery WIP Data load error Instruction fetch errorAPEI BERT WAIT Dom0 own, wait kernel readyAPEI ERST supported Dom0 and hypervisor co-workAPEI EINJ supported Dom0 ownAPEI HEST/GHES WIP Dom0 and hypervisor co-work Intel Confidential 5
  • 6. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 6
  • 7. Xen RAS latest progress• Core error recovery – A new MCA error type, error in current processor execution context – CPU tag it as action required, must deal with before execution resume – Currently 2 type of architecturally defined core error: • Data Load Error • Instruction Fetch Error• APEI support – ACPI Platform Error Interfaces – Bring existing h/w error mechanism together as a coherent infrastructure – Consists of 4 separate tables • Boot Error Record Table • Error Record Serialization Table • Error Injection Table • Hardware Error Source Table – Linux3.0 as dom0 save us much effort • Many dom0 APEI reuse • Little maintain effort, benefit from kernel improvement Intel Confidential 7
  • 8. Core Error Recovery• Xen core error recovery – Basically same MCA infrastructure as uncore error recovery – MCE exception ISR • MCE broadcast to all logical processors • Error in range of hypervisor/guest – If in hypervisor • Reset system – Worst case, cannot resume execution – If in guest • Trigger vMCE to affected guest • Trigger vIRQ to dom0 for logging • Error contained in guest – Medium case, error in guest kernel, kill the guest – Best case, error in guest app, kill the app – Code done, need kernel core recovery to do fine-grain test Intel Confidential 8
  • 9. APEI support• Xen APEI support – BERT • BOOT Error Record Table – For unhandled fatal error occurred in a previous boot • Xen BERT – Dom0 own, wait kernel BERT ready – ERST • Error Record Serialization Table – Save/retrieve fatal error to/from persistent storage • Hypervisor ERST: – Save error • Dom0 ERST: – Retrieve/clear error Intel Confidential 9
  • 10. APEI support• Xen APEI support – EINJ • Error Injection table – Mechanism through which OSPM can inject h/w errors • Xen EINJ – Dom0 own – Test done based on current bios available error types – HEST • Hardware Error Source Table – Platform level description of error sources and error notifications • Xen HEST – Dom0 own SCI logic because of acpica – Hypervisor own NMI logic, Xen APEI NMI handler currently not ready – Need bios ready for more error sources and notifications Intel Confidential 10
  • 11. Robust enhancement• Xen RAS robust enhancement – Xen RAS robust challenge • Buggy bios • Some error types not h/w supported yet • Hard to trigger errors and do auto test – Our work to enhance Xen RAS robust • Do some code cleanup & enhancement • Current supported errors were triggered and tested • QA add error-simulator tools and auto test script • EINJ enabling help debug & test greatly • Robust enhancement will continue w/ new platform support more error types Intel Confidential 11
  • 12. Agenda• Xen RAS overview• Xen RAS latest progress – Core error recovery – APEI support – Robust enhancement• Call for co-work Intel Confidential 12
  • 13. Call for co-work• I/O error handling – PCIe AER, Advanced Error Reporting – For device assign to dom0/pv domU • Basically reuse dom0/domU AER logic – For device assign to hvm • Need PCIe AER support at qemu – Some VALinux work on standard qemu – Porting to Xen qemu with AER support Intel Confidential 13