Your SlideShare is downloading. ×
XS Boston 2008 OpenSolaris
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

XS Boston 2008 OpenSolaris

189

Published on

Frank Van der Linden: Xen and OpenSolaris Fault Management

Frank Van der Linden: Xen and OpenSolaris Fault Management

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
189
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Solaris FMA and Xen Frank van der Linden ● ●Sun Microsystems 1
  • 2. Overview What is FMA? • Requirements to implement FMA • Changes made to Xen • Changes made to Solaris • Status / future work • 2
  • 3. What is FMA? • Fault Management Architecture > Also known as Solaris Fault Management • Provides fault diagnosis and recovery • Without FMA: > Random panics > Cryptic error messages > Needless debugging or hardware swapfests • With FMA: > Detailed reports on failure > If possible, recovery 3
  • 4. What is FMA? (cont.) • Hardware fault handling for: CPU > Cache > Memory > I/O (PCI) > • Initial detection through traps (MCE) or polling • How to work around fatal errors? > CPU offline > Page retire 4
  • 5. 20% 57% 15% 5
  • 6. Requirement: hardware access • In the plain Solaris implementation, needs access to the actual hardware to implement FMA Trap handlers > PCI interrupts > Reading MSRs for polling and diagnostic > Writing MSRs for testing / error injection > • But, this is a virtualized environment • Need to find a balanced way to provide the access that FMA needs, and the abstractions of the hypervisor. 6
  • 7. General design • Hypervisor does the low-level grunge work > Immediate retrieval of error telemetry for CPU and memory after an MCE / NMI > Hypercall interface to retrieve telemetry and to perform recovery actions • Solaris runs as dom0 • DomUs will be informed of fatal errors • Recoverable errors handled by dom0 7
  • 8. Xen changes: trap handling and telemetry collection • Based on work by Christoph Egger @ AMD • Extend the MCE handling to collect telemetry > Read the MSRs > Store telemetry in a global structure • Domains can install an MCE/NMI trap handler for reporting on unrecoverable errors. • Dom0 can install a VIRQ_MCA handler to be informed about recoverable errors (detected by polling) • Hypercalls to retrieve telemetry 8
  • 9. Xen changes: fault injection • Fault injection is used for testing purposes • Two ways to inject faults: > Interpose fault data (MSR values) that are “read” later. > Actually write MSRs • Add hypercall to either interpose data or actually write the MSR values > Only a subset of MSRs relating to MCA can be written 9
  • 10. Xen changes: exposing CPU data • FMA needs to know a little more about the actual CPUs than is available right now: Hardware CPU/Core/Thread ID > Number of (active cores) > APIC id > CPUID info > • Hypercall to expose this information • Information is versioned; for an event that changes this information (e.g. CPU offline or online), the version is incremented. 10
  • 11. Xen changes: CPU offline • The decision to offline a CPU is made by the FMA recovery code, running on Solaris/dom0 • It can take a physical CPU offline through a hypercall. • There is also a hypercall to bring a CPU back online • Restriction: physical CPU 0 seems to have some special tasks in the hypervisor, can not currently be offlined. 11
  • 12. Overview FMA agent data action Solaris dom0 domU VIRQ trap trap HC HC PCI errors Telemetry Hypervisor MCE polling Hardware 12
  • 13. How to take a CPU offline (1) • Stop scheduling VCPUs on it > Scheduler picks from a new cpu set, cpu_sched_map > Take CPU out of cpu_sched_map • Force all VCPUs off the CPU > Go through all domains and force a migration of all their VCPUs onto a new physical CPU > Also migrate timers • This might break CPU affinity or dom0 pinning > Nothing we can do about that; just log the event 13
  • 14. How to take a CPU offline (2) • Interrrupts must be re-routed > There is some existing code for this • Race condition: PIRQ acks > May come in from a domain after the interrupt was re-routed. > Can't wait for a domain to do the ack > So, just do the hardware ack if one is needed, and turn the PIRQ ack from the domain into a no-op when it comes in. • Finally, put the CPU into a sleep loop 14
  • 15. Solaris changes • Don't depend on the hardware access interface > Make FMA modules, such as CPU handling ocde, use an abstraction layer for MSR reads, topography info, etc > Abstraction layers and libraries also come in handy for the Sparc architecture and LDOM virtualization • Don't depend on MSI functionality for PCI error reporting > Fall back on normal PIRQ if MSI not available • Retrieve actual physical CPU info when coming up. • Set up NMI and MCE handlers for dom0 case. 15
  • 16. Status / Future work • Done: > Telemetry retrieval > Physical CPU info retrieval > CPU offline • Tested both for simulated errors and actual hardwere errors • To be done: CPU online > Page retire > Cleanup > Feed back the code > • FMA domain? 16
  • 17. Questions? Frank van der Linden ● –fvdl@sun.com 17

×