LCA13: BOF Reliability Accessibility and Serviceability (RAS)
Upcoming SlideShare
Loading in...5

LCA13: BOF Reliability Accessibility and Serviceability (RAS)



Resource: LCA13

Resource: LCA13
Name: BOF Reliability Accessibility and Serviceability (RAS)
Date: 06-03-2013
Speaker: Robert Richter



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

LCA13: BOF Reliability Accessibility and Serviceability (RAS) LCA13: BOF Reliability Accessibility and Serviceability (RAS) Presentation Transcript

  • Group photograph at Linaro Connect in Copenhagen Monday 29 Oct 2012 LCA13, March 6, 2013 Robert Richter <> BOF Reliability Accessibility and Serviceability (RAS)
  • Slide Agenda
  • Existing subsystems and tools • mcelog ( • EDAC drivers • generic tracepoint: trace_mce_record() (implemented only on x86) • some other arch specific kernel implementations (alpha, powerpc, sparc)
  • mcelog • x86 shared kernel mce code for Intel and AMD ● nmi handler ● trace point ● MCE device (/dev/mce) for Intel ● AMD driver only logs to console • buffers not mmap'ed • mcelog tool works only for Intel
  • EDAC drivers • many separate drivers available • unified sysfs layout • edac-util (using sysfs) • memory only, no other events (cpu, io, power, etc.) • polls only sysfs, not event driven
  • Tracepoints • trace_mce_record() • include/trace/events/mce.h • perf_event subsystem is used as kernel/user i/f • currently only x86 kernel implementation • usable for other archs since tracepoints are generic in the kernel
  • Implementing RAS for ARM • mcelog not suitable: maintained and developed by Intel only • edac: memory errors only, bunch of individual drivers • only arch dependent RAS solutions exist Add another RAS for ARM? No, let's implement an arch independent RAS solution for Linux.
  • RAS daemon • follow a proposal by Borislav Petkov of a generic approach for a RAS daemon • patch set, not yet upstream: • implement a generic and arch independent daemon • use tracepoints and the perf_event subsystem to collect events in a kernel ringbuffer • reuse perf_event userland to access the event buffers • add a RAS daemon to tools/ of the kernel repository • do a reference implementation for x86 and arm
  • RAS daemon - Advantages • generic approach • reuse of existing code (esp. perf code) • event-driven and mmap'ed buffers • supported by the kernel community, x86 maintainers seem to like it • standard and flexible framework allows easily adding features by the kernel community (same as for perf)
  • RAS daemon - Work to do • start with memory error counting (ECC) • kernel: ● add persistent events to kernel: – always enabled since early boot – handle multiple users of event buffers ● add machine check drivers for ARM • userland ● sharing code in tools/perf needs some rework ● implement initial version of a RAS daemon with basic feature set (logging and statistic) ● define and add more features