LCA13: BOF Reliability Accessibility and Serviceability (RAS)
Upcoming SlideShare
Loading in...5
×
 

LCA13: BOF Reliability Accessibility and Serviceability (RAS)

on

  • 314 views

Resource: LCA13

Resource: LCA13
Name: BOF Reliability Accessibility and Serviceability (RAS)
Date: 06-03-2013
Speaker: Robert Richter

Statistics

Views

Total Views
314
Views on SlideShare
314
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

LCA13: BOF Reliability Accessibility and Serviceability (RAS) LCA13: BOF Reliability Accessibility and Serviceability (RAS) Presentation Transcript

  • Group photograph at Linaro Connect in Copenhagen Monday 29 Oct 2012 LCA13, March 6, 2013 Robert Richter <robert.richter@calxeda.com> BOF Reliability Accessibility and Serviceability (RAS)
  • Slide www.linaro.org Agenda
  • www.linaro.org Existing subsystems and tools • mcelog (http://www.mcelog.org/) • EDAC drivers • generic tracepoint: trace_mce_record() (implemented only on x86) • some other arch specific kernel implementations (alpha, powerpc, sparc)
  • www.linaro.org mcelog • x86 shared kernel mce code for Intel and AMD ● nmi handler ● trace point ● MCE device (/dev/mce) for Intel ● AMD driver only logs to console • buffers not mmap'ed • mcelog tool works only for Intel
  • www.linaro.org EDAC drivers • many separate drivers available • unified sysfs layout • edac-util (using sysfs) • memory only, no other events (cpu, io, power, etc.) • polls only sysfs, not event driven
  • www.linaro.org Tracepoints • trace_mce_record() • include/trace/events/mce.h • perf_event subsystem is used as kernel/user i/f • currently only x86 kernel implementation • usable for other archs since tracepoints are generic in the kernel
  • www.linaro.org Implementing RAS for ARM • mcelog not suitable: maintained and developed by Intel only • edac: memory errors only, bunch of individual drivers • only arch dependent RAS solutions exist Add another RAS for ARM? No, let's implement an arch independent RAS solution for Linux.
  • www.linaro.org RAS daemon • follow a proposal by Borislav Petkov of a generic approach for a RAS daemon • patch set, not yet upstream: https://lkml.org/lkml/2011/4/23/72 • implement a generic and arch independent daemon • use tracepoints and the perf_event subsystem to collect events in a kernel ringbuffer • reuse perf_event userland to access the event buffers • add a RAS daemon to tools/ of the kernel repository • do a reference implementation for x86 and arm
  • www.linaro.org RAS daemon - Advantages • generic approach • reuse of existing code (esp. perf code) • event-driven and mmap'ed buffers • supported by the kernel community, x86 maintainers seem to like it • standard and flexible framework allows easily adding features by the kernel community (same as for perf)
  • www.linaro.org RAS daemon - Work to do • start with memory error counting (ECC) • kernel: ● add persistent events to kernel: – always enabled since early boot – handle multiple users of event buffers ● add machine check drivers for ARM • userland ● sharing code in tools/perf needs some rework ● implement initial version of a RAS daemon with basic feature set (logging and statistic) ● define and add more features