Group photograph at Linaro Connect in Copenhagen
Monday 29 Oct 2012
LCA13, March 6, 2013
Robert Richter <robert.richter@ca...
Slide www.linaro.org
Agenda
www.linaro.org
Existing subsystems and tools
• mcelog (http://www.mcelog.org/)
• EDAC drivers
• generic tracepoint: trace_...
www.linaro.org
mcelog
• x86 shared kernel mce code for Intel and AMD
●
nmi handler
●
trace point
●
MCE device (/dev/mce) f...
www.linaro.org
EDAC drivers
• many separate drivers available
• unified sysfs layout
• edac-util (using sysfs)
• memory on...
www.linaro.org
Tracepoints
• trace_mce_record()
• include/trace/events/mce.h
• perf_event subsystem is used as kernel/user...
www.linaro.org
Implementing RAS for ARM
• mcelog not suitable: maintained and developed by
Intel only
• edac: memory error...
www.linaro.org
RAS daemon
• follow a proposal by Borislav Petkov of a generic
approach for a RAS daemon
• patch set, not y...
www.linaro.org
RAS daemon - Advantages
• generic approach
• reuse of existing code (esp. perf code)
• event-driven and mma...
www.linaro.org
RAS daemon - Work to do
• start with memory error counting (ECC)
• kernel:
●
add persistent events to kerne...
Upcoming SlideShare
Loading in …5
×

LCA13: BOF Reliability Accessibility and Serviceability (RAS)

280
-1

Published on

Resource: LCA13
Name: BOF Reliability Accessibility and Serviceability (RAS)
Date: 06-03-2013
Speaker: Robert Richter

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
280
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

LCA13: BOF Reliability Accessibility and Serviceability (RAS)

  1. 1. Group photograph at Linaro Connect in Copenhagen Monday 29 Oct 2012 LCA13, March 6, 2013 Robert Richter <robert.richter@calxeda.com> BOF Reliability Accessibility and Serviceability (RAS)
  2. 2. Slide www.linaro.org Agenda
  3. 3. www.linaro.org Existing subsystems and tools • mcelog (http://www.mcelog.org/) • EDAC drivers • generic tracepoint: trace_mce_record() (implemented only on x86) • some other arch specific kernel implementations (alpha, powerpc, sparc)
  4. 4. www.linaro.org mcelog • x86 shared kernel mce code for Intel and AMD ● nmi handler ● trace point ● MCE device (/dev/mce) for Intel ● AMD driver only logs to console • buffers not mmap'ed • mcelog tool works only for Intel
  5. 5. www.linaro.org EDAC drivers • many separate drivers available • unified sysfs layout • edac-util (using sysfs) • memory only, no other events (cpu, io, power, etc.) • polls only sysfs, not event driven
  6. 6. www.linaro.org Tracepoints • trace_mce_record() • include/trace/events/mce.h • perf_event subsystem is used as kernel/user i/f • currently only x86 kernel implementation • usable for other archs since tracepoints are generic in the kernel
  7. 7. www.linaro.org Implementing RAS for ARM • mcelog not suitable: maintained and developed by Intel only • edac: memory errors only, bunch of individual drivers • only arch dependent RAS solutions exist Add another RAS for ARM? No, let's implement an arch independent RAS solution for Linux.
  8. 8. www.linaro.org RAS daemon • follow a proposal by Borislav Petkov of a generic approach for a RAS daemon • patch set, not yet upstream: https://lkml.org/lkml/2011/4/23/72 • implement a generic and arch independent daemon • use tracepoints and the perf_event subsystem to collect events in a kernel ringbuffer • reuse perf_event userland to access the event buffers • add a RAS daemon to tools/ of the kernel repository • do a reference implementation for x86 and arm
  9. 9. www.linaro.org RAS daemon - Advantages • generic approach • reuse of existing code (esp. perf code) • event-driven and mmap'ed buffers • supported by the kernel community, x86 maintainers seem to like it • standard and flexible framework allows easily adding features by the kernel community (same as for perf)
  10. 10. www.linaro.org RAS daemon - Work to do • start with memory error counting (ECC) • kernel: ● add persistent events to kernel: – always enabled since early boot – handle multiple users of event buffers ● add machine check drivers for ARM • userland ● sharing code in tools/perf needs some rework ● implement initial version of a RAS daemon with basic feature set (logging and statistic) ● define and add more features
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×