Cpu And Memory Events

10,870 views
10,604 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,870
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
112
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cpu And Memory Events

  1. 1. CPU and Memory Events [email_address]
  2. 2. Topics <ul><li>CPU architecture </li></ul><ul><li>Error reporting banks </li></ul><ul><li>Types of errors and handling </li></ul><ul><li>Addressing memory – discussion and example </li></ul><ul><li>Examples of various error messages </li></ul><ul><li>Utilities and programs </li></ul><ul><li>X64 DIMM replacement guidelines </li></ul>
  3. 3. CPU Architecture
  4. 5. Opteron Processor Overview
  5. 6. Dual Core Opteron
  6. 7. Cache and Memory
  7. 8. Cache Organisation
  8. 9. Cache Details <ul><li>L1 64Kbyte per core 2 way set associative </li></ul><ul><li>L1 Data cache protected by ECC </li></ul><ul><li>L1 Instruction cache protected by parity </li></ul><ul><li>L2 cache 16 way set associative </li></ul><ul><li>L2 1Mb per core Both data and instructions </li></ul><ul><li>L2 Protected by ECC </li></ul><ul><li>Least Recently Used (LRU) replacement algorithm </li></ul>
  9. 10. Translation Look Aside Buffer <ul><li>L1 32 Entries </li></ul><ul><li>L1 Fully associative </li></ul><ul><li>L2 512 Entries </li></ul><ul><li>4 way associative </li></ul>
  10. 11. Traditional Northbridge
  11. 12. Opteron Northbridge <ul><li>On Processor Die ( Node ) </li></ul><ul><li>Up to 3 Hyper Transport Link Interfaces </li></ul><ul><li>Memory controller </li></ul><ul><li>Interface to memory </li></ul><ul><li>Interface to CPU cores </li></ul><ul><li>ECC errors are detected and corrected here </li></ul><ul><li>On dual core Nodes – shared between CPUs </li></ul>
  12. 13. Opteron server overview <ul><li>Rev E CPUs DDR1 memory </li></ul><ul><li>Rev F CPUs (M2 systems) DDR2 memory </li></ul><ul><li>4 DIMM slots per CPU (at present) </li></ul><ul><li>Servers utilise both memory channels in parallel allowing a 128 bit access to memory + 16 ECC bits </li></ul><ul><li>Chipkill mode (able to correct up to 4 bit in error if bits lie within nybble boundaries) </li></ul><ul><li>Capability to address up to 1TB </li></ul>
  13. 14. Error Reporting Banks
  14. 15. Opteron Error Reporting Banks <ul><li>Bank 0 Data cache(DC) </li></ul><ul><li>Bank 1 Instruction Cache(IC) </li></ul><ul><li>Bank 2 Bus Unit (BU) </li></ul><ul><li>Bank 3 Load/Store Unit (LS) </li></ul><ul><li>Bank 4 Northbridge(NB) </li></ul>
  15. 16. Error Reporting Bank Registers <ul><li>Machine check control register (MCi_CTL) </li></ul><ul><li>Error reporting control register mask (MCi_CTL_MASK) </li></ul><ul><li>Machine check status register (MCi_STATUS) </li></ul><ul><li>Machine check address register(MCi_ADDR) </li></ul>
  16. 17. Role of registers <ul><li>MCi_CTL – allows control over what errors will be reported </li></ul><ul><li>MCi_CTL_MASK – allows additional control over the errors reported </li></ul><ul><li>Mci_STATUS – where error information gets reported eg syndrome, type of error </li></ul><ul><li>Mci_ADDR – physical address of failure -important in memory errors ( Northbridge - bank 4) </li></ul>
  17. 18. Decoding Mci Status Registers <ul><li>First discover which CPU or Node is reporting the error and which error bank is reporting </li></ul><ul><li>The decode of the status register is dependant on the failing bank </li></ul><ul><li>To decode error – Often the OS or a package on the OS will do much of the work for you </li></ul><ul><li>If you have a Windows system available then consider using MCAT ( machine check analysis tool) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html </li></ul>
  18. 19. Decoding Mci Status Registers Cont <ul><li>Utilities on web eg parcemce – use with caution </li></ul><ul><li>Use Infodoc 78336, 82833 </li></ul><ul><li>Manually use the “ BIOS and Kernel Developer's Guides” ( make sure you use the correct one – Note Rev F has a different guide) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_9003,00.html </li></ul><ul><li>Open a collaboration task </li></ul>
  19. 20. CHIPKILL + SYNDROMES <ul><li>In the Opteron world chipkill is ability to correct up to 4 contiguous memory bits </li></ul><ul><li>128 data bits + 16 ECC bits = 144 bits </li></ul><ul><li>Single symbol correction double symbol detection </li></ul><ul><li>1 failing x4 memory chip can generate 16 separate syndromes </li></ul><ul><li>Syndromes can identify failing bit or bits within word </li></ul><ul><li>Syndromes will tell you which DIMM in a DIMM pair is failing. - They will not identify a DIMM pair or associated CPU </li></ul>
  20. 21. Portion of chipkill syndrome table 128 bit memory word
  21. 22. 64 bit memory word <ul><li>You may see this on workstations </li></ul><ul><li>Configurations with only 1 DIMM </li></ul><ul><li>64 bits + 8 bits ECC </li></ul><ul><li>Can only correct single bits </li></ul><ul><li>Detect double bit errors </li></ul><ul><li>Syndrome is 8 bits </li></ul>
  22. 23. 64 bit word ECC syndrome table
  23. 24. Error Types and handling
  24. 25. Correctable ECC errors <ul><li>BIOS will log to DMI /SEL during BIOS/POST </li></ul><ul><li>It is the responsibility of the OS to handle correctable errors </li></ul><ul><li>On V20z/40z nps reports errors to SP if the threshold is exceeded – Note threshold does not correspond to DIMM replacement guidelines ( CR 6494195, 6386838) – 2 errors in 6 hours NSV 2.4.0.24 will fix this </li></ul><ul><li>How if and where correctable ECC errors are reported is dependant on the type and revision of OS and what packages are installed. </li></ul>
  25. 26. Handling Uncorrectable errors <ul><li>Two main methods. </li></ul><ul><li>Sync Flood analogous to SPARC “fatal reset” </li></ul><ul><li>Machine Check exception – interrupt which the OS handles (panics) </li></ul>
  26. 27. Sync Flood <ul><ul><li>Sync Flooding is a HyperTransport™ method used to stop data propagation in the </li></ul></ul><ul><ul><li>case of a serious error. </li></ul></ul><ul><ul><li>Device that detects the error initiates sync flood. </li></ul></ul><ul><ul><li>All others cease operation, and transmit sync flood packets. </li></ul></ul><ul><ul><li>Packets finally reach the South Bridge (eg nVidia CK8-04). </li></ul></ul><ul><ul><li>BIOS has Pre-programmed SB to trigger system RESET signal, when sync flood is detected </li></ul></ul><ul><ul><li>System reboots </li></ul></ul><ul><ul><li>During Boot Block and POST, BIOS analyzes related error bits in all Nodes, reports of Sync Flood reasons </li></ul></ul><ul><ul><li>First step in debugging get hold of SEL . </li></ul></ul>
  27. 28. 001 | 01/03/2007 | 21:43:00 | OEM #0x12 | | Asserted 2101 | OEM record e0 | 00000000040f0c0200400000f2 2201 | OEM record e0 | 01000000040000000000000000 2301 | 01/03/2007 | 21:43:15 | Memory | Uncorrectable ECC | Asserted | CPU 1 DIMM 0 2401 | 01/03/2007 | 21:43:15 | Memory | Memory Device Disabled | Asserted | CPU 1 DIMM 0 2501 | 01/03/2007 | 21:43:18 | Memory p1.d1.fail | Predictive Failure Asserted 2601 | 01/03/2007 | 20:43:12 | System Firmware Progress | Motherboard initialization | Asserted Sync Flood example SEL
  28. 29. Another example of sync flood error - not so friendly - 1501 | 04/10/2007 | 04:18:02 | OEM #0x12 | | Asserted 1601 | OEM record e0 | 00004800001111002000000000 1701 | OEM record e0 | 10ab0000000810000006040012 1801 | OEM record e0 | 10ab0000001111002011110020 1901 | OEM record e0 | 1800000000f60000010005001b 1a01 | OEM record e0 | 180000000000000000dffe0000 1b01 | OEM record e0 | 1900000000f200002000020c0f 1c01 | OEM record e0 | 1a00000000f200001000020c0f 1d01 | OEM record e0 | 1b00000000f200003000020c0f 1e01 | OEM record e0 | 80004800001111032000000000
  29. 30. Machine check exception <ul><li>For certain unrecoverable errors Machine Check Exceptions are generated </li></ul><ul><li>Generates an interrupt and the OS handles or tries to handle the error eg panics. </li></ul>
  30. 31. Linux machine check exception example CPU 0: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b600000000000185 at 0000000000000940 Kernel panic: CPU context corrupt The above is from kernel: 2.4.21-27.0.1.ELsmp #1 SMP
  31. 32. Machine check exception example Solaris WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0ifying DMI Pool Data .... sched: #mc Machine check pid=0, pc=0xfffffffffb8233ea, sp=0xfffffe8000293ad8, eflags=0x216 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f0<xmme,fxsr,pge,mce,pae,pse> cr2: 8073c62 cr3: d3a7000 cr8: c rdi: ffffffff812dadf0 rsi: ffffffff815f4df0 rdx: 1000 rcx: 42 r8: 1 r9: 1 rax: fffffe8000293c80 rbx: ffffffff81282e00 rbp: fffffe8000293b10 r10: 1 r11: 1 r12: 0 r13: ffffffff81282e00 r14: ffffffff81283318 r15: fffffe800025db40 fsb: ffffffff80000000 gsb: ffffffff81034000 ds: 43 es: 43 fs: 0 gs: 1c3 trp: 12 err: 0 rip: fffffffffb8233ea cs: 28 rfl: 216 rsp: fffffe8000293ad8
  32. 33. Memory Addressing and Interleaving
  33. 34. Example of a DIMM layout
  34. 35. Contiguous addressing versus Interleaving <ul><li>Contiguous – sequential addresses are allocated to the same rank of chips until the capacity is exhausted and then another rank of chips is addressed </li></ul><ul><li>Interleaving – Contiguous addresses are switched between different ranks of memory </li></ul><ul><li>Performance benefit to interleaving </li></ul><ul><li>Good discussion at URL: </li></ul>http://systems-tsc/twiki/pub/Products/SunFireX4100FaqPts/OpteronMemInterlvNotes.pdf
  35. 36. Interleaving <ul><li>Memory DIMMs need to be the same + power of 2 </li></ul><ul><li>Interleave at DIMM level (dual rank) </li></ul><ul><li>Interleave at DIMM pair level </li></ul><ul><li>Interleave at node level (not so common) </li></ul><ul><li>BIOS parameters </li></ul><ul><li>Complicates mapping address to DIMM pair </li></ul>
  36. 37. Rev F DIMM Interleave Addresses
  37. 38. Example of addressing <ul><li>X4100 2 CPUs </li></ul><ul><li>4 x 1GB DIMMs per CPU </li></ul><ul><li>Micron 18VDDF12872G-40BD3 </li></ul><ul><li>Dual rank DIMM </li></ul><ul><li>8 x 64 Meg memory chips/side + ECC chip </li></ul>
  38. 39. Simplified addressing – no interleave <ul><li>Possible 40 bits 0-39 to address 1TB </li></ul><ul><li>128 Bit memory access so first 4 bits is byte address so not used to address memory </li></ul><ul><li>Bits 4 -14 Column address </li></ul><ul><li>Bits 15 – 16 Internal “bank addressing” </li></ul><ul><li>Bits 17-29 Row address </li></ul><ul><li>Bit 30 Chip select ( other side of DIMM) </li></ul><ul><li>Bit 31 Chip select ( other DIMM pair) </li></ul><ul><li>Bit 32 Selects other node </li></ul>
  39. 40. Simplified addressing - interleave <ul><li>Possible 40 bits 0-39 to address 1TB </li></ul><ul><li>128 Bit memory access so first 4 bits are byte addresses so not used to address memory </li></ul><ul><li>Bits 4 -14 Column address </li></ul><ul><li>Bits 15 – 16 Internal “bank addressing” </li></ul><ul><li>Bit 17 Chip select (swapped with bit 30) </li></ul><ul><li>Bit 18 Chip select (swapped with bit 31) </li></ul><ul><li>Bits 19-31 Row address (bits 30, 31 swapped with bits 17 and 18) </li></ul><ul><li>Bit 32 Selects other node </li></ul>
  40. 41. Memory/PCI Hole <ul><li>Gap in memory left for legacy I/O devices and drivers that use 32 bit addressing -situated under 4G (0xffffffff) </li></ul><ul><li>Can cause RAM to be unavailable </li></ul><ul><li>Opterons have capability to map around hole thus allowing all of installed RAM to be visible but this means Node address ranges are altered. </li></ul><ul><li>This is known as memory hoisting </li></ul><ul><li>For memory hole discussion see URL: http://techfiles.de/dmelanchthon/files/memory_hole.pdf </li></ul>
  41. 42. Affect of memory hole on address ranges <ul><li>Actual values will depend on configuration. BIOS revision etc </li></ul><ul><li>Example is for a X4100 M2 with no HBAs installed,BIOS revision 0ABJX034 running OS Red Hat Enterprise Linux AS release 4 (Nahant Update 4) </li></ul>
  42. 43. Technique to discover memory ranges on CPU for Linux systems <ul><li>Cd /var/log </li></ul><ul><li>Grep -i bootmem * </li></ul><ul><li>This is recorded in various files depending on version type of OS – most commonly in dmesg </li></ul>
  43. 45. Memory Hole address range without remapping Node address range displayed at boot. Each Node has 4GB node 0 has “lost” memory (a 4G address range would be 000000000000000-00000000ffffffff) Memory hole exists between dfffffff and fffffff =20000000 [root@va64-x4100f-gmp03 log]# pwd /var/log [root@va64-x4100f-gmp03 log]# grep -i Bootmem mess* Bootmem setup node 0 000000000000000-00000000dfffffff Bootmem setup node 1 0000000100000000-00000001ffffffff
  44. 46. Address range with memory remapping around hole (hoisting) In this case we do not lose the memory. RAM addressing is remapped around the memory hole so address range on Mode 0 grows by 20000000 base + limit of node 1 grows by 20000000 Bootmem setup node 0 0000000000000000-000000011fffffff Bootmem setup node 1 0000000120000000-000000021fffffff
  45. 47. Some examples of error reporting
  46. 48. Red Hat 3 Update 2 kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel: ECC syndrome bits e307 kernel: extended error chipkill ecc error kernel: link number 0 kernel: dram scrub error kernel: corrected ecc error kernel: error address valid kernel: error enable kernel: previous error lost kernel: error address 00000000cf31f8f0
  47. 49. Later Red Hat 3 example kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8
  48. 50. Example of Red Hat 3 GART error CPU 3: Silent Northbridge MCE Northbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0
  49. 51. Example of EDAC output EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label &quot;&quot;: k8_edac EDAC MC0: CE - no information available: k8_edac Error Overflow set EDAC k8 MC0: extended error code: ECC chipkill x4 error EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
  50. 52. MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 4 northbridge TSC e169139a35188 ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d422400040080a13 MCGSTATUS 0 Suse mcelog example kernel 2.6.16.27
  51. 53. Further Suse mcelog example MCE 31 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 3 1 instruction cache TSC 3e2dc434cdb5 ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0
  52. 54. ECC ( non chipkill example) CPU 2 4 northbridge TSC 3da2afa1102b ADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS d418c00000000a13 MCGSTATUS 0
  53. 55. Confusing EDAC example note two MC numbers reporting. eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8, syndrome 0xf4, row 0, channel 1, label &quot;&quot;: k8_edac eaebe242 kernel: MC1: CE - no information available: k8_edac Error Overflow set eaebe242 kernel: EDAC k8 MC0: extended error code: ECC error eaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
  54. 56. FMA information examples <ul><li>This is the same error as the EDAC error example. </li></ul># fmdump -v -u 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 TIME UUID SUNW-MSG-ID Feb 18 15:42:41.1662 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=3 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3
  55. 57. fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288 , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u <EVENT_ID> to identify the module.
  56. 58. # fmdump TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 13441a52-c465-629b-ca9d-fc77b0e66354 -------- ---------------------------------------------------------------------- # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354 TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
  57. 59. Example of FMA detecting CPU error Solaris handles machine check exception and FMA information is available on reboot
  58. 60. SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major EVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME: SOURCE: SunOS, REV: 5.10 Generic_118855-14 DESC: Errors have been detected that require a reboot to ensure system integrity. See http://www.sun.com/msg/SUNOS-8000-0G for more information. Thu Jan 4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry REC-ACTION: Save the error summary below in case telemetry cannot be saved [Thu Jan 4 21:43:21 2007] [Thu Jan 4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401 detector=[ > > version=0 scheme= &quot;hc&quot; hc-list=[...] ] bank-status=b60000000002017a bank-number=2 addr=5a0c addr-valid=1 ip=0 privileged=1 ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401
  59. 61. System now panics and then reboots panic[cpu1]/thread=fffffe800032fc80: Unrecoverable Machine-Check Exception dumping to /dev/dsk/c0t0d0s1, offset 860356608,
  60. 62. SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1, Severity Major EVENT-TIME: Fri Jan 5 10:11:10 MET 2007 PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.net SOURCE: eft, REV: 1.16 EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8d DESC: The number of errors associated with this CPU has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-67 for more information. RESPONSE: An attempt will be made to remove this CPU from service. IMPACT: Performance of this system may be affected. REC-ACTION: Schedule a repair procedure to replace affected CPU. Use fmdump -v -u <EVENT_ID> to identify the module.
  61. 63. #>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME UUID SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100% fault.cpu.amd.l2cachetag Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1
  62. 64. Some programs and utilities
  63. 65. HERD <ul><li>Hardware error report and decode </li></ul><ul><li>Installed as RPM on top of SLES and Redhat and </li></ul><ul><li>Be provide by Sun </li></ul><ul><li>Will report errors to messages file and service processor </li></ul><ul><li>Same command line options as mcelog </li></ul>http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD
  64. 66. mcelog <ul><li>Linux kernels after 2.6.4 do not print recoverable machine check errors to messages file or kernel log </li></ul><ul><li>Instead they are saved into /dev/mcelog </li></ul><ul><li>Mcelog read errors from /dev/mcelog and then deletes entries </li></ul><ul><li>Typically run as a cron </li></ul><ul><li>Eg /usr/sbin/mcelg >> /var/log/mce note this is not collected by sysreport </li></ul><ul><li>Red Hat have implemented as a daemon </li></ul><ul><li>See Red Hat advisory RHEA-2006-0134-7 </li></ul><ul><li>Linux kernels after 2.6.4 do not print do not print recoverable machine check errors to messages file or kernel log </li></ul><ul><li>Instead they are saved into /dev/mcelog </li></ul><ul><li>Mcelog read errors from /dev/mcelog and then deletes entries </li></ul><ul><li>Typically run as a cron </li></ul><ul><li>Eg /usr/sbin/mcelg >> /var/log/mce </li></ul><ul><li>Red hat will/have implemented as a daemon </li></ul><ul><li>See Red Hat advisory </li></ul>
  65. 67. mcat <ul><li>Runs on windows machines </li></ul><ul><li>AMD utility to decode machine check status </li></ul><ul><li>Decodes Windows event log events </li></ul><ul><li>Can be fed status, bank and address to decode errors reported on other machines </li></ul><ul><li>Download from AMD </li></ul><ul><li>http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html </li></ul>
  66. 68. Newisys decoder <ul><li>Utility provided by Newisys to identify failing DIMM for V20z/40z http://systems-tsc/twiki/bin/view/Products/ProdTroubleshootingV20z </li></ul><ul><li>Can be used with extreme care on on other Rev E systems to decode NorthBridge status and if memory DIMM used on system is the same as stinger can be used to help confirm DIMM. </li></ul>
  67. 69. X64 Memory Replacement Policy
  68. 70. X64 Memory Replacement Policy <ul><li>Why – we expect memory to “fail” ie a proportion of memory will experience transient correctable memory errors that will not re-occur due to the physics of memory chips </li></ul><ul><li>Also analysis has shown that in general, memory does not degrade ie correctable errors do not degenerate into uncorrectable errors </li></ul><ul><li>https://onestop/qco/x86dimm/index_x86dimm.shtml </li></ul><ul><li>FIN 102195 </li></ul>02195
  69. 71. Three rules to change DIMMs – I can't count <ul><li>UE failure reported by BIOS/POST </li></ul><ul><li>Solaris 10 U 2 – change a DIMM pair when the system tells you. </li></ul><ul><li>Any UE from systems not running Solaris that you are confident originates from memory </li></ul><ul><li>24 errors from a DIMM in 24 hours </li></ul>
  70. 72. Glossary of terms
  71. 73. Glossary of terms <ul><li>EDAC – Error Detection and Correction – term used by the Linux community for project to handle and identify hardware based errors formerly known as Bluesmoke </li></ul><ul><li>ECC – Error Correcting Code. - In Opteron chipkill mode 16 bits stored in memory along with 128 bits of data. These bits are created by generating parity from various data bits in the data word. </li></ul>
  72. 74. Glossary of terms <ul><li>Syndrome – In Opteron chipkill mode a 16 bit value (4 hexadecimal digits) which can identify the type of error and failing bits within a nybble. The syndrome is generated from comparing ( exclusive OR) the ECC code generated on the write to the ECC code generated on the read. </li></ul><ul><li>Rank – for the purposes of this TOI it can be considered as a set of memory chips which need a separate chip select signal to select the set of chips eg dual ranked DIMMs need two chip select signals sent from the CPU. DIMM interleaving is done between ranks. </li></ul>
  73. 75. Glossary of terms <ul><li>TLB translation Lookaside Buffer – cache in memory used to map virtual addresses to real addresses. </li></ul>

×