Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Diagnostics 2


Published on

  • Be the first to comment

  • Be the first to like this

Advanced Diagnostics 2

  1. 1. 1 Diagnostics 1
  2. 2. Diagnostics 2 ● SP Diags (stinger) ●Spdiag Tool (galaxy) ● SunVTS ● PC Check ● HDT (sundiag replacement) ● CSTH ● Herd & EDAC ● MCAT ● Bonnie ● Memtest ●Web Pages ● Decoder Tools
  3. 3. Sun Confidential: Internal Only 3 Day 2 PM and Acknowledgements • I have borrowed/stolen/copied* the following in this presentation. • Newisys decoder from Barry Wright • HDT from Bernward Schwartz • SGR from http://panacea/twiki/bin/view/SGR/WebHome
  4. 4. Sun Confidential: Internal Only 4 SP Diags for V20/40z • Not to be confused with “spdiag” Tool • Bootable CD (nsv or above required) or SP based • Enable diagnostic boot in BIOS for bootable CD • NSV installed on remote system and mounted locally by NFS .
  5. 5. Sun Confidential: Internal Only 5 SP Diags for V20/40z ● Install diags: cp -r /mnt/cdrom/nsv_file /mnt/nsv/ cd /mnt/nsv/ unzip -a *.zip chmod 777 /mnt/nsv/diags/NSV_version_number/scripts chmod -R 755 /mnt/nsv/diags/NSV_version_number/mppc Note:Now ensure nfs is enabled on server and can export file system sp add mount -r NFS_server_hostname:/directory_with_NSV_files -l /mnt sp update diags -p /mnt/diags/DIAGS_version#
  6. 6. Sun Confidential: Internal Only 6 SP Diags for V20/40z ● diags start (for standalone) ● diags start -n (on-line nic,disk,mem) ● diags get state (confirm diags are loaded) ● diags get tests (list diagnostics tests) ● diags run tests -av ● diags run tests -av >/mnt/log/diags.log ● diags terminate Ensure diags and BIOS, drivers are compatible Diags will fail to run otherwise
  7. 7. Sun Confidential: Internal Only 7 SP Diags for V20/40z • diags -h this will show all syntax • diags -a -v full test • Bootable CD > diags terminate -n > diags start -n > diags run tests -a -v >diags.out & > tail -f diags.out
  8. 8. Sun Confidential: Internal Only 8 The “spdiag” Tool (Galaxy) • SP based diagnostic • Test i2c , voltage , fans , temp • Stop ipmi /etc/init.d/ipmistack stop • /usr/local/bin/spdiag 1 g4 i2ctst • Reboot SP
  9. 9. Sun Confidential: Internal Only 9 PC Check • Supplemental/Tools CD and now boot menu • AMD based X2100,X2100M2,X2200M2 and all new X4x40 platforms • All Intel based platforms • Monitor and keyboard • Serial port • Scripts , burn-in tests , loopback
  10. 10. Sun Confidential: Internal Only 10 PC Check • Front Menu: System Information menu Advanced Diagnostics Tests Immediate Burn-in Testing Deferred Burn-in Testing Create Diagnostic Partition Show Results Summary Print Results Report
  11. 11. Sun Confidential: Internal Only 11 PC Check • Burn-in Testing: > quick.tst - requires user input, no time-out > noinput.tst – no user input, good first test > full.tst – requires loopback & user input • Command Line: > Example pccheck cpu.tst /BD > pccheck /? - shows all flags > pccheck suncsi.tst /IS /BD /KS /MH 30 /HMD 1m /HDD 1m /SD 5m
  12. 12. Sun Confidential: Internal Only 12 SUNvts • What are you trying to test/replicate ? • Local or bootable CD-ROM • Galaxy 2.2 cd contains vts6.3 • GUI or command line • Unsupported platforms: > /opt/SUNWvts/lib/conf/platform.conf > smbios | grep Product > Boot with graphics head > Edit tty boot console=ttya,ttya-mode=”9600,8,n,1,-”
  13. 13. Sun Confidential: Internal Only 13 SUNvts
  14. 14. Sun Confidential: Internal Only 14 HDT (Hardware Debug Tool) • PLEASE USE WITH CAUTION !!! • Will hang the host if OS running • Reboot SP after use http://panacea/twiki/bin/view/Products/Galaxydiag • Additional tools: • /usr/local/bin/ -nohdtl disables hdt test • /usr/local/bin/
  15. 15. Sun Confidential: Internal Only 15 Platform Specifics • On G4: > HDT uses some signals over the i2c bus => IPMI on the SP has to be shut down. SP should be rebooted when done with hdt diags. > JTAG chain goes through all CPU modules => All slots must have CPU or filler module inserted for HDT to work on G4 > Direct access to all CPU's, default is cpu 0 • Other Platforms: > Only CPU0 in JTAG chain > no i2c involved, only used for platform identification From Bernward Schwarte presentation.
  16. 16. Sun Confidential: Internal Only 16 Getting Started & Cautions • hdt or hdtl? > Current hdt binary and some documentation at: under Galaxy->Pre- OS-Diagnostics > Copy to SP: scp hdt sunservice@<SPIP>:/coredump > ssh sunservice@<SPIP> password: changeme > cd /coredump (or check /usr/local/bin for the built in copy) • Caution: All hdt commands stop CPU's. Some hdt commands will reset/power-cycle system. • All command line parameters are interpreted as hex values ! • ./hdt prints syntax of all available commands • ./hdt –pd 0 18 0 • hdt leaves CPU in HDT-mode when exiting, use “-e” option to exit HDT- mode
  17. 17. Sun Confidential: Internal Only 17 Available Commands/Diagnostics • Basics: > Single HDT command: -h * Note: this is not -help > Access io- and memory space: -mr, -mw, -ir, -iw > Access CPU registers: -rd, -rr, -rw > Single step: -hs • Control: > Reset : -xr [b c] > Stop at reset: -xs [b c p] > resource init: “-hi” : sets up HT routing and resources > Power On/Off : -o [0 off, 1 cycle, 2 on] > set CPU: -c G4 only > Breakpoints: -bps -bpm -bpc > exit: -e
  18. 18. Sun Confidential: Internal Only 18 Diagnostics • Extended: > PCI configuration space access -pr, -pw -pd, -ps > “Dump” commands – Machine check: -dm – DIMM SPD: -dd – CMOS: -dc – SIO: -ds – Flash: -df > HT link testing: -a – Powercycles, stops at reset vector, sets all HT links, warm resets
  19. 19. Sun Confidential: Internal Only 19 HDT Not Working > Depending on System state HDT can be non-functional > To capture some system/error state: – Reset system and stop at reset vector: hdt -xs b – Init HT routing and PCI bridge enumeration hdt -hi – Dump Machine check and HT link status: hdt -dm -dl hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.7.0 ------------------------------------------------------- hdtDiag: Error, HDT command failed, no CFF cpu 0 hdtDiag: SysIdent: HDT access failing hdtDiag: defaulting to G12X
  20. 20. Sun Confidential: Internal Only 20 HDT • Check Versions 0.8.0 , 0.8.3 , 0.9.9, 1.3, 1.4.1 etc • ./hdt -xs • ./hdt -hi • ./hdt -l -q or try ./hdt -l -a • ./hdt -e • Reboot SP
  21. 21. Sun Confidential: Internal Only 21 CSTH (Continuous System Telemetry Harness) • Calls ipmitool to create a telemetry stream of: > volt,temp,current,fans and PSU variables ● Collect data and submit for analysis to engineering: ● ./start-csth-ipmi <spname> <splogin> <sppasswd> [--interval <numsecs>] ● Example: ➢ ./start-csth-ipmi test-sp admin test.pass 60 & ➢ ./stop-csth-ipmi test-sp
  22. 22. Sun Confidential: Internal Only 22 CSTH (Example From an x4200)
  23. 23. Sun Confidential: Internal Only 23 HERD (Hardware Error Report Decode) •Hardware error report and decoding from mcelog or via the command line with kernel 2.6.4 or above •Installed as RPM on top of SLES and Red Hat •Be provide by Sun Microsystems •Will report errors to messages file and service processor (if applicable) •Same command line options as mcelog •Must be run on the same host as the machine that reported the errors when using the herd -e function.
  24. 24. Sun Confidential: Internal Only 24 HERD (Hardware Error Report Decode) •Example from console / logs: •Example of running herd manually (pre herd install): Mar 5 18:03:01 va64-x2200c-gmp03 herd: HARDWARE ERROR. This is *NOT* a software problem! Mar 5 18:03:01 va64-x2200c-gmp03 herd: Please contact your hardware vendor Mar 5 18:03:01 va64-x2200c-gmp03 herd: CPU 0 4 northbridge Mar 5 18:03:01 va64-x2200c-gmp03 herd: TSC fcc73b11cf Mar 5 18:03:01 va64-x2200c-gmp03 herd: ADDR 142110 Mar 5 18:03:01 va64-x2200c-gmp03 herd: Northbridge Chipkill ECC error Mar 5 18:03:01 va64-x2200c-gmp03 herd: Chipkill ECC syndrome = 11ea Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit46 = corrected ecc error Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit57 = processor context corrupt Mar 5 18:03:01 va64-x2200c-gmp03 herd: bit61 = error uncorrected Mar 5 18:03:01 va64-x2200c-gmp03 herd: bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' Mar 5 18:03:01 va64-x2200c-gmp03 herd: STATUS b675410011080a13 MCGSTATUS 0 # herd -e 142110 000000142110: Cpu Node 0, DIMM 2
  25. 25. Sun Confidential: Internal Only 25 EDAC (Kernel 2.6.20.xx and above) 2 examples of edac not working & working (x2200): Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Multiple CE in quick succession or DIMM layout Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: Failed to translate InputAddr to csrow for address 0xbb2c2fc0 Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Feb 25 06:50:57 va64-x2200c kernel: MC0: CE - no information available: k8_edac Error Overflow set Feb 25 06:50:57 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC chipkill x4 error ^^ Failed to translate due to overflow bit set This happens if more than one error has occurred before edac gets to it or if edac does not understand the DIMM layout. Here is the correct format of edac's output: Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: general bus error: partic ipating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic) << Always CPU 0 (reporting error) Mar 4 10:43:42 va64-x2200c kernel: MC0: CE page 0x100010, offset 0x10, grain 8, syndrome 0xa1e8, row 0, channel 0, label "": k8_edac ^^ This event tells you the actual offending CPU which in this instance is CPU 0. (label not used by default but Sun may/customer may populate) Mar 4 10:43:42 va64-x2200c kernel: EDAC k8 MC0: extended error code: ECC error <<Decode below: MC0: CE error page 0x100010 adding offset of 0x10 = Address (0x10100010) Grain = 8 which is Chipkill Row 0, Channel 0 = CPU0,DIMM0 Channel 0Channel 1 Channel 0Channel 1 =================================== =================================== Row> csrow0 | DIMM_A0| DIMM_B0 | csrow2 | DIMM_A1| DIMM_B1 | csrow1 | DIMM_A0| DIMM_B0 | csrow3 | DIMM_A1| DIMM_B1 | =================================== =================================== If single rank DIMMs (1GB or less) then csrow1 and csrow3 are not used/available.
  26. 26. Sun Confidential: Internal Only 26 EDAC Continued: •Example output from the SP (not created by edac): 1 | 02/23/2008 | 02:13:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted >>> Edac log should be here but does not show - Instead, you just see the BIOS scrubber results <<< 2 | 02/25/2008 | 16:27:55 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 3 | 02/25/2008 | 17:27:58 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 4 | 02/25/2008 | 18:28:00 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 5 | 02/25/2008 | 19:28:02 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 6 | 02/25/2008 | 20:28:04 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 7 | 02/25/2008 | 21:28:06 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 8 | 02/25/2008 | 22:28:08 | Memory CPU0 DIMM0 | Correctable ECC | Asserted 9 | 02/25/2008 | 23:28:10 | Memory CPU0 DIMM0 | Correctable ECC | Asserted ... and so on ... Do a cat of /proc/mc/0 to give you an understanding of the events occurred in a row/column summary It's edac or herd, not both!!! They both try to grab /dev/mce events and report. (rmmod k8_edac to remove) And remember, the SEL log is your friend so always get an ipmi dump first before escalating or decoding.
  27. 27. Sun Confidential: Internal Only 27 The “mcelog” • Linux kernels after 2.6.4 do not print recoverable machine check errors • Messages are saved in /var/log/mcelog • Mcelog read errors from /dev/mcelog and then deletes entries • Typically run as a cron jog: > /usr/sbin/mcelg >> /var/log/mce > *Note this is not collected by sysreport • RedHat implemented as a daemon • See RedHat advisory RHEA-2006-0134-7
  28. 28. Sun Confidential: Internal Only 28 MCAT (Machine Check Analysis Tool) Event Source 62 - WMIxWDM Processor Number : 0 Bank Number : 4 Time Stamp (0x): 01C856C4 58A8C10D Error Status (0x): D4714000 E1080A13 Error Address (0x): 00000000 A047BF50 Error Misc. (0x): 00000000 00000000 Single bit errors: Correctable ECC error Error address valid in MCi_ADDR Error reporting enabled Second error Error valid Cont: >> Bus Error Code: Participation processor: Local node responded to the request (RES) Time-out: Request did not time out Memory transaction type: Generic read (RD) I/O: DRAM memory access (MEM) Cache level: Generic (LG) North Bridge Error MC4: Extended Error Code: 0x8 - ChipKill ECC Error Error Code: 0x0A13 DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache ChipKill Syndrome: 0xE1E2 Error address at 2564 MB Takes input from a Windows Event Log entry and decodes the output:
  29. 29. Sun Confidential: Internal Only 29 MCAT Continued • This can be gathered by running ipmitool fru: FRU Device Description : p0.fru (ID 6) Product Manufacturer : ADVANCED MICRO DEVICES Product Name : DUAL CORE AMD OPTERON(TM) 275 Product Part Number : 0F21 Product Version : 02 FRU Device Description : p0.d0.fru (ID 8) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D7010058 Continued: >>> FRU Device Description : p0.d1.fru (ID 9) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D7010056 FRU Device Description : p0.d2.fru (ID 10) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D701A6F4 FRU Device Description : p0.d3.fru (ID 11) Product Manufacturer : MICRON TECHNOLOGY Product Name : 1024MB DDR 400 (PC3200) ECC Product Part Number : 18VDDF12872G-40BD3 Product Version : 0300 Product Serial : D701A6EE FRU Output for this failing platform:
  30. 30. Sun Confidential: Internal Only 30 Manual Diagnosis Processor Number :0 - CPU 0 (If said 4 then it would be socket CPU4, not core 4). Error address at 2564 MB (i.e. between 2 and 3 GBytes). From the FRU information, each DIMM is 1 Gbyte. The DIMMs are numbered for closest to CPU outwards based on mapping. (DIMMs should be populated from outside inward but are mapped closest to CPU outwards). The BIOS sets up memory from DIMM0/1 outwards. Assuming "optimal defaults": Our Opterons use a 128-bit wide data path. DIMM0 and DIMM1 are used in a pair. These are single-rank DIMMs but they are all the same so is "chipselect interleaving". The first 128KB are on DIMM0 and 1. The second 128KB are on DIMM2 and 3. 2564/128 = 20.03 ----> which is in DIMM0 and DIMM1 pair. (Always replace Opteron platform DIMMs in pairs). Windows reporting decode is performed as follows:
  31. 31. Sun Confidential: Internal Only 31 Manual Diagnosis ChipKill Syndrome: 0xE1E2 Looking this up in the table 26 of the AMD BIOS And Kernel Writer's Guide shows this is symbol 0x1a which according to the text above 26, this symbol maps to the upper 64-bits of the 128-bit data path. DIMM0 from 00h-0fh provides the low 64-bits, DIMM1 from 10h-1fh provides the high 64-bits. The check bits for the lower 64-bits is 20h-21h and the check bits for the upper 64-bits is 22h-23h Technical documentation including the AMD BIOS and Kernel Writers Guide is available from AMD via:,,30_182_739_9003,00.html Remember though to download the correct document for your processor revision: SingleDual core Opteron for x2100, x2200, x4100, x4200, x4500, x4600 etc is document family 0fh. Quad code Opteron for supported platforms is document family 10h. Manual diagnosis continued:
  32. 32. Sun Confidential: Internal Only 32 Manual Diagnosis Chipkill Syndrome Table for 0Fh CPUs 0-63 data bits Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 00h e821 7c32 9413 bb44 5365 c776 2f57 dd88 35a9 a1ba 499b 66cc 8eed 1afe f2df 01h 5d31 a612 fb23 9584 c8b5 3396 6ea7 eac8 b7f9 4cda 11eb 7f4c 227d d95e 846f 02h 0001 0002 0003 0004 0005 0006 0007 0008 0009 000a 000b 000c 000d 000e 000f 03h 2021 3032 1013 4044 6065 7076 5057 8088 a0a9 b0ba 909b c0cc e0ed f0fe d0df 04h 5041 a082 f0c3 9054 c015 30d6 6097 e0a8 b0e9 402a 106b 70fc 20bd d07e 803f 05h be21 d732 6913 2144 9f65 f676 4857 3288 8ca9 e5ba 5b9b 13cc aded c4fe 7adf 06h 4951 8ea2 c7f3 5394 1ac5 dd36 9467 a1e8 e8b9 2f4a 661b f27c bb2d 7cde 358f 07h 74e1 9872 ec93 d6b4 a255 4ec6 3a27 6bd8 1f39 f3aa 874b bd6c c98d 251e 51ff 08h 15c1 2a42 3f83 cef4 db35 e4b6 f177 4758 5299 6d1a 78db 89ac 9c6d a3ee b62f 09h 3d01 1602 2b03 8504 b805 9306 ae07 ca08 f709 dc0a e10b 4f0c 720d 590e 640f 0ah 9801 ec02 7403 6b04 f305 8706 1f07 bd08 2509 510a c90b d60c 4e0d 3a0e a20f 0bh d131 6212 b323 3884 e9b5 5a96 8ba7 1cc8 cdf9 7eda afeb 244c f57d 465e 976f 0ch e1d1 7262 93b3 b834 59e5 ca56 2b87 dc18 3dc9 ae7a 4fab 642c 85fd 164e f79f 0dh 6051 b0a2 d0f3 1094 70c5 a036 c067 20e8 40b9 904a f01b 307c 502d 80de e08f 0eh a4c1 f842 5c83 e6f4 4235 1eb6 ba77 7b58 df99 831a 27db 9dac 396d 65ee c12f 0fh 11c1 2242 3383 c8f4 d935 eab6 fb77 4c58 5d99 6e1a 7fdb 84ac 956d a6ee b72f
  33. 33. Sun Confidential: Internal Only 33 Manual Diagnosis Chipkill Syndrome Table for 0Fh CPUs 64-128 data bits Symbol 1h 2h 3h 4h 5h 6h 7h 8h 9h ah bh ch dh eh fh 10h 45d1 8a62 cfb3 5e34 1be5 d456 9187 a718 e2c9 2d7a 68ab f92c bcfd 734e 369f 11h 63e1 b172 d293 14b4 7755 a5c6 c627 28d8 4b39 99aa fa4b 3c6c 5f8d 8d1e eeff 12h b741 d982 6ec3 2254 9515 fbd6 4c97 33a8 84e9 ea2a 5d6b 11fc a6bd c87e 7f3f 13h dd41 6682 bbc3 3554 e815 53d6 8e97 1aa8 c7e9 7c2a a16b 2ffc f2bd 497e 943f 14h 2bd1 3d62 16b3 4f34 64e5 7256 5987 8518 aec9 b87a 93ab ca2c e1fd f74e dc9f 15h 83c1 c142 4283 a4f4 2735 65b6 e677 f858 7b99 391a badb 5cac df6d 9dee 1e2f 16h 8fd1 c562 4ab3 a934 26e5 6c56 e387 fe18 71c9 3b7a b4ab 572c d8fd 924e 1d9f 17h 4791 89e2 ce73 5264 15f5 db86 9c17 a3b8 e429 2a5a 6dcb f1dc b64d 783e 3faf 18h 5781 a9c2 fe43 92a4 c525 3b66 6ce7 e3f8 b479 4a3a 1dbb 715c 26dd d89e 8f1f 19h bf41 d582 6ac3 2954 9615 fcd6 4397 3ea8 81e9 eb2a 546b 17fc a8bd c27e 7d3f 1ah 9391 e1e2 7273 6464 f7f5 8586 1617 b8b8 2b29 595a cacb dcdc 4f4d 3d3e aeaf 1bh cce1 4472 8893 fdb4 3155 b9c6 7527 56d8 9a39 12aa de4b ab6c 678d ef1e 23ff 1ch a761 f9b2 5ed3 e214 4575 1ba6 bcc7 7328 d449 8a9a 2dfb 913c 365d 688e cfef 1dh ff61 55b2 aad3 7914 8675 2ca6 d3c7 9e28 6149 cb9a 34fb e73c 185d b28e 4def 1eh 5451 a8a2 fcf3 9694 c2c5 3e36 6a67 ebe8 bfb9 434a 171b 7d7c 292d d5de 818f 1fh 6fc1 b542 da83 19f4 7635 acb6 c377 2e58 4199 9b1a f4db 37ac 586d 82ee ed2f
  34. 34. Sun Confidential: Internal Only 34 Manual Diagnosis ECC Syndrome Table (for completion) for 0Fh CPUs (Single Error Correction, Double Error Detection): n=0 n=1 n=2 n=3 n=4 n=5 n=6 n=7 Bit (0+n) ce cb d3 d5 d6 d9 da dc Bit (8+n) 23 25 26 29 2a 2c 31 34 Bit (16+n) 0e 0b 13 15 16 19 1a 1c Bit (24+n) e3 e5 e6 e9 ea ec f1 f4 Bit (32+n) 4f 4a 52 54 57 58 5b 5d Bit (40+n) a2 a4 a7 a8 ab ad b0 b5 Bit (48+n) 8f 8a 92 94 97 98 9b 9d Bit (56+n) 62 64 67 68 6b 6d 70 75 Bit (64+n) 01 02 04 08 10 20 40 80 *Typically used for single DIMM configurations
  35. 35. Sun Confidential: Internal Only 35 Other Tools/Diags (Un-supported) • Bonnie > Benchmark to measure performance of filesystem • Memtest86+ > Standalone bootable diagnostic Original version • Other memory tool • Netperf or ttcp - google for them - network tools
  36. 36. Sun Confidential: Internal Only 36 SGR • Situation appraisal – Recognise a problem • Problem Analysis - Find True Cause http://systems-tsc/twiki/pub/SGR/SgrtOnlineHelp/PA-guide.pdf The Steps in FTC are: * Define a Problem Statement * Describe the problem with a Problem Specification * Develop Possible Causes from either Experience or Differences and Changes * Identify the Most Probable Cause * Test the Most Probable Cause against the Problem Specification * Verify the Most Probable Cause
  37. 37. Sun Confidential: Internal Only 37 Newisys MCE Decoder v20/40z What to gather from inventory get all -v 1. How many CPU's? 2. How many Dimms per CPU? 3. What is the part number of the Dimm? NOTE:This is for V20/40z ONLY and only works on Northbridge Errors
  38. 38. Sun Confidential: Internal Only 38 Details from CPU0 explained ● • Here you see 4 identical Dimms on CPU0. • The Dimm Manufacture part # is: 36VDDF25672G-40BD2 ● ● ● ● Name Type OEM Manufacture Date Hardware Revision Part # ●CPU 0 DIMM 0 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2 ● CPU 0 DIMM 1 memory 2cffffffffffffff 2005-04-16 0200 36VDDF25672G-40BD2 ●CPU 0 DIMM 2 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2 ●CPU 0 DIMM 3 memory 2cffffffffffffff 2005-03-19 0200 36VDDF25672G-40BD2 ●DDR 0 VRM memvrm S-SCI448 2005-05-27 A01 S01479 • CPU 0 VRM vrm NA
  39. 39. Sun Confidential: Internal Only 39 Determine Type & Rank of Dimm ** Dimms can be single rank or dual rank. For a description of the differences see: or Browse to the Qualified Memory page: Compare your DIMM Manufacture part number to the list: 36VDDF25672G-40BD2 This equates to a 2GB Micron Dual Rank DIMM: Micron: 512MB: MT18VDDF6472G-40BG3 Die: G Single Rank SPD 1.0 1GB: MT18VDDF12872G-40BD3 Die: D Single Rank SPD 1.0 2GB: MT36VDDF25672G-40BD2 Die: D Dual Rank SPD 1.0 Now we are ready to populate the Memory Decode Tool
  40. 40. Sun Confidential: Internal Only 40 Warning! Decode Tool is Sun Internal This link cannot be shared with customers. It is internal for Sun use only. The link has the account and password in it.
  41. 41. Sun Confidential: Internal Only 41 Populate the Decode Tool
  42. 42. Sun Confidential: Internal Only 42 Information for Decode Tool Enter the CPU that has the machine check: (From the Error) 0, 1, 2, or 3 Enter the platform type: 2100 = V20z 4300 = V40z Enter the machine check status: (From the Error) Enter the machine check address: (From the Error) Specify which CPUs have DIMMs: (From inventory ger all -v) Specify which DIMMs are populated on each CPU: (From Inventory get all -v) Specify the DIMM type: (Rank from Qualified Memory Page) BIOS defaults: Leave this at the default (Place a √ in DIMM interleaving, 128 bit DIMM interface, and Chipkill ECC enabled. No √ in Node interleaving)
  43. 43. Sun Confidential: Internal Only 43 Result Output Only one error is present Error details: K8_CPU-0 is reporting this corrected error: DRAM chipkill ECC error found by scrubber The DRAM error was at address '00000000 9C6A0B30' (2 GB range) This error is related to DIMM 1 on K8_CPU-0 The ECC syndrome ('5E34'x) maps to a correctable error at data bit 66 Within the DIMM, this would be an error at physical bit 2 Processor was responding to another source of the transaction Transaction was a read Error classification: Error type: DRAM ECC Error severity: Corrected Error enabled: yes Error recovered: yes Possible sympathy: no Error address: '000000009C6A0B30'x Address type: Physical
  44. 44. Sun Confidential: Internal Only 44 Anything Else........ • Newisys Machine Check (northbridge only) • V20/40z only • http://systems-tsc/twiki/pub/Products/ProdTroubleshootingV20z/V20z-V40z-Memory-DIMM • Windows Debugging > • MCAT • en/assets/content_type/utilities/mcatsetup.exe Machine Check Analysis Tool (MCAT) is a command line utility that takes Windows System Event Log (.evt) file as an argument and decodes the MCA Error logs into human readable format. MCAT can alternatively take in MCE Error information as raw register hexadecimal values as command line argument as well.
  45. 45. 45 Diagnostics Complete! 45