Hardware Management Module

2,096 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,096
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Go to slide 21 to show architectural differences between single and dual core
  • Direct Connect Architecture is the new terminology from AMD that regroups under one name all the technologies that differentiate AMD and Intel technology.
  • AMD64 processors designed from ground up to support multi-core Seamless migration from single-core to multi-core Same infrastructure Same power envelope Think about it. Higher performance. Same power envelope. Same infrastructure. Think about all the things you don’t have to change: Motherboard Power supply Cooling solutions Heat sinks All this requires is a processor swap and a BIOS update. This is an easy migration path for customers to increase their computing power in a cost-effective, non-disruptive manner. Launched 21 April 2005 Leverage AMD Direct Connect Architecture to connect two CPUs on one die along with the memory, I/O which improves the overall system performance and efficiency Socket compatible with existing AMD Opteron 940-pin sockets that support 90nm (95W/80A) Only Rev E based systems are Dual-core upgradable with a BIOS update C ompatible with x86 and AMD64 applications
  • Main differences between Opteron steppings
  • a stick of memory that contains registers will actually hold data for one full clock cycle before it's passed on. A small performance hit is generally incurred as a result. Registered memory is all about scalability and stability.
  • This shows optimum memory placement in V20z and V40z for full capacity usage of 128 bit memory bus.
  • Hardware Management Module

    1. 1. X64 Workshop Hardware Management <ul><li>Andy Harness </li></ul><ul><ul><li>Systems-TSC Technical Community </li></ul></ul>
    2. 2. Managing X64 Hardware <ul><li>AMD CPU Types </li></ul><ul><li>System Upgrades - CPUs </li></ul><ul><li>System Upgrades - Memory </li></ul><ul><li>Identification of M2/Non-M2 Systems </li></ul><ul><li>Updating ILOM Firmware </li></ul><ul><li>System Board Replacement </li></ul><ul><ul><li>Updating FRUID data </li></ul></ul><ul><li>Current Issues </li></ul>
    3. 3. Single-core Opteron <ul><li>AMD64 </li></ul><ul><ul><li>Direct Connect Architecture </li></ul></ul><ul><ul><li>Integrated DDR DRAM Memory Controller </li></ul></ul><ul><ul><li>HyperTransport Interconnect Technology </li></ul></ul>CPU0 1MB L2 Cache CPU0 1MB L2 Cache
    4. 4. AMD Direct Connect Architecture <ul><li>The AMD Direct-Connect Architecture eliminates traditional system bottlenecks created by Front Side Bus (FSB) architecture </li></ul><ul><li>Best approach of directly interconnecting CPU, memory, and I/O resources : </li></ul><ul><ul><ul><li>Direct connection of CPUs to each other using Coherent HyperTransport links </li></ul></ul></ul><ul><ul><ul><li>Direct connection of CPUs to I/O resources using HyperTransport links </li></ul></ul></ul><ul><ul><ul><li>Direct connection of CPUs to memory using integrated DDR memory controller </li></ul></ul></ul><ul><ul><ul><li>Direct connection between CPUs on the same die (Dual-core) </li></ul></ul></ul>HyperTransport HyperTransport I/O
    5. 5. Dual-core Opteron <ul><li>AMD64 was designed as CMP (Chip-level Multi-Processing) from the start with Crossbar Switch and System Request Queue (CPU1 uses 2 nd port on SRQ) </li></ul><ul><li>Each core has dedicated 1MB L2 Cache </li></ul><ul><li>Both cores share the memory controller and HyperTransport™ interconnects </li></ul><ul><li>Performance characterization of single-core based systems have revealed that the Memory and HyperTransport bandwidths are under-utilized even while running high-end server workloads </li></ul>CPU0 1MB L2 Cache
    6. 6. AMD CPUs - Next Generation Opteron (Rev F) <ul><li>Continuity </li></ul><ul><ul><ul><li>Same 32/64-bit execution core </li></ul></ul></ul><ul><ul><ul><li>Same Power envelope </li></ul></ul></ul><ul><ul><ul><li>Same AMD Direct Connect Architecture w/ upto 3 HT links per CPU </li></ul></ul></ul><ul><ul><ul><li>Same 1 MB L2 cache per core </li></ul></ul></ul><ul><li>New Features </li></ul><ul><ul><ul><li>Second generation Opteron design </li></ul></ul></ul><ul><ul><ul><li>Dual-Core only </li></ul></ul></ul><ul><ul><ul><li>Seamless Dual-Core to Quad-Core upgradeability in same thermal envelope and socket </li></ul></ul></ul><ul><ul><ul><li>AMD Virtualization (AMD-V) hardware assisted support </li></ul></ul></ul><ul><ul><ul><li>Sockets F (LGA-1207) or AM2 (PGA-940) </li></ul></ul></ul>
    7. 7. AMD CPUs - Next Generation Opteron (Rev F) <ul><li>Steppings F2 & F3 </li></ul><ul><ul><ul><li>Rev F Opteron CPUs shipping in CY06Q4 are stepping F2 </li></ul></ul></ul><ul><ul><ul><li>AMD were due to start production of F3 processors in November 2006 (Series 2000 & 8000) and December 2006 (Series 1000) </li></ul></ul></ul><ul><ul><ul><li>Stepping F3 has several Errata fixes </li></ul></ul></ul><ul><ul><ul><ul><li>#133: Internal Termination Missing on Some Test Pins </li></ul></ul></ul></ul><ul><ul><ul><ul><li>#153: Potential System Hang in Multiprocessor systems with ? 14 Cores </li></ul></ul></ul></ul><ul><ul><ul><ul><li>#157: SMIs that are not Intercepted May Cause Unpredictable System Behaviour </li></ul></ul></ul></ul><ul><ul><ul><li>Stepping F3 allows higher speed bins than F2 </li></ul></ul></ul><ul><ul><ul><li>AMD has released BIOS code that ensures mixing of F2 & F3 CPUs in a single system. It is up to each Product Team to decide testing/qualification of such configurations Next Generation Opteron (Rev F) </li></ul></ul></ul>
    8. 8. AMD CPUs – Opteron Rev F Model Naming 2 2 2 2 1/2/8 2 XX 3 rd & 4 th digits = relative performance compared to other processors in the series 2 nd digit = socket generation (Rev F = 2) 1 st digit = scalability, max number of processors supported (1, 2, or 8) 4-digit nomenclature
    9. 9. AMD Opteron Rev. E and Rev. F
    10. 10. Opteron Steppings & Sockets <ul><li>Rev F ≠ Socket F </li></ul><ul><ul><li>Rev F is the Processor Stepping </li></ul></ul><ul><ul><li>Socket F (1207) is the physical socket </li></ul></ul><ul><ul><li>Rev F Opteron uses different sockets </li></ul></ul><ul><ul><li>Opteron Series 2000 & 8000 use socket F (1207) </li></ul></ul><ul><ul><li>Opteron Series 1000 use Socket AM2 </li></ul></ul><ul><ul><li>Socket F will be used by Opteron G (Quad Core) </li></ul></ul>
    11. 11. AMD Opteron Rev F Model Naming <ul><li>1200 Series </li></ul><ul><ul><ul><li>100 Series replacement </li></ul></ul></ul><ul><ul><ul><li>Single socket only </li></ul></ul></ul><ul><ul><ul><li>1 HT link, Not Cache Coherent </li></ul></ul></ul><ul><ul><ul><li>Socket AM2 </li></ul></ul></ul><ul><ul><ul><ul><li>2200 Series </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>200 Series replacement </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Up to 2 sockets </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>3 HT links, 1 is Cache Coherent </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Socket 1207 </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><ul><li>8200 Series </li></ul></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><ul><ul><li>800 Series replacement </li></ul></ul></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><ul><ul><li>Up to 8 sockets </li></ul></ul></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><ul><ul><li>3 HT links, ALL are Cache Coherent </li></ul></ul></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><ul><ul><li>Socket 1207 </li></ul></ul></ul></ul></ul></ul></ul>
    12. 12. System Upgrades <ul><li>CPU Upgrade Options </li></ul><ul><li>Memory Upgrade Options </li></ul>
    13. 13. System Upgrades <ul><li>Can I Upgrade from a Rev E CPU to a Ref F CPU? </li></ul><ul><li>Generally NO because... </li></ul><ul><ul><li>CPU sockets are different </li></ul></ul><ul><ul><li>Memory Type is different </li></ul></ul><ul><li>What about X4600? </li></ul><ul><ul><li>ALL CPU/Mem boards including DIMMs need to be replaced </li></ul></ul><ul><ul><li>The BIOS/ILOM firmware must be updated to M2 version during the upgrade procedure </li></ul></ul>
    14. 14. System Upgrades - CPUs <ul><li>CPU General Rules </li></ul><ul><ul><li>All CPUs must be the same model </li></ul></ul><ul><ul><li>All CPUs must be the same clock speed </li></ul></ul><ul><ul><li>No Mixing of Single/Dual Core </li></ul></ul><ul><ul><li>All CPUs should be the same Stepping </li></ul></ul><ul><ul><li>May require specific firmware levels </li></ul></ul>
    15. 15. Opteron Steppings More info: http://www.amd.com/processorquickrefguide
    16. 16. System Upgrades - CPUs Sun Fire V20z Compatibility Between Components and Software for Each Server Version
    17. 17. System Upgrades - CPUs Sun Fire V20z CPU Options and Part Numbers
    18. 18. System Upgrades - CPUs Sun Fire V40z Compatibility Between Components and Software for Each Server Version
    19. 19. System Upgrades - CPUs Sun Fire V40z CPU Options and Part Numbers
    20. 20. System Upgrades - Memory <ul><li>General Memory Configuration Rules </li></ul><ul><ul><li>Memory Types </li></ul></ul><ul><ul><ul><li>Registered (Buffered) vs Unbuffered (Unregistered) DIMMs </li></ul></ul></ul><ul><ul><ul><li>DDR2 vs DDR1 </li></ul></ul></ul><ul><ul><li>Memory slots </li></ul></ul><ul><ul><li>Memory placement </li></ul></ul>
    21. 21. Buffered vs. Unbuffered DIMMs <ul><ul><li>Registered DIMMs have additional registers placed between the CPUs and DIMMs </li></ul></ul><ul><ul><ul><li>Improves signal integrity, allowing longer traces </li></ul></ul></ul><ul><ul><ul><li>and larger, more reliable memory subsystems </li></ul></ul></ul><ul><ul><ul><li>Up to 4GB per DIMM slot, 8 DIMM slots per CPU (4 typical) </li></ul></ul></ul><ul><ul><li>Unbuffered DIMMs do not have these registers </li></ul></ul><ul><ul><ul><li>Lower cost and slightly higher performance </li></ul></ul></ul><ul><ul><ul><li>Up to 4 DIMM slots per CPU </li></ul></ul></ul><ul><ul><ul><li>Sun Fire X2100 w/AMD Opteron 100 Series CPU </li></ul></ul></ul>
    22. 22. Memory Speed vs. DIMM Slots/CPU <ul><li>4 DIMM slots per Opteron processor </li></ul><ul><ul><li>Optimized for memory performance </li></ul></ul><ul><ul><li>Enables the use of higher performing DIMMs </li></ul></ul><ul><ul><li>4 DIMMs @ DDR400 outperforms 8 DIMMs @ DDR266 </li></ul></ul><ul><li>8 DIMM slots per Opteron processor </li></ul><ul><ul><li>Trades memory capacity for performance </li></ul></ul><ul><ul><ul><li>DDR400 DIMMs can only be used to populate 4 DIMM slots, leaving the remaining 4 unused </li></ul></ul></ul><ul><ul><ul><li>DDR333 DIMMs can only be used to populate 6 DIMM slots, leaving the remaining 2 unused </li></ul></ul></ul><ul><ul><ul><li>DDR266 DIMMs can be used to populate 8 DIMM slots – but with a very noticeable performance impact </li></ul></ul></ul><ul><ul><li>8 DIMMS slots per socket requires physically larger systems (X2200-M2) </li></ul></ul>
    23. 23. Memory Placement <ul><li>The AMD Opteron processor’s memory controller works in 64-bit (single channel) or 128-bit (dual channel) mode ECC operation. </li></ul><ul><li>For best memory performance, AMD recommends running in 128-bit mode ECC operation. </li></ul><ul><li>To enable 128-bit mode, DIMMs should be populated in 2 identical pairs such that they each occupy one-half of the AMD Opteron processor’s 128-bit memory controller interface. </li></ul><ul><li>This is a logical View at bus level which does not represent physical location of DIMMs on Motherboard. </li></ul>64-bit Pair 2 64-bit Pair 1
    24. 24. DDR2 vs DDR memory <ul><li>240 pins vs 184 pins </li></ul><ul><li>Keyway in different Position </li></ul><ul><li>Label Description e.g PC2-5300 vs PC3200 </li></ul>
    25. 25. Identification of M2/Non-M2 systems <ul><li>Using the Service Processor </li></ul><ul><li>From the Operating System </li></ul><ul><li>Physical Identification </li></ul>
    26. 26. Identification of M2/Non-M2 systems <ul><li>Common Differences </li></ul><ul><ul><li>1/2/8000 Series CPUs v 1/2/800 series CPUs </li></ul></ul><ul><ul><li>Heatsinks </li></ul></ul><ul><ul><li>DDR2 v DDR1 memory </li></ul></ul><ul><ul><li>PCIexpress v PCI-X </li></ul></ul><ul><li>Reference Documentation </li></ul><ul><ul><li>http://www.sun.com/blueprints/1106/820-0373.pdf </li></ul></ul>
    27. 27. Identification of M2/Non-M2 systems <ul><li>Product Name </li></ul><ul><ul><li>Shown in BIOS Boot Screen </li></ul></ul><ul><ul><li>BIOS Setup Utility </li></ul></ul><ul><ul><li>ILOM CLI under /SYS/MB </li></ul></ul><ul><ul><li>IPMItool fru output </li></ul></ul><ul><li>CPU Information </li></ul><ul><ul><li>Solaris - psrinfo -pv </li></ul></ul><ul><ul><li>Linux – cat /proc/cpuinfo </li></ul></ul><ul><ul><li>Windows – Look in Device Manager for Processors </li></ul></ul>
    28. 28. Identification of M2/Non-M2 systems
    29. 29. Identification of M2/Non-M2 systems
    30. 30. Identification of M2/Non-M2 systems
    31. 31. Identification of M2/Non-M2 systems <ul><li>-> show /SYS/MB </li></ul><ul><li>/SYS/MB </li></ul><ul><ul><li>Targets: </li></ul></ul><ul><ul><li>BAT </li></ul></ul><ul><ul><li>NET0 </li></ul></ul><ul><ul><li>NET1 </li></ul></ul><ul><ul><li>P0 </li></ul></ul><ul><ul><li>P1 </li></ul></ul><ul><li>Properties: </li></ul><ul><ul><li>SEEPROM = </li></ul></ul><ul><ul><li>Product Information: </li></ul></ul><ul><ul><li>manufacturer name = SUN MICROSYSTEMS </li></ul></ul><ul><ul><li>product name = Sun Fire X4100 M2 </li></ul></ul><ul><ul><li>version = (no information) </li></ul></ul><ul><ul><li>serial number = 0640BD0152 </li></ul></ul><ul><ul><li>part number = 602-3482-01 </li></ul></ul><ul><ul><li>T_AMB = 21.000000 degrees C </li></ul></ul><ul><ul><li>V0_VDD = No reading available </li></ul></ul><ul><ul><li>V0_VDDIO = No reading available </li></ul></ul><ul><ul><li>V0_VTT = No reading available </li></ul></ul><ul><ul><li>V1_VDD = No reading available </li></ul></ul><ul><ul><li>V1_VDDIO = No reading available </li></ul></ul><ul><ul><li>V1_VTT = No reading available </li></ul></ul><ul><ul><li>V_+12V = No reading available </li></ul></ul><ul><ul><li>V_+1V2 = No reading available </li></ul></ul><ul><ul><li>V_+1V5 = No reading available </li></ul></ul><ul><ul><li>V_+2V5 = No reading available </li></ul></ul><ul><ul><li>V_+3V3MAIN = No reading available </li></ul></ul><ul><ul><li>V_+3V3STBY = 3.252400 Volts </li></ul></ul><ul><ul><li>V_+5V = No reading available </li></ul></ul><ul><ul><li>V_-12V = No reading available </li></ul></ul><ul><li>Commands: </li></ul><ul><li>cd </li></ul><ul><li>show </li></ul><ul><li>-> </li></ul>
    32. 32. Identification of M2/Non-M2 systems <ul><li>FRU Device Description : mb.fru (ID 2) </li></ul><ul><li>Chassis Type : Rack Mount Chassis </li></ul><ul><li>Chassis Part Number : 000-0000-00 </li></ul><ul><li>Chassis Serial : 0226-0638LHF013N </li></ul><ul><li>Board Product : ASSY,MOTHERBOARD,X4600,REV F </li></ul><ul><li>Board Serial : 1762TH1-0625001244 </li></ul><ul><li>Board Part Number : 501-7638-01 </li></ul><ul><li>Board Extra : 01 </li></ul><ul><li>Board Extra : G4F_MB </li></ul><ul><li>Product Manufacturer : SUN MICROSYSTEMS </li></ul><ul><li>Product Name : SUN FIRE X4600 </li></ul><ul><li>Product Part Number : 602-3472-01 </li></ul><ul><li>Product Serial : 0640AM0978 </li></ul><ul><li>Output from ipmitool fru print command </li></ul>
    33. 33. Identification of M2/Non-M2 systems <ul><ul><li>Everyone and Everything Participating on the Network </li></ul></ul>X2100 M2 X2100
    34. 34. Identification of M2/Non-M2 systems X4100 M2 X4100
    35. 35. Identification of M2/Non-M2 systems <ul><ul><li>Everyone and Everything Participating on the Network </li></ul></ul>X4200 M2 X4200
    36. 36. Identification of M2/Non-M2 systems Ultra20 M2 Ultra20
    37. 37. Identification of M2/Non-M2 systems <ul><ul><li>Everyone and Everything Participating on the Network </li></ul></ul>Ultra40 M2 Ultra40
    38. 38. Updating ILOM Firmware <ul><li>The upgrade of the ILOM firmware on the Galaxy range of systems should be carried out in a specific manner to avoid loss of system fruid data and to ensure the safe operation of the system </li></ul><ul><li>This is to confirm that upgrade in steps is still required contrary to some inferred messages circulating </li></ul><ul><li>SW1.0 ==>SW1.1==>SW1.2 or SW1.2.1-->SW1.3 </li></ul><ul><ul><li>==> has to pre-flash and flash upgrade in that order </li></ul></ul><ul><ul><li>--> no pre-flash needed. </li></ul></ul>
    39. 39. Updating ILOM Firmware The G12 (not G12F) step by step upgrade process, starting from a system with SW1.0a (ILOM build 6464) would look like below. Step #1 is only for early access systems with ILOM build 6169. Your entry point in this process will depend on what firmware level the system is running. For example, with a system with ILOM build 9306, start from step #4. 1. SW1.0a - ILOM build 6464 URL: http://www.sun.com/download/products.xml?id=436bd009 2. Pre-flash script: ilom.X4100-preflash_1.2.sh 3. SW1.1 - ILOM build 9306 URL: http://www.sun.com/download/products.xml?id=442f01f5 4. Pre-flash script ilom.X4100-preflash_1.2.sh 5. SW 1.2 - ILOM 1.0.5 build 12029c URL: http://www.sun.com/download/products.xml?id=44cfd445 6. SW 1.3 - ILOM 1.1.1 (or 1.1.1.1) build 15632 URL: http://www.sun.com/download/products.xml?id=45b94409
    40. 40. Updating FRUID data <ul><li>After Service intervention or firmware update the FRUID information stored in the SEEPROM for the one or more parts may no longer be correct and may need to be re-entered manually </li></ul><ul><li>On Galaxy type systems, this can be accomplished using the service processor utility “servicetool” </li></ul>
    41. 41. Updating FRUID data <ul><li>Example of Good Data </li></ul><ul><li>-> show /SYS/MB/SEEPROM </li></ul><ul><li>Properties: </li></ul><ul><li>SEEPROM = </li></ul><ul><li>Product Information: </li></ul><ul><li>manufacturer name = SUN MICROSYSTEMS </li></ul><ul><li>product name = SUN FIRE X4200 </li></ul><ul><li>version = (no information) </li></ul><ul><li>serial number = 0550AN026D </li></ul><ul><li>part number = 602-3103-01 </li></ul><ul><li>Example of Bad Data </li></ul><ul><li>-> show /SYS/MB/SEEPROM </li></ul><ul><li>Properties: </li></ul><ul><li>SEEPROM = </li></ul><ul><li>Product Information: </li></ul><ul><li>manufacturer name = SUN MICROSYSTEMS </li></ul><ul><li>product name = SUN FIRE X4100 </li></ul><ul><li>version = (no information) </li></ul><ul><li>serial number = 0000000000 </li></ul><ul><li>part number = 602-0000-00 </li></ul>
    42. 42. Updating FRUID data <ul><li>Example of Updating System Board FRUID data </li></ul><ul><ul><li>Log in to Service Processor as user “sunservice” </li></ul></ul><ul><ul><li>ebusy> ssh -l sunservice va64-x4200c-sp-gmp03 </li></ul></ul><ul><ul><li>sunservice@va64-x4200c-sp-gmp03's password: changeme </li></ul></ul><ul><ul><li>Issue the servicetool command, and answer the questions </li></ul></ul><ul><ul><li>[(flash)root@SUNSP00144F0E27BD:~]# servicetool --board_replaced=mainboard </li></ul></ul><ul><ul><li>--fru_product_part_number --fru_product_serial_number </li></ul></ul>
    43. 43. Updating FRUID data Servicetool is going to update the mainboard FRU with product and chassis information collected from the removed mainboard. The following preconditions must be true for this to work: * The new mainboard must be installed. * The service processor must not have been replaced with the motherboard. * The service processor firmware must not have been upgraded prior to the motherboard replacement; do firmware upgrades after component swaps! Do you want to continue (y|n)? y Mainboard FRU configuration has been updated. Servicetool is going to update the mainboard FRU product part number. Do you want to continue (y|n)? y When entering values, do not use quotes; If you require embedded quotes, escape them with three backslashes; e.g. amp;quot;
    44. 44. Updating FRUID data What is the new product part number? 602-3103-01 The product part number has been updated. The new part number is: &quot;602-3103-01&quot; Servicetool is going to update the mainboard FRU product serial number. Do you want to continue (y|n)? y When entering values, do not use quotes; If you require embedded quotes, escape them with three backslashes; e.g. amp;quot; What is the new product serial number? 0550AN026D The product serial number has been updated. The new serial number is: &quot;0550AN026D&quot; Updating FRUs... done [(flash)root@SUNSP00144F0E27BD:~]#
    45. 45. Current Issues <ul><li>FAB 102770 Galaxy power busbar connections </li></ul><ul><li>CR 6335741 X4100 PCI Riser causing reboots </li></ul><ul><li>CR 6515060 X4600 Randomly powers off </li></ul><ul><li>CR 6537731 X4600M2 DIMM slots labelled wrongly </li></ul>
    46. 46. FAB 102770 Thermal Issue on Galaxy <ul><li>Failure to properly tighten the System/Motherboard or the DC Power Distribution Board bus bar connections on Galaxy may lead to thermal event </li></ul><ul><ul><li>New Nut design </li></ul></ul><ul><ul><li>Screws should be torqued to 7.5in/lbs (0.847385 Newton Meters) and 18in/lbs (2.03372 Newton Meters) for nuts. </li></ul></ul><ul><ul><li>Busbar test to monitor 12v rail with and without load </li></ul></ul><ul><ul><ul><li>http://nsgrelease.sfbay/galaxy12/releases/G12x-SW1.3-rc38/ops/061215/ </li></ul></ul></ul><ul><ul><ul><li>http://sdpsweb.central/FIN_FCO/FAB/102770/SPE/busbar </li></ul></ul></ul>
    47. 47. FAB 102770 Thermal Issue on Galaxy <ul><li>1) Copy the latest busbar tool to the service processor /coredump directory. </li></ul><ul><li>scp busbar sunservice@?sp_ip?:/coredump <cr> </li></ul><ul><li>(where ?sp_ip? is the target IP address) </li></ul><ul><li>.....continue conection (yes/no)? yes <cr> </li></ul><ul><li>password: changeme <cr> </li></ul><ul><li>2) ssh into the targeted system </li></ul><ul><li>ssh sunservice@?sp_ip? <cr> </li></ul><ul><li>password: changeme <cr> </li></ul><ul><li># cd /coredump </li></ul>
    48. 48. FAB 102770 Thermal Issue on Galaxy busbar <loopcnt> <system name> loopcnt - This is the number of time you wish busbar to run. If this value is 0 then busbar will run forever. System name - This specifies the machine type to test. Below is a list of systems known to busbar. system name g1 = Galaxy1 g2 = Galaxy2 g1 = Galaxy1e g2e = Galaxy2e g1f = Galaxy1f g2f = Galaxy2f cnst = Constellation For example, to run busbar 3 time on Constellation I would use the following command: ./busbar 3 cnst Description: The busbar was designed to find systems with poor busbar connections. This done by reading the 12 volt sensor twice. The first time the 12 volt sensor is read with the system in reset and the fans spun down so as to minimize the load on the system. The second time the 12 volt sensor is read with the system running and the fans at their highest rpm so as to maximize the load on the system. The two numbers are compared, if the difference between the two is greater than 5% then there may be a problem with the bus bar connection and an error is generated.
    49. 49. FAB 102770 Thermal Issue on Galaxy
    50. 50. CR 6335741 X4100 PCI Riser causing reboots <ul><li>Excessive ring-back noise during write cycles from the option card installed on the 133MHz Slot (slot 1) </li></ul><ul><li>Workaround – only use slot 0 (100MHz) </li></ul><ul><li>Replace Riser Card, 501-6914-01with 501-6914-02 </li></ul><ul><ul><li>Engineering suggest replacing both risers </li></ul></ul><ul><li>Until FCO is available raise CIC </li></ul>
    51. 51. CR 6515060 X4600 Randomly Powers Off <ul><li>Small percentage of CPUs giving ThermTrip errors </li></ul><ul><ul><li>BIOS 44 (SW1.3) available end of march, will log ThermTrip events as default. If applied, ThermTrip events will cause the system to shutdown. When the system is powered on, there will a message stating a ThermTrip event has occurred. This message is also logged and can be retrieved by ipmitool. </li></ul></ul><ul><ul><li>Further debugging using latest version of HDT required to isolate failing CPU. Requires debug BIOS which disables shutdown on ThermTrip so only Sun badged engineers allowed to use it </li></ul></ul>
    52. 52. CR 6515060 X4600 Randomly Powers Off <ul><li>Thermtrip detected by hdtl – Example </li></ul><ul><li>[(flash)root@SUNSP00144F26E93F:/coredump]# ./hdtl -y </li></ul><ul><li>hdtDiag: Galaxy/Thumper HDT Diagnostics, Version 0.9.6 </li></ul><ul><li>------------------------------------------------------- </li></ul><ul><li>hdtDiag: ThermTrip Diags </li></ul><ul><li>Stopping IPMI Stack....Done. </li></ul><ul><li>no dbdry cpu 00 </li></ul><ul><li>hdtDiag: resetting system, hard reset </li></ul><ul><li>hdtDiag: waiting for power good </li></ul><ul><li>hdtDiag: Power on 01 </li></ul><ul><li>stopped at reset vector </li></ul><ul><li>hdtDiag: Galaxy4: 8 - CPU Configuration mod 00ff </li></ul><ul><li>hdtDiag: Power is on </li></ul><ul><li>hdtDiag: relocating 8132 </li></ul><ul><li>hdtDiag: ck08 : 0x005110de </li></ul><ul><li>CK08 System Control Register space 0x4400 </li></ul><ul><li>00 : 0x09000000 </li></ul><ul><li>04 : 0x00000000 </li></ul><ul><li><snip> </li></ul>
    53. 53. CR 6515060 X4600 Randomly Powers Off <ul><li></snip> </li></ul><ul><li>f8 : 0x00000000 </li></ul><ul><li>fc : 0x00000000 </li></ul><ul><li>TCO Status Register : 0x00410002 </li></ul><ul><li>THERMTRIP_STS asserted : THERMTRIP detected by ck08 </li></ul><ul><li>TCO Ctrl Register : 0x00ff0800 </li></ul><ul><li>THERMTRIP_RST set : G2/S5 shutdown on THERMTRIP enabled </li></ul><ul><li>hdtDiag: reset to get to ASP's </li></ul><ul><li>hdtDiag: resetting system, hard reset </li></ul><ul><li>hdtDiag: exit HDT mode </li></ul><ul><li>hdtDiag: cpu 0 Slot A : 0x11510120 ENABLED </li></ul><ul><li>hdtDiag: cpu 1 Slot C : 0x104f0020 ENABLED </li></ul><ul><li>hdtDiag: cpu 2 Slot B : 0x0f570220 ENABLED </li></ul><ul><li>hdtDiag: cpu 3 Slot D : 0x0f4d0220 ENABLED </li></ul><ul><li>hdtDiag: cpu 4 Slot E : 0x104c0120 ENABLED </li></ul><ul><li>hdtDiag: cpu 5 Slot F : 0x0f4f0220 ENABLED </li></ul><ul><li>hdtDiag: cpu 6 Slot G : 0x10490120 ENABLED </li></ul><ul><li>hdtDiag: cpu 7 Slot H : 0x10460320 ENABLED </li></ul><ul><li>[(flash)root@SUNSP00144F26E93F:/coredump]# </li></ul>
    54. 54. CR 6515060 X4600 Randomly Powers Off <ul><li>Hdtl output using engineering BIOS </li></ul><ul><li></snip> </li></ul><ul><li>f8 : 0x00000000 </li></ul><ul><li>fc : 0x00000000 </li></ul><ul><li>TCO Status Register : 0x00400000 </li></ul><ul><li>THERMTRIP_STS de-asrt : No THERMTRIP detected by ck08 </li></ul><ul><li>TCO Ctrl Register : 0x00fe1000 </li></ul><ul><li>THERMTRIP_RST set : G2/S5 shutdown on THERMTRIP disabled </li></ul><ul><li>hdtDiag: cpu 0 Slot A : 0x115e0120 ENABLED </li></ul><ul><li>hdtDiag: cpu 1 Slot C : 0x105f0020 ENABLED </li></ul><ul><li>hdtDiag: cpu 2 Slot B : 0x0f63022a ENABLED THERMTRIP CORE0 </li></ul><ul><li>hdtDiag: ===> CPU slot B : reset caused by THERMTRIP </li></ul><ul><li>hdtDiag: cpu 3 Slot D : 0x0f560220 ENABLED </li></ul><ul><li>hdtDiag: cpu 4 Slot E : 0x10590120 ENABLED </li></ul><ul><li>hdtDiag: cpu 5 Slot F : 0x0f5a0220 ENABLED </li></ul><ul><li>hdtDiag: cpu 6 Slot G : 0x10560120 ENABLED </li></ul><ul><li>hdtDiag: cpu 7 Slot H : 0x10500320 ENABLED </li></ul><ul><li>[(flash)root@SUNSP00144F26E93F:/coredump]# </li></ul>
    55. 55. CR 6537731/6538830 X4600-M2 DIMM slots labelled wrongly in service manual and on service label <ul><li>G4's top cover service label, pn 263-2329-04-50, X4600 M2. Step #10 &quot;DIMM Insertion&quot; the DIMM slot number information is INCORRECT </li></ul><ul><li>Currently it shows: [Board top edge] </li></ul><ul><ul><li>DIMM0 -White - Pair 0 </li></ul></ul><ul><ul><li>DIMM1 -White - Pair 0 </li></ul></ul><ul><ul><li>DIMM2- Black - Pair 1 </li></ul></ul><ul><ul><li>DIMM3- Black - Pair 1 </li></ul></ul><ul><li>The label should read: [Board top edge] </li></ul><ul><ul><li>DIMM3 -White - Load first - Pair1 </li></ul></ul><ul><ul><li>DIMM2 -White - Load first - Pair1 </li></ul></ul><ul><ul><li>DIMM1- Black - Pair 0 </li></ul></ul><ul><ul><li>DIMM0- Black - Pair 0 </li></ul></ul>
    56. 56. <ul><li>[email_address] </li></ul>

    ×