{DESCRIPTION} {TRANSCRIPT} Welcome, this section covers the Troubleshooting aspects of servers. This is Topic 11 in a series of topics of the System x Technical Principles Course – XTW01.
{DESCRIPTION} {TRANSCRIPT} At the completion of this topic, you should be able to: Identify basic troubleshooting questions to consider Identify the six possible states of a system Identify diagnostic tools that are available to gather and analyze information for each given system state
{DESCRIPTION} {TRANSCRIPT} This course is designed to familiarize you with troubleshooting tools that will help to determine the appropriate solutions to your system’s problem. We will examine the six system states to gather and analyze problem, introduce the basic tools that will help identify and determine any system issues, and introduce the Dynamic System Analysis (DSA) tool which collects and analyzes system information to aid in diagnosing system problems. The next slides describes the troubleshooting questions to consider.
{DESCRIPTION} {TRANSCRIPT} When troubleshooting System x and BladeCenter blade servers, knowing the answers to these questions can help lead you toward a quicker fix. It also will give you an idea of how to enter the diagnostic package and where to look for error indications. In the next slides, we will take a look at other tools available to help you diagnose and solve hardware-related problems as well as software installation and configuration options.
{DESCRIPTION} {TRANSCRIPT} This section examines the six system states.
{DESCRIPTION} {TRANSCRIPT} When trying to identify a system problem consider the six possible states of a system. All IBM xSeries, eServer (AMD processor-based) and System x servers start in a uniform manner. All have a common set of interfaces to advise where in the power-up sequence the server has reached. Knowing how far the system gets helps in determining what should have happened. This will help you in your problem analysis.
{DESCRIPTION} {TRANSCRIPT} All servers are supported by documentation, which forms part of the tool set for both information gathering and information analysis. For example, a Problem Determination and Service Guide (PDSG), contains a list of errors that may occur (information gathering) during POST but also contain probable causes of the error (information analysis). Not all information sources are available in all system states. Let’s take a closer look at the individual states.
{DESCRIPTION} {TRANSCRIPT} In this state, there are likely to be no indicators from the system. However, that fact is an indicator in itself. A system with no AC power cannot start so there will be no fan noise, the disks will not spin and there will be no lights. Your information gathering tools are your eyes and ears. Sight and sound clues are often overlooked but a lack of visual or audible indicators can be used for problem isolation. The PDSG contains information in several places about such situations and offers guidance on how to test the condition and what action to take to resolve it.
{DESCRIPTION} {TRANSCRIPT} Here, AC power is present. There may be a visual indicator on the outside of the power supply. Depending on the configuration of the system, there may also be event logs from a BMC or an RSA/management module if one is installed in the system. The information gathering tools for BMC are the SMBridge utility and the IBM SVCCon tool. The BMC log is in IPMI standard format and may require some interpretation to identify any problems. The RSA log is in human readable format and will indicate more clearly what is wrong. Remember that the tools from system state 1 are also available in this state. The PDSG will be useful in understanding any event log information that is available.
{DESCRIPTION} {TRANSCRIPT} In this state, something has caused the system to fail in POST. If POST can start, there may be audible warnings or visual warnings in the form of POST messages such as check point codes, adapter messages, etc. In addition, if POST is able to load the setup and diagnostic routines, you may be able to use these tools. If so, both have the ability to view the event logs of the system, giving you another option to view system status. If POST fails on an adapter, the adapter may display messages that give clues to what is failing. Information on adapter failure messages may be contained in the PDSG, RETAIN tips or on the IBM Support Web site. All of these tools are useful in trying to understand the nature of the problem.
{DESCRIPTION} {TRANSCRIPT} Here, POST has completed. This is a strong indicator that the hardware has checked out and is working as designed. However, it does not mean that the hardware is ‘fit for purpose’. As the ‘purpose’ is to have the hardware behave as a server, something is stopping the software from loading. Areas to examine here may include diagnostic checks on disks, the disk boot sequence and any RAID configurations to confirm that a viable RAID array exists and can load an OS. Again, RETAIN tips and the IBM support Web site may have information that can help diagnose why the OS is failing to start.
{DESCRIPTION} {TRANSCRIPT} In this state, the OS has started its boot process but is failing to complete and provide a console. At this stage, the OS itself may give indicators of what is wrong. For example, a Microsoft Windows server OS shows progress indicators on the screen and may report software error messages. This information is likely to be documented on the Microsoft support Web site. Some of the information may also be contained within IBM RETAIN tips. It is important to check all available sources when information is available and verify any fault diagnosis with two confirmations if they are available.
{DESCRIPTION} {TRANSCRIPT} Finally, this state indicates that the system started correctly. This means that, at some point, everything was working as designed and the server was fit for purpose. However, something subsequently failed. An example of this is where a disk fails after the OS loads, causing the system to reboot. Depending on any fault tolerant characteristics of the server, the restart may have been successful. If it was, this opens up new opportunities for information gathering and analysis, in the form of OS event logs and the DSA tool. As with all previous states, if multiple information sources are available, all should be checked to look for matches in symptoms to increase your confidence level in the diagnosis.
{DESCRIPTION} {TRANSCRIPT} It may not always be the case that multiple information sources are available for a given fault. If they are, you must always check all sources against the available analysis tools. Ideally, you are looking for a match. If there is a mis-match between two information sources, you may need to look for a third source to help you to be confident that you have identified the cause of the problem.
{DESCRIPTION} {TRANSCRIPT} Always use recognized reference points. Tools such as RETAIN tips, the PDSG and the IBM support Web site contain the collective knowledge of the teams who designed the system, along with the accumulated knowledge of the many people who support the product. Your experience is, of course, extremely valuable. If you have seen a fault many times and you are working with someone who is seeing the fault for the first time, you can bring your experience to bear and help your team members. Remember to explain why a fault exists, not just how to fix it so that next time, your team members can fully understand the circumstances surrounding the fault.
{DESCRIPTION} {TRANSCRIPT} This section introduces the some of the basic data gathering diagnostic tools starting with the light path diagnostic tool.
{DESCRIPTION} {TRANSCRIPT} The light path diagnostics allow you to quickly identify the type of system error that occurred by monitoring and reporting the health of the processors, main memory, hard disk drives, PCI adapters, fans, power supplies, VRMs, and the internal system temperature. The server is designed so that any LEDs that are illuminated remain illuminated when the server shuts down as long as the power source is good. This feature helps you isolate the problem if an error causes the server to shut down. The system board also contains LEDs beside specific components—such as DIMM Slot 12—identifies the failed part. The light path diagnostics works even when the server is unplugged. The two buttons shown on the light path panel are: Remind - You can use the remind button on the light path diagnostics panel to put the system-error LED on the operator information panel into Remind mode. When you press the remind button, you acknowledge the error but indicate that you will not take immediate action. The system-error LED flashes while it is in Remind mode and stays in Remind mode until one of the following conditions occurs: - All known errors are corrected. - The server is restarted. - A new error occurs, causing the system-error LED to be lit again. Reset - Use this button to force an immediate system restart.
{DESCRIPTION} {TRANSCRIPT} This section identify the three types of service processor: Baseboard Management Control (BMC), Remote Supervisor Adapter (RSA), and Advanced Management Module (AMM).
{DESCRIPTION} {TRANSCRIPT} This table lists the three types of service processor hardware management options available for IBM System x servers and BladeCenter chassis. These are separate service processors used to control power to the device and perform management and diagnostic functions. Baseboard management controller (BMC) or mini baseboard management controller (Mini-BMC) is a specialized microcontroller embedded on the motherboard, and used to control some System x servers. It is the intelligence in the Intelligent Platform Management Interface (IPMI) architecture. The BMC manages the interface between system management software and platform hardware through different types of sensors built into the server report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system (OS) status, to name a few. The BMC monitors these sensors and can send alerts to a system administrator via the network if any of the parameters do not stay within preset limits, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective action such as resetting or power cycling the system to get a hung OS running again. Standard in some System x servers and an option in others, the RSA expands BMC capability by allowing you to perform systems management functions whether your server is operational or not. The RSA can both be accessed either in-band through a device driver, or out-band over serial or Ethernet. The AMM provides system-management functions and keyboard/video/mouse (KVM) switching for all of the blade servers in a BladeCenter chassis that support KVM. Each BladeCenter chassis comes with at least one advanced management module. The Remote Supervisor Adapter or the Remote Supervisor Adapter II and the Advanced Management Module (AMM) monitors and reports the status and health of your system’s components via standalone Web interfaces. The Web interfaces contains similar tasks such an System Status and Event Log where you can view possible errors that may indicate a potential failure, and messages captured on the server’s hardware environment.
{DESCRIPTION} {TRANSCRIPT} This section introduce the Dynamic System Analysis (DSA) tools.
{DESCRIPTION} {TRANSCRIPT} IBM Dynamic System Analysis (DSA) collects and analyzes system information to aid in diagnosing system problems. The system information is collected into a compressed XML file that can be sent to IBM Service. By default, DSA output is created in the \IBM_Support directory of the hard disk defined by the %SystemDrive% environment variable. Additionally, users can view the system information through optionally generated HTML Web pages. DSA creates a merged log that allows users to easily identify cause-and-effect relationships from different log sources in the system. Optionally, DSA can also run diagnostics on the installed components. Two versions of the DSA are available. The first, DSA Portable Edition runs from the command prompt on a supported system without altering any system files or system settings. It expands to temporary space on the target system, runs, and deletes all intermediate files after execution completes. Its design and packaging allow it to collect system information in sensitive customer environments with only temporary use of system resources. The second version, DSA Installable Edition provides a permanent installation of DSA onto a system. This installation shares a similar command prompt interface with the portable edition. With DSA Installable Edition, you can get an UpdateXpress comparison analysis to verify whether your firmware and drivers are current.
{DESCRIPTION} {TRANSCRIPT} DSA Portable Edition is not installed on the target system. When run, DSA Portable Edition expands to a temporary directory, which is removed after DSA information collection is completed. DSA Portable Edition is designed to fit on removable media such as a CD or USB key. The removable media must be supported for use with the server on which you plan to run DSA Portable Edition. Installation of DSA requires 15MB of disk space. DSA requires 50 to 100 MB of available memory during the data collection process. The amount of memory required for this process depends on the size of the logs being collected from the system. To view the information that is collected by DSA, you must use Internet Explorer 6.0, with Service Pack 1 (or later) or Mozilla 1.4.0 (or later) or Firefox 1.04 (or later). In order to display the DSA data in a web browser, 30 to 100MB of available memory is required. The exact amount of memory required depends on the size of the logs being viewed. At the time of writing, DSA is launched by running the single MS Windows executable file: (ibm_utl_dsa_200p_windows_noarch.exe) or a Linux distribution specific shell script. It supports various command line options and the command might vary according the released version of the program. Using the -t option allows to send the collected data to IBM System x Service and Support. DSA uses File Transfer Protocol (FTP) to transfer the compressed XML output file to IBM Service. When DSA is run with the -v option, it creates a subdirectory that contains HTML files that you can view with a Web browser.
{DESCRIPTION} {TRANSCRIPT} DSA Installable Edition can be run from the Start menu, from a command prompt or a Linux shell prompt and has the same requirements as the Portable edition. The installation can be performed either manually or in unattended mode. Once installed, the DSA executable file is collectall.exe and, like in the Portable Edition, it supports command-line options. When the file is launched with the -u option, the user can designate a fully qualified path to an UpdateXpress CD or CD image. Alternatively, using the -ul option will instruct DSA to use the online UpdateXpress index. An additional command-line utility, rtdcli.exe , is included with DSA Installable Edition to provides more control over the execution of the diagnostic tests. When launched from the Start Menu, the HTML files are automatically generated.
{DESCRIPTION} {TRANSCRIPT} IBM has developed an embedded Preboot DSA (Dynamic System Analysis) diagnostics tool for the IBM System x3850 M2 and x3950 M2 because the previous tools such as PC Doctor didn’t meet the requirements for the newer IBM System x systems. The Preboot DSA is a NVRAM-based version of the of the Dynamic System Analysis tool that is used by the Technical Support teams to collect system and component level, operating system driver information as well as hardware event logs of different of hardware components or operating system event logs to diagnosis of system problems. The DSA and Preboot DSA collect information that can be viewed locally or uploaded to an IBM internal FTP server for the Technical Support teams to have remote access from different locations at every time in each part of the world in case a deeper analysis of system state information or error logs is required. It is the primary method of testing the major components of the server. This image shows the options available from the main Preboot DSA interface. The Preboot DSA is a feature for selected servers, at this time.
{DESCRIPTION} {TRANSCRIPT} You can accessed the Preboot DSA once the system reaches state 4, completion of POST by pressing the F2 key when it is displayed on screen. By default, the system will first display the memory test menu, simply select Quit using the right arrow key on your keyboard and select ‘Quit to DSA’. The Preboot DSA diagnostic program might appear to be unresponsive for an unusual length of time (up to 10 minutes) when you start the program. This is normal operation while the program loads. It is recommended that you power on all attached devices first before powering on the server.
{DESCRIPTION} {TRANSCRIPT} Next, the Preboot DSA displays a Command main menu that allows you to make the following selections by typing in the following commands: gui - take you the graphical environment cmd - offers various command as an option copy - copy DSA results to a removable media exit - exits the program help - is also available If you choose to type in the option “cmd”, as shown in the second image. It will provide a list of IBM DSA Interactive that allows you to collect, view, or display various tests and results on your system components.
{DESCRIPTION} {TRANSCRIPT} This is a screen shot of the Diagnostics option. You can use this submenu to perform a variety of diagnostic tests on system hardware such as CPU or Memory stress test.
{DESCRIPTION} {TRANSCRIPT} This is a screen shot of the System Information option. Use the submenu to obtain an overview about your system or your multinode partition.
{DESCRIPTION} {TRANSCRIPT} The advice above is not specific to the any server, it is generic to any problem handling. Comparing the configuration and software set-up between “working” and “non-working” systems will often lead to problem resolution. With the variety of hardware and software combinations that can be encountered, the following information can help assist you in problem determination. If possible, have this information available when requesting assistance from Service Support and Engineering functions.
{DESCRIPTION} {TRANSCRIPT} Any time parts are removed to reseat, replaced with a new part or any cables are unplugged for any reason the potential to disturb other components of the system is very high. Especially in multi-board systems, it is vital that code levels are matched; that is they can work together across the boards. You “must” also run a full pass of diagnostics on the part that was replaced, then, if that is OK run a “quick test” on the entire system to make sure no other errors were induced during the maintenance activity. You should also run the Light Path Diagnostic LED test to ensure that the LED’s will be effective in indicating future errors.
{DESCRIPTION} {TRANSCRIPT} This slide presents a glossary of acronyms and terms used in this topic .
{DESCRIPTION} {TRANSCRIPT} Having completed this topic, you should be able to: Identify basic troubleshooting questions to consider Identify the six possible states of a system. Identify diagnostic tools that are available to gather and analyze information for each given system state
{DESCRIPTION} This screen displays html links. {TRANSCRIPT} Listed are some additional resources that will help you learn more about the IBM System x. IBM offers a rich library of resources on a variety of topics - from customized Web-based education to downloadable brochures, planning and installation guides on popular solutions, as well as maintaining IBM Systems.
{DESCRIPTION} Displays the statement of “End of Presentation” in the center of the slide. {TRANSCRIPT} Thank you!