Troubleshooting          Communications Manager          Crashes, Cores, Service          Restarts          Nikhil Phansal...
Overview         In this Presentation we will focus on troubleshooting the following issues:          Service Crashes    ...
Identifying an Application Core         How to determine that a coredump has occurred on a system ?         Here are the t...
Identifying an Application Core         How to determine which application has generated the coredump file ?         Righ...
Generating Backtrace           Use the following CLI command to generate a backtrace:                       utils core ana...
Search Topic         Using the first 4 to 6 lines of the backtrace to formulate a search string for         Topic. Conside...
Review Results of Topic Search         Check if there are any known bugs applicable to the customer’s CUCM version.Present...
Troubleshoot Unresolved Coredumps        If the backtrace does not match an existing bug, then the following data        s...
Troubleshoot Unresolved Coredumps         The logs will provide an indication of the system activity prior to the crash. ...
Intentional Coredumps: Resource Starvation         An CallManager service may generate a coredump intentionally. This coul...
Intentional Coredumps : Resource StarvationPresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confi...
Intentional Core Dumps: Due to Mem Leak          Sometimes, a memory leak may trigger a coredump.          This is becau...
Intentional Coredumps: Due to Mem Leak         backtrace         ===================================         #0 0x00a157a2...
Troubleshoot Intentional Coredumps          Intentional coredumps typically generate similar backtraces.          Search...
Services Not Starting           A service not starting is different from a service crash.            Often times the serv...
Services Not Starting           Perform a “utils service list”            via CLI                  Is the service deactiv...
Licensing           If CCM is not starting, verify License Unit Report that            SW_Feature License is loaded and s...
Verify disk space           ‘show status’ will display disk usage for active, inactive,            and common partitions ...
Symptoms of DB Problems           If multiple services will not start and no logs are being            written, there may...
Symptoms of DB Problems                  Check for any user with excess sessions open or if any single session is         ...
Informix/DNS           CSCsw88022 -Database should still start and function          when DNS is unavailable. This is fix...
Services Deactivated After Reboot           The ‘services.conf’ is located in /usr/local/platform/conf           It cont...
Troubleshoot Server Freezes          Problem Symptoms:           The server was running fine for a number of minutes, mon...
Troubleshoot Server Freezes           Check the console for any messages. Eg:                  EXT3-fs error (device sda6...
dmesg           dmesg (for "display message") can be used to print the message buffer of            the kernel.          ...
Hardware Problems: Server Self          Diagnostics   Power on Self Test (POST)    During boot up, server will test all h...
Vendor Diagnostics (HP/IBM)          IBM and HP require bootable hardware diagnostics discs to be           run.         ...
File System Issues          A forced reboot or hard reset can cause damage to the file systems that will            preven...
File System Issues          Resolution:           Boot the server using the CUCM recovery disk.           Execute the au...
Kernel Panic           A kernel panic is an action taken by an operating system upon detecting            an internal fat...
Netdump           Use netdump to troubleshoot kernel panic issues.           Netdump uses UDP port 6666.           Cont...
Configuring Netdump          Configure the Netdump server          2. Login to the server designated as the netdump server...
Configuring NetdumpPresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   33
Configuring Netdump          Configure the Netdump client          2. Login to the server designated as the netdump client...
Configuring Netdump          Verify that the client and server are communicating.           After configuring the netdump...
Netdump: Example                         !!DO NOT TRY THIS IN A PRODUCTION ENVIRONMENT!!          On netdump client machin...
Netdump: Example          The netdump diagnostic information gets stored in a sub-directory at the            /var/crash l...
ASR: Automatic Server Recovery           Applicable only to HP servers. Enabled by default.           ASR is implemented...
IMM: Integrated Management Module           Newer IBM servers such as the 7835-I3 and the 7845-I3 include IBM’s          ...
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   40
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   41
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   42
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   43
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   44
IMM: Integrated Management Module          The IMM is set initially with a user name of USERID and password of            ...
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   46
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   47
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   48
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   49
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   50
IMM: Integrated Management ModulePresentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   51
HP vs. IBM                                                                               HP                            IBM...
Case study-1          TAC case: 611181361.          Problem Description: Customer created TAC case to investigate follow...
Case study-1          Backtrace:                                                                          #33   0x080668a...
Case study-1          The backtrace contained strings such as ‘execute_command_internal’,           ‘parse_and_execute’ ,...
Case study-2          TAC case: 612476435.          Problem Description: CallManager service coredumps every 2 and half ...
Case study-2            The backtrace indicates that its an intentional coredump.            Hence, need to review the P...
Case study-2            The %VM Used counter appears to be high            The VMSize for CCM is high. Also, note how   ...
Case study-2          Escalation was submitted to the Business Unit (BU).          Filed a software defect CSCtc70568 wi...
Commonly Found Crash Defects           CSCsv49493 – 7828-H3 goes down with journal aborted error           CSCta73022 –7...
Q/A           Questions?Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   61
Presentation_ID   © 2006 Cisco Systems, Inc. All rights reserved.   Cisco Confidential   62
Upcoming SlideShare
Loading in …5
×

TSRT Crashes

1,298 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,298
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

TSRT Crashes

  1. 1. Troubleshooting Communications Manager Crashes, Cores, Service Restarts Nikhil Phansalkar, Adam Frankel Cisco Unified CommunicationsPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 1
  2. 2. Overview In this Presentation we will focus on troubleshooting the following issues:  Service Crashes • Identify and debug coredumps •Troubleshoot services not starting up properly • Common issues that trigger service failures (Licensing, DNS etc)  Server Crashes • Symptoms of hardware failure • File system corruption • Kernel Panic • Using netdump to troubleshoot Kernel Panic • ASR (Automatic Server Recovery) • IMM (Integrated Management Module)  Case StudiesPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 2
  3. 3. Identifying an Application Core How to determine that a coredump has occurred on a system ? Here are the typical symptoms of a coredump:  Server remained up, but service was temporarily affected.  An alert generated from RTMT about a core file being generated.  A message in Eventviewer – Application log.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 3
  4. 4. Identifying an Application Core How to determine which application has generated the coredump file ? Right click on the alert and select Alert Detail. This will show which application generated the core, the time of the core, and the server that had the core. Use the CLI command to list all cores present on the system.: utils core list [for CUCM ver 5.x, 6.x] utils core active list [for CUCM ver 7.x and later] In the above examples, it’s the CCM application that generated the coredump.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 4
  5. 5. Generating Backtrace Use the following CLI command to generate a backtrace: utils core analyze <CoreFilename> [for CUCM ver 5.x, 6.x] utils core active analyze <CoreFilename> [for CUCM ver 7.x and later] Option-1: Generate the backtrace using the CLI command in the customer environment. The core analysis may cause momentary increase in CPU utilization. For busy systems, it is advised to run this command during off-hours. Option-2: Generate the backtrace on a lab server.  Download and retrieve the core file from the production system.  Upload the core file to /var/log/active/core on a lab server (requires root access). The lab server should be running the exact same CUCM version.  Execute the CLI command on the lab server.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 5
  6. 6. Search Topic Using the first 4 to 6 lines of the backtrace to formulate a search string for Topic. Consider the following backtrace: As a starting point, the following search string can be used: _STL::list PickupMemberDnTable::findSubscribedMemberDnList PickupMonitoring::sendNotifyReqPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 6
  7. 7. Review Results of Topic Search Check if there are any known bugs applicable to the customer’s CUCM version.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 7
  8. 8. Troubleshoot Unresolved Coredumps If the backtrace does not match an existing bug, then the following data should be collected for analysis:  Event Viewer-SystemLog  Event Viewer-ApplicationLog  RIS DataCollector PerfmonLog  Logs (set to Detailed/Debug trace level) for the service that generated the coredump. It’s a good idea to get CallManager logs even if its not the application that crashed.  Coredump file (required to submit an escalation to BU).Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 8
  9. 9. Troubleshoot Unresolved Coredumps  The logs will provide an indication of the system activity prior to the crash.  The intention is to isolate any unique events or errors that may have been a factor in triggering the coredump.  If the coredump has occurred multiple times, check for repeating patterns of any particular event/error. Identifying the circumstances leading up to the coredump typically expedites the resolution of these issues.  Finally, open an escalation with the Business Unit. Use the template on the escalation page to ensure that you have collected all the required information.  If its not a known issue, then most likely you could be the proud submitter of a new software defect!Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 9
  10. 10. Intentional Coredumps: Resource Starvation An CallManager service may generate a coredump intentionally. This could be due to:  High CPU utilization on the system. Thus CCM may get not access to the CPU resources and may crash itself on purpose in order to recover from that state.  This also can indicate some thread that the CCM is trying to use is blocked and thus CCM crashes to attempt to get it out of this state.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 10
  11. 11. Intentional Coredumps : Resource StarvationPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 11
  12. 12. Intentional Core Dumps: Due to Mem Leak  Sometimes, a memory leak may trigger a coredump.  This is because due to OS limitation, any individual process can allocate max 3 Gb memory.  If the process tries to allocate memory beyond this limit, an intentional coredump will be generated.  Refer next slide to see what the backtrace will look like in this situation.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 12
  13. 13. Intentional Coredumps: Due to Mem Leak backtrace =================================== #0 0x00a157a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x01276825 in raise () from /lib/tls/libc.so.6 #2 0x01278289 in abort () from /lib/tls/libc.so.6 #3 0x0050d58b in __gnu_cxx::__verbose_terminate_handler () from /usr/local/cm/lib/libstlport.so.5.1 #4 0x0050b2a1 in __cxxabiv1::__terminate () from /usr/local/cm/lib/libstlport.so.5.1 #5 0x0050b2d6 in std::terminate () from /usr/local/cm/lib/libstlport.so.5.1 #6 0x0050b41f in __cxa_throw () from /usr/local/cm/lib/libstlport.so.5.1 #7 0x0050b86c in operator new () from /usr/local/cm/lib/libstlport.so.5.1 #8 0x0a06bb2d in SdlProcessBase::operator new (size=102700) at SdlProcessBase.cpp:105 #9 0x0a0014e2 in H245SessionManager::create (parentId={mSdlProcessName = 0x0, mSdlNodeId = 4, mSdlAppId = 100, mSdlProcessNumber = 150, mSdlProcessInstance = 2629}, vH245TerminalType=H245_Gateway, vH245TransportConnectionMode=H245Client, vH245IpAddress=404699044, vH245IpPort=40076, vTCPTos=96, vPassThruMSD=false, vTCSTimeout=10, vFastStartInd=0, vFsAudioOutgoingLCN=0, vFsAudioIncomingLCN=0, pktCaptureContext=0xbffab74d "", allowTCPKeepAlivesForH323=true) at ProcessH245SessionManager.cpp:221 #10 0x08a5629c in H245Interface::start_Transition (this=0xbff99008, s=@0x5c70990) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:123 #11 0x08a99354 in H245Interface::fireSignal (this=0xbff99008, sdlSignal=@0x5c70990) at /vob/ccm/Common/Include/Sdl/SdlProcessBase.hpp:175 #12 0x0a06c904 in SdlProcessBase::inputSignal (this=0xbff99008, rSignal=0x5c70990, traceType=SdlSystemLog::SignalRouterThread, highPriority=0, normalPriority=0, lowPriority=0, veryLowPriority=0, lazyPriority=0, dbUpdatePriority=0) at SdlProcessBase.cpp:397 #13 0x0a0746ce in SdlRouter::callProcess (this=0xe225ac0, _sdlSignal=0x5c70990, _deleteSignal=@0x36b8d07, _traceType=SdlSystemLog::SignalRouterThread, _hp=0, _np=0, _lp=0, _vlp=0, _lzp=0, _dbp=0) at SdlRouter.cpp:371 #14 0x0a0740f3 in SdlRouter::scheduler (sdlRouter=0xe225ac0) at SdlRouter.cpp:281 #15 0x05514bd7 in ACE_OS_Thread_Adapter::invoke (this=0xfe57a30) at OS_Thread_Adapter.cpp:94 #16 0x054d5087 in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137 #17 0x00db73cc in start_thread () from /lib/tls/libpthread.so.0 #18 0x0131a96e in clone () from /lib/tls/libc.so.6Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 13
  14. 14. Troubleshoot Intentional Coredumps  Intentional coredumps typically generate similar backtraces.  Searching topic may yield several several hits. But, they may not always be pertinent to the issue you are troubleshooting.  Remember: intentional coredump is a symptom of some other problem.  If you see an intentional coredump, retrieving and analyzing PerfMonLogs is crucial to figure out the CPU/Memory utilization prior to the coredump since that will lead you to root cause.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 14
  15. 15. Services Not Starting  A service not starting is different from a service crash. Often times the service never started on system boot.  Some Possible Culprits • Licensing • Database • Disk Space • services.conf corruption • Software defectPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 15
  16. 16. Services Not Starting  Perform a “utils service list” via CLI Is the service deactivated? Is the service “Commanded out of service”? Is the service in a “[STOPPED]” state?  Make an assessment as to which service(s) is expected to be started but is notPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 16
  17. 17. Licensing  If CCM is not starting, verify License Unit Report that SW_Feature License is loaded and sufficient NODE Licenses are availablePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 17
  18. 18. Verify disk space  ‘show status’ will display disk usage for active, inactive, and common partitions  Verify that none are above 97% disk usage  Some services require disk space on the active partition to start and on the common partition for logging purposesPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 18
  19. 19. Symptoms of DB Problems  If multiple services will not start and no logs are being written, there may be a problem with Informix  Verify if “A Cisco DB” has started  Run ‘show tech dbstateinfo’ • Determine if Informix is online (first line • Find #RSAM to compare the number of db sessions and used DB memory per user, similar to ‘onstat –g ses’  Check informix logs for DB errors activelog cm/log/informix/ccm.logPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 19
  20. 20. Symptoms of DB Problems Check for any user with excess sessions open or if any single session is using excess DB memory. This may identify a process that needs to be investigated further.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 20
  21. 21. Informix/DNS  CSCsw88022 -Database should still start and function when DNS is unavailable. This is fixed as of 7.1(1) as sqlhosts no longer uses dns  If “dns” is present in the “hosts” line of the /etc/nsswitch.conf then Informix relies on DNS to startup properly (pre 7.1)  Check ‘utils network host [fqdn/ip]’ Make sure that external resolution resolves properly for all CUCM servers, forward and reverse.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 21
  22. 22. Services Deactivated After Reboot  The ‘services.conf’ is located in /usr/local/platform/conf  It contains a list of which services to activate on boot  If the disk is full this file might be recreated as a zero byte file. This will cause all services to be deactivated on startup.  Remedy the disk situation  Restore the services.conf from another server or lab server of same version as a workaround  After service is restored advise customer to rebuild corrupted nodePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 22
  23. 23. Troubleshoot Server Freezes Problem Symptoms:  The server was running fine for a number of minutes, months, or years and then suddenly stops responding.  The server cannot be accessed via the web, ssh, or the console.  All CUCM services stopped responding.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 23
  24. 24. Troubleshoot Server Freezes  Check the console for any messages. Eg: EXT3-fs error (device sda6) in start_transaction: Journal has aborted  The errors may also be written to Eventviewer-SystemLog. But, this can only be viewed after system reboot. Note that it may not capture all messages displayed on the console. Note: you can access the console using iLO (on HP servers) or using IMM (on supported IBM servers).  Reboot the server. A recovery disc may be required to ensure that the file system has fully recovered.  Check for hardware issues.  If none of the above reveals the cause, then enable netdump using the CLI to gather information for subsequent failures.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 24
  25. 25. dmesg  dmesg (for "display message") can be used to print the message buffer of the kernel.  This contains diagnostic messages (example: when I/O devices encounter errors).The messages are typically displayed to the console. But, the console output can quickly get overwritten.  If filesystem becomes readonly, syslog messages are no longer written to syslog file on disk. But, the messages will still exist in kernel memory.  dmesg provides a mechanism to review these messages at a later time.  Currently, this command has to be executed from root. There is an enhancement defect CSCtc59353 to get this information directly from the admin CLI.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 25
  26. 26. Hardware Problems: Server Self Diagnostics Power on Self Test (POST)  During boot up, server will test all hardware for functionality  Failure of any device results in POST which is displayed on screen, audible error (beeps), or an amber/red light being displayed  Hard drives have indicator light green is normal running state, amber or red indicates a problem  Inspect hardware report for SMART errors. This may occur if disk has a large number of bad sectors. In this case light may still be green.  Lights on front of server, and on the motherboard can help indicate failing hardware If there is a red or amber light on front of server, run vendor diagnostic to get more detailsPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 26
  27. 27. Vendor Diagnostics (HP/IBM)  IBM and HP require bootable hardware diagnostics discs to be run.  IBM Servers require DSA  HP Servers require Smart Start  Detailed Steps are provided in the email templates on TAC-Wiki  http://tac-wiki/Communications_Manager_Hardware_failurePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 27
  28. 28. File System Issues A forced reboot or hard reset can cause damage to the file systems that will prevent the server from booting. This can also be caused due to a firmware bug or a hardware problem (eg: bad hard drive). Symptoms:  Server does not boot completely. Console may indicate: *** An error occured during the file system check. *** Dropping you to a shell; the system will reboot *** when you leave the shell. Give root password for maintenance (or type Control-D to continue):  Server displays file system related errors on boot: EXT3-fs error (device ...) in start_transaction: Journal has aborted  Server indicates a manual file system check (FSCK) is requiredPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 28
  29. 29. File System Issues Resolution:  Boot the server using the CUCM recovery disk.  Execute the automatic and manual file system check.  It is always suggested to use the latest recovery disc regardless of product version. Note: Prior to CUCM 6.1.4 and CUCM 7.0.2, the recovery disk contained manual [m] and automatic [f] fsck options. The automatic option [f] was not effective and sometimes did not resolve the issue. The manual option [m] worked fine in all cases. Starting with CUCM 6.1.4 onwards & CUCM 7.0.2 onwards, the fsck logic was enhanced and recovery CD menu was updated to contain the automatic option only [refer CSCsu08170].  Not all file system corruptions can be fixed. You might have to fresh install and execute a DRS restore.  If the system is still experiencing issues, this points to hardware failure. Install new hard drives and then perform a fresh install with DRS recovery.  A frequently observed bug is CSCta73022. If /common partition is affected, BU recommends rebuilding the server.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 29
  30. 30. Kernel Panic  A kernel panic is an action taken by an operating system upon detecting an internal fatal error from which it cannot safely recover.  Typically caused by attempts by the operating system to read an invalid or non-permitted memory address are a common source of kernel  In many cases, the operating system could continue operation after memory violations have occurred. However, the system is in an unstable state and rather than risking security breaches and data corruption, the operating system stops to prevent further damage and facilitate diagnosis of the error.  A kernel panic may also occur as a result of a hardware failure or a bug in the operating system.  This is similar to Windows "Bug Check" (aka: "Blue Screen of Death").  IPVMS, CSA and FIOR are the Cisco kernel modules that may cause Kernel Panic. You can try disabling them as a workaround.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 30
  31. 31. Netdump  Use netdump to troubleshoot kernel panic issues.  Netdump uses UDP port 6666.  Contains information that indicates where the kernel panicked.  Utilizes a client-server model.  Does not work with NIC-teaming enabled.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 31
  32. 32. Configuring Netdump Configure the Netdump server 2. Login to the server designated as the netdump server. 3. Start the netdump server: utils netdump server start 4. Enter the following command for all the netdump client machines: utils netdump server add-client <Ip-Addr-of-netdump-client> 5. Enter the following command to verify status of the netdump server: utils netdump server status 6. Use the following command to verify the clients on the list: utils netdump server list-clientsPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 32
  33. 33. Configuring NetdumpPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 33
  34. 34. Configuring Netdump Configure the Netdump client 2. Login to the server designated as the netdump client. 3. Start the netdump client: utils netdump client start <Ip-Addr-of-netdump-server> 4. Enter the following command to verify status of the netdump client: utils netdump client statusPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 34
  35. 35. Configuring Netdump Verify that the client and server are communicating.  After configuring the netdump server and netdump client, execute the following command on the netdump server: file list activelog crash/  You should see a new sub-directory which has the client IP address and the date-timestamp when it started: admin:file list activelog crash/ <dir> 14.48.60.80-2010-03-05-11:30 <dir> magic <dir> scripts dir count = 3, file count = 0 admin:  A new sub-directory will be created each time the netdump client is restarted.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 35
  36. 36. Netdump: Example !!DO NOT TRY THIS IN A PRODUCTION ENVIRONMENT!! On netdump client machine, trigger a kernel panic: The console displays:Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 36
  37. 37. Netdump: Example The netdump diagnostic information gets stored in a sub-directory at the /var/crash location on the netdump server: Contents of the log file:Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 37
  38. 38. ASR: Automatic Server Recovery  Applicable only to HP servers. Enabled by default.  ASR is implemented via HP ASM driver (Advanced System Management).  ASR is implemented via a 10 minute countdown timer .  During regular operation, the ASM driver frequently resets this timer to prevent it from counting down to zero.  If the timer counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot.  Need to collect IML logs from the system (IML: Integrated Management Log) using the following command: file view system-management-log ID Severity Initial Time Update Time Count ------------------------------------------------------------- 0000 Critical 20:44 04/02/2007 20:44 04/02/2007 0001 LOG: ASR Detected by System ROMPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 38
  39. 39. IMM: Integrated Management Module  Newer IBM servers such as the 7835-I3 and the 7845-I3 include IBM’s IMM.  IMMs have an OS Watchdog feature that is similar to HP’s ASRs. This feature is disabled by default.  Refer to CSCte05285 which tracks the enhancement request to include the server recovery functionality into the new IBM servers.  You can access IMM using its own Ethernet port (labelled ‘System Mgmt’).Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 39
  40. 40. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 40
  41. 41. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 41
  42. 42. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 42
  43. 43. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 43
  44. 44. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 44
  45. 45. IMM: Integrated Management Module The IMM is set initially with a user name of USERID and password of PASSW0RD (with a zero, not the letter O).Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 45
  46. 46. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 46
  47. 47. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 47
  48. 48. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 48
  49. 49. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 49
  50. 50. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 50
  51. 51. IMM: Integrated Management ModulePresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 51
  52. 52. HP vs. IBM HP IBM Enabled in all HP servers by Supported in newer IBM servers only Automated default. [7835-I3 and 7845-I3] via IMM. Disabled Recovery by default. To view corresponding logs: To view corresponding logs: <TBD> ‘file view system-management-log’ In-depth vendor Diagnostics Smartstart –CD (bootable) DSA-CD (bootable) (requires downtime) High-level system CLI commands: CLI commands: diagnostics (does utils create report hardware utils create report hardware not require utils diagnose test utils diagnose test show hardware show hardware downtime) show environment show environmentPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 52
  53. 53. Case study-1  TAC case: 611181361.  Problem Description: Customer created TAC case to investigate following alarm: 04/06/2009 20:38:26.455 LPM|GenAlarm: AlarmName = CoreDumpFileFound, DeviceName = fm11d-bq50vcm1, AlarmMsg = CoreDumpFileFound TotalCoresFound : 1 CoreDetails : The following lists up to 6 cores dumped by corresponding applications. Core1 : Unknown (core.3733.11.showtechCCMDB.s.1239075504) AppID : Cisco Log Partition Monitoring ToolPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 53
  54. 54. Case study-1  Backtrace: #33 0x080668a4 in execute_command () #0 0x080ba54c in glob_filename () #34 0x08067ed2 in execute_command_internal () #1 0x080ba5a2 in glob_filename () #35 0x08066fde in execute_command_internal () #2 0x080ba5a2 in glob_filename () #36 0x080668a4 in execute_command () #3 0x080ba5a2 in glob_filename () #37 0x08067ed2 in execute_command_internal () #4 0x080ba5a2 in glob_filename () #38 0x08066fde in execute_command_internal () #5 0x080ba5a2 in glob_filename () #39 0x080668a4 in execute_command () #6 0x080ba5a2 in glob_filename () #40 0x08068e94 in execute_command_internal () #7 0x080ba5a2 in glob_filename () #41 0x08066f6d in execute_command_internal () #8 0x080ba5a2 in glob_filename () #42 0x080668a4 in execute_command () #9 0x080823b2 in shell_glob_filename () #43 0x0805c969 in reader_loop () #10 0x0807ed3d in expand_words_shellexp () #44 0x0805ae9b in main () #11 0x0807f26c in expand_words_shellexp () #12 0x0807ec19 in expand_words () #13 0x08069766 in execute_command_internal () #14 0x08066d9c in execute_command_internal () #15 0x08094822 in parse_and_execute () #16 0x0807b3b2 in command_substitute () #17 0x0807e223 in pat_subst () #18 0x08079700 in cond_expand_word () #19 0x080797c1 in cond_expand_word () #20 0x08079819 in expand_string_unsplit () #21 0x08079478 in string_rest_of_args () #22 0x08078f8c in strip_trailing_ifs_whitespace () #23 0x08079029 in do_assignment () #24 0x0807f2b4 in expand_words_shellexp () #25 0x0807ec19 in expand_words () #26 0x08069766 in execute_command_internal () #27 0x08066d9c in execute_command_internal () #28 0x08067f09 in execute_command_internal () #29 0x08066fde in execute_command_internal () #30 0x080668a4 in execute_command () #31 0x08067ed2 in execute_command_internal () #32 0x08066fde in execute_command_internal ()Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 54
  55. 55. Case study-1  The backtrace contained strings such as ‘execute_command_internal’, ‘parse_and_execute’ , ‘expand_words_shellexp’.  This most likely meant that the coredump was related to a CLI command.  Next, retrieved and analyzed following traces: - Cisco CallManager Admin - IPT Platform CLI Logs  The IPT Platform CLI logs revealed that the “show tech locales” was the last CLI command executed just prior to the coredump occurrence.  Topic search did not yield any known bugs.  An escalation was submitted to Business Unit.  CSCsz24566 was then filed. It was eventually resolved by the BU.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 55
  56. 56. Case study-2  TAC case: 612476435.  Problem Description: CallManager service coredumps every 2 and half days admin:utils core active list Size Date Core File Name ================================================================= 2009-09-13 08:03:25 core.9800.6.ccm.1252843074 2009-09-15 15:58:52 core.2497.6.ccm.1253044183 2009-09-18 00:03:38 core.3564.6.ccm.1253245847 2009-09-20 08:00:16 core.6676.6.ccm.1253447596 2009-09-22 16:00:18 core.8282.6.ccm.1253649103  Backtrace: #0 0x001627a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00d64815 in raise () from /lib/tls/libc.so.6 #2 0x00d66279 in abort () from /lib/tls/libc.so.6 #3 0x084c4e7a in preabort () at ProcessCMProcMon.cpp:101 #4 0x084c4e92 in IntentionalAbort (reason=0xa9fdbdc "CallManagers timers appear incorrect. This may be due to CPU or blocked function. Attempting to restart CallManager.") at ProcessCMProcMon.cpp:106 #5 0x084c66c3 in CMProcMon::verifySdlTimerServices () at ProcessCMProcMon.cpp:843 #6 0x084c7035 in CMProcMon::callManagerMonitorThread (cmProcMon=0xec122d0) at ProcessCMProcMon.cpp:439 #7 0x0107e5fb in ACE_OS_Thread_Adapter::invoke (this=0xf3ef3b8) at OS_Thread_Adapter.cpp:94 #8 0x01040cbf in ace_thread_adapter (args=0x0) at Base_Thread_Adapter.cpp:137 #9 0x002dc3cc in start_thread () from /lib/tls/libpthread.so.0 #10 0x00e061ae in clone () from /lib/tls/libc.so.6Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 56
  57. 57. Case study-2  The backtrace indicates that its an intentional coredump.  Hence, need to review the Perfmon data next to check for • CPU Utilization • Memory Leaks  The CPU utilization looks steady prior to the coredump.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 57
  58. 58. Case study-2  The %VM Used counter appears to be high  The VMSize for CCM is high. Also, note how the line slopes upwards. Signifies increasing memory usage over time. => Data points to a CCM memory leak.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 58
  59. 59. Case study-2  Escalation was submitted to the Business Unit (BU).  Filed a software defect CSCtc70568 with BU recommendation.  High level analysis of why CCM coredump’ed: Due to the memory leak, an internal data structure became large in size. A new entry was subsequently added to this data structure. The data structure had to be re-sized to accommodate the new element. The re-size operation took a long time and the CallManager service coredump’ed as a result of that.  CSCtc70568 ended up being marked as a duplicate of CSCsx25778.Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 59
  60. 60. Commonly Found Crash Defects  CSCsv49493 – 7828-H3 goes down with journal aborted error  CSCta73022 –7835-I2/7845-I2 file system read-only mode journal aborted error  CSCtb89163 – CER defect for above  CSCtb79203 – 7845H server read only  CSCte19556 – Core while deleting H323 Gateway part of RG  CSCtd58872 – Cdcc to check the return value from getSideGivenCI prevent CCM core  CSCte44391 – kpml message over 24 character causes ccm coredump  CSCsl74589 – HardwareFailureAlert is raised due to iLO 2 Comm Error  CSCsl01006 – CCM core when making call while updating pickup group  CSCsk21012 – process core due to File size limit exceededPresentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 60
  61. 61. Q/A  Questions?Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 61
  62. 62. Presentation_ID © 2006 Cisco Systems, Inc. All rights reserved. Cisco Confidential 62

×