Openmic crash,hang,monitoring


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Openmic crash,hang,monitoring

  1. 1. Open mic onOpen mic on Overview of Domino serverOverview of Domino server crash, hang and health checkcrash, hang and health check 88thth August, 2013August, 2013
  2. 2. 2 Ranjit Rai – Lotus Technical Advisor Focussing on Entire Notes Domino Hansraj Mali – Lotus Technical Advisor Focussing on Entire Notes Domino Vinayak Tavargeri – Lotus Support Manager Open Mic Facilitator Open Mic Team Jayaval Rajendran – Lotus Technical Advisor Focussing on Entire Notes Domino Sukanya Deepthi – Lotus Technical Support Engineer Presenter
  3. 3. AgendaAgenda ➢ What is Domino server crash? ➢ Causes of Domino server crash. ➢ What is NSD/causes of incomplete NSD ➢ What is Domino server hang /Performance ➢ Causes of Domino server hang/Performance ➢ Data Collection for crash ➢ Performance/hang Data Collection ➢ How to Monitor Server health (SAI, DCT, DDM and server commands) ➢ What and How to use Domino Diagnostic Probe ➢ Best Practices ➢ Resources ➢ Q&A
  4. 4. Definition of Domino Server crashDefinition of Domino Server crash ➢ A Crash is a controlled shutdown of the Domino processes. ➢ The Shutdown handler, in most cases generates an NSD so the cause of the crash can be identified and prevented. ➢ All the processes that loads Notes runtime (Nnotes.dll) gets terminated by the shutdown handler. ➢ In most cases, a descriptive message, to the cause of the crash may be found on the console.log. For sample it will be as below:- Thread=[00E3:012T] PANIC: Insufficient Memory
  5. 5. What could be the causes of aWhat could be the causes of a server to crash?server to crash? ● In most cases, a defect in the codes. ● Processing an un-initialized piece of data can cause an 'Access violation' ● Accessing a Corrupt data structure. A crash in this case may be desirable since this could cause corrupt data be written to the disk. ● Un-handled/unknown errors. ● Low resources - eg., memory. ● OS Simply can (uncontrolled) shutsdown the processes because they use too much resources - may not produce an NSD.
  6. 6. What is NSD?What is NSD? ➢ NSD - Notes System Diagnostic ➢ It is most important diagnostic data used in troubleshooting issues like crash, hang, performance, memory related and any errors on console ➢ By default it will be enabled in server document --> Basic Tab --> Fault Recovery Section --> “Run NSD to Collect Diagnostic Information”. ➢ NSD log file will be created by default in IBM_TECHNICAL_SUPPORT directory with following name format (it includes date and time when the file was created) ➢ Format will be as below: nsd_<Platform>_<ServerName>_YYYY_MM_DD@HH_MM_SS. Log Note: Recommended to leave the default settings on.
  7. 7. Reasons for Incomplete NSDReasons for Incomplete NSD - Do not click on cancel button while NSD running at the time of crash (Windows):- - In Unix do not kill any processes/all processes with nsd -kill command.
  8. 8. My server crashed, Now what?My server crashed, Now what? 1. Send the NSD & console.log files from IBM_Technical_Support folder to Support team. 2. Support matches the fatal stack with the existing known defects for the release of the version that customer is using. 2.1 If there's a match and a coded Fix exist. → A HotFix is provided OR if the next MR already has a fix, an upgrade maybe suggested and is desired in most cases; more than a handful of Hotfixes can be bad. → If there no hotfix available, a new fix request will be submitted. Usually takes a day or two to build it. 2.2 If the crash is new, usually an extensive review/collaboration is carried out as to find cause of the crash and a mean to alleviate the crash can be suggested. eg., Crashing due to corrupt data or low memory condition which can easily be alleviate by running maintenance on the database or by reducing memory usage.
  9. 9. Continued..Continued.. 2.3 Not enough data to move forward. Need more data with some debug ini's turned on. Eg., Memory overwriteCorruption. May require few iterations of data collection with some special ini's turned on. Support would usually be explicit about this. 3. Once a fix is provided, monitor the server post fix application. IMPORTANT: Be sure to turn off any ini's that were suggested during the course of troubleshooting.
  10. 10. Best Practices in minimizingBest Practices in minimizing crashes/downtimescrashes/downtimes ● Always run the latest MR. ● Run only what's needed. ● Periodic maintenance of the key databases. ● Enable Transaction logging for large server - this ensure the server come back online quickly in the event of a crash. ● If NSD takes longer to finish then please report this to Support. Support may suggest few parameters to speed up based on the conditions - eg., nsd -nomemcheck and -nodirlist can speed up the NSD generation.
  11. 11. What is hang/performanceWhat is hang/performance Hang is a situation where the Domino server is still running and can see domino console, but one or more tasks on the server are not responding to requests. These tasks may still be active, but they are not responding to the request. This is also a state that sometimes occurs when computer programs do not run as designed. Most of the time, a hang occurs due to a low-level loop or a permanent unavailability of a resource, causing serious performance issues. Here the NSD will not get generate automatically. We need to run NSD manually at the time of issue.
  12. 12. Causes of Domino ServerCauses of Domino Server Hang/PerformanceHang/Performance ➢ It includes Resource problems (insufficient resources) ➢ Third-party application conflicts ➢ Hardware problems such as:  High CPU Usage  High memory usage  Slow Disk I/O  Network related issues ➢ In general, server hangs are more difficult to analyze than server crashes.
  13. 13. Data collection for DominoData collection for Domino hang/performancehang/performance ➢ Should enable below debug parameters:-  debug_threadid=1  Console_log_enabled=1  debug_show_timeout=1  debug_capture_timeout=1  Server_show_performance=1 ➢ Should run back to back manual NSDs ➢ Collect that NSD, console log and semdebug files.
  14. 14. Difference between crash and hangDifference between crash and hang Crash:- 1. All Domino processes will end 2. NSD will run automatically Hang :- 1. Domino processes still keep running but users will not receive any response from Domino for their requests 2. NSD should run manually.
  15. 15. What to do when Domino server isWhat to do when Domino server is completely downcompletely down - When Domino server continuously crashing or throwing semaphores at startup then primarily try below things to make it up:- Recreate log.nsf Recreate mail boxes Recreate transaction logging (if it is enabled) - If still Domino server not coming up then comment out server tasks line from notes.ini by putting semicolon (;) in-front of it and load task one after the other and see which task having the issue:-
  16. 16. Monitoring Server healthMonitoring Server health - We can see server health by viewing few things as below:- Here we can see CPU utilization of each task (same thing can be done from task manger for Windows/ Topas command for Unix)
  17. 17. Continued..Continued.. - Using Domino Configuration Tuner:- DCT pulls information from the notes.ini file, Server documents, and Configuration documents. DCT looks at the configuration settings to see if something's out of line. Can download for free from below link:- 8&context=SWA00&dc=D400&q1=dct - Open it and Chose server -->Name scan-->Click run
  18. 18. Continued..Continued.. - Review Results
  19. 19. Continued..Continued.. - Using Domino Domain Monitoring:- Here we can see the status of all processes. And main part here we have to check availability Index. If this value is very low then we may face performance issue. We can check this value with command "show ai" and use the recommended value for server_transinfo_range.
  20. 20. Continued..Continued.. - Each server in a cluster periodically determines its own workload based on the response time of the requests the server has processed recently. The workload is expressed as a number from 0 to 100, where 0 indicates a heavily loaded server and 100 indicates a lightly loaded server. This number is called the server availability index. As response times increase, the server availability index decreases. The server availability index is based on the expansion factor, which indicates the current workload on a server You can gauge the average Expansion Factor during busy times and use this chart to determine a value of Server_Transinfo_Range that will yield the approximate desired SAI. The expansion factor can be obtained from Domino statistics.
  21. 21. Continued..Continued.. -- From below we can check statistics of particular server. Same can be obtained with “sh stat” command from Domino server console:-
  22. 22. Console commandsConsole commands - “Sh Server” Here we can see since how long server is up and running. transaction/minute, peak# of transaction and at what time, availability index, mails pending/dead mails etc;
  23. 23. Continued..Continued.. - “sh task debug” We will get here the task details
  24. 24. Continued..Continued.. - “Sh stat” We will gets all statistics details. Here we can see average queue length. If average queue length is greater than 2 we can say there might be some disk IO issues. We can see each task CPU utlization and memory utilization
  25. 25. Continued..Continued.. - “Sh user debug” Can see how may users connected, idle time, netadress etc;. If they are idle since long time then we can use parameter “server_session_timeout=xx”, where xx is a value in minutes. This forces the server to close a session which has been idle for the "xx" period of time and frees up the session memory used by an otherwise idle session.30-45 minutes is the minimum recommended setting for this parameter.
  26. 26. Domino Diagnostic ProbeDomino Diagnostic Probe ● DDP is a small Java utility provided by Lotus Support to monitor Lotus Domino servers (Domino 8.x and above). It intended to be used for servers that have been intermittently slow or unresponsive. The probe is a Java process (dbopen.jar) that runs in the background. The utility probes a specific Domino server over time by invoking database open transactions. If it detects a slow response it will invoke NSD to gather diagnostic data. ● How the probe works The probe is started from a command prompt and runs as a standalone process, not as a Domino server process. It uses the identity of the Domino server and attempts to open a session and a database as specified by the -database [-d] parameter every n seconds as specified by the -polling [-p] parameter. If the time it takes to open the database exceeds the time specified by the -threshold [-t] parameter, then the NSD program is launched to collect diagnostic data. Note: You must use the IBM Java that comes with Lotus Domino. The probe has not been tested with SUN Java and is therefore not supported.
  27. 27. Continued..Continued.. ● Options for the probe:- The table below describes the options for the Domino Diagnostic Probe utility. To stop the probe, you must use the quit command at the command prompt from which you started the probe. Always use the quit command to stop the probe. Do not use Ctrl-C or close the window without first issuing a quit command. If you do, the java.exe process will terminate abnormally and you will see messages on the Domino server console. In this event, the computer will need to be restarted.
  28. 28. Continued..Continued.. ● Setting up to run the probe:- 1. Create a new database of any type on the Domino server to be monitored; for example, maildomprobe.nsf. Once created, open the database and go to File -> Application -> Access Control. In the Access Control List, add and then highlight the Domino server name and change the User type field to Unspecified as shown below. Save changes.
  29. 29. Continued..Continued.. 2. Copy the dbopen.jar to the Domino program directory. From a command prompt, switch to the directory where the server’s notes.ini is located (typically the data directory on UNIX and the program directory on Windows). Start the probe from a command prompt using the syntax example provided below: Windows: jvmbinjava -jar dbopen.jar -d maildomprobe.nsf -t 3 -p 30 -nsdoptions "- nomemcheck" -outfile C:DominodataIBM_TECHNICAL_SUPPORTDomPerfMon.txt ● For Unix and iSeries you can check from below link:- Note: If the Domino server becomes unresponsive for an extended period of time, the probe will execute the NSD three times. Once the Domino server is restarted, the probe will resume normal operation.
  30. 30. Best PracticesBest Practices - Make Sure Transaction logging and DAOS to be on different/dedicated drive (other than Domino data Directory Drive), if enabled. - Maintain good free disk space where data directory is located - Schedule monthly server restarts - Do not schedule maintenance tasks during business hours and also make sure to finish before business hours (like compact, updall, fixup) - Slow mail delivery can take place when large group is listed in bcc field in a mail message. Use the notes.ini variable, Disable_BCC_group_expansion=1, to disable bcc group expansion.
  31. 31. Determining if the OS resources isDetermining if the OS resources is causing the performance issuecausing the performance issue ● AIX - nmon - ● Windows -perfmon - ● Linux - linperf - ● Solaris - sarmon -
  32. 32. ResourcesResources Title: How to run a manual NSD for Notes/Domino on Windows URL: ➢ Title: How to run NSD manually on a Domino server for UNIX platforms URL: ➢ Title: How to automate the collection of memory dumps URL: ➢ Overview of HTTP Request Logs for Domino Web server URL: ➢ What is a basic definition of a Domino server hang and crash? URL:
  33. 33. Q & A ?