Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Back from the Dead: When Bad Code Kills a Good Server

900 views

Published on

It's Friday and a new customer calls. Their mission critical app is taking :05 to open documents and the users are quite concerned. Where do you start when handed a 20-year-old application you have never seen, on a server you barely know? Join two IBM Champions as they dissect a complex Domino performance problem from both the administration and development side to provide a complete customer solution. This session includes best practices around problem-solving techniques and a checklist you can use internally to quickly solve problems you encounter.

Published in: Technology
  • Be the first to comment

Back from the Dead: When Bad Code Kills a Good Server

  1. 1. Back from the Dead:When Bad Code Kills a Good Server May 2, 2017
  2. 2. This webinar is brought to you as part of the free monthly webinar series from:
  3. 3. Howard Greenberg @TLCC Courtney Carter @Teamstudio Bill Malchisky Jr. @BillMalchisky Serdar Basegmez @serdar_basegmez
  4. 4. teamstudio.com/blog
  5. 5. TLCC Courses • The Leader in Notes and Domino Training since 1997 • Self Paced Distance Learning Courses for Notes/Domino – XPages, Development, and Administration (user too!) • OnSite Private Classes • Mentoring/Consulting Services • Free demo courses – Intro. To XPages Development – Application Development 1 1
  6. 6. XPages Courses! FREE !! Introduction to XPages Development JavaScript for XPages Development XPages Development 1 XPages Development 2 Rapid XPages Development using Application Layout and Dojo UI Controls Java 1 for XPages Developers Java 2 for XPages Developers
  7. 7. Don’t Miss Engage!!! Mon-Tue, May 8-9, 2017 Antwerp, Belgium 80+ Sessions, all in English. TOP speakers from all over the world! 5 Tracks: Business & Strategy, Development, Administration & Deployment, Emerging Technologies and Big Data & Analytics ALL FOR FREE!!! https://engage.ug/
  8. 8. 4 And for the U.S. - MWLUG August 8 – 10th, 2017 Alexandria, VA (just a few minutes from Washington, DC airport) Registration and Call for Abstracts are now open! Over 40 technical sessions and workshops Breakfast and lunch for two days Networking with other IBM Professionals Tuesday evening reception Wednesday networking and fun event Access to experts of IBM solutions Free workshops ALL FOR $75!!! www.mwlug.com
  9. 9. Upcoming and Recorded Webinars • June - SmartNSF - 100% Smart - and in color! www.tlcc.com/xpages-webinar View Previous Webinars (use url above)
  10. 10. Asking Questions – Q and A at the end Use the Orange Arrow button to expand the GoToWebinar panel Then ask your questions in the Questions pane! We will answer your questions verbally at the end of the webinar
  11. 11. Back from the Dead:When Bad Code Kills a Good Server Serdar Basegmez Bill Malchisky @sbasegmez @BillMalchisky
  12. 12. Back from the Dead: When Bad Code Kills a Good Server Serdar Basegmez - Developi - @serdar_basegmez William Malchisky Jr. - ESS - @BillMalchisky DEV-1661 IBM Connect 2017 Conference, 20-23 February 2017
  13. 13. Legal Disclaimer © IBM Corporation 2017. All Rights Reserved. The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. IBM Lotus® Domino® IBM Lotus® Notes® Lotus® Redbooks® Red Hat® is a registered trademark of Red Hat, Inc. Apple, Mac, Mac OS, iPad, iPhone, and OS X are trademarks of Apple Inc., registered in the U.S. and other countries. Microsoft, Active Directory, and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. Vmware is a registered trademark of Vmware, Inc. in the United States and/or other jurisdictions.countries. Other company, product, or service names may be trademarks or service marks of others. All references to Acme Corporation and Acme, Inc. refer to a fictitious company and are used for illustration purposes only.
  14. 14. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  15. 15. Disclaimer "Ladies and Gentlemen. The story you are about to see is true; the names have been changed to protect the innocent." --Dragnet
  16. 16. Disclaimer "Ladies and Gentlemen. The story you are about to see is true; the names have been changed to protect the innocent." --Dragnet For example... Acme Corporation is now referred to as Acme, Inc.
  17. 17. Setting Expectations • What we will cover • Problem analysis • Troubleshooting skills • Best practices • The performance impact of suboptimal applications • What we omitted • Boring, rambling, dry lecture • Useless drivel
  18. 18. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  19. 19. Customer Calls • "We're having a problem. Can you help?" • "Absolutely. What's happening?" • "Our mission critical DB is really $%&@#$^& our users. It's way too slow. It takes less time to reboot [Windows 3.1 on an i386 with 32MB RAM] than to open a document." • "Any idea what changed?" • "We don't know. We have not touched the box."
  20. 20. Why Domino Servers Fail? • Lack of expertise and/or knowledge • Unplanned and/or unexpected expansion • No dedicated administrator • No change management • No monitoring • Workaround overloading
  21. 21. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  22. 22. "Round Up the Usual Suspects" • While waiting for access, request the following • Helps establish the level of criticality notes.ini log.nsf sh tasks top vmstat iosys df -h Affected user(s) to server ping results mount swapon -s Server NAB DB copy, sans users
  23. 23. Data, Data Everywhere • Ran DCT - returned a few items, but nothing applicable to the performance issue experienced • Checked Domino stats • Located a key issue - needle in haystack • SAI fluctuated wildly, frequently, plummeting to 18% for minutes then rising sharply again • Locate any recent NSD files for analysis
  24. 24. Quick Example - iostat, vmstat malchw@san-domino:~$ iostat Linux 3.13.0-83-generic (san-domino) 03/23/2016 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 6.21 0.25 3.69 0.51 0.00 89.34 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 45.34 2075.44 778.25 6028264 2260469 sdb 0.36 1.52 0.03 4422 80 dm-0 24.51 117.04 186.80 339957 542584 dm-1 16.17 415.61 79.82 1207173 231836 dm-2 17.64 1540.92 511.61 4475713 1485996 
 malchw@san-domino:~$ vmstat procs -----------memory---------- --swap-- ---io-- -system-- -------cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 16943764 153144 7941660 0 0 262 98 144 681 6 4 89 1 0
  25. 25. Pro Tip on Data Collection • Watch the server when nobody else does • Lots of strange things happen on servers overnight • Observed the system processing over one million records in :15 twice a week, at different times Example: no one at Acme, Inc. knew this occurred or why
  26. 26. Initial Data Analysis - OS • Swap space 50% of installed memory • Memory was under 1GB for mission critical server Several key DBs contained 100k+ docs • Combination created page faulting plague further eroding performance • System properly patched • Free space adequate
  27. 27. Initial Data Analysis - Notes.ini • Obvious but important data points • Server layout • Where items located • Recognized server.id file • Server tasks Contrast to sh tasks requested earlier • No obvious problems
  28. 28. Initial Data Analysis - Amgr • Agents running all hours of the night and day • Agents running from DBs actively being compacted • Agents running from DBs when updall and fixup running • Not all scheduled agents needed to run all weekend
  29. 29. Initial Data Analysis - Log.sf • Compact still running when updall Program fires-off • Compact never finished before execution time ceiling hit Left largest DBs in a completely suboptimal state • Connected to servers that did not exist • Scheduled replication documents • Significant delays with replica synchronization • Ensured data never properly synchronized across domain • Certain connection documents only covered two DBs
  30. 30. Initial Data Analysis - DBs • Several big DBs last fixup completed two years ago • Most heavily used files 30-75% Used • Many views means clicking one forces a new index build • No design, document, or attachment compression • Design server task citing non-existent templates
  31. 31. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  32. 32. Tier 1 - OS • Swap space - No set rule these days 1.5x - 2.0x RAM is good rule of thumb • Memory - 4GB per processor on busy servers • VMware settings if available Avoid temptation of too many processors • Review partitions and free space
  33. 33. Additional OS Considerations • Check that previous made system changes stick Unfamiliar servers can exhibit odd behavior • Check IBM Technotes for any recent performance issues • Once OS is working, check to ensure that virtualization is optimal
  34. 34. Tier 2 - Domino • Space properly Program Documents Avoid overlap with agents and other Programs • Pause agent schedule during maintenance • Schedule a weekend to complete first full maintenance set First full compact will take much longer than you realize • Create maintenance schedule with tasks agreed to by business line managers Ensures all needed jobs are available when needed
  35. 35. Additional Items to Fix • Review all enabled Domino features to ensure they function properly • Simple configuration miscues can impact negatively • Cluster replication unable to locate a cluster member • DNS errors create lookup delays • Remove unneeded, deprecated network ports
  36. 36. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  37. 37. Where are We? • Domino admin handled the first level treatment • Server performs well, but not good enough • Triangulated the issue to a mission-critical application • Now what?
  38. 38. Somebody Else’s Code Source: http://ryankennedy.io/running-the-deep-dream
  39. 39. Why Domino Apps Fail? • Lack of expertise and/or knowledge • Developers evolved from power users • Architecture overloading • Unplanned and/or unexpected expansion • Undocumented code and/or business process • No change management • Quick & dirty development
  40. 40. Developers vs Performance Issues • There is no magic pill for finding a performance issue • Many problems are circumstantial Depends on who/when/how… • Repeating the problem on a controlled environment • Need for Proof! • The most difficult part of the task • Need to be systematical
  41. 41. Science Just Works! • Research and Assessment, • Speculation for fixes, • Experiment, • Prove! http://www.wired.com/2013/04/whats-wrong-with-the-scientific-method/
  42. 42. Methodology Research Symptoms (e.g. logs, performance data, etc.) Story (e.g. user input) Application code Hypothesis Speculation on possible reasons Search for ‘Usual Suspects’ Experiment Testing for possible reasons Analyze Check symptoms if fixed Conclusion Issue validated and proved to be fixed.
  43. 43. Research & Assessment • What to collect, based on the symptom; • CPU/memory load, hangs, spikes, crashes, etc. • All the time, the same time everyday or random? • Experienced by specific users? • We are looking for a pattern between incidents.
  44. 44. Data Collection Checklist Log/NSD/Semaphore files Server configuration (inc. notes.ini) Server monitoring and statistics data Web logs (for web application issues) XPages and OSGi logs (for XPages specific issues) Application and dependencies
  45. 45. Isolate the Application • Sometimes, even opening in DDE may cause issues! e.g. XPages components are automatically built • Application code might have side effects e.g. Updating on another data source, adding audit logs, performance degradation on the server, etc. • There will be dependencies • Once isolated, we can start inspection…
  46. 46. Usual Suspects • Database corruptions • @Today/@Now in views • Code snippets acting like an admin Updating views, replicating databases, running server commands • Code snippets using the worst practices Search in a large database, wrong looping, etc. • Anything that fits into the pattern if there is one e.g. An agent matching the incident timing
  47. 47. Nothing yet? Digging deeper!
  48. 48. Team Up! • Deeper investigation needs a teaming effort • Admins and Developers should collaborate • A test setup to simulate the production environment • Intensive / Controlled debugging sessions in limited time windows • Sharing expertise • Experimenting on production should be the last resort • Once a repeatable error found, cooperate for a solution
  49. 49. Examples
  50. 50. Example Case - Analysis • JVM Crash with the HTTP task • Random times • No pattern in the log • Memory dumps point a leak in the JVM Heap • Inspected XPages applications, nothing found • Triangulated the problem into one XPages app, following clues in intensive debugging on memory • Isolated the application for a load test, nothing found • Increased logging, to collect more data, no hope!
  51. 51. Example Case - Resolution • Checked the server configuration and noticed • Logging data incomplete • Removed exclusions • New logs pointed the problem: • Searching software crawling a specific page • Page generates state data and fills up the memory • Simulated the same crash on the test environment • One line of code fixed the issue
  52. 52. Another Case - Analysis • A mission critical application at a bank • Web application with 2000+ users • CPU spikes and random hangs, mostly afternoon • Logs are clear, no crashes, no error messages • Isolated the application, inspected the ‘usual suspects’ • Found a web agent updating a view! • Triangulated the problem using web logs and SEMDEBUG • But, cannot validate the issue on the test environment…
  53. 53. Another Case - Resolution • Cooperated with the Domino admin • Detailed assessment on the server configuration • We found the issue! • “ServerTasksAt14” running an updall task. • Another Program file running updall on a specific database, every 30 minutes • Applied to the test platform, validated by a load test • Problem solved!
  54. 54. Our Story in Forty-five Minutes • Preface • Chapter I - The Beginning • Chapter 2 - Searching for Clues • Chapter 3 - Creating a Solid Platform • Chapter 4 - The Softside of Performance Gains • The Final Chapter - Results
  55. 55. Quality Analysis Yields Quality Results • Page faults reduced to zero • General DB usage and administration tasks work well • SAI now over 80% • Weird overnight (agent) system operations resolved • Key DBs have 93% used space now • All DBs compressed: design, documents, all attachments • Program documents, agents all adjusted: finish, no overlap
  56. 56. Note on Performance When done properly, few users tend to notice the change, but if removed they will all complain
  57. 57. Teamwork vs. Performance Neither an admin nor a developer could solve all of these issues alone!
  58. 58. Bonus Slide • You can get help inspecting applications and servers • Teamstudio is the sponsor today! Cooperteam MartinScott Teamstudio Ytria
  59. 59. Serdar Başeğmez • IBM Champion (2011 - 2017) • Developi Information Systems, Istanbul • OpenNTF / LUGTR / LotusNotus.com • Featured on… Engage UG, IBM Connect, ICON UK, NotesIn9…
  60. 60. William Malchisky Jr. • IBM Champion (2011 - 2017) • Effective Software Solutions, LLC • Co-founder of Linuxfest at Lotusphere/Connect • Speaker at 25+ Lotus/IBM related events/LUGs • Co-authored two IBM Redbooks • Co-wrote the IBM Education Administration track for Domino 8.5
  61. 61. Follow Up - Contact Information Serdar Basegmez serdar.basegmez@developi.com @serdar_basegmez Skype: sbasegmez Blog: lotusnotus.com Bill Malchisky Jr. william.malchisky@effectivesoftware.com @billmalchisky Skype: FairTaxBill Blog: billmal.com
  62. 62. Questions and Answers
  63. 63. Questions???? Use the Orange Arrow button to expand the GoToWebinar panel Then ask your questions in the Questions panel! Remember, we will answer your questions verbally
  64. 64. @sbasegmez @BillMalchisky @TLCCLtd @Teamstudio Upcoming Events: Connect Comes to the UK, May 4th Engage in Antwerp Belgium, May 8-9 DNUG meeting in Germany on May 31 to June 1 Social Connections in Chicago, IL on June 1-2 MWLUG in Alexandria, VA on August 8-10, 2017 Question and Answer Time! Teamstudio Questions? contactus@teamstudio.com 978-712-0924 TLCC Questions? howardg@tlcc.com 888-241-8522 or 561-953-0095 Howard Greenberg Courtney CarterSerdar Basegmez Bill Malchisky

×