This document provides an overview of common digital performance killers that can affect collaborative applications like Domino, Traveler and Sametime. It identifies several key areas that can degrade performance including disk I/O, memory constraints, overloaded virtual hosts, congested proxy servers, firewall and load balancer timeouts, and misconfigured LDAP servers. Specific examples are provided to illustrate how problems in these areas negatively impacted the performance of applications at various companies. The document emphasizes the importance of ongoing performance monitoring and tuning of the entire IT environment for high availability applications.
3. Why Are We Here?
§ Two of IBM's senior troubleshooters
§ Combined 35+ years of experience with varied customer environments
§ 60% of reported problems with collaborative applications are environmental
– NOT addressed by fixes/patches
– NOT addressed by IBM software configurations
§ High availability apps (Domino, Traveler, Sametime) are often the first to manifest lower-level
problems
– Sound familiar? “Our other apps don't have any problems”
§ Most of these performance killers are avoidable
§ Performance tuning & monitoring your entire environment is key!
4. Disk I/O – Local Drives
§ Slow/congested disks affect EVERYTHING
– Application performance
– Operating system performance (e.g. paging file)
§ Optimal read/write performance – less than 15ms per transaction
§ Hallmark symptom – disk queue lengths
– These queue lengths indicate the number of transactions awaiting disk service – not
their size
– Every disk manufacturer agrees – queue lengths > 2.0 indicate poor performance
§ Monitor: Platform statistics (Domino), Perfmon (Windows), iostat (Linux)
5. Disk Shares (Client-side)
§ Commonly used in virtual desktop environments
– Citrix, VMware VDI
§ Large user data (e.g. Domino databases) stored on remote disk farms
§ May substantially degrade file operations (especially upload/download)
§ Clients may have applications installed on remote disks
§ Smart idea to have multiple file servers for critical data
§ Monitor: All usual disk I/O metrics apply
6. Disk I/O – SAN and NAS
§ Same basic standards – 15ms per transaction, disk queues <= 2.0
§ Complicated by multiple applications sharing same disk chassis
– Other applications “hammering” SAN can create problems for you
§ SAN: Check HBA configuration on servers and cache configuration on SAN device(s)
– Don't forget max queue depth!
§ NAS: Network latency/congestion most common performance factor
§ Monitor: Platform statistics, Perfmon, iostat, network analysis
§ Best way to ensure performance, examine disk latency (<15-20 ms) for avg. read & write
separately (don't pay attention to total I/O operations)
7. Memory Constraints
§ Memory consumption can vary widely with load
– Be aware of growth in userbase, or added mobile devices
§ Pay particular attention to JVMs
– Know their memory configuration
– Consult tuning documentation
– Can be particular concern in Websphere environments
§ Can be exacerbated by high paging rates in the OS, or strapped kernel caches
§ Monitor: Perfmon, vmstat (Committed Bytes, %Committed Bytes)
8. Overcommitted Virtual Hosts
§ Growing problem with rise of virtual environments
§ Performance problems can be triggered by demands of OTHER VMs
§ May show up as memory constraints, CPU contention, network latency
§ May be paired with disk I/O constraints
§ Pay attention to Websphere logs for CPU contention
§ Dynamic load re-distribution (moving VM's to new host) can cause problems for HA or
near-real-time apps (and lessen your HA to boot!)
§ Monitor: Check %CPUReady in VMWare statistics (> 5% = contention)
§ Can be addressed with resource pooling/prioritization
9. Congested Proxy Servers
§ Major contributor to end-user performance
§ Frequently seen in new cloud deployments (i.e. large added baseline load)
§ Tends to appear during “peak times”
§ Tends to affect multiple applications (and general Internet traffic)
§ Can be exacerbated with increased file upload/download traffic (e.g. Connections Files)
§ Can affect extranet users (if reverse proxy servers in use)
§ Can be confirmed with HTTP analysis
§ Test/monitor with: HTTPWatch, Firebug, Rational Performance Test
10. Proxy Server Example
§ ACME - enterprise with 50,000 users
§ All Internet web traffic required to go through farm of 3 proxy servers
§ ACME migrated messaging to the cloud
§ Massive surge in HTTP/HTTPS traffic swamped proxy servers
§ User reports focused on applications – and which applications were most commonly
used?
– No one thinks twice if a random website is slow...
§ Resolved by expanding proxy capacity (adding proxy servers)
11. Firewall & Load Balancer Timeouts
§ Often conflict with application-layer timeout settings
§ Load balancer timeouts can result in arbitrarily high (re)connection rates
– Check session affinity/”stickiness” timeouts
§ Create situations where neither endpoint has clear picture of connectivity
§ Often indicated by “connection reset” or “connection timed out” log errors
§ Monitor/confirm via network analysis
– Red flag: TCP retransmissions & RST in existing connection
– Red flag: TCP RSTs appearing “out of nowhere”
§ Mitigate by ensuring that application-layer timeout is the shortest
12. Firewall Timeout Example
§ ACME has a significant number of extranet users
§ Users complain that if Notes client had been idle for more than 30 minutes, client “freezes” for
20-30 seconds when they resume activity (check mail, send a draft email, refresh a view, etc.)
§ Initial review of logs showed “server not responding” in client logs, and “connection reset” or
“connection broken” indicators in server logs
§ Network traffic analysis showed connections established normally, but eventually going through
the retransmission/timeout cycle (resulting in 20+ second delay)
§ When both endpoints show symptoms of retransmissions and timeouts, we suspect that an
intermediate device is interfering
§ Resolved by lowering Domino Server Session timeout below Firewall's 30 minutes (allows for
orderly closure of idle sessions)
13. Firewall Timeout
FIREWALL
Idle timeout: 30m
CLIENT
SERVER
Idle timeout: 60m
Start
FIREWALL
Idle timeout: 30m
CLIENT
SERVER
Idle timeout: 60m
After 30 minutes idle, firewall SILENTLY “drops state”
X X
FIREWALL
Idle timeout: 30m
CLIENT
SERVER
Idle timeout: 60m
The next time either side tries to use the connection
X 3-5 retransmissions, then
give up with TCP RST
X
14. Software Firewalls (Client & Server)
§ Often installed by default
– Do you know if your standard image includes one?
§ Often affect even localhost connections
§ Do not usually include timeout capability
§ Must be configured for specific apps/ports/port ranges
§ May exhibit symptoms of “some things work, others don't”
§ Usually a problem on client side
§ Recommend: disabling software firewalls on servers
15. Network Appliances
§ Often used to improve WAN performance (e.g. Riverbed, Blue Coat)
– Includes content caching, bandwidth throttling, packet shaping
– Packet shaping and bandwidth throttling may also be introduced by routers and
switches due to Quality of Service (QoS) policies
§ When problems occur, can cause various performance problems:
– Email attachment problems (packet shaping)
– Higher WAN loads
– SYN-ACK problems, causing general problems on target App server
§ Red flags: WAN behavior different from LAN behavior, network/application diagnostics
point to a network issue
§ Monitor: Network bandwidth usage, dropped packets, connection resets
16. Network Accelerator/Packet Shaping Example
§ Admins upgrade Domino server's OS from AIX 6 to AIX 7
§ Suddenly, users could not access large attachments over the WAN
– Notes Client experiences small delay, then produces error “Remote System no longer responding”
§ Network capture client side shows that server acknowledges full attachment size
– After downloading first part of attachment, window size is reset
– Server stops sending additional attachment data, client produces error
– WAN connections routed through Blue Coat device that performed packet shaping
– Turned off packet shaping, problem went away
– Packet Shaping was deliberately truncating download
– Network team did not see any “errors”, since no packets were dropped
– AIX 6 -> AIX 7, TCP window behavior changed slightly, Blue Coat device needed to account for changes in
its application-specific policy settings
17. LDAP Performance/Misconfiguration/Search Filters
§ Usually shows up as authentication delays
§ May also cause “slow lookups”
§ LDAP server performance may be affected by other applications
§ Overly complex search filters can degrade LDAP performance
– How many different ways do users need to authenticate?
§ Use of large (or nested) groups also affects performance
§ Active Directory: Consult Global Catalog Server, NOT domain controllers
§ Mitigate with standard (and simple) search filters
§ Monitor: LDAP and host statistics
§ Note: HA apps depend upon HA LDAP!
18. LDAP Search Filter Example
§ ACME's authentication filter
(&(|(objectclass=person)(objectclass=EuroPerson))(|(uid=%s)(cn=%s)(mail=%s)
(employeeid=%s)))
§ This filter demands 6 LDAP comparison per record queried
§ First deployment of a worldwide application brought the LDAP infrastructure to its knees
§ Resolved by simplyifing LDAP schema (removing distinction between person and
EuroPerson) and setting a standard “how you will authenticate” policy (using only email
address or employee ID)
19. SQL Servers
§ An important component for Traveler
§ Mail file and connection metadata is contained on SQL server
§ Traveler must consult SQL before it knows if there are relevant changes in the mailfile to
push to the user
§ Any SQL performance hiccups can dramatically affect Traveler performance
§ Mobile adoption may drive you to the breaking point
§ Monitor: SQL server and host statistics, including disk latency
20. Third-Party Plugins/Addins/Extensions
§ Can affect both servers and clients
§ Notes plugins can add menu options, Browser plugins can modify Javascript/CSS
§ Server-side extensions can introduce external dependencies
– Example: archival plugin may use SAN/NAS/UNC drives
§ Even if the primary external app is disabled, extension dll's will still load & execute
– e.g. Anti-Virus, Mail Signature plugins
§ Don't forget authentication addins
– May contribute to overall LDAP load
§ Monitor: May be specific to plugin/extension, application stats will apply
21. Third Party Addin Example
§ ACME users intermittently report a 2-5 second delay when Notes Client sends email
– No apparent network delays
– Diagnostics show creation of Note in mail.box requires 3-5 seconds
– Additional debug shows that Ext Mgr Plugin is invoked, takes 95% of the time
§ ACME uses a third party mail signature app that adds a pre-existing signature
– Even though they disable the third party task, the delay still occurs
– Domino still loads any referenced DLL's, depending on how written, may introduce delays -
cannot assume that these DLLs are 'lite'
§ Resolved by removing the extmgr DLL from notes.ini (Third Party Vendor required to investigate)
22. In Summary
§ Very few of these problems come “out of nowhere”
§ They're often the consequence of growth
§ Get an idea NOW of what “normal”/”good” looks like!
§ You may be suffering from one (or more) of these problems right now
§ Engage your server, network and/or security teams NOW to avoid problems with
new or expanded deployments
23. Questions?
§ Where's YOUR pain point?
THANKS FOR BEING HERE!
rob_gearhart@us.ibm.com wes_morgan@us.ibm.com
@wesmorgan1
24. Engage Online
§ SocialBiz User Group socialbizug.org
– Join the epicenter of Notes and Collaboration user groups
§ Social Business Insights blog ibm.com/blogs/socialbusiness
– Read and engage with our bloggers
§ Follow us on Twitter
– @IBMConnect and @IBMSocialBiz
§ LinkedIn http://bit.ly/SBComm
– Participate in the IBM Social Business group on LinkedIn
§ Facebook https://www.facebook.com/IBMConnected
– Like IBM Social Business on Facebook