Stabilising a large ibm connections environment

Berlin, October 16-17 2018
Stabilising a Large IBM
Connections Environment
Martijn de Jong
@martdj

PLATINUM SPONSORS
GOLD SPONSORS
BRONZE SPONSORS
SILVER SPONSORS
Please update this slide
before the event. We will
send you an updated
template with all sponsors.
Thank you.
PLATINUM SPONSORS
GOLD SPONSORS
BRONZE SPONSORS
SILVER SPONSORS
SPEEDSPONSORING BEER SPONSOR

Social Connections 14 Berlin, October 16-17 2018
Who am I
• M.Sc. Electrical Engineering at the University of Delft, The Netherlands
• Psychology & Ergonomics at the University of Stellenbosch, South Africa
• Worked with IBM Domino in development, administration and as an instructor since 2000
• Working for ilionx since 2004
• Worked with IBM Connections since 2012 with 2 of top 3 largest accounts in the Netherlands
Martijn de Jong
mdejong@ilionx.com
twitter.com/martdj
nl.linkedin.com/in/martdj
blog.martdj.nl

Life beyond Connections
ClimbingMusicals

The Case
• Client with 22K employees (7K of which added 3 months prior to my arrival)
• IBM Connections 5.5 CR3
• Everything installed on Windows 2012
• In a private cloud on MS Azure
• MS SQL 2012 as SQL server
• 7 WebSphere servers (1 Dmgr/Cognos/Analytics, 4 Connections applications, 2 Docs
viewer/conversion)
• Connections clustered. 2 servers per cluster
• 4 - 10 IHS servers
• IBM Engagement center is the homepage/startpage for all employees
• Next to standard applications and ICEC, Communities Surveys, Cognos, Kudos
Boards, Kudos Analytics, DomainPatrol Social and ConnectionsExpert are installed

The Problem
• Connections would simply become
unavailable during the day. Only solution at
the time: A full environment restart which
would take about 30 minutes. This would
happen on average weekly.
• The former administrator was gone

Agenda
• Squeaky SQL
• Craving Coordinator
• Marauding Movies
• Agonising Assumptions
• Plundering Push Notifications
• Bickering Blogs

Squeaky SQL

Squeaky SQL
• After a high demand had been put on the
SQL server (for example, by using Kudos
Analytics), the Connections environment
would start to crack with SQL errors in the
logs
• Memory usage on SQL server: 100%
• “Solution”: restart environment

Squeaky SQL
• History:
• MS SQL was installed by Azure/
Windows admin
• Databases/users created by former
administrator

Squeaky SQL
• Configuration:
• 2 servers
• Active-passive cluster
• Server 1: 14GB memory
• Server 2: 28GB memory
• All partitions (data/logs/temp) on one (not so fast)
disk
• No limitations to memory usage of SQL server

Squeaky SQL
• Cause:
• Lack of memory on server 1 
(if server 1 was used, Connections would
crash sooner)
• SQL server would allocate all available
memory and not release it. Windows OS
would start to swap

Squeaky SQL
• Solution:
• Double memory on SQL Server 1
• Limit max memory of SQL server to
24GB

Lesson learned:
Get a DBA to help you with the configuration of
your SQL backend

Craving Coordinator

Craving Coordinator
• Next problem I noticed were problems with
clustering
• The WebSphere Application Servers view
looked like this:

Craving Coordinator
• Next problem I noticed were problems with
clustering
• The WebSphere Application Servers view
looked like this:
• The WebSphere Application Clusters view
looked like this:

Craving Coordinator
• A lot of these errors in SystemOut.logs:
AgentClassImp W HMGR1001W: An attempt to receive a message of type GrowAgentRequest for Agent Agent: :
[_ham.serverid:ConnectionsCell01ConnectionsNode11SearchServer01]
[drs_inst_name:ic/services/cache/OAuth20DBClientCache][drs_inst_id: 1512698926654][ibm_agent.seq:1227]
[drs_mode:0][drs_agent_id:
CommunitiesServer01ic/services/cache/OAuth20DBClientCache9266541] in AgentClass AgentClass :
[policy:DefaultNOOPPolicy][drs_grp_id:
ConnectionsReplicationDomain] failed. The exception is
com.ibm.wsspi. hamanager.HAGroupMemberAlreadyExistsException: The member already exists
at com.ibm.ws.hamanager.impl.HAManagerImpl.joinGroup(HAManagerImpl.java:179)
at com.ibm.ws.hamanager.agent.AgentImpl.<init>(AgentImpl.java:174)
at com.ibm.ws.hamanager.agent.AgentClassImpl.onMessage(AgentClassImpl.java:429)
at com.ibm.ws.hamanager.impl.HAGroupImpl.doOnMessage(HAGroupImpl.java:794)
at com.ibm.ws.hamanager.impl.HAGroupImpl$HAGroupUserCallback.doCallback(HAGroupImpl.java:1382)
at com.ibm.ws.hamanager.impl.Worker.run(Worker.java:64)
at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1881)

Craving Coordinator
• Both the cluster viewer and the error message show problems
with the High Availability Manager (HAM)
• The WebSphere HAM is the component that is responsible for the
automatic failover support.
• The error message would occur in case the HA manager is not
able to obtain a communications thread from the thread pool
• The location of the services that depend on the HAM is managed
by the core group coordinator
• The core group coordinator can’t manage these services properly
if it is craving for resources…

Craving Coordinator
• Your Deployment manager is the primary
target for the Core Group Coordinator Task

Craving Coordinator
• History:
• Kudos Analytics was previously installed on same
servers as half the Connections applications
• Former administrator had had an outage when
Analytics was heavily used
• He moved Kudos Analytics to an Appserver on
Dmgr machine
• Together with Cognos

Craving Coordinator
• The configuration:
• Dmgr machine memory: 14 GB
• Max heap size Cognos: 6 GB
• Max heap size Kudos Analytics: 6 GB
• Heap size node agent: 768 MB
• Heap Dmgr: 1 GB

Craving Coordinator
• Solution:
• Assign more memory to the Core
Coordinator if you have a lot of jvms 
Transport Memory Size: 200MB
instead of 100MB)
• Set a parameter for higher
efficiency 
IBM_CS_HAM_PROTOCOL_VERSION – 6.0.2.31
• Set preferred coordinator servers.
Choose servers with enough
resources

Lesson learned:
Don’t underestimate the importance of your Deployment
Manager. Make sure your Deployment Manager always has
enough resources!

Marauding Movies

Marauding Movies
• Problem:
• Connections environment crashed. 2 (out of 4) main WebSphere
Application servers became totally unreachable.
• When we could finally log on to one server, we saw that memory
usage was 100% (usually 20GB free) as was cpu usage
• One jvm used 24GB of memory (max heap size 2GB): The Files
server
• Initial “solution”: We blocked traffic to the Connections environment
to allow all servers to start up except for the files servers. Then we
allowed traffic again to give users access to the other applications

Marauding Movies
• Investigation of the logs showed a large
occurrence of a specific file
“inn.Challenge_total.mp4”
• The file was 305 MB
• It was downloaded over 50.000 times in
less than 2 days…

Marauding Movies
• Cause:
• The movie was embedded in a Blog post
• The Blog post was part of the blog that’s incorporated
in the Engagement Center’s homepage
• Every time a user would go to the company’s
homepage, the browser would try to download the file
• Environment couldn’t take this load
(50K*305MB=15,2TB)

Marauding Movies
• Solution:
• Delete the movie
• Start FilesCluster
• Find the user who posted the movie
• instruct user to NEVER do that again

Lesson learned:
The IBM Engagement Center homepage could cause enormous
load on specific servers. Instruct the users who post on the
homepage well
but…
Why did this crash the WebSphere FilesCluster?!?

Agonising Assumptions

• “Assumption is the mother
of all fuckups…” 
— Travis Dane
• Previous administrator
assured me files were
downloaded through IBM
HTTP Server
• He seemed correct
•

• “Assumption is the mother
of all fuckups…” 
— Travis Dane
• Previous administrator
assured me files were
downloaded through IBM
HTTP Server
• He seemed correct
• But…

Lesson learned:
If you replace a former administrator, check the whole
environment. Don’t assume everything was configured correctly

Plundering Push Notifications

• This happened before I came in
• The Connections environment had become very slow
• Investigation showed that the web servers had run
out of threads
• Most threads were used by the Push notification
application
• The previous administrator had solved this by
disabling this application

• Background
• “IBM HTTP Server on Windows has a Parent process
and a single multi-threaded Child process. 
On 64-bit Windows operating systems, each instance of
IHS is limited to approximately 2000 ThreadsPerChild” 
— IBM Connections 6.0 tuning guide
• Push server connections stay open for a long time and
take these sparse threads
• Especially when not tuned

• Check your webservers using the server-
status page (server-status?auto is handy
for automation)
• W’s are from the Push notifications
application
• if you’re regularly low on idle workers,
change your push notification timeout
parameter (see tuning guide)
• Linux servers can handle far more
connections
• On Windows, using httpd-la.exe instead
of httpd.exe can double the amount your
webserver can handle (see http://www-01.ibm.com/
support/docview.wss?uid=swg1PI04922)

• Chosen solution
• Moving to Linux was not an option
• Timeout parameter 40000
• Configure 10(!) webservers
• Of which 4 are on by default
• Others are started as needed (using a runbook on Azure)
• With this configuration, the push notification application
was successfully re-enabled

Lesson Learned:
Windows is not a suitable platform for IBM HTTP Server in a large environment
If you have to use it, watch out for your Push Notification application. Tune it if
necessary

Bickering Blogs

Bickering Blogs
• The problem:
• The blogs application would become
slow and then unavailable
• As the ICEC homepage shows blogs on
the homepage, this problem is highly
visible for the client

Bickering Blogs
• The cause:
• Certain actions (updating the hit counter, updating the likes counter)
would create a deadlock between the individual blog servers and the
blogs database
• This would result in hung threads in the blogs application
• The number of hung threads would rise to the maximum available
threads in about an hour
• When this happens the Blogs application would become unavailable 
[13-3-18 9:38:13:696 CET] 000000f4 ThreadMonitor W WSVR0605W: Thread "WebContainer :
1" (00000162) has been active for 711627 milliseconds and may be hung. There is/are 11 thread(s)
in total in the server that may be hung.

Bickering Blogs
• The solution:
• We pmr’d this problem with IBM.
Despite multiple fixes, the problem
remains till this day
• So no solution yet!

Bickering Blogs
• The workaround:
• We use a powershell script to monitor the
SystemOut.log of the blogs servers for the
occurrence of the hung threads
• If they are found, a mail is sent to the
administrators
• We log on and hard kill the Blog server process
(stopping the blogs servers nicely does not work)

Where are we now?
• The Connections environment has been
stable this entire year with no major
outages
• Usage of the Connections environment is
still rising
• And the customer is happy :-)

For more technical details, check my blog 
https://blog.martdj.nl
Technical details

Questions

Stabilising a large ibm connections environment

More Related Content

What's hot

Similar to Stabilising a large ibm connections environment

More from Martijn de Jong

Recently uploaded

Stabilising a large ibm connections environment