Your logging stack is running as an app in your OpenShift platform and something goes wrong within the platform itself, how do you make sure you can still access the logs to diagnose the problem? We take you through our journey in running OpenShift on AWS and how we came to a good answer to that question.
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
Hands-off logging for OpenShift in AWS
1. Hands-off logging
for OpenShift in AWS
Lessons learned in Deloitte Platform Engineering
Amir Moghimi, Lead Platform Engineer
Jason Howard, Lead Platform Engineer
2. Who runs the logging
stack?
Your logging stack is running as apps inside your
OpenShift platform and something goes wrong
within the platform itself, how do you make sure
you can still access the logs to diagnose the
problem?
3. How do you guarantee you can
access the logs and not miss
entries due to load or
maintenance issues?
Tip
ElasticSearch is not
trivial software to run
and maintain in
production.
4. Meet James.
He recently started doing a bit of
operations as part of a devops team.
He loved the ideas behind devops
movement but had little operations
experience.
5. A support ticket
He gets a call to look into why
OpenShift is not able to deploy any
new containers.
First thing he looks at is the logs in
Kibana but only finds that there are no
new log entries since hours ago.
6. An operational
immaturity left James
feeling frustrated and
he quit the devops
team.
Tip
Always remember the
impact of developers
not considering the
operational complexities
they introduce in the
code.
7. Don’t run the full logging stack!
Let someone else do it for you.
(With a little help from AWS)
Tip
OpenShift built-in
logging stack, i.e. EFK,
can help you ship the
logs to another logging
stack.
8. OpenShift documentation:
“Sending logs directly to an
AWS Elasticsearch instance
is not supported. Use
Fluentd Secure Forward to
direct logs to an instance of
Fluentd that you control and
that is configured with the
fluent-plugin-aws-ela
sticsearch-service
plug-in.”
10. Then, James would find
the logs in AWS ES
He could get to the error message in
registry pod that showed the disk is full.
He was then able to go and free some
disk space while thinking why openshift
registry is not able to garbage collect
images when the disk is full!
11. It’s no surprise we use AWS ElasticSearch
regularly.
There are many
things that can go
wrong with ES.
Tip
Have you heard of split
brain where each node
elects itself as the new
master (thinking that the
other master-eligible
node has died) and the
result is two clusters, or
a split brain?
12. Make sure you configure the following
parameters for Fluentd log shipper
appropriately:
buffer_chunk_limit
buffer_queue_limit
num_threads
flush_interval
Also the default 200M memory limit is
too low and 400~500M is more
reasonable under load.
Tip
AWS ES has a bulk
upload limit, depending
on the instance size. The
buffer_chunk_limit times
buffer_queue_limit
should not exceed the
bulk upload limit of your
ES instance.