These slides were presented during technical event at my organization. It focuses on overview to find a root cause of the unexpected system down events. It is mainly useful for Linux or Unix system administrators. Here, I tried to cover all aspects of the topic. It took me more than 2 hours to present these slides, but one can also cover these slides within short time-span. Gray background of slides is implemented to hide the company logo and to preserve the confidentially of private template. However, The Knowledge is not restricted :)
Introduction to Reliability Evaluation Techniques –
Reliability Models for Hardware Redundancy –
Permanent faults only - Transient faults.
Introduction to clock synchronization –
A Non-Fault-Tolerant Synchronization Algorithm –
Fault-Tolerant Synchronization in Hardware –
Completely connected zero propagation time system –
Sparse interconnection zero propagation time system –
Fault tolerant analysis with Signal Propagation delays.
Real Time Systems – Issues in Real Time Computing – Structure of a real time system – Process – Task – Threads.
Classification of Tasks – Task Periodicity – Periodic Tasks- Sporadic Tasks – Aperiodic Tasks – Task Scheduling –
Classification of Scheduling Algorithms – Event Driven Scheduling – Rate monotonic scheduling – Earliest deadline first scheduling.
Inter Process Communication:- Shared data problem, Use of Semaphore(s), Priority Inversion Problem and Deadlock Situations -
PCD – Process Control Daemon is a light-weight system level process manager for Embedded-Linux based projects (consumer electronics, network devices, etc.).
PCD starts, stops and monitors all the user space processes in the system, in a synchronized manner, using a textual configuration file.
PCD recovers the system in case of errors and provides useful and detailed debug information.
Introduction to Reliability Evaluation Techniques –
Reliability Models for Hardware Redundancy –
Permanent faults only - Transient faults.
Introduction to clock synchronization –
A Non-Fault-Tolerant Synchronization Algorithm –
Fault-Tolerant Synchronization in Hardware –
Completely connected zero propagation time system –
Sparse interconnection zero propagation time system –
Fault tolerant analysis with Signal Propagation delays.
Real Time Systems – Issues in Real Time Computing – Structure of a real time system – Process – Task – Threads.
Classification of Tasks – Task Periodicity – Periodic Tasks- Sporadic Tasks – Aperiodic Tasks – Task Scheduling –
Classification of Scheduling Algorithms – Event Driven Scheduling – Rate monotonic scheduling – Earliest deadline first scheduling.
Inter Process Communication:- Shared data problem, Use of Semaphore(s), Priority Inversion Problem and Deadlock Situations -
PCD – Process Control Daemon is a light-weight system level process manager for Embedded-Linux based projects (consumer electronics, network devices, etc.).
PCD starts, stops and monitors all the user space processes in the system, in a synchronized manner, using a textual configuration file.
PCD recovers the system in case of errors and provides useful and detailed debug information.
Operating system - Process and its conceptsKaran Thakkar
This presentation gives an overview of Process concepts in Operating System. The presentation aims at alleviating most of the overheads while understanding the process concept in operating system. this tailor made presentation will help individuals to understand the overall meaning of process and its underlying concepts used in an operating system.
Slides for a college course at City College San Francisco. Based on "Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software", by Michael Sikorski and Andrew Honig; ISBN-10: 1593272901.
Instructor: Sam Bowne
Class website: https://samsclass.info/126/126_F18.shtml
Operating system - Process and its conceptsKaran Thakkar
This presentation gives an overview of Process concepts in Operating System. The presentation aims at alleviating most of the overheads while understanding the process concept in operating system. this tailor made presentation will help individuals to understand the overall meaning of process and its underlying concepts used in an operating system.
Slides for a college course at City College San Francisco. Based on "Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software", by Michael Sikorski and Andrew Honig; ISBN-10: 1593272901.
Instructor: Sam Bowne
Class website: https://samsclass.info/126/126_F18.shtml
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
Session ID: HKG18-TR14
Session Name: HKG18-TR14 - Postmortem Debugging with Coresight
Speaker: Leo Yan
Track: Training
★ Session Summary ★
For most cases we can easily debug with kernel's oops dumping info, but sometimes we need to know more information for program execution flow before the issue happens. So we can rely on two tracing methods to reproduce the program execution flow, one method is using software tracing which is kernel's pstore method; another method is to rely on Coresight hardware tracing, this method also can avoid extra workload introduced by tracing itself. Coresight has provided two mechanisms for Postmortem debugging, one method is Coresight CPU debug module so we can extract CPU program counter info, this is quite straightforward to debug CPU lockup issue; Another is Coresight panic kdump, we connect kernel kdump mechanism to extract Coresight tracing data so we can reproduce the last execution flow before panic (even hang issue with some tweaking in kernel). This session wants to go through these topics and demonstrate the debugging tools on 96boards Hikey in 25 minutes session.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/hkg18/hkg18-tr14/
Presentation: http://connect.linaro.org.s3.amazonaws.com/hkg18/presentations/hkg18-tr14.pdf
Video: http://connect.linaro.org.s3.amazonaws.com/hkg18/videos/hkg18-tr14.mp4
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2018 (HKG18)
19-23 March 2018
Regal Airport Hotel Hong Kong
---------------------------------------------------
Keyword: Training
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
lecture 1 (Part 2) kernal and its categoriesWajeehaBaig
Kernel and its categories
computer start up
Architecture of Operating system(Monolithic ,Layered,Micro kernel,Network and distributed O.S)
Interrupt and its function
System calls
System boot
O.S services(for system, for user)
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
3. Why RCA is important
Business Impact
Loss of money due to outages.
Disruption in availability of services.
Risk of re-occurrence of the issue.
Finding the culprit behind the scene.
Security breach or human error.
4. General Approach (Non-Technical)
The RCA is a method of problem solving.
There can be more than one root cause behind the issue.
Purpose is to identify solution or workaround, to prevent
reocurrence at lowest cost and simplest way.
RCFA (Root Cause Failure Analysis) recognizes that complete
prevention of recurrence by one corrective action is not always
possible.
Famous methods/tools - 5 whys, Pareto analysis, Cause and
effect model etc.
5. Technical Approach (Basic)
Whether unexpected reboot is effect of some planned activity ?
Was there any recent configuration changes (sw/hw) ?
What does my recent logs suggest ?
Any unususal behaviour (or logs) spotted ? (console)
Is there some relation between the occurence of the events ?
Do we have a reliable power source ? (UPS)
6. Step Forward
Is it a virtual or physical system ?
Check logs recorded by hyper-visor and/or hardware. (mcelog,
IML logs, ASR events, hyper-visior utilization etc.)
Is this part of some cluster ? Any fencing event recorded ?
Try to find whether its a real OS issue; Or its related to
application/network/storage ?
Is this result of some malecious script running on the system ?
Is there any anti-virus installed and running on the system ?
What all panic parameters are set on the system ?
7. Deep Diving
Is there any known bug with running kernel ? (search bugzilla)
Is this issue reproducible on demand ? Any possible
workaround ?
Does replica of system exhibit similar behaviour ? Compare
initramfs of replicas to find out any differences.
Is there some known issue with the combination of the OS
version and a perticular application running on the system ?
Any sign of abnormal resource utilization near the event ?
Whether complete/partial dump is captured for the reboot ?
Check vmcore-dmesg.txt logs and try to find known issues on
vendor portal.
8. Server Hung Scenario
Do not confuse it with application hang scenario. Do all
checks. There is no standard defination for OS hung situation.
Some facts regarding crash vs hang situations:
• Crash is often immidiately follow a problem in kernel space. Like : Programming
error, Defective hardware, Unsupported operation etc.
• During crash oops messages are displayed and it helps in diagnosis
• Crash or panic is easier to troubleshoot. It provides stack trace and panic task
details.
• System hang are more subtle. It can be the result of simply temporary
performance issue caused by inefficient algorithms or as complicated as dead
locks.
• No oops messages displayed on console, dont know what thread caused hang.
Hence it makes hang issue more complicated to analyze.
Take a snapshot of virtual guest and extract memory dump.
Or trigger panic using available panic techniques, make sure
panic initiate’s the memory dump mechanism.
9. User Initiated
The "exiting on signal 15" message is the last message that syslog
service emits during normal shutdown.
The presence of this message in the messages file indicates a directed
shutdown of the system. Either from a user or a program.
Is there any system health monitoring software running which may issue
the 'shutdown' command ? For ex :
• Automatic system recovery software.
• Hardware monitoring tools.
• UPS software with shutdown capability etc.
How to find which user it was -- set audit rules or use script.
Check secure logs & bash history of users for shutdown event.
10. Cluster Initiated
Cluster reboots system using fencing mechanism. Common clustering
softwares are : Oracle clusterware, VCS & RedHat Cluster etc.
Unlike many common thoughts, high-availability is not the highest priority
of an HA cluster, but only the 2nd one.
There are two classes of fencing methods, one - which disables a node
itself, the other - disallows access to resources such as shared disks.
Cluster can fall victim to conditions called Split Brain and Amnesia.
Clusters use a process called “STONITH” in order to correct the issue;
this simply means the healthy nodes kill the sick node.
I/O fencing is one of the important feature of VCS, whereas Oracle-RAC
simply gives the message - "Please Reboot" to the sick node. The node
bounces itself and rejoins the cluster. RedHat cluster uses fence device
configuration to handle fencing events.
One can also set fence delay to allow OS to capture vmcore for fencing
events.
11. Hardware Faults
The most common hardware errors that are captured on the system are:
• Memory errors or Error Correction Code (ECC) problems.
• Inadequate cooling / processor over-heating.
• System bus errors. Cache errors in the processor or hardware.
• Firmware bugs, EDAC and NMI’s.
The kernel does the immediate actions (like killing processes etc.) and
mcelog decodes the errors.
The mcelog is the user space backend for logging machine check errors
reported by the hardware to the kernel.
• Seen MCE error : HARDWARE ERROR. This is *NOT* a software problem!”
12. Panic Parameters
These are used to deliberately panic system, when certain
conditions are met. It is necessary for debugging purpose
• 1) kernel.hung_task_panic
• 2) kernel.softlockup_panic
• 3) vm.panic_on_oom: This parameter will panic the kernel on oom-killer
events and capture a vmcore if kdump service is running as expected.
• 4) kernel.panic_on_io_nmi
• 5) kernel.unknown_nmi_panic: It utilizes NMI switch capability to force a
kernel panic on a hung system. This feature makes use of the computer's NMI
switch to trigger a panic.
• 6) kernel.panic_on_oops
• 7) kernel.panic_on_unrecovered_nmi
• 8) kernel.nmi_watchdog: The NMI watchdog monitors system interrupts and
initiates a reboot if the system appears to have hung.
• 9) kernel.panic_on_stackoverflow
• 10) kernel.panic [secs]
13. Panic Strings
These panic strings explain cause of the panic. But it is not always
sufficient to determine the actual cause.
When a kernel panic occurs, the system usually displays a message on
the console and all the system activity stops’
• Kernel BUG at net/sunrpc/sched.c:695!
• BUG: unable to handle kernel paging request at xxxxx
• BUG: unable to handle kernel NULL pointer dereference at xxxxx / (null)
• divide error: 0000 [#1] SMP
• Kernel panic – not syncing: softlockup: hung tasks / hung_task: blocked tasks
• Kernel panic – not syncing: Watchdog detected hard LOCKUP on cpu 0
• Kernel panic – not syncing: out of memory, panic_on_oom is selected
• Kernel panic – not syncing: Out of memory and no killable processes..
• Kernel panic – not syncing: An NMI occurred, please see the Integrated Management Log for
details.”
• Kernel panic – not syncing: NMI IOCK error: Not continuing / NMI: Not continuing / nmi watchdog
• Kernel panic – not syncing: Fatal Machine check
• Kernel panic – not syncing: Attempted to kill init !
• Kernel panic – not syncing: GAB: Port h halting system due to client process failure
14. Kernel logging
Syslog is a standard logging facility. It collects messages of various
programs and services including the kernel, and stores them, depending
on setup, in a bunch of log files typically under /var/log.
The “/var/log/messages” aims at storing valuable, non-debug and non-
critical messages. This log should be considered the "general system
activity" log.
Administrators use log rotation facility to maintain historical data. One
can also change the logging level based on the requirement of the setup.
# Common call traces seen in messages are :
• OOM-killer and memory stats.
• Softlockup logs for various cores.
• Page allocation failures.
• Segfaults : Signifies an error in one particular process.
kernel: fmg[6335]: segfault at 0xffffd2dc rip 0xffffd2dc rsp 00000000ffffd1bc errorX
• Trap divide error : Application crash due to “divide by zero”
kernel: nmupm[2792] trap divide error rip:804a39a rsp:ffa4eb24 error:X
15. OOM call traces
The out_of_memory function is called when the system memory
(including swap) has been fully allocated to a point where regular system
activities cannot be performed until some of that memory is freed.
The mm/oom_kill.c terminate one or more processes based on badness()
score; which follows an algorithm that does not kill any innocent task.
<snip/>
Node 0 DMA: 3*4kB 2*8kB 2*16kB 3*32kB 2*64kB 2*128kB ... 3*4096kB = 15132kB
Node 0 DMA32: 452*4kB ..
Node 0 Normal: 13315*4kB .. <<<
[..]
Free swap = 0kB <<<
Total swap = 8388604kB
[..]
kernel: httpd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 <<<
kernel:
kernel: Call Trace:
[<ffffffff800c3a6a>] out_of_memory+0x8e/0x2f5
[<ffffffff8000f2eb>] __alloc_pages+0x245/0x2ce
[<ffffffff80012a62>] __do_page_cache_readahead+0x95/0x1d9
</snip>
16. D-state call traces
These messages serve as a warning that something may not be
operating optimally. They do not necessarily indicate a serious problem
and any blocked processes should eventually proceed when the system
recovers.
The “khungtaskd” has the ability to detect tasks stuck in D-state (
Uninterruptible Sleep (UN) ) longer than a specified time period and
results in following type of message in system log:
<snip/>
INFO: task syslogd:2643 blocked for more than 120 seconds. <<<
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <<<
syslogd D ffff81000237eaa0 0 2643 1 2646 2634
(NOTLB) <<<
ffff8101352c3d88 0000000000000086 ffff8101352c3d98 ffffffff80063ff8
0000000000001000 0000000000000009 ffff81013d2c57e0 ffff810102ac1820
0000340b30708992 0000000000000571 ffff81013d2c59c8 000000010000089f
Call Trace: <<<
[<ffffffff80063ff8>] thread_return+0x62/0xfe
[...]
[<ffffffff8005e28d>] tracesys+0xd5/0xe0
</snip>
17. Soft-lockup call traces
Soft lockups are situations in which the kernel's scheduler subsystem has
not been given a chance to perform its job.
It can be caused by defects in the kernel, by hardware issues or by
extremely high workloads.
<snip/>
kernel: BUG: soft lockup - CPU#7 stuck for 206s! [sosreport:14372] <<<
kernel: Modules linked in: rpcsec_gss_krb5 nfsd..vsock(U) ipv6 .. vmware_balloon .. vmxnet3 ..
dm_mod [last unloaded: speedstep_lib] <<<
[..]
/440BX Desktop Reference Platform
kernel: RIP: 0010:[<ffffffff81162cbd>] [<ffffffff81162cbd>] s_show+0x1ad/0x330 <<<
kernel: RSP: 0018:ffff8801e482fd98 EFLAGS: 00000202
kernel: RAX: 0000000000000000 RBX: ffff8801e482fe18 RCX: ffff88043febfb80 <<<
kernel: RDX: 0000000000000000 RSI: 00000000000036a7 RDI: ffff88043febfb60
[...]
kernel: <d> 00000000000036a7 ffff880437830f00 ffff8801e482fe18 ffff88031e3f1640
kernel: Call Trace:
kernel: [<ffffffff8119db87>] ? seq_read+0x267/0x3f0 <<<
kernel: [<ffffffff81054c30>] ? __dequeue_entity+0x30/0x50 .....
</snip>
18. Page allocation failures
The kernel frequently needs to allocate chunks of memory for the
temporary storage of data and structures. Sometimes allocations
demands many physically contiguous pages which may not always be
available. In times like this memory allocator may choose to fail the
allocation request.
Common cause are memory-crunch, memory-fragmentation, memory-
zone exhausted and drivers with different service routines.
• Usual workaround is to check the value of vm.min_free_kbytes and double it. Also
setting vm.zone_reclaim_mode to 0 can help to avoid memory congestion issues .
</snip>
kernel: swapper: page allocation failure. order:2, mode:0x20 <<<
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-220.4.1.el6.x86_64 #1
kernel: Call Trace:
kernel: <IRQ> [<ffffffff81123daf>] ? __alloc_pages_nodemask+0x77f/0x940
kernel: [<ffffffff8115dc62>] ? kmem_getpages+0x62/0x170
kernel: [<ffffffff8115e87a>] ? fallback_alloc+0x1ba/0x270
kernel: [<ffffffff8115e2cf>] ? cache_grow+0x2cf/0x320
kernel: [<ffffffff8115e5f9>] ? ____cache_alloc_node+0x99/0x160 ...
</snip>
19. SysRq
It is a 'magical' key combo that you can hit, and to which the kernel will
respond regardless of whatever else it is doing, even if the console is
unresponsive.
The sysrq key is one of the best (and sometimes the only) way to
determine what a machine is really doing. It is useful when a server
appears to be "hung" or for diagnosing elusive, transient, kernel-related
problems.
For security reasons, SysRq key is disabled by default.
• Because enabling sysrq gives someone with physical console access an extra
abilities. It is recommended to disable it when not troubleshooting a problem or
to ensure that physical console access is properly secured.
There are several sysrq events(and ways) that can be triggered once the
sysrq facility is enabled.
• # echo h > /proc/sysrq-trigger
Commonly used options are :
• m - dump info about memory allocation
• t - dump thread state information
• c - intentionally crash the system
20. Kdump
Kdump is mechanism that uses kexec to capture the crash dump. Crash
dump is also known as “vmcore” it can be captured using -
kdump/diskdump/netdump/xendump/LKCD/vmss2core etc.
kexec is a fastboot mechanism that allows booting a Linux kernel from
the context of an already running kernel without going through the BIOS.
Crash dump captures the state of the kernel at the moment of panic. It is
a snapshot of the physical memory at the time of crash.
• Vmcore can be collected by using following methods :
• Automatically when kernel panics (parameters) or oops. It can be due to Bug in
kernel or in third party driver. In case of memory corruption and hardware problems
• Manually when admin uses sysrq, NMI switch or by taking snapshot.
• Limitations of vmcore: Not useful for analysing healthy system; It cant capture
historical logs; It is complex and requires expertise to analysis it.
• Configuring kdump and starting service is not sufficient, testing kdump is must.
Also find out supported and unsupported kdump target for perticular OS vendor.
• There are multiple factors that affect vmcore generation, ex : Clustering, HP-
systems, Bonding, Network-cards/modules, virtualization etc.
21. Bugs
A software bug is a failure or flaw in a program that produces undesired
or incorrect results. It’s an error that prevents the application from
functioning as it should.
There are many reasons for software bugs. Most common reason is
human mistakes in software design and coding.
The BUG_ON() function acts similar to panic, but is called by intentional
code meant to check abnormal conditions.
The vmcore and vmcore-dmesg.txt helps to identify bugs. Bugs can be
in any software, but bug in device drivers or in kernel can cause outages.
A kernel bug example is - divide by zero in find_busiest_group() function
causing kernel panic in RHEL6 kernels.
A deadlock bug in “vmtoolsd” causing system hung - is an example of
external software bug leading to system panic condition.
22. Preparing for Future
Configure kdump on all systems. It has no side effects.
Configure audit rules based on business requirements.
Properly configure the cluster setting and test it.
Tune system as per guidelines of Application vendor.
Be ready with backup plan.
Patch regularly.