DevOps_SelfHealing

1
Abstract — A “virtual” manufacturing execution and
security platform (vMES) provides a novel way to
manage the outsourced production plants. Factory
personnel use services provided via this system over a
Cloud, adding ability to gather real-time data from
devices or environment, provide service level insights
and gain visibility with a “control tower” that helps take
decisions connecting top floor to the plant floor. Benefits
of the virtualized plant floor include separation of
compute, storage and services from the physical
machines on the plant floor. The compute function
relocates to a central data center; and the risk of
downtime or service errors reduces drastically. The
move to the cloud based services for manufacturing
plant is done in phases by introduction of a hierarchical
topology in network of factories. In first phase the
challenge of virtualization has been addressed. Response
to unforeseen events and meeting delivery commitments
despite those events is a pressing challenge for today’s
virtual manufacturing systems. In second phase
automation of response to errors is addressed and this
paper talks about a framework to enable this automation.
I. INTRODUCTION
A Self Healing module is automation across
infrastructure & services to connect elements of the
environment with specific focus on actions in response
to failures events. This helps automate daily operations,
improve productivity and accelerate troubleshooting
with failure detection and problem remediation. A self-
healing architecture takes continuous delivery to an
advanced level. In past, when failures happen in a
system, human intervention was needed where the
operations personnel try to identify the cause and then
apply a fix. Many failures occur repetitively and require
a standard fix; some failures may not have ready and
identifiable resolution, while for few errors there may be
dependencies. Delay in identification and resolution is
directly proportional to the loss of productivity to the
organization. It’s important that the monitoring of failure
occurrence happens in real time and an intelligent
remediation is taken in automated manner. This paper
proposes a framework integrated with DevOps tools, to
help monitor the system and detect failure and
eventually provide a resolution. This framework also
brings automation and collaboration together;
transforming operations and DevOps teams gain better
productivity and faster response times.
II. SELF-HEAL USE CASES FOR VMES
For a Cloud Transition, whether for services or for
software, it is essential that we automate not only the
service or environment elasticity but also the service
behavior on error. Error recovery needs to be as fluid as
the positive case deliverables. Thus, informed decision
on the failing instance of a service is vital to scalability.
The failure detection is done by a prober component,
followed by corrective actions by a resolver and finally a
notifier keeping track of important events and notifying
the operator. Post notification, the use cases also include
escalating to a human actor if a defined threshold is
crossed. This Self Heal Module will be called SHM
hereafter.
Figure 1: Why Self Healing
A. Failure Detection and Resolution
The use cases for vMES include restarting hung
processes (due to memory leaks or high disk I/O),
tracking number of restart attempts by SHM. Another
important use case is to detect hardware failures and
move the virtual machines with application workloads to
a different physical or virtual host. Any error during this
process migration is reported and escalated. Highly
Available applications can see performance degrade and
so an auto provisioning of new virtual machines will
maintain performance and uptime at all times. Cron jobs
that are not designed to run in parallel can be easily
moved to other machine in case of hardware failure
without any human intervention. A restart of non-
responding host after identifying and verifying SSH
login failures, for example, falls into a similar category.
Failure detection is also extended to security failures
such as DOS attacks on the network to create instant
iptables rules or TCP wrappers to block unknown IP
domains or rogue clients. In case of web services, the
Self-Healing in the Cloud based Manufacturing Execution System
Atul Dhingra

2
SHM adds IP Addresses to web server’s blacklist
database. A major use case is also on database systems
to automatically adjust DB data propagation settings or
data transfer limits based on the availability of Internet
bandwidth at the plant. SHM removes bad nodes from
DB clusters to avoid applications failures and also
detects slow SQL query transactions, analyze and
optimize DB tables to reduce disk fragmentation. Log
indexing is another area where the need is to detect fast
growth in system or application logs and run rotate
scripts to avoid any space or disk consumption problems
even before they occur. This is done by tracking any
changes in the open file descriptors for each
participating process to log the rate of growth. Rapid or
steady growth monitoring can further lead to actions
such as to abort the offending process. SHM also tracks
number of network connections or concurrent service
request versus system load and does hierarchy based
process prioritization in case of problems.
B. Automated Troubleshooting
A foundational use case for SHM is to up level tier (n-1)
capability to troubleshoot problems without tier (n)
support. Specific use cases in this category include
capturing disk usage of system and application with
proper depth levels and posting results to a bug tracking
system automatically. Another area is to detect web
service errors and capture logs during that time range for
future reference. This includes detection of core dump or
thread dump generations and analysis of core files using
debugger tools. Troubleshooting (of out of memory,
high load average) is made faster with analysis of system
data using tools (sar, netstat, lsof, vmstat, top etc.). Data
migration failures and error log from applications or DB
servers are correlated to arrive at root cause of problem
being debugged.
C. Advanced Continuous Delivery
Use case on CD starts with build and test applications
with build tools (e.g. Jenkins), provisioning new host
with complete application stack, then roll-forth or
rollback based on performance data. Also, provisioning
new Test Stations (virtual machines with application
libraries) in plant that will run “atomic” actions.
Deployment of new software reliably ensuring graceful
roll back and system snapshots for debugging, backup of
old data to preconfigured destinations (other host) when
disk space is at premium are under this category.
III. FRAMEWORK OVERVIEW
SHM system has loosely coupled components that
communicate over message queues, and scales
horizontally to deliver this automation. The components
include:
• Event Listeners receive events from monitoring
systems, external applications or DevOps tools.
• Event trigger generates an action or a workflow.
• Actions are either SSH, REST API calls or a
custom python script when an event is triggered.
• Rules map between the triggers and actions.
• Workflow is a set of actions, defined in an order,
with conditions, and pass data between actions.
• Actions Logs are recorded for future analysis
and debugging.
Event listeners get events messages from various
external systems. Event triggers generate actions based
on rules. Actions are placed in message queue. Actions
generated by workflows are also placed in message
queues to be picked up by workers to perform actions
and log results for analytics and availability reports.
Figure 2: Self Healing Module
IV. FUTURE OUTLOOK
With plan to integrate with IOT based manufacturing
equipment, SHM will also interact with IOT based
manufacturing equipment to capture logs & run
diagnostics. The time taken to repair equipment can be
reduced, providing crucial information to the technicians
or the software actors.
REFERENCES
[1] Agile Virtual Enterprises: Implementation and
Management Support 2006

DevOps_SelfHealing

More Related Content

What's hot

Similar to DevOps_SelfHealing

DevOps_SelfHealing