Autonomic Computing and Self-Healing Systems
Colorado State University
Dr. France / Dr. Georg
Self-adapting systems have been developed to address the needs and are an
important part of many critical systems. While the ideas and models for self-adapting
systems have been around for quite a while, the current research and development has
made many strides towards true self-healing, auto-modifying systems. These systems can
be designed to adapt to the needs of the underlying system at run-time or while running
based on changes in the system as a whole. Self-adapting systems have both advantages
and disadvantages; some of which are general to all self-adapting systems and others that
are specific to implementations. This paper will include a thorough description of several
systems as well as the specific advantages, disadvantages and known issues along with
There are four characteristics of self-managing systems: self-configuration, self-
optimization, self-healing and self-protection . For the system to be complete in its
implementation all four of the characteristics must be satisfied. In the current
environment this has proven to be a difficult endeavor. Addressing all four points can
make a system grow to an unwieldy size and level of complication. There have been
many advances in the design of tools and implementations in recent years that have made
these autonomic, self-healing systems close to a reality instead of a pipe dream.
The beginnings of the development of these self-managing systems are built
around Model Driven Engineering. “In MDE, a model is an abstraction or reduced
representation of a system that is built for specific purposes” . This reduced abstraction
simplifies the system design so that the overall behavior and interactions can be mapped
out. Once this overview is completed, appropriate self-representations become important
to the continuation of the development. “It is critical that such representations be causally
connected” . This is important because (1) “the model as interrogated should provide
up-to-date and exact information about the system to drive subsequent adaptation
decisions; and (2) if the model is causally connected, then adaptations can be made at the
model level rather than at the system level” .
In order to achieve these systems, developers have had to both develop the tools
for designing the systems and use the tools to build the actual systems. This paper will
begin with a thorough description of autonomic computing and self-healing systems. It
will then describe the tools utilized and will follow with detailed descriptions of several
past and current systems that incorporate the self-managing ideals.
“The term autonomic computing was first used by IBM in 2001 to describe
computing systems that are said to be self-managing” . Autonomic computing is
centered on the idea of removing human intervention in the system. The main goal is to
design and then develop systems that will adapt to changes in their environment on their
own. The description of autonomic computing from IBM compared it to the complexity
of the human body and the autonomic nervous system because of its self-managing
“A system with autonomic capabilities installs, configures, tunes and maintains its
own components at runtime” . IBM describes the four main properties of self-
management as self-configuration, self-optimization, self-healing and self-protection.
Self-configuration says that a system can reconfigure itself based on high-level goals and
modeling. Self-optimization says that a system will make the best use of its resources.
Self-healing is the ability to detect and diagnose issues or problems and self-protection in
a system means that system can protect itself from malicious attacks and from unintended
or inadvertent changes.
There are five levels of autonomicity proposed by IBM as the Autonomic
Computing Adoption Model Levels. Level 1 is the basic level. At this level, the system
elements are managed by highly skilled staff and changes are made manually. Level 2 is
referred to as the Managed level. At this level, the system monitors itself and is
intelligent enough to reduce some of the burden on the system administrators. The
Predictive level, level 3, is characterized by the system using modeling of behavior to
recognize system-wide behavioral patterns and suggests fixes to the IT staff. Level 4 is
the Adaptive level. At this level, human interaction is minimized and the tools that were
used at level 3 and automated more so that the burden on the IT staff is minimal. The
fully autonomic level is the final level. At level 5, systems are able to self-manage almost
all functionality related to the needs of the system.
The basic building block in an autonomic system is the Autonomic Element (AE).
An autonomic element is a software-based component that is responsible for managing
sub-systems. “Autonomic elements may cooperate to achieve a common goal … [such
as] servers in a cluster optimizing the allocation of resources to application to minimize
the overall response time or execution time of the applications” . Autonomic
elements implement the MAPE-K loop as the control loop for the managing of the sub-
system. A complete description of the MAPE-K loop will be seen in the Tools section of
Variability models can be used to build autonomic elements and systems at
runtime. “The use of variability models at runtime brings new opportunities for
autonomic capabilities by reutilizing the efforts invested at design time” . Systems
leveraging the variability model can use knowledge of the design to attain autonomic
modifications at compile-time and can further use system modeling to self-modify at
runtime. Standards such as the meta-data exchange allow the models that were used at
design-time to additionally be used at run-time. The negative aspect of variability models
is the potential for exponential explosion of the possible state transitions. “In order to
manage variability and avoid the combinatorial explosion of artifacts needed to support
this variability, [software tools] focus on variation points and variants instead of focusing
on whole configurations” . A very specific type of autonomic system is the self-
healing system. These systems are often designed using the variability model.
“A self-healing system is one that: replaces traditional error messages with robust
error detection, handling and correction that produces telemetry for automated diagnosis,
provides automated diagnosis and response from the error telemetry for hardware and
software entities, provides recursive fine-grained restart of services based upon
knowledge of their dependencies, presents simplified administrative interactions for
diagnosed problems and their effects on services and resources” . There are several
ways that self-healing systems can be implemented. The simplest implementation is
through redundancy. Adding duplicate components for critical systems all the way up to
duplication of the entire system can allow for fail-safe operations. The issue with this is
that this is inherently wasteful of resources. The redundant components could be used
productively with a more innovative implementation. In addition to the wastefulness of
this, it only addresses total failures of components in the system. It does not address
degradation in the system. To heal these types of issues, a more robust solution was
In many implementations of self-healing systems, a multi-faceted approach is
needed. “Two distinct elements are required for the development of self-healing systems.
First, an automated or semi-automated agent must be present to make the decision of
when and how to affect repair on a system. Second, an infrastructure for actually
executing the repair strategy must be available to that agent” . The use of managers is
the favored approach. In Solaris 10, managers to address faults and service issues were
used. The fault manager uses a system level model to determine when a failure or
degradation has occurred and searches through a dynamic list of solutions to determine
the most opportune solution. The service manager allows for restarting of services and
applications that have failed or degraded below a pre-determined threshold. This pre-
determined threshold is given by the application to the service manager when it enters use
on the system.
A very popular approach to developing self-healing systems is architecture-
centric. “An architectural style is a collection of constraints on components, connectors
and their configurations targeted towards a family of systems with shared characteristics”
. In architecture-based self-healing system, to repair a running system, the changes
have to be machine readable by the underlying system as well as the describing systems.
This machine readable change instruction is referred to as an architectural difference or
diff. A diff describes the difference in the system before and after the repair. A diff is
comprised of components, links, connectors and interfaces.
According to Mikic-Rakic et.al., “self-healing systems should satisfy:
adaptability, dynamicity, awareness, autonomy, robustness, distributability, mobility and
1. Adaptability: The system must allow changes to both the static
and dynamic portions of the system.
2. Dynamicity: Addresses the run-time changes that a system is
able to make.
3. Awareness: The architectural style must allow for self-
monitoring in the system.
4. Autonomy: Autonomy is completed through the system being
able to address the anomalies discovered.
5. Robustness: The architectural style should allow for the system
to respond to unforeseen conditions.
6. Distributability: The system must have good performance in
7. Mobility: The architecture should allow for modifications to
the location of components in the system.
8. Traceability: A system should allow for a direct correlation
between the model and the run-time execution. 
Some of these requirements are basic building blocks for systems and are used to enforce
a system level hierarchy on the data-flow and basic structure of the system while others
administer the dynamic changes to the system based on the data-flows. These dynamic
indicators analyze specific aspects of the system and address and implement the needed
changes. “The ability to dynamically repair a system at runtime based on its architecture
requires several capabilities: the ability to describe the current architecture of the system;
the ability to express an arbitrary change to that architecture that will serve as a repair
plan; the ability to analyze the result of the repair to gain confidence that the change is
valid; and the ability to execute the repair plan on a running system without restarting the
While self-healing systems are an innovative idea, there are still many issues in
their design and implementation that must be overcome in order for them to develop into
an integral part of a computer system. “Self-healing functionality for users and
administrators of a modern operating system [must] provide fine-grained fault isolation
and restart where possible of any component –hardware or software – that experiences a
problem” . Without this fine-grain fault isolation, fixing the problem becomes
overkill in most situations. Any general fault will be addressed with a similar approach:
restart the component or application. This is not always an optimal or applicable solution.
With systems that are considered real-time and critical, often the downtime required for
such an overzealous solution is not available. With the finer level of fault isolation, small
problems can be fixed in a way that does not cripple the system even for a short period.
A second issue is tool integration. Seamless integration is “especially important in
the context of self-healing systems since no human can be involved in manually
transforming tool outputs or invoking tools” . This integration has to be performed
with multiple tools. Current self-healing systems are so complex that no single tool
encompasses all the needs. With multiple tools added to an already complex system, the
system can become unwieldy. To ameliorate this complexity, using tools that have been
thoroughly tested is of the utmost importance. In addition to this, managing the growing
complexity of fix models can grow exponentially as the system grows. Predetermined
solutions and exhaustive solution searches can also lead to problems. The solution space
can grow exponentially and determining the best solution to a problem can be subjective
with computing systems obviously only capable of making objective decisions.
The third issue facing self-healing systems follows from a potential solution to the
previously described solution. Building solutions to problems at runtime or at
component/application integration time based on detailed models of the system can be a
potential solution but the main issue with this is that “in an open system, upfront system
analysis is at best of limited, heuristic usefulness” . Because of the fact that open
systems tend to be dynamic and are molded by their environment, design-time models
become less relevant as the system grows and changes during runtime. While these
design-time models are important, a combination of all types of solutions is needed to
build robust systems that can grow and morph into what is truly needed.
There are many tools designed for building autonomic and self-healing systems.
The first and most widely used tool is the MAPE-K loop. The MAPE-K loop stands for
monitor, analyze, plan, execute and knowledge. The monitor “collates and aggregates
information received into the system and attempts to characterize any symptoms relating
to the way the system is running” . The analyze phase processes the symptoms to
determine if the issues at hand need to be addressed. At the plan step, the system decides
what and how changes can be handled for a successful implementation. The execute stage
is where the plan is implemented and activated. After a completed execute phase, the
knowledge phase of the cycle attempts to determine if the implementation of the plan was
successful in instigating the necessary changes.
Figure 1. MAPE-K Loop Design 
The MAPE-K loop design was first proposed by IBM as a solution to autonomic
computing design. Systems built with the MAPE-K ideal tend to be robust. “The MAPE-
K loop is controlled by a manager, an embedded part of the autonomic element that
coordinates the individual activities” . This manager is an integrated part of the
underlying system. Often systems will have several managers running the MAPE-K loop
simultaneously, autonomously and distributed. The way that the autonomic managers
interact with each other is determined by the autonomic computing architecture.
This integration of several managers leads to the next logical step of developing
autonomic software product lines (ASPLs). ASPLs can self-manage a large and complex
system and interact with other systems both local and distributed in order to deal with
product variations and dynamic system changes. There has been a good deal of research
into the ASPL concept and the Software Engineering Institute (SEI) has developed a
framework. This framework divides systems into three general categories: core assets
development, product development and management. The core assets are the basic
components in the SPL. They can range from business artifacts to reference architectures.
The product development is what is built with the core assets. They comprise the larger
portions of the system that perform the objectives of the design. The final portion is the
management. The management is what provides maintenance to all product
developments as well as monitors the system to determine where potential problems will
likely occur based on models.
“Fractal is an advanced component model and associated on-growing(sic)
programming and management support devised initially by France Telecom and INRIA
since 2001” . The Fractal model is based on component-based software engineering. It
makes use of components, interfaces, which are interaction points between those
components, and bindings which are the communication channels between components.
Fractal also uses the concepts of membranes and contents. “The membrane exercises an
arbitrary reflexive control over its content” . A membrane is made up from a set of
controllers. “The model is recursive with sharing at arbitrary levels” . The model is
programming language independent. It is extensible. Bindings are controlled through the
specific programming. “The Fractal project targets the development of a reflective
component technology for the construction of highly adaptable and reconfigurable
distributed systems” . Fractal enforces a limited number of architectural structures.
This allows the systems to be more robust as the specified component need not exist at
runtime in order for successful management.
The final tool analyzed is SmartAdapters. SmartAdapters are used to decrease the
complexity of dynamically adaptive systems. The first step in the use of SmartAdapters is
the maintenance of a high-level representation model of the running systems. Maintaining
the high-level model allows for a quicker and more through response to issues as they
arrive. “SmartAdapters automatically generate an extensible Aspect-Oriented Modeling
framework specific to [the] metamodel” . These metamodels are used to control the
potentially exponential growth of solutions. “Using Aspect-Oriented weavers, whole
configurations can be built on-demand by selecting a set of aspects in practice [using]
SmartAdapters” . Components that occur in all configurations are the base models
and are used to weave the aspects of the system.
“In SmartAdapters, an aspect is composed of three parts: i) an advice model,
representing what [to] weave, ii) a pointcut model, representing where [to] weave the
aspect and iii) weaving directives specifying how to weave the advice model at the join
points matching the pointcut model” . The advice model is a portion of the model that
is potentially having an issue. The pointcut model also represents a portion of the model
but it is described by the roles in the system. The weaving directives specify how to
weave an aspect building from the advice model to the pointcut model using the domain-
specific language of the system.
Early self-managing projects were funded by DARPA for the military. The first
was Situational Awareness System (SAS). It was created to aid in communication
between soldiers on the battlefield. The communication devices had to be durable and
able to deal with harsh conditions and potential jammers to the communication channels.
The design of the system was a distributed peer-to-peer systems with self-healing
“The DARPA Self-Regenerative Systems program started in 2004 is a project that
aims to develop technology for building military computing systems that provide critical
functionality at all times, in spite of damage caused by unintentional errors or attacks”
. The four aspects of this project are (1) software made resistant to errors and attacks,
(2) binary code that is modifiable to make attacks harder when trying to exploit
vulnerabilities, (3) a scalable architecture that is intrusion tolerant and (4) the ability to
build systems that can attempt to detect malicious inside users and block attempts to
attack the system.
NASA, in 2005, began work on the Autonomous NanoTechnology Swarm
(ANTS). The project was designed to launch a swarm of 1000 small spacecrafts into an
asteroid belt and use the information gathered to determine which asteroids were deemed
interesting for further investigation. The ships would be required to use autonomic
techniques to continually elect a leader and rebuild communication channels.
MAPE-K implementations include the Autonomic toolkit, ABLE, Kinesthetics
eXtreme and Self-Management Tightly Coupled with Application . The Autonomic
toolkit is a prototype implementation of the MAPE-K loop built in Java but able to
communicate through XML to other applications. ABLE is also a toolkit designed by
IBM but it is designated for use in multi-tangent systems that need self-management
implementations. Kinesthetics eXtreme is a complete autonomic loop designed mainly in
Java that is focused on adding autonomic abilities to legacy systems that may not have
been designed with autonomic capabilities. Finally, Self-Management Tightly Coupled
with Application is a project with the goal of developing middleware frameworks that
offer self-management functionality to applications.
There are currently eight platforms that support Fractal components in multiple
programming languages. “Julia was historically (2002) the first Fractal implementation”
. It was developed and used by the France Telecom. It is based in Java and was
developed to prove that component-based systems did not have to perform inefficiently.
“Think is a C implementation of Fractal” . Think is also available through France
Telecom but has development assistance through STMicroelectronics. Think is used to
build kernels of all sorts. The kernels range from exo-kernels and micro-kernels to low
memory complete operating systems. “ProActive is a distributed and asynchronous
implementation of Fractal targeting grid computing” . France Telecom developed
ProActive as middleware for parallel, concurrent and distributed computing grids. It is
object based and allows for asynchronicity deployment and management. “AOKell is a
Java implementation by INRIA Jacquard” . It is similar to Julia but based on AOP
using membranes for load time weaving. Its performance is similar to that of Julia.
“FractNet is a .Net implementation of the Fractal component model developed by the
LSR laboratory” . It is a port of AOKell to the .Net platform. It is similar in design and
performance also. “Flone is a Java implementation of the Fractal component model
developed by INRIA Sardes for teaching purposes” . It is not a full implementation but
instead is a group of APIs that simplify the Fractal model so that it is more easily
understood by students. “FracTalk is an experimental Smalltalk implementation of the
Fractal component model developed at Ecole des Mines de Douai” . FracTalk focuses
on dynamic elements in component-based programming. “Plasma is a C++ experimental
implementation of Fractal developed at INRIA Sardes” . It is dedicated to building
multimedia applications that are self-adaptive. Fractal also has a complete repertoire of
open components for middleware and operating systems.
The smart home feature model is an autonomic computing solution described by
Cetina . The design is such that a smart home can be fully automated and yet still
dynamically adjust to the changing patterns of the residents and the influx and removal of
components. The goal of autonomic computing for smart homes is “to reduce …
configuration effort, [so that] smart homes can provide the following autonomic
capabilities: Self-configuration. New kinds of devices can be incorporated into the
system; Self-healing. When a device is removed or fails, the system should adapt itself to
offer its services using alternative components; Self-adaptation. Users’ needs differ and
change over time. The system should adjust its services to fulfill user preferences” .
This behavior is similar to context adaptation. The Model-Based Reconfiguration Engine
(MoRE) is used to implement the management of the models used in the system. The
operations of the engine are used to determine how to evolve the system to meet future
needs and reconfigurations. Reconfiguration actions fall in to three categories:
1. Component actions: components that must be installed, uninstalled or
2. Channel actions: Communications for active components
3. Model actions: updates to the MoRE model after the component and channel
actions occur. 
Figure 2. Smart Home Model 
Because any change to the system can trigger a need for a change to the model,
the high-level models must be maintained and updated. This allows the system to
continually gather and process information about the dynamicity of the system and
affords the MoRE the ability to develop and implement solutions.
Self-healing and autonomic systems have begun to integrate themselves in to
many more generalized computing systems. The tools that are used to build these systems
have been developed and optimized to make the best use of the underlying models and
component architectures. The MAPE-K loop and the Fractal modeling system are two of
the most accepted and widely used tools for developing these autonomic systems.
The MAPE-K loop, proposed by IBM, is used to develop manager driven systems
that interact to build solutions. These solutions are then integrated into the running
system. Many active self-healing systems are based on MAPE-K. These include the
Autonomic toolkit, ABLE, Kinesthetics eXtreme and Self-Management Tightly Coupled
with Application. All of these implementations have well-documented success.
The Fractal modeling system, based on component models, was designed and
implemented by France Telecom and has been used to implemented multiple systems
across many different languages. The Fractal modeling system is much more complicated
than the MAPE-K loop but the systems appear to be more straight-forward to implement.
These and other tools have been used to build multiple systems ranging from
smart homes to deep-space multi-object space probes. The unifying aspect of all the
systems built to be autonomic and self-healing is that they tend to be complicated. This
level of complexity can grow exponentially as the size of the system grows. To overcome
this complexity, models and metamodels are often used to describe the runtime systems.
These models are built to speed the processing of solutions but as often as not just add
more complexity and sprawl to the system.
While all of the tools described and used to build the implementations of self-
healing systems are useful and well-developed, they would likely work best in a
conglomeration. By merging the best aspects of the tools and design models,
functionality will grow at a faster rate than the complexity of the implementation.
Managers used in the MAPE-K loop could be utilized in the MoRE to build more
cohesive systems. These systems and managers using the Fractal model management
ideals would be more likely to have lower reaction time to changes in the environment
and additionally would suffer from less model creep when describing the potential
solution space. In addition to this merger of the best portions from multiple tools, there is
a potential to use database-style storage of known good configurations as they are
implemented along with new advanced and streamlined searching technologies to
expedite the implementation of changes as needed.
Autonomic systems and self-healing systems are going to become a part of most
systems as the computing industry embraces the ideas and realizes the inherent good in
the design. These systems will become more complex and all-encompassing as they
become more commonplace. New tools will have to be developed to aid in the
implementation of these new systems and a better understanding of system modeling will
be needed by all individuals that interact with the development of these systems. Industry
will embrace the usefulness of these systems but in order for there to be success in the
implementations steps must be taken to merge the best aspects of the tools and to train the
engineers on modeling and best practices.
 Abbas, N.; Andersson, J.; Loewe, W.; , “Autonomic Software Product Lines (ASPL).
Proceeding of the 7th international conference on Autonomic computing (ICAC
'10). ACM, New York, NY, USA, pp.324-331. 2010
 Blair, G.; Bencomo, N.; France, R.B.; , "Models@ run.time," Computer , vol.42,
no.10, pp.22-27, Oct. 2009
 Cetina, C.; Giner, P.; Fons, J.; Pelechano, V.; , “Autonomic computing through reuse
of variability models at runtime: the case of the smart homes” Computer , vol.42,
no.10, pp.37-43, Oct. 2009
 Cheung-Foo-Wo, D.; Tigli, J.; Lavirotte, S.; Riveill, M.; “Self-adaptation of event-
driven component-oriented middleware using aspects of assembly.” Proceedings
of the 5th international workshop on Middleware for pervasive and ad-hoc
computing: ACM, New York, NY, USA, pp.31-36. 2006
 Coupaye, T.; Stefani, J-B.; “Fractal component-based software engineering.”
Proceedings of the 2006 conference on Object-oriented technology: ECOOP
2006 workshop reader (ECOOP'06), Springer-Verlag, Berlin, Heidelberg, pp.117-
 Dabrowski, C.; Mills, K.; , “Understanding self-healing in service-discovery
systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02),
ACM, New York, NY, USA, pp.15-20. 2002
 Dashofy, E.; Hoek, A.; Taylor, R.; , “Towards architecture-based self-healing
systems.” Proceedings of the first workshop on Self-healing systems (WOSS '02),
ACM, New York, NY, USA, pp.21-26. 2002.
 Fickas, S.; Hall, R.; , “Self-Healing Open Systems.” Proceedings of the first
workshop on Self-healing systems (WOSS '02), ACM, New York, NY, USA, 99-
 George, S.; Evans, D.; Davidson, L.; ,“A biologically inspired programming model
for self-healing systems.” Proceedings of the first workshop on Self-healing
systems (WOSS '02), ACM, New York, NY, USA, 102-104. 2002
 Huebscher, M.; McCann, J.; “A survey of autonomic computing—degrees, models,
and applications.” ACM Comput. Surv. 40, 3, Article 7 (August 2008), 28 pages.
 Maoz, S.; , "Using Model-Based Traces as Runtime Models," Computer , vol.42,
no.10, pp.28-36, Oct. 2009
 Mengusoglu, E.; Pickering, B.; , “Automated management and service provisioning
model for distributed devices.” Proceedings of the 2007 workshop on Automating
service quality: Held at the International Conference on Automated Software
Engineering (ASE). ACM, New York, NY, USA, pp.38-41. 2007
 Mikic-Rakic, M.; Mehta, N.; Medvidovic, N.; , “Architectural style requirements for
self-healing systems.” In Proceedings of the first workshop on Self-healing
systems WOSS '02, ACM, New York, NY, USA, pp.49-54. 2002.
 Morin, B.; Barais, O.; Jezequel, J.-M.; Fleurey, F.; Solberg, A.; , "Models@
Run.time to Support Dynamic Adaptation," Computer , vol.42, no.10, pp.44-51,
 Morin, B.; Fleurey, F.; Bencomo, N.; Jezequel, J.-M.; Solberg, A.; Dehlen, V.;Blair,
G.; , “An Aspect-Oriented and Model-Driven Approach for Managing Dynamic
Variability,” Proceedings of the 11th international conference on Model Driven
Engineering Languages and Systems (MoDELS '08), Springer-Verlag, Berlin,
Heidelberg, pp.782-796. 2008
 Morin, B.; Barais, O.; Nain, G.; Jezequel, J.; , “Taming Dynamically Adaptive
Systems using models and aspects.” Proceedings of the 31st International
Conference on Software Engineering (ICSE '09). IEEE Computer Society,
Washington, DC, USA, pp.122-132. 2009
 Shapiro, M.; “Self-Healing in Modern Operating Systems.” Queue 2, 9, pp.66-75,
 Weyns, D.; Malek, S.; Andersson, D.; , "FORMS: a formal reference model for self-
adaptation.” Proceeding of the 7th international conference on Autonomic
computing (ICAC '10). ACM, New York, NY, USA, pp.205-214. 2010