Simplified Cost Efficient Distributed System

Simplified Cost Efficient Distributed System
MD. Nadim Hossain Sonet, S.M. Tahsinur Refat Emon
Department of Computer Science, American International University-Bangladesh
{nh.sonet, tr.emon} @yahoo.com
Abstract — Availability, Reliability and Dependability
are the desired feature for any kind of distributed
system. Replication is a well-known and vastly used
technique to achieve fault tolerance with the help of a
monitoring service in distributed system, which
enhances service availability. Proxy is also a well
appreciated architecture for distributed system, which
works transparently in between clients and services
hiding the original roots of a running service.
We propose a Simplified Cost Efficient Distributed
System architecture which relies on replication and
recovery techniques using monitoring service, proxy
service to handle service calls and a specialized server
architecture which serves as both backup and standby
service provider.
This paper presents an overview of our proposed
distributed system and demonstrates an alternate
technique of replication and recovery service.
Keywords — Cost Efficient, Monitor, Backup,
Replication, Recovery.
I. INTRODUCTION
Reliability and Availability are the main
concerns when dealing with distributed systems and
applications. These requirements for distributed
system are continuously increasing in scientific and
commercial applications such as finance, industrial
control, telecommunication and so on. One of the
solutions for achieving fault-tolerance in case of
distributed system is to build software on top of fault-
tolerant hardware. This is a viable solution for some
domains but it is tightly coupled and can hardly be
reusable and scalable. Besides, economic factors also
prevent large scale adoption of such approaches.
More efficient software based fault-tolerance
techniques depends on replication, enhanced through
management support for control and consistency.
From fault-tolerance perspective, distributed systems
major advantage is, “It can be easily made redundant,
which is the core of fault-tolerance techniques” [7].
In our paper, we are going to propose a
“Simplified Cost Efficient Distributed System” for a
small office or business environment but it’s possible
to make it workable for large scale distributed
systems also. We consider the following general
model of a distributed system: System consists of a
set of processes which runs different services,
connected through communication links; connections
are used by the processes to exchange data and
message. We consider different types of faults:
Service node failures, Communication media failures,
Backup failures and Transmission delays.
Most of the recent research in distributed system
is towards large scale implementation. But to make
the research truly successful small or mid-range
implementation should be taken into consideration.
But there is very little work going in this domain. So,
there remains a gap where distribution system lacks
to fulfill the demand. This is why we are motivated to
contribute in this domain and this paper represents
our research work on small or mid-range distributed
system.
In our approach, process of dealing with faults is
split into distinct activities. Using error detection
policy we identify that the current state of system is
in an invalid state, which means some component or
node in the system has failed. To do this, we rely on
real-time monitoring service of the distributed
system, using custom built modules within the
HAWKEYE Framework [1]. This service module
performs background audits to determine whether a
service node is functioning properly or not. To make
sure that the effects of the errors are limited, it is
necessary to isolate the failed components so that its
effects are not propagated further. A proxy service
helps to abstract the internal complexity and failure
from client viewpoint which keeps the system simple
from top view. In the error recovery phase, the errors
and its effects are removed by restoring the system or
service to a valid state. In a distributed system a
single fault may cause many cascading errors.
Through our replication and recovery service if any
error occurs or a node becomes unavailable, we are
going to recover it to an earlier valid state.
Remainder of this paper is organized as follows:
Section 2 surveys the existing work and argues the
originality of our approach. Section 3 presents the

architectural overview of the system. Section 4
explains how our proposed architecture is going to
work with the help of several distinct modules.
Section 5 presents the performance evaluation of our
solution while Section 6 concludes this paper.
II. RELATED WORK
Approaches for the dependability problem are
usually focused on fault detection and fault tolerance
rather than fault removal. [2] Introduces an adaptive
system for fault detection in grids and a policy-based
recovery mechanism. Fault detection is achieved by
monitoring the system and for recovery several
mechanisms are available like replication and check-
pointing. Another approach less related to large scale
distributed systems is presented in [3], which
addresses the tolerance to hardware faults through
visualization which is categorized under Loosely
Synchronized Redundant Virtual Machines
(LSRVM). Another aspect of dependability is
security. In case of grid systems, [5] identifies a set
of base requirements (single authentication,
credentials protection, interoperability with local
security solutions etc.) and proposes both an
architectural model for security and a public key
cryptography infrastructure based on a reference
implementation (Grid Security Infrastructure - GSI).
However, all existing solutions for security in
distributed systems have unresolved issues [6]. Most
of the research works going on distributed system are
considering large scale systems. But none are giving
importance on cost efficient and simplified
distributed system. During our research we didn't get
any better work going on small scale distributed
system. So we had to brainstorm and generate ideas
to make small and simplified distributed system and
above all papers helped us in understanding different
types of distributed systems and architectures.
III. ARCHITECTURAL OVERVIEW
Our proposed hybrid distribution system
architecture is based on a minimal set of
functionality, necessary to ensure the reliability and
availability. Now-a-days, the problem of developing
dependable small scale distributed system is
technologically limited and unavailable for suitable
environments. An example of such a problem is the
weak resilience of the operating systems. The
solution to this consists in the inclusion of a complete
set of mechanism necessary to guarantee the various
technical needs. The architectural model includes two
specialized sets of components [7]. We are going to
focus on second set of components in this paper.
The first architectural component set's to
ensuring dependability, reliability at the OS level. In
core of the operating system lies the OS kernel. A
reference monitor component resides within kernel
space. It cannot be circumvented or modified. It must
be simple and compact enough to be readily
understood. This component can validate all attempts
to access the system resources based on the input
provided by the second component, the security
policy. This component provides complete mediation
of verification of the validity of requests, based on
sets of permissions. All possible problems are
reported and stored for further inspection [4].
The second architectural components act on the
middle-ware layer of the distributed system. It
ensures dependability between different hosts using
the distributed system. At the bottom of the
architecture is the core of the system designed to
orchestrate the functionalities provided by other
components. Its role is to integrate and provide a
dependable execution environment for other
components. This architectural layer is built around a
proxy service which implements the mechanisms for
ensuring dependability based on a set of services. The
proxy service is capable to mask possible faults
completely transparent for the clients. This
architecture allows the generic integration with
various technologies and development and evaluation
of various new models. The architecture’s
components can be interconnected with services
available in such systems. These middle-ware layer
components include a set of mechanisms necessary to
guarantee security needs in case of distributed
systems. They are capable of taking security
decisions based on monitored data [7].

Figure 1: Proposed Distributed System Architecture.
IV. WORK PROCEDURE
A. Proxy Service:
A proxy service is capable of masking possible
faults or service failures, completely transparent to
the client. In a distributed architecture proxy service
not only masks possible faults and failures, but also
optimizes the access to the distributed services.
Our proposed solution, masks possible failures
and optimizes the access to the static services and if
any error occurs then through a replication manner,
client still gets the service without any interruption.
The requests sent by a client when invoking a service
are intercepted by the proxy service, which
communicates with monitoring service for the URI of
that particular service provider node. When
monitoring service provides the particular URI then
proxy communicates with that service provider node
with client request. If any node goes down or error
occurs it re-forwards the request to backup and
standby service provider node concerning with
monitoring service. The approach is presented in
Figure 1.
The proxy service also considers balancing the
load between services. The implementation of a load
balancing policy ensures a more efficient use of
resources and leads to a smaller response time [7].
The architecture of the system implementing the
proxy pattern is represented in Figure 1. In this
architecture a proxy is situated between the client and
the real service. This has the advantage that it can
control the access to the real service and inside can
implement various actions that are triggered each
time a service is being invoked.
The messages sent to the real service are
intercepted by the serialization and deserialization
techniques and are redirected to the real service with
the help of proxy service. All actions executed are
transparent to both the client and the real service. The
client considers throughout the entire

communication, that it is connected with the real
service and not with the proxy.
The monitoring and load balancing service
centralizes all information regarding the service
provider nodes and allows in choosing the standby
service provider node in some critical circumstance
when CPU load cross’s its maximum capacity.
B. Monitoring Service:
The proxy service takes decisions based on real-
time information received from a monitoring service.
The role of the monitoring service is crucial both in
terms of information and high availability of system.
In our approach towards a cost efficient
simplified distributed system we used service
replication and we guarantee that all processing
nodes achieve state consistency, both in the absence
of failures and after failure recovery. Monitoring
service receives request from proxy service for the
URI of a particular node who can provide the
requested service to the client. Monitoring service
keeps track of all running node’s logical and physical
information in real time. It then informs proxy about
a particular node and keeps track of that process. If
any failure occurs it then informs replication and
recovery service; so that it can immediately stop
replication of that node and sends request to the
monitoring service so that it can transfer that node’s
running process to backup and standby service
provider without hampering the execution of client
request. In the meantime proxy also communicates
with monitoring service for getting the URI for
backup and standby service provider node. To
monitor our proposed distribution system we
consider a monitoring tool named Hawkeye.
HAWKEYE Framework:
Hawkeye is a monitoring tool developed by the
Condor group and designed to automate problem
detection like: high CPU load, high network traffic or
resource failure and software maintenance within a
distributed system. Its underlying infrastructure
builds on the Condor and ClassAd technologies. The
main use case that Hawkeye was built to address is,
being able to offer warnings using Trigger ClassAds.
It also allows for easier software maintenance within
a pool.
Figure 2: HAWKEYE Framework.

Hawkeye involves two fundamental ideas:
1. Its use of the Condor ClassAd Language to
identify resources in a pool;
2. ClassAd matchmaking to execute jobs based
on attribute values of resources to identify problems
in a pool.
A ClassAd is a set of attribute/value pairs (e.g.,
“operating system” and “Linux”).
The manager performs ClassAd matchmaking
between a Trigger ClassAd, submitted by a client,
and all Startd ClassAds.
A trigger ClassAd specifies an event and a job to
execute if the event occurs. If any machine advertises
a Startd ClassAd with a CPU load value of greater
than 50, the manager will not execute that process.
The architecture of Hawkeye comprises four
major components:
• Pool
• Manager
• Monitoring Agent
• Module
The components are organized in a four-level
hierarchical structure.
Pool: Is a set of computers, in which one computer
serves as the manager and the remaining computers
serve as monitoring agents.
Manager: Is the head computer in the pool that
collects and stores (in an indexed resident database)
monitoring information from each agent registered to
it. It is also the central target for queries about the
status of any Pool member.
Monitoring Agent: Is a distributed information
service component that collects ClassAds from each
of its modules and then integrates them into a single
Startd ClassAd. At fixed intervals, the agent sends
the Startd ClassAd to its registered manager. An
agent can also directly answer queries about a
particular module; however, the client must first
consult the manager for the agent’s IP-address.
Module: Is simply a sensor that advertises resource
information in a ClassAd format.
C. Replication and Recovery Service:
(I) The main idea in replication is to keep the data
safe and accessible at any time. This makes the
service available to the user at all time and reduces
access latency. In our replication design, we are
considering a special type of replication architecture,
keeping cost efficiency and simplification in mind to
suite mid-sized organizational usage.
Our replication component consists of listener,
performer, storer and eraser module. Listener tracks
system changes, performer executes operations,
storer saves system state and eraser deletes previous
data in a structured manner. Listener always checks
for update and validates data. If there is any new data
it passes the information to performer module.
Performer module receives the serialized data (Object
converted in a formatted message to exchange
between services) and deserializes (Formatted
message reconverted into actual object for execution)
it to make usable. On successful deserialization the
object state is saved for further use. Finally, the
eraser module runs into action in a consistent time
interval which is set to one hour to erase previous
data after new request is successfully executed and
the data is sent to backup server for storage as history
log. One interesting note to keep in mind, all the
nodes except backup and standby service provider
provides distinct services which is replicated in real
time by replication service and if any node crashes
the replication service redirects its data to backup and
standby service to avoid service interruption.
(II) Recovery refers to avoiding accidental loss of
data due to unavoidable circumstances and
reconstructing the system to be functional for
operation.
In our recovery design, recovery component
consists of listener and performer module. Listener
tracks system changes and performer executes
operations. Listener checks for any issued flag from
monitoring service regarding any node crash. If any
message is received from monitoring service, the
message is then passed to performer module.
Performer module validates the node failure time
information. If the crash time sits in current time pool
which is less than one hour, it checks replication
component’s index and serializes that node’s most
recent data and sends that data to the corresponding
node for restoring the node. Else it requests the
backup server to pass the last stored data to the
corresponding node, and reconstruct that failed node.

D. Backup and Standby Service:
(I) Backup mainly refers to the term of logging
mechanism which keeps a brief snapshot of the
system.
In our backup architecture, backup server
consistently monitors all the nodes and stores their
data periodically every two hours. It helps to keep
track of the historical log of the system and any
activities regarding system change in a successful
manner. In this architecture for better cost efficiency
and reduced hardware requirement backup server
keeps data of the system for a week and after that it
replaces the previous data with the most recent one.
(II) Standby service component is our special
consideration for managing system availability with
check-pointing. It basically sits idle in normal
workflow, it comes into action when a certain node
crashes or when all the nodes are busy processing
existing requests and are unable to process any latest
request due to maximum load consumption.
If a node crashes, the standby service component
acts as the alternate node and provides services to
user as if the system is fully functional from the
outside view. And if the current service providing
nodes are at maximum load consumption level it
processes new incoming requests which keeps the
system functional and stable from the user point of
view, thus validating system availability viewpoint.
V. PERFORMANCE EVALUATION
To measure our proposed system’s performance
we compared it with the reference system based on
service time, cost efficiency and error masking form
factor. Below a visual comparison is presented
between two systems for better understanding.
Service Time:
In case of service time comparison, we took Time in
X-Axis and Performance in Y-Axis.
From above graph we can see that, though the
reference architecture provides better performance
but our proposed architecture is more stable.
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6
Performance
Time
Proposed Architecture
Reference Architecture

Cost Efficiency:
In case of service cost efficiency comparison, we
took Resource in X-Axis and Cost in Y-Axis.
From above graph we can see that, our architecture is
the clear winner in terms of cost efficiency.
Error Masking:
In case of error masking comparison, we took
Request in X-Axis and Failure in Y-Axis.
From above graph we see that, the reference
architecture is much better than proposed architecture
in terms of error masking.
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7
Cost
Resource
Reference Architecture
Proposed Architecture
0
2
4
6
8
1
2
3
4
5
6
Failure
Request
1 2 3 4 5 6
Proposed Architecture 5 4 5 7 5 4
Reference Architecture 7 5 3 4 3 4

VI. CONCLUSION
In this paper we presented an architectural
approach for satisfying dependability requirements in
case of small scale cost efficient distributed systems.
Dependability remains a key element in the context
of application development and is by far one of the
most important issues still not solved by recent
research efforts.
Our work is concerned with developing a
distributed system which in simple, cost efficient and
practical for a small office or business environment.
Therefore, we proposed the design maintaining a
hierarchical architectural model that allows a unitary
and aggregate approach to dependability
requirements while preserving scalability of small
scale distributed systems. The original contribution
addresses the architectural model and the
combination of existing monitoring, data
management & backup, security and fault tolerance
solutions.
The system is still in conceptual design phase but
we are working on it so that in future we can
implement this architecture for a small scale partial
distributed system.
REFERENCES
[1] Xuehai Zhang, Jeffrey Freschl and Jennifer M. Schopf -
- “A Performance Study of Monitoring and Information
Services for Distributed Systems”, Department of
Computer Science, University of Chicago.
[2] H. Jin, X. Shi, W. Qinag, D. Zou, DRIC -- “Dependable
Grid Computing Framework”, IEICE - Transactions on
Information and Systems, E89-D(2), 2006.
[3] A. Cox, K. Mohanram, S. Rixner – “Dependable ???
Unaffordable, 1st workshop on Architectural and
system support for improving software dependability”,
San Jose, California, 2006.
[4] V. Cristea, C. Dobre, F. Pop, C. Stratan, A. Costan, C.
Leordeanu, E. Tirsa -- “Models and Techniques for
Ensuring Reliability, Safety, Availability and Security
of Large Scale Distributed Systems”, In Proc. of the the
17th International Conference on Control Systems and
Computer Science, HiperGrid09, Bucharest, Romania,
2009.
[5] I. Foster, C. Kesselman, G. Tsudik, S. Tuecke --”A
Security Architecture for Computational Grids”, Proc.
Fifth Conf. Computer and Communications Security,
ACM, 1998.
[6] A. Arenas -- ”State of the art survey on trust and
security in Grid computing systems”, Technical Report,
Council for the Central Laboratory of the Research
Councils, UK, 2006.
[7] Alexandru Costan, Ciprian Dobre, Florin Pop, Catalin
Leordeanu, Valentin Cristea – “A Fault Tolerance
Approach for Distributed Systems Using Monitoring
Based Replication”, University Politehnica of
Bucharest.

Simplified Cost Efficient Distributed System

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (19)

Similar to Simplified Cost Efficient Distributed System

Similar to Simplified Cost Efficient Distributed System (20)

Recently uploaded

Recently uploaded (20)

Simplified Cost Efficient Distributed System