DATA CENTER ENVIRONMENT
Cluster A Cluster B
1 Cluster A is underprovisioned.
Cluster B is overprovisioned.
Preparing to remove
2 Cluster B vacates the target node
and redistributes its load.
The target node is removed from
3 Cluster B.
The target node is added to Cluster A. Add
Cluster A redistributes its load to
5 include the new node. Both clusters
are now appropriately provisioned.
Figure 1. Load balancing and resource reallocation
The statistical multiplexing of hardware across application clus- Load balancing occurs within clusters; resource balancing
ters is at the core of the virtual data center concept. In the virtual occurs between clusters. Achieving synergy in both workload dis-
data center, administrators physically configure hardware only once; tribution and resource utilization is the main goal of the virtual data
software creates logical associations among hardware components center. Although simple in concept, the implementation is nontrivial.
as needed. For example, virtual LANs (VLANs) can be configured A centrally acting controller that orchestrates operations across
through software that resides on a network switch. By managing the entire data center could ideally make use of three load-balancing
hardware as pools of similar, easily “relocated” components, the and resource-balancing mechanisms: redistribute work, remove
virtual data center can optimize real-time, dynamic resource allo- node, and add node.2 For example, Figure 1 shows how the virtual
cation purely through software. data center could balance loads and resources between two clus-
ters running different clustering software, as follows:
Balancing loads and resources
Two mechanisms work in tandem to achieve maximum perfor- 1. Before reallocation. Cluster A is experiencing heavy demand
mance and optimal resource utilization in the virtual data center: and is nearing the saturation point; it is underprovisioned
because it requires additional hardware resources. Cluster B
• Load balancing: The ability to redistribute work across the has spare capacity; it is overprovisioned because it has an
nodes in a cluster, commensurate with each node’s process- abundance of hardware resources.
ing capacity 2. Preparing to remove. An identification algorithm targets a
• Resource balancing: The ability to move nodes among clus- node in Cluster B for transfer to Cluster A. System adminis-
ters in order to increase and decrease cluster sizes and thus trators can program the logic in the identification algorithm
cluster processing capacities to align with business objectives, such as minimizing
impact to clients or minimizing time to yield of the target
2For more information about virtual data center management software and the role of the global engine, see “Managing the Virtual Data Center” by J. Craig Lowery, Ph.D., in Dell Power Solutions, August 2003.
2 POWER SOLUTIONS November 2003
DATA CENTER ENVIRONMENT
as the unit of work performed by
Server 1 Server 2 Server 3
Three operations imple-
(underutilized) an always-resident server process
in response to a particular client
J = Job J J J J ment the load-balancing request; in this instance, the
J J J J J1
request represents the job. Gen-
J J J J J J and resource-balancing erally, for workloads character-
ized by uniformly short job
J3 J2 J1
mechanisms: redistribute lengths, utilization is nearly
equivalent across the nodes in
work, remove node, steady state. However, when job
Figure 2. Job scheduling workload redistribution lengths are generally unknown or
and add node. highly variable, job completion
times are impossible for the job
(that is, the time until the cluster makes the node avail- scheduler to predict. Conse-
able). Cluster B uses workload redistribution methods quently, utilization across nodes in the cluster becomes skewed over
(discussed in the next section) to vacate this node. time as the scheduler makes inefficient assignments.
3. Remove. The vacant target node is removed from Cluster B. In addition, unpredictable job completion times can diminish
4. Add. The target node is added to Cluster A, which begins to the value of the remove operation: When the cluster must vacate
redistribute its work across the new cluster member. a node, the job scheduler excludes that node from new assign-
5. After reallocation. A steady-state workload exists in Cluster A; ments. However, because the completion times for existing jobs on
both Clusters A and B are provisioned appropriately. the node are unpredictable, determining when the node will be ready
As shown in Figure 1, the redistribute, add, and remove oper-
ations enable the reallocation of hardware resources according to Managing process migration
changes in demand. The add operation is easy to implement The workload redistribution method most difficult to implement is
because it does not require an immediate reaction from the affected process migration. In this approach, the job scheduler initially assigns
cluster; new nodes are integrated in a nondisruptive fashion during a process (that is, a job) to one node. Then the cluster management
the redistribute operation. The remove operation also is simple software subsequently suspends the process, moves it to a differ-
because work can be redistributed to vacate a target node before ent node, and resumes execution (see Figure 3).
removing that node. Clearly, of the three operations, redistribute To appreciate the difficulties of process migration, consider the
is the most critical. types and sizes of state information that general processes own,
and that this state information must be copied from one system to
Redistributing workloads Assigning new jobs to another to effect a migration. In the absence of overhead, process
Whereas load-balancing mecha- migration provides the greatest flexibility and the fastest response
nisms determine the appropriate nodes, or job scheduling, to configuration changes for jobs that possess very little state infor-
allocation of work across nodes, mation. Sometimes migrating processes with sizable memory
workload redistribution is the
means of achieving that alloca-
is a primary function of
tion. That is, workload redistrib-
ution techniques move a thread
cluster operating systems.
Server 1 Server 2 Server 3
of execution—a job, a process, or
a session—among nodes.
P = Process
Assigning new jobs to nodes, or job scheduling, is a primary P1 P2
function of cluster operating systems. Usually the scheduler chooses = Data
P3 P4 Copy P4
the least utilized node for a new job, as shown in Figure 2, but it
can employ other criteria as well. Once a job starts on a node, it
runs to completion on that node.
A job is usually defined as a process group spanning a time
from creation to completion. Alternatively, a job may be defined Figure 3. Process migration workload redistribution
www.dell.com/powersolutions POWER SOLUTIONS 3
DATA CENTER ENVIRONMENT
that of the system bus, system designers finally can consider work-
Before hand-off After hand-off load redistribution methods such as session migration for mass
Shared storage Shared storage deployment. Because workload redistribution is the key to load
Data Data Data Data and resource balancing, high-performance interconnects are criti-
cal to the success of the virtual data center.
Data Data Data Data Data Data
One interesting variation of process migration that also can
leverage new interconnect technologies is virtual machine–hosted
P P operating systems. Products such as VMware™ GSX Server™ and
P P P P P ESX Server™ software and Microsoft® Virtual Server run several
P P P P
guest operating systems concurrently on a host operating system
by simulating a reference implementation of node hardware. All state
P = Process P = Idle or new process P = Vacating process
information (the image) for the guest operating system resides in
a large file, normally located in the local file system of the host.
This method makes possible
Figure 4. Session migration workload redistribution suspending the execution of the
guest operating system with its
structures—such as arrays, large local files, temporary working current state fixed in the image
files, network connections, and user interface I/O paths—is more
redistribution is the key file, moving the file to another
time-consuming than simply letting the processes finish on the host system, and resuming exe-
node to which they were initially assigned. to load and resource bal- cution. The transfer time for the
If migration overhead were uniformly low, process migration image file introduces problems
would be the ideal workload redistribution mechanism. A refine- ancing, high-performance similar to those of process migra-
ment of the process migration concept is session or transaction tion. However, by keeping the
migration, whereby a vacating process on one node hands off an interconnects are critical image file in shared storage and
in-progress transaction to a new or idle process on another node enabling multiple host systems to
(see Figure 4). A combination of shared storage and network to the success of the access the image file across a
communication typically facilitates the hand-off. Several options high-performance interconnect,
exist for implementing the session migration mechanism. For nearly instant migration of an
virtual data center.
example, processes could keep state information in local storage, entire execution environment is
copying it to shared storage only when they must vacate the attainable.
node. However, this approach offers little improvement over The drawbacks to this method include the overhead for sup-
process migration. porting the virtual machine and the large-size granularity at the
Conversely, processes could work directly from shared storage operating system level rather than at the process level. Adminis-
all the time, allowing a process to vacate almost immediately if trators could assign one process per virtual operating system to
necessary. In this scenario, another process on a different node could achieve finer granularity, but the virtualization overhead would
pick up where the vacating process left off simply by accessing the become prohibitive. Even so, the combination of virtual operating
share. Although long possible in shared-memory multiprocessor sys- systems, shared storage, and high-performance interconnects can
tems, this approach was considered impractical in the general case be considered a step forward in achieving the goals of the virtual
of cooperating servers on a LAN. However, relatively inexpensive, data center.
high-performance interconnects such as InfiniBand™ and Gigabit
Ethernet3 fabrics are enabling bandwidth-intensive cooperative Achieving the primary goal of transparency
activities, such as working directly from a storage area network For at least 20 years, the concepts of load balancing, resource bal-
(SAN) and remote direct memory access (RDMA). ancing, and workload redistribution have appeared in academic lit-
Given environments in which low-cost, commodity hardware erature describing distributed operating systems.4 Much research into
components can communicate at performance levels approaching creating such systems has led to many of the advances now
3This term indicates compliance with IEEE standard 802.3ab for Gigabit Ethernet, and does not connote actual operating speed of 1 Gbps. For high-speed transmission, connection to a Gigabit Ethernet server
and network infrastructure is required.
4“Distributed Operating Systems” by Andrew S. Tanenbaum and Robbert Van Renesse in Association of Computing Machinery Computing Surveys, vol. 17, no. 4, December 1985.
4 POWER SOLUTIONS November 2003
DATA CENTER ENVIRONMENT
incorporated into the virtual data center concept. One key differ- Oracle software stack provides a
entiating characteristic of distributed operating systems, as opposed
Technologies described consistent virtualization layer
to traditional operating systems, is transparency. That is, users of above the hardware and oper-
the system neither know, nor need to know, what network com-
years ago in academic ating system. Session migration
ponents are cooperating to service their requests and how the com- is a natural extension of the grid
ponents accomplish those tasks. journals are finally features currently available in
Because the virtual data center is a form of the distributed Oracle products.
operating system, transparency is a primary goal. To be truly suc- taking shape as
cessful, the virtual data center must be able to host applications Moving closer to the ideal
that are indifferent to underlying hardware details and location, and tangible products. virtual data center
are unaware that they may be subject to migration. Application Achieving cluster reconfiguration
developers should not have to worry about synchronizing with and load balancing that is both predictable and transparent is
reconfiguration events. Ideally, the virtual data center presents extremely difficult and presents several trade-offs. Job scheduling
applications with a virtual machine that provides continuity of exe- is transparent because no migration occurs, but it is not predictable.
cution at all times, making no special demands on applications to Process migration also is transparent but not predictable, because
accommodate reconfiguration below the virtualization layer. a potentially large amount of state information must be trans-
Some data center management software uses network boot ferred. Session migration is more predictable when processes work
capabilities to make a node execute entirely from shared storage, directly from shared storage across high-speed network inter-
similar to diskless workstations. This method allows a node to connects, but this predictability often is achieved at the expense
change “personalities” by rebooting to a different image under the of transparency. Products coming to market implement each of these
direction of some controlling agent, thereby providing location approaches, some in ways that may eventually overcome their tra-
transparency. Although certainly useful, this method is of limited ditional shortcomings.
benefit in the virtual data center because it is essentially coarse- Despite these challenges, much progress has been made in
grained job scheduling; administrators must wait for all executing moving toward the ideal virtual data center, and the pace of devel-
jobs to complete before rebooting the node to a new image. opment is quickening. Technologies described years ago in
One way to achieve finer time academic journals are finally taking shape as tangible products.
granularity (that is, move appli- High-performance interconnects, virtualization-ready execution
cations around the data center To be truly successful, the environments, and standard components and protocols are paving
more quickly) without sacrificing the way to the eventual realization of this long-pursued comput-
transparency is to construct envi- virtual data center must be ing model.
ronments specifically designed to
relieve applications of the able to host applications J. Craig Lowery, Ph.D. (email@example.com) is chief security architect and a soft-
burdens associated with load ware architect and strategist in the Dell™ Product Group–Software Engineering. Craig has
balancing and workload redistri- an M.S. and a Ph.D. in Computer Science from Vanderbilt University and a B.S. in Comput-
that are indifferent to
bution. For example, many ing Science and Mathematics from Mississippi College. His primary areas of interest include
Oracle® products—most notably computer networking, security, and performance modeling.
Oracle9i™ Real Application Clus-
underlying hardware details
ters databases—include this capa- FOR MORE INFORMATION
bility, which Oracle refers to as and location, and are Microsoft Virtual Server: http://www.microsoft.com/windowsserver2003/
grid computing.5 Application pro- evaluation/trial/virtualserver.mspx
gramming interfaces (APIs) exist unaware that they may be Oracle Real Application Clusters: http://www.oracle.com/ip/rac_home.html
for location-transparent data VMware: http://www.vmware.com
access and processing, while the subject to migration.
5The Oracle definition of grid computing differs somewhat from other connotations of grid computing. For more information about grid computing, visit http://www.globus.org.
www.dell.com/powersolutions POWER SOLUTIONS 5