SlideShare a Scribd company logo
1 of 27
Download to read offline
The Global Computing challenges behind:
Capacity Planning
&
Data Center Architecture
Stephen J. Gasparini
GET434 - Computing Challenges
December 12, 2014
Page 2 of 27
Capacity Planning & Data Center Architecture
Table of Contents
Title page………………………………………page 1
Table of Contents…………………....…………page 2
Executive summary……………………….....…page 3
Body of Report…………………………......…..pages 4-20
Introduction……………..…………..…….pages 4-5
Capacity planning…………………………pages 5-8
Data Center Architecture………...………..pages 9-14
Challenges/Solutions...……………………pages 15-16
Case example..…………………………....pages 17-19
Recommendation……………………….…page 20
Lessons Learned……………………………..…..page 21
Glossary……………………………………….…pages 22-23
Bibliography…………………………...….……..page 24
Appendices………………………………..……..pages 25-27
Page 3 of 27
Capacity Planning & Data Center Architecture
Executive Summary
Capacity planning and data center architecture are two semi-related technical topics each with
their own technical challenges. Capacity planning deals with the challenge of attempting to accurately
predict the resource requirements of the future. Data center architecture deals with the everlasting
challenge of preventing and battling downtime6
.
Data center architecture is started during the planning to build of a new data center. The most
important part is determining the average and peak resource requirements of the new data center, you
need enough resources for current and future requirements, but not too much excess to where it is
simply just wasting money. The most accurate way to predict these requirements is through the
capacity planning 3-phase process. Although capacity planning the best was to predict, there are still
too many unknowns and variables to accurately predict how much of each resource will be required in
the future.
Another very important challenge of data center architecture is prevention of downtime6
. I rank
capacity planning higher however, because in order to prevent downtime6
, you need to first create a
data center with the correct resource requirements. Failing to provide enough resources can also create
downtime6
, hence why resource planning is more important to focus on.
Many things can actually cause downtime6
. They include things like: electrical outages,
damaged network infrastructure, or even natural disasters, etc. Anything that would cause a network to
become inaccessible to its users causes downtime6
. Downtime6
costs companies of all sizes anywhere
from hundreds of dollars to millions of dollars in losses for each hour of downtime.
Best practices are rules and regulations taken by data center architects in order to prevent
downtime6
. Hopefully if you prepare enough you won’t have to deal with it often, but even companies
like Facebook and Google have occasional downtime6
. It’s so small and seldom, most people don’t
even notice it. Typically, the more money spend on a data center, disaster recovery4
, and redundancy,
the less downtime6
a company will have. The fact is downtime6
can only be prevented, there will
always be a natural or artificial chance and cause of downtime6
for everyone, even Google.
For anyone looking to solve the challenge of accurately predicting future resource requirements,
we have capacity planning, and the 3-phase process, which will give you the most accurate prediction
possible with the knowledge and facts of today.
Those looking to solve the challenge of downtime6
should focus on trying to prevent it using
best practices, monitor their system for any possible causes, and have a plan to fix it when it inevitably
strikes. That is the best way to solve downtime6
. It cannot be completely eliminated as a problems, but
by using the best practices for data centers, and proper monitor and maintenance, downtime6
will
become a very small problem.
Page 4 of 27
Capacity Planning & Data Center Architecture
Introduction
It is difficult for a capacity planner or data center architect to be accurate to much degree when
planning for the needs and future needs of a particular data center. Capacity planning is the biggest
solution to some of the challenges behind data center architecture. In data center architecture there are
many prevalent challenges but two are harder to avoid than the rest. Those two are preventing
downtime6
and predicting future capacity requirements. Most the practices in good data center
architecture revolve around preventing possible downtime6
. This is because the data center is there to
provide a service, usually to make money from that service, and when the service is down the data
center is not completing its purpose, and possibly costing the owner a lot of money. The practice
behind good capacity planning will increase the accuracy of any predictions made about future
requirements.
I have investigate the practices behind good and bad capacity planning and data center
architecture, the challenges behind them, and solutions to those challenges. I have identified the links
and differences between the two topics. My findings revolve around the notion that data center
architecture is the main topic and predicting resource usage is one of the many challenges of data
center architecture, but the most important when planning a new data center. Capacity planning is a
part of good data center architecture, and a solution to one of the biggest challenges, when it is done
correctly.
Before my investigation, my recommendation for data center architecture is to follow best
practices and use capacity planning as a guideline. This is because capacity planning is only accurate
to a certain point and can’t account for all the unknowns the future will most likely hold, but it is the
best we have to go off of in the present. Sometimes even capacity planners and data center architects
have to assume and guess at what the future will be like, usually by backing it up with facts and testing.
Page 5 of 27
Capacity Planning & Data Center Architecture
Some of the key best practices for to maximize availability3
in data center architecture are: making
things efficient, costly, simple, modular12
, scalable, and flexible, regularly scheduled maintenance and
cleaning, physical security/prevention from natural and man-made disasters, practice redundancy and
modularity12
in everything, efficient and smart physical architecture and system design, and most
importantly: prevent downtime6
wherever possible.
Capacity Planning
Capacity planning is the prediction of the resource requirements for a data center. It is more
accurate than server sizing because server sizing is an estimate of hardware based upon the
applications, peak performance levels, and expected activity. Capacity planning is backed up by
technical performance data, acquired through testing. Although capacity planning is the best
benchmark we have, the sad truth is there are simply too many variables to accurately predict how
much of each resource will be required until it is too late (Jayaswal 144-145).
There are many questions to ask during capacity planning, and with proper testing, you can
answer most of those questions. First you determine your current service level requirements. This is
the behind the scenes workloads. Once completed you will know how much of each resource is being
used by whom, and how many/which machines. Next you measure your current capacity usage and
capacity available overall. This means you determine the maximum resource usage you are currently
prepared for and your current utilization25
. What is measured is the CPU, I/O, applications, memory,
and other parts of the machines are being used. You also need to determine when your peak resource
usage will equal or exceed the current abilities. It is important to separate the peak performance
measurements from the average usage. You need enough resources allocated to handle the peak
workloads, but if you don’t have enough to constantly manage the average workload efficiently, you
Page 6 of 27
Capacity Planning & Data Center Architecture
will have problems with high utilization25
and low efficiency (Rich 2-12). Once you record your
current usage, capacity, and user load you can scale that to the user load you will plan for in the future.
Some of the unknown variables include the future trends, exact requirements, applications, peak
usage, average performance levels, etc. As you can see there are many things that are impossible to
predict about the future. These are risks every data center architect will face. You can only plan based
on current expectations. It is not easy to be a good capacity planner, appendix 1 shows some suggested
skills one should have in order to be a successful capacity planner (Schiesser 1).
Another way to explain the capacity planning process is as the “three phase process” described
in chapter 12 of Administering Data Centers. The phases are: Phase 1: Define the customer’s
requirements, Phase 2: Measure or estimate current resource utilization25
, Phase 3: Size the new server.
The first phase is assesses the workload for the new environment and understands what the
user’s latency11
expectations are. Also important is collecting information on current and future
requirements, applications, and the type/amount of workloads and acceptable latency11
.
Estimating CPU requirements in Phase 1 has several factors. A better way is to ask questions
like the size of sorts that will be done, on memory or disk, will there be parsing or complex navigation,
and can the CPU handle the size of the mathematical manipulation being done (Jayaswal 146).
Memory is also assessed in Phase 1. One approach is sandbagging or adding extra memory to
be safe, but that’s expensive and inefficient. It’s also important not to undershoot the memory
requirements. The SGA19
(system global area) must be sized correctly. You must also determine the
maximum number of application users, because that is a huge factor in determining the amount of
required memory, I/O throughput, and CPU usages for the application and back-up database servers.
There are a lot of factors that impact the amount of memory dedicated to each user. These include the
type of operations performed, amount of shared images, and amount/type of sorting and parsing.
Page 7 of 27
Capacity Planning & Data Center Architecture
When estimating the number and size of disk required you must be sure to factor in for I/O
spindles and database archiving. It is important to keep I/O spread across several spindles, usually
done with several small disks attached directly to the server or attached via SAN fabric. Sometimes
spreading around I/O can be difficult and unnecessary. Some items should be archived on separate
disks like: databases, table-spaces, binaries, and redo logs.
Latency11
is the response time between the servers and the users. The ideal latency11
would be
0.00 seconds, meaning the user doesn’t have to wait at all. Obviously that is impossible but getting as
close as you can to zero is ideal. This is done by providing enough resources in all the right places. It
is important to identify the worst-case acceptable latency11
for different types of workloads.
The type and amount of current workloads need to be measured. The types and memory
consumption, CPU usage, and I/O usages must be recorded for average and peak performance levels.
This helps set the bar for average and peak levels in the future. Amount of users to amount of
workload is also important to develop a scale for the future requirements.
Phase 2 estimates and measures CPU and memory usage for each individual computer and user.
This is done through testing the existing workloads of applications. If applications aren’t available,
resource usage is estimated from data through the application vendors or independent tests.
CPU workload caused by a particular computation is determined by multiplying CPU usage
with the duration of the CPU load. Workload is measured in performance unit-seconds. The best way
to measure utilization25
is by running a pre-timed computation. This will indicate how fast one CPU is
running compared to how fast would be expected. The computation is pre-timed, we know a CPU of x
size would take seconds and a CPU of 2x size should take y seconds. A expecting to take y seconds
can be determined to be faster or slower than expected, based on its test results.
Memory consumption is tricky because of the pre-discussed notion or buying enough memory
Page 8 of 27
Capacity Planning & Data Center Architecture
you don’t run out, but avoiding sandbagging, which is expensive and inefficient. Multiple areas of
memory must be considered and accounted for. Areas such as: operating system memory, kernel,
system library memory, file system buffer memory, user and application/database requirements.
Phase 3 is sizing the new server requirements. This is done by using the information acquired
in the first two phases from testing and estimating. We use those numbers to project future
requirements based on current number of users and the expected number of future users. This gives us
our latency11
, utilization25
, memory, CPU, and other requirements to abide by (Jayaswal 143-151). It is
important to recognize that scaling is not always as simple as 1200 * 150% = 1800. This is thanks to
incommensurate scaling. Incommensurate scaling means when a system is scaled, not everything
increases at the same rate. The same concept holds true if you were to take a mouse and scale it to the
size of an elephant, the mouse would be crushed because its weight would increase exponentially while
its height increase linearly (Saltzer 1.1.1.3).
CPU estimates are used to predict the number of CPUs needed based on the amount of users
and workload at any given time. The formula for CPU requirements is the total CPU needed for
computation, number of users, projected computations per second, and estimated CPU workload per
computation. Also necessary to account for is the operating systems, kernel processes, application
processes, and system response time requirements. Adding CPUs does not scale linearly, this is called
the SMP factor and can be observed in appendix 2.
Memory estimates are also derived from the numbers determined in the first two phases.
Similar to CPU sizing, memory needs to be sized for all aspects of the system including OS processes,
kernel, file system buffer, application, and database shared space. All aspects must be predicted per
user and scaled appropriately to figure out the requirements for the new system. It is important to
remember if there’s not enough memory for peak usage problems will occur (Jayaswal 151-154).
Page 9 of 27
Capacity Planning & Data Center Architecture
Data Center Architecture
Data Center Architecture is the design and implementation of a data center, new or old, and
involves intense planning. By using capacity planning we have an accurate depiction of the minimum
average and peak requirements the data center needs. Next is the design of the next data center. The
design for a data center has two layers, each with their own challenges, solutions, and best practices.
The first layer deals with any software on the servers and machines, such as: applications, I/O, O/S,
services, network tools, memory data, processing information, and data/administrative tools. The
second is the physical layer addressing the actual space, physical machines, network infrastructure,
electrical system, and the HVAC10
system.
The system design and the physical layout of the data center need to be efficient, costly, and
provide the best all-around user-administrator experience. The data center needs to keep availability3
as close to 100% as possible. If anything involving the data center goes wrong/fails, it can cost the
company, and users, money and business depending on the duration of the downtime6
. Downtime6
cost
is calculated by multiplying the amount of workers who can’t work or working on fixing the system,
multiplied by their average hourly wage, multiplied by the duration of downtime6
in hours, plus any
lost revenue. So a company with thirty workers getting paid $20 an hour for a two hour downtime6
is
losing $1200 plus lost revenue. A large company like Google, can lose millions of dollars an hour.
The first layer is a lot less challenging, especially once capacity planning is completed. This
layer is usually started before, but not finished until construction of the physical layer is completed.
Before finalizing this layer, if the capacities planned for have changed (they will), they can be
updated/modified to meet new current expectations, as long as you have that physical space allotted
(buffer, excess, or expansion space). A good data center architect/capacity planner will have come
close with their previous requirement predictions because they used capacity planning.
Page 10 of 27
Capacity Planning & Data Center Architecture
Most data center architects/system analysts will design a system that is simple, scalable,
modular12
, flexible, maximizes availability3
, and is secure; these are four key concepts all good systems
are designed around. The system needs to be set-up so all the programs (applications, O/S, I/O,
services, network tools, memory data, and data/admin tools) are separated onto their appropriate
machines, but programmed to link/talk to their appropriate program neighbors.
It is important to ensure security is implemented, working, and tested throughout this layer
before connect opening the data center to the raw internet; this is the layer that all the sensitive memory
is located within. Security measures that should be taken at this level include: a DMZ5
, firewalls9
,
authentication1
, authorizations2
, and logins. Once the system is finished, the data center can be opened
for business.
Just like in the first layer, the design of any good data center should follow four key
requirements: simple, scalable, modular12
, and flexible. Simple enough anyone in the field could step
in and look around knowing what they were looking at. Scalable for the future when inevitably the
data center no longer has enough resources for their requirements again. Modular12
so everything is
divided into its own sections and sub-sections making locating specific machines and making repairs
simple. Finally flexible so when something doesn’t work as planned or new management has their own
plan, it can be adapted to fit another situation.
An architect designing a data center should plan: in advance, for the worst, for growth, for
changes, and against vandalism. The architect and company need to plan far in advance to get
everything right the first time, retrofitting a data center a second time is inexpensive and not desirable.
Planning for the worst ensures 24/7 uptime24
in worst case scenarios. Planning for growth and changes
fall into scalable and flexible. Unfortunately vandalism exists from teens rebelling and rival companies
or groups who want to put you out of business. The architect should simplify his design constantly,
Page 11 of 27
Capacity Planning & Data Center Architecture
and ensure everything is labeled in advance and when being built. This includes labeling anything and
everything from cables, ports and wires to machines, racks, and rooms.
It is important to choose a physically secure location to start. An ideal location is safe from
natural disasters, such as hurricanes, floods, tornados, earthquakes, etc. A location known for security
and safety is important, the inner-city ghetto is no place for hundreds of thousands to millions of
dollars of equipment. Also important in a location is availability of a reliable powers source. Another,
often forgotten, factor of a location is the availability of local talent already living there, who
potentially could fill various important positions.
A big decision that needs to be made from the start, and not changed if possible, is whether or
not to have a raised floor. If you have a raised floor it is beneficial to the machines but then there must
be a subfloor, ramps, more building codes, and special floor tiles for weight. It is beneficial because
everything from network cables and electricity to HVAC10
can run out of sight under the floor in the
plenum15
. It is important to account for the weight of everything on everything with a raised floor.
This includes the weight of the racks, machines, people, forklifts, tiles, and anything else that might be
held up by the sub-floor structure. As seen in appendix 3, the weight of a server room can add up
quickly with additional racks holding multiple machines. All of those have to be accounted for point-
load17
and static-load22
so the floor is never compromised.
Network infrastructure in a data center is extremely important to uptime24
because it connects
the entire system and outside world to its network. The network should be adequately connected to the
outside world through authorized areas like DMZs5
with enough bandwidth to ensure outside users can
always connect and get the level of service they require. When creating the network, it’s important to
benefit from modularity12
using PODs16
, patch panels, and network switches to separate parts of the
system that can afford to be separate, in order to benefit the system overall.
Page 12 of 27
Capacity Planning & Data Center Architecture
All network cables, and other cables and wires, should be redundant, properly labeled, and
follow a color coding, follow the minimum cable radius to avoid bending damage, and set-up to avoid
tangling which causes issues. Examples of good and bad cabling can be observed in appendix 4.
Redundancy ensures if one link goes down, that server is still accessible. When counting cables for a
data center, they add up quickly, everything links to multiple places, making the amount of cables to
servers an exponential relationship. Every time you add a server you could potentially be adding a
dozen or so cables.
Power distribution is similar to the network infrastructure, in that every sub-system has their
own power requirements and should be supplied with modularity12
to account for this. The main goal
of power distribution is to have sufficient and reliable power running throughout the data center. It
should be redundant, like the network cables, to ensure no single points of failure.
In capacity planning we estimated the resource requirements, in power distribution it is done
similarly because each piece of equipment has it’s our requirements. The power distribution system
must account for the power on all levels, from individual machines on racks, to entire server rooms, to
the entire building. Also included is the requirements of the HVAC10
system, fire control, lighting,
monitoring, NOC13
, and security. An electrical system can, and should, be modular12
using circuit
breakers or PDUs14
and/or providing electricity room by room. It must separate single phase and three
phase power because those go to their own respective users. ESD8
needs to be accounted for to prevent
any people or machines from being damaged by an imbalance in electric charge. Usually involving
discharge grounding points.
The power distribution must have an adequate back-up in order to prepare for worst case
scenarios where the primary power provider goes offline. Usually that means another power provider
or, more likely, back-up generators. The back-up system must have an adequate UPS23
that can
Page 13 of 27
Capacity Planning & Data Center Architecture
maintain the load, even at peak performance levels, until the back-up power source kicks in. Usually
back-up generators take 20-60 seconds to kick in fully so that is usually the time the UPS23
needs to
maintain the data center completely on its own.
Another way to ensure uptime24
is the HVAC10
system. This system ensures the machines and
data center will run constantly without problems. The HVAC10
is responsible for keeping the machines
within an ideal temperature and humidity range at all times and within the optimal range most of the
time. The acceptable range for most servers is between 50 and 90 degree Fahrenheit and humidity
between 25% and 75%. The optimal range is 70 to 74 degrees Fahrenheit and humidity between 45%
and 50%. The optimal range is a narrow window, but important because the reliability of electronics
and their longevity depends on their temperature and humidity. In fact, the reliability of electronics
reduces 50% every 18 degrees above 70 Fahrenheit. The air-flow is important because if you are
forcing air in but not out, the farther from the ventilation will have a more drastic difference from the
air closer to it. Usually the cool, dry air is ventilated through the plenum15
and through perforated tiles
below the machines, and forced upwards where it cools the machines, warms and is captured by a hot
air return in the sub-ceiling, because hot air rises.
The HVAC’s10
effectiveness can be affected by several factors. If there’s proper air circulation,
the placement of the racks, bottom-top cooling or top-bottom cooling, front-front rack placement or
front-back placement. Machines at the front and top of the racks typically are hotter than the bottom
and back. To fight this imbalance, it is important to have high-flow racks and machine placement so a
lot of cool air is reaching the top of the racks. Also it is more effective to have back-back rows where
every other aisle is hot or cold and keeps all the hot air in the same place, aiding in heat dispersion. If
the racks were places back-back this would mix the hot and cold aisles into the same place which is not
good for air flow or heat dispersion. See an example of an HVAC in appendix 6.
Page 14 of 27
Capacity Planning & Data Center Architecture
Once all the challenges have been overcome of building your data center to fit the capacity
planned for, it must be maintained in order to prevent any downtime6
. Good data center maintenance
usually means having an NOC13
that monitors the data center 24/7 or at least during peak hours.
Network monitoring can be done by third party or system administrators. Constantly monitoring your
network ensures you immediately can address any problems. If redundancy is done right, most
problems still have another layer of infrastructure before they arise. Fixing problems before they
happen is key to uptime24
. SNMP21
is a powerful monitoring tool used to ensure all the systems and
devices are working properly, it lets you know what resources are out there, and even give you status
and health updates of specific devices or systems (Jayaswal 27-91).
Both physical and network security are extremely important. If you have data, it needs to be
secured to some level to prevent someone from accessing, tampering, or stealing it. While thinking
security, you also have to protect the data center and its data from nature. Your location should be
secure both from other people and natural disasters. Physical security usually varies upon location, but
can range from a low risk location, cameras and guards to RFID cards18
, PIN codes, a highly secure
location, tail-gate sensors22
, etc. (Kassner 1). Logical security is just as, if not more, important than
physical security. It can range widely depending on what type of data you’re dealing with, logical
security can range anywhere from basic firewalls9
, onsite/offsite back-ups, authorization2
, and antivirus
to a Disaster-Recovery Plan4
, DMZ5
, encryption/decryption7
, authentication1
, etc.
Lastly, the data center should be properly cleaned, repaired, and tested on a regular basis. This
ensure the machines will keep running and not become/remain damaged, and all the systems and back-
up systems are working. One of the biggest causes of overheating in machines is when the fan
becomes clogged with dust and no longer cools the machine, the small particles build up and create
dust-bunnies that can be dangerous to the machines (Jayaswal 61-69, 495-536).
Page 15 of 27
Capacity Planning & Data Center Architecture
Challenges/Solutions
An overview of the main challenges and sub-challenges found in data center architecture and
capacity planning leaves us with two main topics again, predicting the future resource requirements,
and keeping downtime6
as close to zero as possible. This is because when you are setting out to create
a data center, you will always need to know how much of each resources the new data center will need
and keep those resources up and running for your users.
Some key challenges under capacity planning are: predicting the future expansion of resource
needs, determining what/how much resources to sandbag, and accurately predicting future resource
requirements. What makes predicting the future expansion difficult is the unpredictability of many
aspects about the future. The challenge of knowing where to sandbag is difficult because it would be
safest to sandbag everywhere. Lastly, accurately predicting future resource requirements is challenging
because there are so many variables and unknowns about the future, especially the first two challenges.
A solution to the challenge of predicting future requirements lies within the individual company.
There are predictions made about a company’s future, usually predicted by the marketing department.
If VP of marketing says in five years users with increase 50% with same average usage, plus or minus a
degree of accuracy. That is the predicted number of users five years from now that you plan for, plus
or minus the degree of accuracy of that prediction.
A solution to deciding where to sandbag resources and how much can be predicted by market
trends. Obviously, we’d like to avoid sandbagging as much as possible, but it is also a lot better to
have more resources than not enough. The big question is where does it appear the biggest expansion
in resource needs will be needed, obviously we will sandbag a little bit for all resources to be safe, but
if some resources seem more unpredictable than others, allocate extra room there. As far as how much
to sandbag, it should be as small as possible, so if we know the unpredictable resource may exceed the
Page 16 of 27
Capacity Planning & Data Center Architecture
amount predicted, but won’t exceed 5% predicted, we can sandbag 5% of that resources just in case.
Once you’ve eliminated the first two challenges, the third is easier to deal with. The solution to
predicting future resource requirements is the 3-phase process. If the 3-phase process of capacity
planning is followed, you will increase accuracy of predictions. The 3-phase process has testing,
benchmarks, and formulae useful in predicting future resource requirements based off the current
resource usage. Through this testing of current resources/usage and using the formula to scale the
requirements to the future amount of users, produces the most accurate prediction possible, given the
information we have now. It is important to remember the resource requirements can usually be
modified to fit the changing trends accordingly, until the data center is being built.
The key challenge involved with data center architecture is downtime6
. Downtime6
has many
challenges under it, these sub-challenges of data center architecture are: natural disasters, network
infrastructure, security, electrical power, and temperature/humidity. Basically, when it comes down to
it, anything that can go wrong in the system could cause downtime6
and most likely will if all steps to
prevent it fail. Downtime6
is a large and difficult challenge because it has so many causes, and can
only be prevented by keeping the entire system running perfectly.
The challenge of downtime6
is solved by solving the sub-challenges that cause downtime6
. We
can only prevent downtime6
because many causes are re-occurring. If we design the data center with
modularity12
and redundancy, however, this gives us time to locate and fix potential problems before
they occur. A problem would have to occur in the primary and redundant measures taken. The sub-
solutions include: a location safe from natural and man-made disasters, strong network infrastructure
through redundancy, modularity12
, and best practices, adequate security on the system and in the data
center, electrical redundancy, modularity12
, a UPS23
, and back-up electrical system, and ensuring the
HVAC10
system follows best practices, and is sufficient for the data center (Jayaswal 27- 91, 143-154).
Page 17 of 27
Capacity Planning & Data Center Architecture
Case Example
OneNeck IT Solutions is a data center company that has reinvented some traditional methods of
data center architecture in their newly-renovated, Minnesota data center. They are praised by reviews
and their own customers for their availability3
and security, despite being non-traditional in some
aspects of their designs. However, the non-traditional designs to some, are innovations in the eyes of
others. They have created unique ways to complete tasks in a data center, and have taken some aspects
of their systems further in depth than other companies. They pride themselves on providing a great
user experience, being financially efficient, and being energy efficient. They are so confident in their
ability to provide 100% uninterrupted availability3
, anything less is refunded to their.
Located in Eden Prairie, MN, one of nine of OneNeck’s data centers is a marvel of new
technology and innovation. It was designed to incorporate new, energy saving and efficiency boosting
technologies. They have seen such great results, they recently added 6,000 sq. ft. of raised-floor space
to their data center increasing their total floor space to 18,000 sq. ft. Normally adding more space is
expensive and should be planned for during the initial design, but the original design was built
modular12
and scalable so expansion was not only possible, but a great investment.
All their cables, wires, ventilation, and utilities run in the sub-floor, including their gaseous fire
suppression system. They have what is known as a cold air plenum15
because the cold air to cool the
machines is transported through the sub-floor. The subfloor includes the network infrastructure,
HVAC10
, electrical, and any redundancies with those systems. The only perforated tiles to direct air
flow out of the sub-floor are beneath the server racks. The warmed air is gathered above the racks by
ducting and sheet metal, then is sent into the drop ceiling and returned to be cooled and re-circulated.
The entire HVAC10
system is closed, thus efficient.
OneNeck cools the air using two CRAC (computer-room-air-conditioning) heat exchangers.
Page 18 of 27
Capacity Planning & Data Center Architecture
The returning hot air is either cooled by a water/glycol mixture pumped through the cooling tower heat
exchangers outside, or mechanical air-conditioners in the CRAC units on especially hot days. When
the cooling towers are handling all the heat exchange it is free air conditioning, and being Minnesota,
OneNeck tends to have a low A/C bill. When the outside temperature drops below freezing, a DCIM
automated system by Honeywell turns off the cooling tower’s water pumps and drains the lines.
Besides managing the A/C system the Honeywell Building Automated System (DCIM) manages the
raised floor temperature and humidity, all power systems, physical security, and asset management.
OneNeck provides many service options to their customer, many of whom are healthcare and
government customers. Cloud and hosting solutions include: cloud servers, private clouds, hybrid
clouds, cloud storage, desktops in the cloud, and colocation. Some managed services include:
applications, databases, networks, servers, end user support, disaster recovery4
, security and
compliance, and communication and collaboration. ERP application managements are offered like
Oracle, Microsoft, Infor, and SAP. Professional services offered include: IT assessments, design,
migrations and implementations, IT roadmaps and planning, and technology consulting. Lastly they
offer IT hardware resale for Cisco, EMC, HP, VMware, Citrix, F5, and NetApp products.
They have a small NOC13
in the entrance to their data center. Next is the raised floor,
computing area. To get there you travel through a secure hallway where the inside doors self-lock if
the outside door is opened and vice-versa, this preserves the integrity of the good air flow. Their
security is based on each room, most customers have their own room within the data center where their
machines are located. Their security measured include, RFID cards18
, PIN codes, dual-iris biometric
scanners, and state-of-the-art tail-gate sensors22
.
The power supply comes from three different substations on the power gird and mate with their
transformers behind the data center. If the local power fails, their eco-friendly UPS23
system
Page 19 of 27
Capacity Planning & Data Center Architecture
automatically pick-up the slack until the three huge diesel generators kick on. The UPS23
system is in
between the power grid and generators and the building, meaning it’s always on. The UPS23
system
conditions all power coming in and uses a flywheel generator that is still spinning and converting
momentum to electricity when the power turns off. It can maintain the whole data center on no
batteries or fuel, pure momentum for a full 60 seconds! More than enough time for their 9 second
generators to get to full power. OneNeck even ensured they contracted a diesel company with gravity
fed fuel, because if OneNeck doesn’t have electricity, neither will the diesel pumps (Kassner 1).
Using a hypothetical situation about this data center I will express my findings. Let’s assume
before expanding the data center OneNeck hired a new capacity planner and data center architect.
Let’s also assume OneNeck independently doesn’t have the ability to predict their current or future
requirements.
The capacity planner ran tests to determine the desired additional resources for the profit given
and came up with x amount of resources and determined that OneNeck was currently operating at a
peak of 90% utilization25
of their current resources with a resource capacity of 2x over 12,000 sq. ft.
The capacity planner told the architect, in order to meet the goals of OneNeck of adding x resources,
we need to add 6000 sq. ft. of computing space. Luckily the original architect designed the data center
to be modular12
and scalable, so all they had to do was run the new cables, wires, and ventilation to the
additional servers and floor space and it should all work normally.
The architect planned the expansion, just as OneNeck’s current system was set up. After the
expansion the entire data center still remained within the transformer’s, UPSs’23
, generators, and
HVAC’s10
maximum capacity. The only possible way this situation would work out this perfectly is if
data center architecture’s best practices were evident from the start, with both architects, and if the
capacity planning properly followed the 3-phases of capacity planning.
Page 20 of 27
Capacity Planning & Data Center Architecture
Recommendation
My recommendation is for anyone looking to overcome the challenges data center architecture
and will potentially, and probably, run into problems in two key areas: predicting future requirements,
and preventing downtime6
. Both these technical challenges seem too big to solve, but they can both be
managed. The accurate prediction challenge can be made as accurate as possible through proper
capacity planning, which is usually accurate. Preventing downtime6
is as simple as following an
instruction booklet, the best practices aren’t always the best for every situation, but if you stay as close
to them as possible, downtime6
should not be an issue in most situations.
I recommend always using the 3-phase capacity planning process to predict future resource
requirements. If followed the predictions should be as accurate as the possibly can be without knowing
facts about the variables and unknowns in the future. The 3-phase process will allow you to assess
your current average and peak utilization25
and maximum capacity. Using those anyone can use the
formulae to scale their current resources to determine the future needs for a given situation.
I recommend following as many, if not all of the best practices for data center architecture. It is
also important to remember the four key elements to design and build by: scalable, flexible, modular12
,
and simple. By using redundancy wherever possible, it almost eliminates the risk of a single point of
failure and reduces the risk of most preventable failures. Availability3
should always be a huge factor
in what direction a data center architect should take.
Page 21 of 27
Capacity Planning & Data Center Architecture
Lessons Learned
While conducting this research and compiling this report, I came across several challenges
myself. I also learned some important new facts and concepts through my research. Besides the
technical facts, some very important lessons were learned, not related to capacity planning and data
center architecture. Many of the lessons I learned apply to life and business.
The first lesson is to always plan for the worst case scenario. If you plan to use 100% of your
time to complete a task, you will fail in planning. Set-backs happen when they are least convenient,
this ideal is so universal they named a law about it after murphy. Plan to be set-back and you won’t be
disappointed in the time you ended up not needing. Plan for the worst, so if the worst happens you’re
prepared. The worst feeling is when something goes wrong and you have no idea what to do.
A second lesson learned is when conducting research, to always try and keep searching for
more, the next thing you would have found will always be the best one to find. I found an article that
had some very good points after the fact of my paper being well underway. Luckily, in learning the
first lesson, it gave me a chance to add it. The job is not done, until it is overdone.
Lastly I learned how to compile many technical facts and areas into a report that makes some
sense. If you first use a few paragraphs to explain your idea, then create an outline of the final idea,
you’ll end up doing most of the hard work without realizing it. Normally I wouldn’t have done a
project outline, this one was required so I did it, and realized why I should always outline a project,
paper, report, or anything of this size
Page 22 of 27
Capacity Planning & Data Center Architecture
Glossary
1. Authentication – Determines whether a user is actually a valid user and has a proper login and
credentials. Ex. Computer accounts allow certain people to log in (Jayaswal 17).
2. Authorization – Determines whether a particular host or user is allowed to view or change any
particular information. Ex. The admin account on your computer has authorization to do things
a guest account would not, like change passwords (Jayaswal 206).
3. Availability – The amount of time a system is usable. Usually calculated as a percentage of an
elapsed year. Ex. 99% availability equates to 876 hours of downtime6
each year (Jayaswal 6).
4. Disaster-Recovery Plan – A plan in place in case of an extended period of outage. Can have
servers dedicated to act as a secure back-up in the event of data loss. Usually has servers that
take over in the event the primary servers fail (Jayaswal 18).
5. DMZ – De-Militarized Zone – A network subnet that contains servers that you want more open
access to than the internal networks. It is more vulnerable and visible to outsiders. Acts as a
firewall9
between the outside internet and the network inside (Jayaswal 496).
6. Downtime – Duration of time where a provided service is inaccessible or offline for any reason
usually: maintenance, power failure, error, network/infrastructure problems, broken cables, etc.
Measured in seconds, minutes, hours, or, in worst cases, days (Jayaswal 5).
7. Encryption/Decryption – Scrambles data using a key before sending. Only intended recipients
or someone with the key to unscramble the data can comprehend the data (Jayaswal 179).
8. ESD – Electro-Static Discharge – A charge difference between two points, people, or devices
that causes a discharge of electricity. Can be small and harmless, like the shock from rubbing
your hair on a balloon, or powerful, damaging, and even deadly, like lightning (Jayaswal 78).
9. Firewall – An internal program that monitors your network connection and stops any bad traffic
from getting through the “wall” it establishes. It can also be a specialized router that filter fata
based on source and destination addresses, ports, content, etc. allowing only authorized traffic
the pass through (Jayaswal 486).
10. HVAC – Heating, Ventilation, and Air-Conditioning – General term for the A/C system in a data
center. Controls temperature, humidity, and sometimes fire suppression (Jayaswal 28).
11. Latency – Time delay of data traffic through a network or switch, measured in seconds of
milliseconds. I.e. how long it takes for a user to get a response from a system (Jayaswal 131).
12. Modularity – A design concept that separates a system into several components, and each
component can be separately designed, implemented, managed, and replaced (Saltzer 1.3).
Page 23 of 27
Capacity Planning & Data Center Architecture
13. NOC – Network Operations Center – a facility, usually located outside a data center, dedicated
and staffed with people, usually 24/7, to monitor the availability3
of all devices and services in a
data center. NOCs use software like SNMP20
to help monitor data center(s) (Jayaswal 6).
14. PDU – Power Distribution Unit – An electrical distribution box, fed by a high amp, three phase
connector, with power outlets and circuit breakers included (Jayaswal 77).
15. Plenum – The space between a sub-floor and the raised floor or sub-ceiling and ceiling, almost
always located in a data center. Usually about 2 feet high, large enough to contain network
cables, electrical wires, ventilation system, and sometimes plumbing (Jayaswal 596).
16. POD – Point Of Distribution – A rack containing network switches, terminals servers, and cable
patch ports. Used to distribute a network from this point to many end points (Jayaswal 54).
17. Point-load – The weight or load on a single point. Usually refers to the weight a leg of a rack
exerts on a tile below it. Ex. If a rack weighing 100lbs has a leg on a tile, the point load of that
leg on that tile is 25lbs (Jayaswal 44).
18. RFID card – Radio Frequency IDentification – A card with a unique radio signal to grant and
distribute access, usually to unlock a door, or several doors, without a key (PC.net 1).
19. SGA – System Global Area – Areas of shared memory, usually dedicated to RAM (Burleson 1).
20. SNMP – Simple Network Monitoring Protocol – send reports on the status and health of
systems, networks, and devices to a central location, usually a NOC13
(Jayaswal 62).
21. Static-load – The total weight on a single tile, floor, etc. If a tile has 1 leg from a rack weighing
100lbs and 2 legs from a rack weighing 200lbs, the static-load on that tile is 125lbs, assuming
no other weight is being exerted on that tile (Jayaswal 44).
22. Tail-gate sensor – An electronic sensor that can determine when, and how many of, a
person/people or an object has passed through a point or doorway (Kassner 1).
23. UPS – Uninterruptible Power Supply – a large battery, or other device, capable of sustaining the
power capacity for a given amount of time. Used to power a system, until back-up generators
or alternative sources take over, in the case of a power failure (Jayaswal 73).
24. Uptime – Duration of time a service is accessible or online. The maximum uptime is, usually
the target uptime, 24 hours a day, 7 days a week, 365 days in a normal year (Jayaswal 6).
25. Utilization – the fraction or percentage at which a particular resource is being used with respect
to maximum resource capabilities. Ex. A 20 MBits/second CPU working at 5 MBits/second
would have a 25% utilization, i.e. 25% of its potential is being used (Jayaswal 131).
Page 24 of 27
Capacity Planning & Data Center Architecture
Sources used:
Burleson, Donald. "Oracle Concepts - SGA System Global Area." Oracle Concepts. DBA-Oracle.com,
Jan. 2014. Web. Dec. 2014.
Jayaswal, Kailash. Administering Data Centers: Servers, Storage, and Voice over IP. Indianapolis, IN:
Wiley Pub., 2006. eBook.
Kassner, Michael. "OneNeck IT Solutions' Minnesota Data Center Uses New Technology to Improve
Service." TechRepublic.com. Tech Republic, Oct. 2014. Web. Oct. 2014.
PC.net. "Definition of RFID." Definition of RFID. PC.net, Aug. 2009. Web. Nov. 2014.
Rich, Joe. "How to Do Capacity Planning." TeamQuest.com. TeamQuest, Jan. 2010. Web. Nov. 2014.
Saltzer, J. H., and Frans Kaashoek. Principles of Computer System Design: An Introduction.
Burlington, MA: Morgan Kaufmann, 2009. eBook.
Schiesser, Rich. "How to Develop an Effective Capacity Planning Process."Computerworld.com.
Computer World, Mar. 2010. Web. Nov. 2014.
Page 25 of 27
Capacity Planning & Data Center Architecture
Appendices
Appendix 1: Characteristics of a good capacity planner (Schiesser 1).
Appendix 2: SMP Factor for Adding CPUs (Jayaswal 152).
Page 26 of 27
Capacity Planning & Data Center Architecture
Appendix 3: Server racks (Jayaswal 28).
Appendix 4: Good vs. Bad cable practices (Jayaswal 53).
Appendix 5: NOC13
Center (Jayaswal 63).
Page 27 of 27
Capacity Planning & Data Center Architecture
Appendix 6: HVAC10
system (Jayaswal 88).

More Related Content

What's hot

Business Continuity and Recovery Planning for Power Outages
Business Continuity and Recovery Planning for Power OutagesBusiness Continuity and Recovery Planning for Power Outages
Business Continuity and Recovery Planning for Power OutagesARC Advisory Group
 
Building a Project Storyboard with Matt Hansen at StatStuff
Building a Project Storyboard with Matt Hansen at StatStuffBuilding a Project Storyboard with Matt Hansen at StatStuff
Building a Project Storyboard with Matt Hansen at StatStuffMatt Hansen
 
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuff
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuffThe DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuff
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuffMatt Hansen
 
Project Financial Benefits with Matt Hansen at StatStuff
Project Financial Benefits with Matt Hansen at StatStuffProject Financial Benefits with Matt Hansen at StatStuff
Project Financial Benefits with Matt Hansen at StatStuffMatt Hansen
 
Defining a Project Scope with Matt Hansen at StatStuff
Defining a Project Scope with Matt Hansen at StatStuffDefining a Project Scope with Matt Hansen at StatStuff
Defining a Project Scope with Matt Hansen at StatStuffMatt Hansen
 
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuff
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuffDeveloping a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuff
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuffMatt Hansen
 
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)Matt Hansen
 
Risk Assessment with a FMEA Tool
Risk Assessment with a FMEA ToolRisk Assessment with a FMEA Tool
Risk Assessment with a FMEA ToolMatt Hansen
 
Production Planning and Scheduling
Production Planning and SchedulingProduction Planning and Scheduling
Production Planning and SchedulingManish kumar
 
Accenture: Outlook What C Suite Should Know About Analytics 2011
Accenture: Outlook What C Suite Should Know About Analytics 2011Accenture: Outlook What C Suite Should Know About Analytics 2011
Accenture: Outlook What C Suite Should Know About Analytics 2011Brian Crotty
 

What's hot (12)

Oracle 0472
Oracle 0472Oracle 0472
Oracle 0472
 
Business Continuity and Recovery Planning for Power Outages
Business Continuity and Recovery Planning for Power OutagesBusiness Continuity and Recovery Planning for Power Outages
Business Continuity and Recovery Planning for Power Outages
 
Building a Project Storyboard with Matt Hansen at StatStuff
Building a Project Storyboard with Matt Hansen at StatStuffBuilding a Project Storyboard with Matt Hansen at StatStuff
Building a Project Storyboard with Matt Hansen at StatStuff
 
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuff
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuffThe DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuff
The DMAIC Roadmap (Levels 1 & 2) with Matt Hansen at StatStuff
 
Project Financial Benefits with Matt Hansen at StatStuff
Project Financial Benefits with Matt Hansen at StatStuffProject Financial Benefits with Matt Hansen at StatStuff
Project Financial Benefits with Matt Hansen at StatStuff
 
Defining a Project Scope with Matt Hansen at StatStuff
Defining a Project Scope with Matt Hansen at StatStuffDefining a Project Scope with Matt Hansen at StatStuff
Defining a Project Scope with Matt Hansen at StatStuff
 
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuff
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuffDeveloping a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuff
Developing a Project Strategy Using IPO-FAT Tool with Matt Hansen at StatStuff
 
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)
Lean and Six Sigma Project Methodologies by Matt Hansen at StatStuff (S03)
 
Risk Assessment with a FMEA Tool
Risk Assessment with a FMEA ToolRisk Assessment with a FMEA Tool
Risk Assessment with a FMEA Tool
 
Production Planning and Scheduling
Production Planning and SchedulingProduction Planning and Scheduling
Production Planning and Scheduling
 
Accenture: Outlook What C Suite Should Know About Analytics 2011
Accenture: Outlook What C Suite Should Know About Analytics 2011Accenture: Outlook What C Suite Should Know About Analytics 2011
Accenture: Outlook What C Suite Should Know About Analytics 2011
 
Escalation lets do it right
Escalation   lets do it rightEscalation   lets do it right
Escalation lets do it right
 

Similar to Final Report GET434

Capacity Planning and Headroom Analysis for Taming Database Replication Latency
Capacity Planning and Headroom Analysis for Taming Database Replication LatencyCapacity Planning and Headroom Analysis for Taming Database Replication Latency
Capacity Planning and Headroom Analysis for Taming Database Replication LatencyZhenyun Zhuang
 
3 Data Center infrastructure Design Mistakes.pptx
3 Data Center infrastructure Design Mistakes.pptx3 Data Center infrastructure Design Mistakes.pptx
3 Data Center infrastructure Design Mistakes.pptxBluechip Gulf IT Services
 
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...hrutikeshAnpat
 
Maintworld NEXUS v2
Maintworld NEXUS v2Maintworld NEXUS v2
Maintworld NEXUS v2Rafael Tsai
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Michael Wigley
 
TierPoint white paper_How_to_Position_Cloud_ROI_2015
TierPoint white paper_How_to_Position_Cloud_ROI_2015TierPoint white paper_How_to_Position_Cloud_ROI_2015
TierPoint white paper_How_to_Position_Cloud_ROI_2015sllongo3
 
Map r whitepaper_zeta_architecture
Map r whitepaper_zeta_architectureMap r whitepaper_zeta_architecture
Map r whitepaper_zeta_architectureNarender Kumar
 
Future of data center
Future of data centerFuture of data center
Future of data centeraditya panwar
 
Analyzing data, performance and impacts in construction
Analyzing data, performance and impacts in constructionAnalyzing data, performance and impacts in construction
Analyzing data, performance and impacts in constructionMichael Pink
 
Business Continuity Getting Started
Business Continuity Getting StartedBusiness Continuity Getting Started
Business Continuity Getting Startedmxp5714
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On RequirementsByron Workman
 
The Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperThe Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperEdgar Alejandro Villegas
 
Successful_BC_Strategy.pdf
Successful_BC_Strategy.pdfSuccessful_BC_Strategy.pdf
Successful_BC_Strategy.pdfmykovalenko1
 
Building a Business Continuity Capability
Building a Business Continuity CapabilityBuilding a Business Continuity Capability
Building a Business Continuity CapabilityRod Davis
 
New solutions for production dilemmas
New solutions for production dilemmasNew solutions for production dilemmas
New solutions for production dilemmasarmandogo92
 

Similar to Final Report GET434 (20)

Best Practices for Planning your Datacenter
Best Practices for Planning your DatacenterBest Practices for Planning your Datacenter
Best Practices for Planning your Datacenter
 
Capacity Planning and Headroom Analysis for Taming Database Replication Latency
Capacity Planning and Headroom Analysis for Taming Database Replication LatencyCapacity Planning and Headroom Analysis for Taming Database Replication Latency
Capacity Planning and Headroom Analysis for Taming Database Replication Latency
 
DRP.ppt
DRP.pptDRP.ppt
DRP.ppt
 
3 Data Center infrastructure Design Mistakes.pptx
3 Data Center infrastructure Design Mistakes.pptx3 Data Center infrastructure Design Mistakes.pptx
3 Data Center infrastructure Design Mistakes.pptx
 
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...
What is Data Center Management_ - Modius _ DCIM - Data Center Infrastructure ...
 
Maintworld NEXUS v2
Maintworld NEXUS v2Maintworld NEXUS v2
Maintworld NEXUS v2
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Oasize llnl
Oasize llnlOasize llnl
Oasize llnl
 
Basic-Project-Estimation-1999
Basic-Project-Estimation-1999Basic-Project-Estimation-1999
Basic-Project-Estimation-1999
 
TierPoint white paper_How_to_Position_Cloud_ROI_2015
TierPoint white paper_How_to_Position_Cloud_ROI_2015TierPoint white paper_How_to_Position_Cloud_ROI_2015
TierPoint white paper_How_to_Position_Cloud_ROI_2015
 
Map r whitepaper_zeta_architecture
Map r whitepaper_zeta_architectureMap r whitepaper_zeta_architecture
Map r whitepaper_zeta_architecture
 
Future of data center
Future of data centerFuture of data center
Future of data center
 
Analyzing data, performance and impacts in construction
Analyzing data, performance and impacts in constructionAnalyzing data, performance and impacts in construction
Analyzing data, performance and impacts in construction
 
Business Continuity Getting Started
Business Continuity Getting StartedBusiness Continuity Getting Started
Business Continuity Getting Started
 
Reducing Time Spent On Requirements
Reducing Time Spent On RequirementsReducing Time Spent On Requirements
Reducing Time Spent On Requirements
 
The Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology WhitepaperThe Four Pillars of Analytics Technology Whitepaper
The Four Pillars of Analytics Technology Whitepaper
 
Successful_BC_Strategy.pdf
Successful_BC_Strategy.pdfSuccessful_BC_Strategy.pdf
Successful_BC_Strategy.pdf
 
Building a Business Continuity Capability
Building a Business Continuity CapabilityBuilding a Business Continuity Capability
Building a Business Continuity Capability
 
Report on medical center
Report on medical centerReport on medical center
Report on medical center
 
New solutions for production dilemmas
New solutions for production dilemmasNew solutions for production dilemmas
New solutions for production dilemmas
 

Final Report GET434

  • 1. The Global Computing challenges behind: Capacity Planning & Data Center Architecture Stephen J. Gasparini GET434 - Computing Challenges December 12, 2014
  • 2. Page 2 of 27 Capacity Planning & Data Center Architecture Table of Contents Title page………………………………………page 1 Table of Contents…………………....…………page 2 Executive summary……………………….....…page 3 Body of Report…………………………......…..pages 4-20 Introduction……………..…………..…….pages 4-5 Capacity planning…………………………pages 5-8 Data Center Architecture………...………..pages 9-14 Challenges/Solutions...……………………pages 15-16 Case example..…………………………....pages 17-19 Recommendation……………………….…page 20 Lessons Learned……………………………..…..page 21 Glossary……………………………………….…pages 22-23 Bibliography…………………………...….……..page 24 Appendices………………………………..……..pages 25-27
  • 3. Page 3 of 27 Capacity Planning & Data Center Architecture Executive Summary Capacity planning and data center architecture are two semi-related technical topics each with their own technical challenges. Capacity planning deals with the challenge of attempting to accurately predict the resource requirements of the future. Data center architecture deals with the everlasting challenge of preventing and battling downtime6 . Data center architecture is started during the planning to build of a new data center. The most important part is determining the average and peak resource requirements of the new data center, you need enough resources for current and future requirements, but not too much excess to where it is simply just wasting money. The most accurate way to predict these requirements is through the capacity planning 3-phase process. Although capacity planning the best was to predict, there are still too many unknowns and variables to accurately predict how much of each resource will be required in the future. Another very important challenge of data center architecture is prevention of downtime6 . I rank capacity planning higher however, because in order to prevent downtime6 , you need to first create a data center with the correct resource requirements. Failing to provide enough resources can also create downtime6 , hence why resource planning is more important to focus on. Many things can actually cause downtime6 . They include things like: electrical outages, damaged network infrastructure, or even natural disasters, etc. Anything that would cause a network to become inaccessible to its users causes downtime6 . Downtime6 costs companies of all sizes anywhere from hundreds of dollars to millions of dollars in losses for each hour of downtime. Best practices are rules and regulations taken by data center architects in order to prevent downtime6 . Hopefully if you prepare enough you won’t have to deal with it often, but even companies like Facebook and Google have occasional downtime6 . It’s so small and seldom, most people don’t even notice it. Typically, the more money spend on a data center, disaster recovery4 , and redundancy, the less downtime6 a company will have. The fact is downtime6 can only be prevented, there will always be a natural or artificial chance and cause of downtime6 for everyone, even Google. For anyone looking to solve the challenge of accurately predicting future resource requirements, we have capacity planning, and the 3-phase process, which will give you the most accurate prediction possible with the knowledge and facts of today. Those looking to solve the challenge of downtime6 should focus on trying to prevent it using best practices, monitor their system for any possible causes, and have a plan to fix it when it inevitably strikes. That is the best way to solve downtime6 . It cannot be completely eliminated as a problems, but by using the best practices for data centers, and proper monitor and maintenance, downtime6 will become a very small problem.
  • 4. Page 4 of 27 Capacity Planning & Data Center Architecture Introduction It is difficult for a capacity planner or data center architect to be accurate to much degree when planning for the needs and future needs of a particular data center. Capacity planning is the biggest solution to some of the challenges behind data center architecture. In data center architecture there are many prevalent challenges but two are harder to avoid than the rest. Those two are preventing downtime6 and predicting future capacity requirements. Most the practices in good data center architecture revolve around preventing possible downtime6 . This is because the data center is there to provide a service, usually to make money from that service, and when the service is down the data center is not completing its purpose, and possibly costing the owner a lot of money. The practice behind good capacity planning will increase the accuracy of any predictions made about future requirements. I have investigate the practices behind good and bad capacity planning and data center architecture, the challenges behind them, and solutions to those challenges. I have identified the links and differences between the two topics. My findings revolve around the notion that data center architecture is the main topic and predicting resource usage is one of the many challenges of data center architecture, but the most important when planning a new data center. Capacity planning is a part of good data center architecture, and a solution to one of the biggest challenges, when it is done correctly. Before my investigation, my recommendation for data center architecture is to follow best practices and use capacity planning as a guideline. This is because capacity planning is only accurate to a certain point and can’t account for all the unknowns the future will most likely hold, but it is the best we have to go off of in the present. Sometimes even capacity planners and data center architects have to assume and guess at what the future will be like, usually by backing it up with facts and testing.
  • 5. Page 5 of 27 Capacity Planning & Data Center Architecture Some of the key best practices for to maximize availability3 in data center architecture are: making things efficient, costly, simple, modular12 , scalable, and flexible, regularly scheduled maintenance and cleaning, physical security/prevention from natural and man-made disasters, practice redundancy and modularity12 in everything, efficient and smart physical architecture and system design, and most importantly: prevent downtime6 wherever possible. Capacity Planning Capacity planning is the prediction of the resource requirements for a data center. It is more accurate than server sizing because server sizing is an estimate of hardware based upon the applications, peak performance levels, and expected activity. Capacity planning is backed up by technical performance data, acquired through testing. Although capacity planning is the best benchmark we have, the sad truth is there are simply too many variables to accurately predict how much of each resource will be required until it is too late (Jayaswal 144-145). There are many questions to ask during capacity planning, and with proper testing, you can answer most of those questions. First you determine your current service level requirements. This is the behind the scenes workloads. Once completed you will know how much of each resource is being used by whom, and how many/which machines. Next you measure your current capacity usage and capacity available overall. This means you determine the maximum resource usage you are currently prepared for and your current utilization25 . What is measured is the CPU, I/O, applications, memory, and other parts of the machines are being used. You also need to determine when your peak resource usage will equal or exceed the current abilities. It is important to separate the peak performance measurements from the average usage. You need enough resources allocated to handle the peak workloads, but if you don’t have enough to constantly manage the average workload efficiently, you
  • 6. Page 6 of 27 Capacity Planning & Data Center Architecture will have problems with high utilization25 and low efficiency (Rich 2-12). Once you record your current usage, capacity, and user load you can scale that to the user load you will plan for in the future. Some of the unknown variables include the future trends, exact requirements, applications, peak usage, average performance levels, etc. As you can see there are many things that are impossible to predict about the future. These are risks every data center architect will face. You can only plan based on current expectations. It is not easy to be a good capacity planner, appendix 1 shows some suggested skills one should have in order to be a successful capacity planner (Schiesser 1). Another way to explain the capacity planning process is as the “three phase process” described in chapter 12 of Administering Data Centers. The phases are: Phase 1: Define the customer’s requirements, Phase 2: Measure or estimate current resource utilization25 , Phase 3: Size the new server. The first phase is assesses the workload for the new environment and understands what the user’s latency11 expectations are. Also important is collecting information on current and future requirements, applications, and the type/amount of workloads and acceptable latency11 . Estimating CPU requirements in Phase 1 has several factors. A better way is to ask questions like the size of sorts that will be done, on memory or disk, will there be parsing or complex navigation, and can the CPU handle the size of the mathematical manipulation being done (Jayaswal 146). Memory is also assessed in Phase 1. One approach is sandbagging or adding extra memory to be safe, but that’s expensive and inefficient. It’s also important not to undershoot the memory requirements. The SGA19 (system global area) must be sized correctly. You must also determine the maximum number of application users, because that is a huge factor in determining the amount of required memory, I/O throughput, and CPU usages for the application and back-up database servers. There are a lot of factors that impact the amount of memory dedicated to each user. These include the type of operations performed, amount of shared images, and amount/type of sorting and parsing.
  • 7. Page 7 of 27 Capacity Planning & Data Center Architecture When estimating the number and size of disk required you must be sure to factor in for I/O spindles and database archiving. It is important to keep I/O spread across several spindles, usually done with several small disks attached directly to the server or attached via SAN fabric. Sometimes spreading around I/O can be difficult and unnecessary. Some items should be archived on separate disks like: databases, table-spaces, binaries, and redo logs. Latency11 is the response time between the servers and the users. The ideal latency11 would be 0.00 seconds, meaning the user doesn’t have to wait at all. Obviously that is impossible but getting as close as you can to zero is ideal. This is done by providing enough resources in all the right places. It is important to identify the worst-case acceptable latency11 for different types of workloads. The type and amount of current workloads need to be measured. The types and memory consumption, CPU usage, and I/O usages must be recorded for average and peak performance levels. This helps set the bar for average and peak levels in the future. Amount of users to amount of workload is also important to develop a scale for the future requirements. Phase 2 estimates and measures CPU and memory usage for each individual computer and user. This is done through testing the existing workloads of applications. If applications aren’t available, resource usage is estimated from data through the application vendors or independent tests. CPU workload caused by a particular computation is determined by multiplying CPU usage with the duration of the CPU load. Workload is measured in performance unit-seconds. The best way to measure utilization25 is by running a pre-timed computation. This will indicate how fast one CPU is running compared to how fast would be expected. The computation is pre-timed, we know a CPU of x size would take seconds and a CPU of 2x size should take y seconds. A expecting to take y seconds can be determined to be faster or slower than expected, based on its test results. Memory consumption is tricky because of the pre-discussed notion or buying enough memory
  • 8. Page 8 of 27 Capacity Planning & Data Center Architecture you don’t run out, but avoiding sandbagging, which is expensive and inefficient. Multiple areas of memory must be considered and accounted for. Areas such as: operating system memory, kernel, system library memory, file system buffer memory, user and application/database requirements. Phase 3 is sizing the new server requirements. This is done by using the information acquired in the first two phases from testing and estimating. We use those numbers to project future requirements based on current number of users and the expected number of future users. This gives us our latency11 , utilization25 , memory, CPU, and other requirements to abide by (Jayaswal 143-151). It is important to recognize that scaling is not always as simple as 1200 * 150% = 1800. This is thanks to incommensurate scaling. Incommensurate scaling means when a system is scaled, not everything increases at the same rate. The same concept holds true if you were to take a mouse and scale it to the size of an elephant, the mouse would be crushed because its weight would increase exponentially while its height increase linearly (Saltzer 1.1.1.3). CPU estimates are used to predict the number of CPUs needed based on the amount of users and workload at any given time. The formula for CPU requirements is the total CPU needed for computation, number of users, projected computations per second, and estimated CPU workload per computation. Also necessary to account for is the operating systems, kernel processes, application processes, and system response time requirements. Adding CPUs does not scale linearly, this is called the SMP factor and can be observed in appendix 2. Memory estimates are also derived from the numbers determined in the first two phases. Similar to CPU sizing, memory needs to be sized for all aspects of the system including OS processes, kernel, file system buffer, application, and database shared space. All aspects must be predicted per user and scaled appropriately to figure out the requirements for the new system. It is important to remember if there’s not enough memory for peak usage problems will occur (Jayaswal 151-154).
  • 9. Page 9 of 27 Capacity Planning & Data Center Architecture Data Center Architecture Data Center Architecture is the design and implementation of a data center, new or old, and involves intense planning. By using capacity planning we have an accurate depiction of the minimum average and peak requirements the data center needs. Next is the design of the next data center. The design for a data center has two layers, each with their own challenges, solutions, and best practices. The first layer deals with any software on the servers and machines, such as: applications, I/O, O/S, services, network tools, memory data, processing information, and data/administrative tools. The second is the physical layer addressing the actual space, physical machines, network infrastructure, electrical system, and the HVAC10 system. The system design and the physical layout of the data center need to be efficient, costly, and provide the best all-around user-administrator experience. The data center needs to keep availability3 as close to 100% as possible. If anything involving the data center goes wrong/fails, it can cost the company, and users, money and business depending on the duration of the downtime6 . Downtime6 cost is calculated by multiplying the amount of workers who can’t work or working on fixing the system, multiplied by their average hourly wage, multiplied by the duration of downtime6 in hours, plus any lost revenue. So a company with thirty workers getting paid $20 an hour for a two hour downtime6 is losing $1200 plus lost revenue. A large company like Google, can lose millions of dollars an hour. The first layer is a lot less challenging, especially once capacity planning is completed. This layer is usually started before, but not finished until construction of the physical layer is completed. Before finalizing this layer, if the capacities planned for have changed (they will), they can be updated/modified to meet new current expectations, as long as you have that physical space allotted (buffer, excess, or expansion space). A good data center architect/capacity planner will have come close with their previous requirement predictions because they used capacity planning.
  • 10. Page 10 of 27 Capacity Planning & Data Center Architecture Most data center architects/system analysts will design a system that is simple, scalable, modular12 , flexible, maximizes availability3 , and is secure; these are four key concepts all good systems are designed around. The system needs to be set-up so all the programs (applications, O/S, I/O, services, network tools, memory data, and data/admin tools) are separated onto their appropriate machines, but programmed to link/talk to their appropriate program neighbors. It is important to ensure security is implemented, working, and tested throughout this layer before connect opening the data center to the raw internet; this is the layer that all the sensitive memory is located within. Security measures that should be taken at this level include: a DMZ5 , firewalls9 , authentication1 , authorizations2 , and logins. Once the system is finished, the data center can be opened for business. Just like in the first layer, the design of any good data center should follow four key requirements: simple, scalable, modular12 , and flexible. Simple enough anyone in the field could step in and look around knowing what they were looking at. Scalable for the future when inevitably the data center no longer has enough resources for their requirements again. Modular12 so everything is divided into its own sections and sub-sections making locating specific machines and making repairs simple. Finally flexible so when something doesn’t work as planned or new management has their own plan, it can be adapted to fit another situation. An architect designing a data center should plan: in advance, for the worst, for growth, for changes, and against vandalism. The architect and company need to plan far in advance to get everything right the first time, retrofitting a data center a second time is inexpensive and not desirable. Planning for the worst ensures 24/7 uptime24 in worst case scenarios. Planning for growth and changes fall into scalable and flexible. Unfortunately vandalism exists from teens rebelling and rival companies or groups who want to put you out of business. The architect should simplify his design constantly,
  • 11. Page 11 of 27 Capacity Planning & Data Center Architecture and ensure everything is labeled in advance and when being built. This includes labeling anything and everything from cables, ports and wires to machines, racks, and rooms. It is important to choose a physically secure location to start. An ideal location is safe from natural disasters, such as hurricanes, floods, tornados, earthquakes, etc. A location known for security and safety is important, the inner-city ghetto is no place for hundreds of thousands to millions of dollars of equipment. Also important in a location is availability of a reliable powers source. Another, often forgotten, factor of a location is the availability of local talent already living there, who potentially could fill various important positions. A big decision that needs to be made from the start, and not changed if possible, is whether or not to have a raised floor. If you have a raised floor it is beneficial to the machines but then there must be a subfloor, ramps, more building codes, and special floor tiles for weight. It is beneficial because everything from network cables and electricity to HVAC10 can run out of sight under the floor in the plenum15 . It is important to account for the weight of everything on everything with a raised floor. This includes the weight of the racks, machines, people, forklifts, tiles, and anything else that might be held up by the sub-floor structure. As seen in appendix 3, the weight of a server room can add up quickly with additional racks holding multiple machines. All of those have to be accounted for point- load17 and static-load22 so the floor is never compromised. Network infrastructure in a data center is extremely important to uptime24 because it connects the entire system and outside world to its network. The network should be adequately connected to the outside world through authorized areas like DMZs5 with enough bandwidth to ensure outside users can always connect and get the level of service they require. When creating the network, it’s important to benefit from modularity12 using PODs16 , patch panels, and network switches to separate parts of the system that can afford to be separate, in order to benefit the system overall.
  • 12. Page 12 of 27 Capacity Planning & Data Center Architecture All network cables, and other cables and wires, should be redundant, properly labeled, and follow a color coding, follow the minimum cable radius to avoid bending damage, and set-up to avoid tangling which causes issues. Examples of good and bad cabling can be observed in appendix 4. Redundancy ensures if one link goes down, that server is still accessible. When counting cables for a data center, they add up quickly, everything links to multiple places, making the amount of cables to servers an exponential relationship. Every time you add a server you could potentially be adding a dozen or so cables. Power distribution is similar to the network infrastructure, in that every sub-system has their own power requirements and should be supplied with modularity12 to account for this. The main goal of power distribution is to have sufficient and reliable power running throughout the data center. It should be redundant, like the network cables, to ensure no single points of failure. In capacity planning we estimated the resource requirements, in power distribution it is done similarly because each piece of equipment has it’s our requirements. The power distribution system must account for the power on all levels, from individual machines on racks, to entire server rooms, to the entire building. Also included is the requirements of the HVAC10 system, fire control, lighting, monitoring, NOC13 , and security. An electrical system can, and should, be modular12 using circuit breakers or PDUs14 and/or providing electricity room by room. It must separate single phase and three phase power because those go to their own respective users. ESD8 needs to be accounted for to prevent any people or machines from being damaged by an imbalance in electric charge. Usually involving discharge grounding points. The power distribution must have an adequate back-up in order to prepare for worst case scenarios where the primary power provider goes offline. Usually that means another power provider or, more likely, back-up generators. The back-up system must have an adequate UPS23 that can
  • 13. Page 13 of 27 Capacity Planning & Data Center Architecture maintain the load, even at peak performance levels, until the back-up power source kicks in. Usually back-up generators take 20-60 seconds to kick in fully so that is usually the time the UPS23 needs to maintain the data center completely on its own. Another way to ensure uptime24 is the HVAC10 system. This system ensures the machines and data center will run constantly without problems. The HVAC10 is responsible for keeping the machines within an ideal temperature and humidity range at all times and within the optimal range most of the time. The acceptable range for most servers is between 50 and 90 degree Fahrenheit and humidity between 25% and 75%. The optimal range is 70 to 74 degrees Fahrenheit and humidity between 45% and 50%. The optimal range is a narrow window, but important because the reliability of electronics and their longevity depends on their temperature and humidity. In fact, the reliability of electronics reduces 50% every 18 degrees above 70 Fahrenheit. The air-flow is important because if you are forcing air in but not out, the farther from the ventilation will have a more drastic difference from the air closer to it. Usually the cool, dry air is ventilated through the plenum15 and through perforated tiles below the machines, and forced upwards where it cools the machines, warms and is captured by a hot air return in the sub-ceiling, because hot air rises. The HVAC’s10 effectiveness can be affected by several factors. If there’s proper air circulation, the placement of the racks, bottom-top cooling or top-bottom cooling, front-front rack placement or front-back placement. Machines at the front and top of the racks typically are hotter than the bottom and back. To fight this imbalance, it is important to have high-flow racks and machine placement so a lot of cool air is reaching the top of the racks. Also it is more effective to have back-back rows where every other aisle is hot or cold and keeps all the hot air in the same place, aiding in heat dispersion. If the racks were places back-back this would mix the hot and cold aisles into the same place which is not good for air flow or heat dispersion. See an example of an HVAC in appendix 6.
  • 14. Page 14 of 27 Capacity Planning & Data Center Architecture Once all the challenges have been overcome of building your data center to fit the capacity planned for, it must be maintained in order to prevent any downtime6 . Good data center maintenance usually means having an NOC13 that monitors the data center 24/7 or at least during peak hours. Network monitoring can be done by third party or system administrators. Constantly monitoring your network ensures you immediately can address any problems. If redundancy is done right, most problems still have another layer of infrastructure before they arise. Fixing problems before they happen is key to uptime24 . SNMP21 is a powerful monitoring tool used to ensure all the systems and devices are working properly, it lets you know what resources are out there, and even give you status and health updates of specific devices or systems (Jayaswal 27-91). Both physical and network security are extremely important. If you have data, it needs to be secured to some level to prevent someone from accessing, tampering, or stealing it. While thinking security, you also have to protect the data center and its data from nature. Your location should be secure both from other people and natural disasters. Physical security usually varies upon location, but can range from a low risk location, cameras and guards to RFID cards18 , PIN codes, a highly secure location, tail-gate sensors22 , etc. (Kassner 1). Logical security is just as, if not more, important than physical security. It can range widely depending on what type of data you’re dealing with, logical security can range anywhere from basic firewalls9 , onsite/offsite back-ups, authorization2 , and antivirus to a Disaster-Recovery Plan4 , DMZ5 , encryption/decryption7 , authentication1 , etc. Lastly, the data center should be properly cleaned, repaired, and tested on a regular basis. This ensure the machines will keep running and not become/remain damaged, and all the systems and back- up systems are working. One of the biggest causes of overheating in machines is when the fan becomes clogged with dust and no longer cools the machine, the small particles build up and create dust-bunnies that can be dangerous to the machines (Jayaswal 61-69, 495-536).
  • 15. Page 15 of 27 Capacity Planning & Data Center Architecture Challenges/Solutions An overview of the main challenges and sub-challenges found in data center architecture and capacity planning leaves us with two main topics again, predicting the future resource requirements, and keeping downtime6 as close to zero as possible. This is because when you are setting out to create a data center, you will always need to know how much of each resources the new data center will need and keep those resources up and running for your users. Some key challenges under capacity planning are: predicting the future expansion of resource needs, determining what/how much resources to sandbag, and accurately predicting future resource requirements. What makes predicting the future expansion difficult is the unpredictability of many aspects about the future. The challenge of knowing where to sandbag is difficult because it would be safest to sandbag everywhere. Lastly, accurately predicting future resource requirements is challenging because there are so many variables and unknowns about the future, especially the first two challenges. A solution to the challenge of predicting future requirements lies within the individual company. There are predictions made about a company’s future, usually predicted by the marketing department. If VP of marketing says in five years users with increase 50% with same average usage, plus or minus a degree of accuracy. That is the predicted number of users five years from now that you plan for, plus or minus the degree of accuracy of that prediction. A solution to deciding where to sandbag resources and how much can be predicted by market trends. Obviously, we’d like to avoid sandbagging as much as possible, but it is also a lot better to have more resources than not enough. The big question is where does it appear the biggest expansion in resource needs will be needed, obviously we will sandbag a little bit for all resources to be safe, but if some resources seem more unpredictable than others, allocate extra room there. As far as how much to sandbag, it should be as small as possible, so if we know the unpredictable resource may exceed the
  • 16. Page 16 of 27 Capacity Planning & Data Center Architecture amount predicted, but won’t exceed 5% predicted, we can sandbag 5% of that resources just in case. Once you’ve eliminated the first two challenges, the third is easier to deal with. The solution to predicting future resource requirements is the 3-phase process. If the 3-phase process of capacity planning is followed, you will increase accuracy of predictions. The 3-phase process has testing, benchmarks, and formulae useful in predicting future resource requirements based off the current resource usage. Through this testing of current resources/usage and using the formula to scale the requirements to the future amount of users, produces the most accurate prediction possible, given the information we have now. It is important to remember the resource requirements can usually be modified to fit the changing trends accordingly, until the data center is being built. The key challenge involved with data center architecture is downtime6 . Downtime6 has many challenges under it, these sub-challenges of data center architecture are: natural disasters, network infrastructure, security, electrical power, and temperature/humidity. Basically, when it comes down to it, anything that can go wrong in the system could cause downtime6 and most likely will if all steps to prevent it fail. Downtime6 is a large and difficult challenge because it has so many causes, and can only be prevented by keeping the entire system running perfectly. The challenge of downtime6 is solved by solving the sub-challenges that cause downtime6 . We can only prevent downtime6 because many causes are re-occurring. If we design the data center with modularity12 and redundancy, however, this gives us time to locate and fix potential problems before they occur. A problem would have to occur in the primary and redundant measures taken. The sub- solutions include: a location safe from natural and man-made disasters, strong network infrastructure through redundancy, modularity12 , and best practices, adequate security on the system and in the data center, electrical redundancy, modularity12 , a UPS23 , and back-up electrical system, and ensuring the HVAC10 system follows best practices, and is sufficient for the data center (Jayaswal 27- 91, 143-154).
  • 17. Page 17 of 27 Capacity Planning & Data Center Architecture Case Example OneNeck IT Solutions is a data center company that has reinvented some traditional methods of data center architecture in their newly-renovated, Minnesota data center. They are praised by reviews and their own customers for their availability3 and security, despite being non-traditional in some aspects of their designs. However, the non-traditional designs to some, are innovations in the eyes of others. They have created unique ways to complete tasks in a data center, and have taken some aspects of their systems further in depth than other companies. They pride themselves on providing a great user experience, being financially efficient, and being energy efficient. They are so confident in their ability to provide 100% uninterrupted availability3 , anything less is refunded to their. Located in Eden Prairie, MN, one of nine of OneNeck’s data centers is a marvel of new technology and innovation. It was designed to incorporate new, energy saving and efficiency boosting technologies. They have seen such great results, they recently added 6,000 sq. ft. of raised-floor space to their data center increasing their total floor space to 18,000 sq. ft. Normally adding more space is expensive and should be planned for during the initial design, but the original design was built modular12 and scalable so expansion was not only possible, but a great investment. All their cables, wires, ventilation, and utilities run in the sub-floor, including their gaseous fire suppression system. They have what is known as a cold air plenum15 because the cold air to cool the machines is transported through the sub-floor. The subfloor includes the network infrastructure, HVAC10 , electrical, and any redundancies with those systems. The only perforated tiles to direct air flow out of the sub-floor are beneath the server racks. The warmed air is gathered above the racks by ducting and sheet metal, then is sent into the drop ceiling and returned to be cooled and re-circulated. The entire HVAC10 system is closed, thus efficient. OneNeck cools the air using two CRAC (computer-room-air-conditioning) heat exchangers.
  • 18. Page 18 of 27 Capacity Planning & Data Center Architecture The returning hot air is either cooled by a water/glycol mixture pumped through the cooling tower heat exchangers outside, or mechanical air-conditioners in the CRAC units on especially hot days. When the cooling towers are handling all the heat exchange it is free air conditioning, and being Minnesota, OneNeck tends to have a low A/C bill. When the outside temperature drops below freezing, a DCIM automated system by Honeywell turns off the cooling tower’s water pumps and drains the lines. Besides managing the A/C system the Honeywell Building Automated System (DCIM) manages the raised floor temperature and humidity, all power systems, physical security, and asset management. OneNeck provides many service options to their customer, many of whom are healthcare and government customers. Cloud and hosting solutions include: cloud servers, private clouds, hybrid clouds, cloud storage, desktops in the cloud, and colocation. Some managed services include: applications, databases, networks, servers, end user support, disaster recovery4 , security and compliance, and communication and collaboration. ERP application managements are offered like Oracle, Microsoft, Infor, and SAP. Professional services offered include: IT assessments, design, migrations and implementations, IT roadmaps and planning, and technology consulting. Lastly they offer IT hardware resale for Cisco, EMC, HP, VMware, Citrix, F5, and NetApp products. They have a small NOC13 in the entrance to their data center. Next is the raised floor, computing area. To get there you travel through a secure hallway where the inside doors self-lock if the outside door is opened and vice-versa, this preserves the integrity of the good air flow. Their security is based on each room, most customers have their own room within the data center where their machines are located. Their security measured include, RFID cards18 , PIN codes, dual-iris biometric scanners, and state-of-the-art tail-gate sensors22 . The power supply comes from three different substations on the power gird and mate with their transformers behind the data center. If the local power fails, their eco-friendly UPS23 system
  • 19. Page 19 of 27 Capacity Planning & Data Center Architecture automatically pick-up the slack until the three huge diesel generators kick on. The UPS23 system is in between the power grid and generators and the building, meaning it’s always on. The UPS23 system conditions all power coming in and uses a flywheel generator that is still spinning and converting momentum to electricity when the power turns off. It can maintain the whole data center on no batteries or fuel, pure momentum for a full 60 seconds! More than enough time for their 9 second generators to get to full power. OneNeck even ensured they contracted a diesel company with gravity fed fuel, because if OneNeck doesn’t have electricity, neither will the diesel pumps (Kassner 1). Using a hypothetical situation about this data center I will express my findings. Let’s assume before expanding the data center OneNeck hired a new capacity planner and data center architect. Let’s also assume OneNeck independently doesn’t have the ability to predict their current or future requirements. The capacity planner ran tests to determine the desired additional resources for the profit given and came up with x amount of resources and determined that OneNeck was currently operating at a peak of 90% utilization25 of their current resources with a resource capacity of 2x over 12,000 sq. ft. The capacity planner told the architect, in order to meet the goals of OneNeck of adding x resources, we need to add 6000 sq. ft. of computing space. Luckily the original architect designed the data center to be modular12 and scalable, so all they had to do was run the new cables, wires, and ventilation to the additional servers and floor space and it should all work normally. The architect planned the expansion, just as OneNeck’s current system was set up. After the expansion the entire data center still remained within the transformer’s, UPSs’23 , generators, and HVAC’s10 maximum capacity. The only possible way this situation would work out this perfectly is if data center architecture’s best practices were evident from the start, with both architects, and if the capacity planning properly followed the 3-phases of capacity planning.
  • 20. Page 20 of 27 Capacity Planning & Data Center Architecture Recommendation My recommendation is for anyone looking to overcome the challenges data center architecture and will potentially, and probably, run into problems in two key areas: predicting future requirements, and preventing downtime6 . Both these technical challenges seem too big to solve, but they can both be managed. The accurate prediction challenge can be made as accurate as possible through proper capacity planning, which is usually accurate. Preventing downtime6 is as simple as following an instruction booklet, the best practices aren’t always the best for every situation, but if you stay as close to them as possible, downtime6 should not be an issue in most situations. I recommend always using the 3-phase capacity planning process to predict future resource requirements. If followed the predictions should be as accurate as the possibly can be without knowing facts about the variables and unknowns in the future. The 3-phase process will allow you to assess your current average and peak utilization25 and maximum capacity. Using those anyone can use the formulae to scale their current resources to determine the future needs for a given situation. I recommend following as many, if not all of the best practices for data center architecture. It is also important to remember the four key elements to design and build by: scalable, flexible, modular12 , and simple. By using redundancy wherever possible, it almost eliminates the risk of a single point of failure and reduces the risk of most preventable failures. Availability3 should always be a huge factor in what direction a data center architect should take.
  • 21. Page 21 of 27 Capacity Planning & Data Center Architecture Lessons Learned While conducting this research and compiling this report, I came across several challenges myself. I also learned some important new facts and concepts through my research. Besides the technical facts, some very important lessons were learned, not related to capacity planning and data center architecture. Many of the lessons I learned apply to life and business. The first lesson is to always plan for the worst case scenario. If you plan to use 100% of your time to complete a task, you will fail in planning. Set-backs happen when they are least convenient, this ideal is so universal they named a law about it after murphy. Plan to be set-back and you won’t be disappointed in the time you ended up not needing. Plan for the worst, so if the worst happens you’re prepared. The worst feeling is when something goes wrong and you have no idea what to do. A second lesson learned is when conducting research, to always try and keep searching for more, the next thing you would have found will always be the best one to find. I found an article that had some very good points after the fact of my paper being well underway. Luckily, in learning the first lesson, it gave me a chance to add it. The job is not done, until it is overdone. Lastly I learned how to compile many technical facts and areas into a report that makes some sense. If you first use a few paragraphs to explain your idea, then create an outline of the final idea, you’ll end up doing most of the hard work without realizing it. Normally I wouldn’t have done a project outline, this one was required so I did it, and realized why I should always outline a project, paper, report, or anything of this size
  • 22. Page 22 of 27 Capacity Planning & Data Center Architecture Glossary 1. Authentication – Determines whether a user is actually a valid user and has a proper login and credentials. Ex. Computer accounts allow certain people to log in (Jayaswal 17). 2. Authorization – Determines whether a particular host or user is allowed to view or change any particular information. Ex. The admin account on your computer has authorization to do things a guest account would not, like change passwords (Jayaswal 206). 3. Availability – The amount of time a system is usable. Usually calculated as a percentage of an elapsed year. Ex. 99% availability equates to 876 hours of downtime6 each year (Jayaswal 6). 4. Disaster-Recovery Plan – A plan in place in case of an extended period of outage. Can have servers dedicated to act as a secure back-up in the event of data loss. Usually has servers that take over in the event the primary servers fail (Jayaswal 18). 5. DMZ – De-Militarized Zone – A network subnet that contains servers that you want more open access to than the internal networks. It is more vulnerable and visible to outsiders. Acts as a firewall9 between the outside internet and the network inside (Jayaswal 496). 6. Downtime – Duration of time where a provided service is inaccessible or offline for any reason usually: maintenance, power failure, error, network/infrastructure problems, broken cables, etc. Measured in seconds, minutes, hours, or, in worst cases, days (Jayaswal 5). 7. Encryption/Decryption – Scrambles data using a key before sending. Only intended recipients or someone with the key to unscramble the data can comprehend the data (Jayaswal 179). 8. ESD – Electro-Static Discharge – A charge difference between two points, people, or devices that causes a discharge of electricity. Can be small and harmless, like the shock from rubbing your hair on a balloon, or powerful, damaging, and even deadly, like lightning (Jayaswal 78). 9. Firewall – An internal program that monitors your network connection and stops any bad traffic from getting through the “wall” it establishes. It can also be a specialized router that filter fata based on source and destination addresses, ports, content, etc. allowing only authorized traffic the pass through (Jayaswal 486). 10. HVAC – Heating, Ventilation, and Air-Conditioning – General term for the A/C system in a data center. Controls temperature, humidity, and sometimes fire suppression (Jayaswal 28). 11. Latency – Time delay of data traffic through a network or switch, measured in seconds of milliseconds. I.e. how long it takes for a user to get a response from a system (Jayaswal 131). 12. Modularity – A design concept that separates a system into several components, and each component can be separately designed, implemented, managed, and replaced (Saltzer 1.3).
  • 23. Page 23 of 27 Capacity Planning & Data Center Architecture 13. NOC – Network Operations Center – a facility, usually located outside a data center, dedicated and staffed with people, usually 24/7, to monitor the availability3 of all devices and services in a data center. NOCs use software like SNMP20 to help monitor data center(s) (Jayaswal 6). 14. PDU – Power Distribution Unit – An electrical distribution box, fed by a high amp, three phase connector, with power outlets and circuit breakers included (Jayaswal 77). 15. Plenum – The space between a sub-floor and the raised floor or sub-ceiling and ceiling, almost always located in a data center. Usually about 2 feet high, large enough to contain network cables, electrical wires, ventilation system, and sometimes plumbing (Jayaswal 596). 16. POD – Point Of Distribution – A rack containing network switches, terminals servers, and cable patch ports. Used to distribute a network from this point to many end points (Jayaswal 54). 17. Point-load – The weight or load on a single point. Usually refers to the weight a leg of a rack exerts on a tile below it. Ex. If a rack weighing 100lbs has a leg on a tile, the point load of that leg on that tile is 25lbs (Jayaswal 44). 18. RFID card – Radio Frequency IDentification – A card with a unique radio signal to grant and distribute access, usually to unlock a door, or several doors, without a key (PC.net 1). 19. SGA – System Global Area – Areas of shared memory, usually dedicated to RAM (Burleson 1). 20. SNMP – Simple Network Monitoring Protocol – send reports on the status and health of systems, networks, and devices to a central location, usually a NOC13 (Jayaswal 62). 21. Static-load – The total weight on a single tile, floor, etc. If a tile has 1 leg from a rack weighing 100lbs and 2 legs from a rack weighing 200lbs, the static-load on that tile is 125lbs, assuming no other weight is being exerted on that tile (Jayaswal 44). 22. Tail-gate sensor – An electronic sensor that can determine when, and how many of, a person/people or an object has passed through a point or doorway (Kassner 1). 23. UPS – Uninterruptible Power Supply – a large battery, or other device, capable of sustaining the power capacity for a given amount of time. Used to power a system, until back-up generators or alternative sources take over, in the case of a power failure (Jayaswal 73). 24. Uptime – Duration of time a service is accessible or online. The maximum uptime is, usually the target uptime, 24 hours a day, 7 days a week, 365 days in a normal year (Jayaswal 6). 25. Utilization – the fraction or percentage at which a particular resource is being used with respect to maximum resource capabilities. Ex. A 20 MBits/second CPU working at 5 MBits/second would have a 25% utilization, i.e. 25% of its potential is being used (Jayaswal 131).
  • 24. Page 24 of 27 Capacity Planning & Data Center Architecture Sources used: Burleson, Donald. "Oracle Concepts - SGA System Global Area." Oracle Concepts. DBA-Oracle.com, Jan. 2014. Web. Dec. 2014. Jayaswal, Kailash. Administering Data Centers: Servers, Storage, and Voice over IP. Indianapolis, IN: Wiley Pub., 2006. eBook. Kassner, Michael. "OneNeck IT Solutions' Minnesota Data Center Uses New Technology to Improve Service." TechRepublic.com. Tech Republic, Oct. 2014. Web. Oct. 2014. PC.net. "Definition of RFID." Definition of RFID. PC.net, Aug. 2009. Web. Nov. 2014. Rich, Joe. "How to Do Capacity Planning." TeamQuest.com. TeamQuest, Jan. 2010. Web. Nov. 2014. Saltzer, J. H., and Frans Kaashoek. Principles of Computer System Design: An Introduction. Burlington, MA: Morgan Kaufmann, 2009. eBook. Schiesser, Rich. "How to Develop an Effective Capacity Planning Process."Computerworld.com. Computer World, Mar. 2010. Web. Nov. 2014.
  • 25. Page 25 of 27 Capacity Planning & Data Center Architecture Appendices Appendix 1: Characteristics of a good capacity planner (Schiesser 1). Appendix 2: SMP Factor for Adding CPUs (Jayaswal 152).
  • 26. Page 26 of 27 Capacity Planning & Data Center Architecture Appendix 3: Server racks (Jayaswal 28). Appendix 4: Good vs. Bad cable practices (Jayaswal 53). Appendix 5: NOC13 Center (Jayaswal 63).
  • 27. Page 27 of 27 Capacity Planning & Data Center Architecture Appendix 6: HVAC10 system (Jayaswal 88).