Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
PRObE
1. | |JULY 2014
1CIOReview| |February 2016
1CIOReview
NcompassTrac:
JAVA DEVELOPMENT SPECIAL
T h e N a v i g a t o r f o r E n t e r p r i s e S o l u t i o n s FEBRUARY 26 - 2016 CIOREVIEW.COM
CIO INSIGHTS:
BILL MILLER,
CIO,
EMS USA, INC
IN MY OPINION:
JANELLE KLEIN,
CTO,
NEW IRON
Java:The Panacea for
Enterprise Application
Development
2. | |JULY 2014
10CIOReview | |February 2016
31CIOReview
Repurposing
Supercomputers—
What happens on
“The Other Side?”By Andree Jacobson, CIO, New Mexico Consortium & Project Manager, PRObE
G
overnment and industry alike invest heavily in massive
computer systems to satisfy the insatiable demand for
compute power of today’s society. Compared to other
types of equipment, computers have an unusually short
life-span. After only a few years of operation most computers are
replaced with faster, better, systems. With all this hardware being
continuously decommissioned, where does it all go?
Government, industry, and research facilities keep building larger
and faster supercomputers, which is a natural effect of trying to keep
up with the ever growing demand for compute cycles to perform crit-
ical scientific calculations required to ensure the safety of our nation
or the profitability of a company. The technology competition is es-
sentially a modern version of the space race that occurred during the
cold war as the country with the fastest computer
will perform the most advanced science.
As these massive computer systems are
built and put into production, the “Top
500” list reveals the current state of the
race at a SuperComputing conference
every six months. For the last three
years, China’s “Tianhe-2” computer
system with a peak 54.9 PetaFLOPS
(Trillion Floating Point Operations
Per Second) in is the lead, followed by
the U.S. Department of Energy - Oak
Ridge National Laboratory sys-
tem called “Titan” at roughly
half the performance of
its Chinese counterpart.
Running these systems
require several MegaWatts of power and cost millions of dollars a
year to operate. In industry, massive corporations like Google, Ama-
zon, and Facebook build their own data centers around the world to
supply enough compute power to meet the needs of their hundreds of
millions of users. Each of these types of data centers can host tens of
thousands of individual computers often referred to as nodes. A node
is at least as powerful your average office / home computer. Many
have co-processors (like GPUs or Xeon Phi’s) to speed up calcula-
tions; some have disk, and most have fast networking capabilities.
The end result is a massive amount of hardware that has one thing in
common - at some point, inevitably - each and every node needs to
be discarded.
LiveorLetDie?
Andree Jacobson, CIO for the New Mexico Consortium (NMC) fo-
cuses on the fate of these decommissioned supercomputers. He is the
project manager for PRObE (The Parallel Reconfigurable Observa-
tional Environment) which is an NSF funded compute facility hosted
by the NMC in Los Alamos, NM. The NMC is a non-profit organi-
zation with a purpose to improve the research environment in New
Mexico by facilitating collaborations between Los Alamos National
Laboratory (LANL) and the three research universities in the state.
PRObE is a pilot project designed to determine the feasibility of us-
ing re-purposed supercomputer hardware for research purposes. Gary
Grider, Division Leader for High Performance Computing at LANL
came up with the idea for PRObE in 2006 after arriving to the conclu-
sion that many of their computer systems that are normally decom-
missioned and subsequently destroyed despite still having quite a bit
of useful life left in them. Many facilities deal with their decommis
CIO INSIGHTS
Andree Jacobson
3. | |JULY 2014
11CIOReview
| |February 2016
32CIOReview
sioned systems by putting them on trucks
and driving them to a secure facility where
the components are placed in an industrial
metal shredder which chops them into tiny
pieces which are then melted down to re-
cover precious metals. But does something
that might have cost $30M just three or four
years prior really only possess scrap value
today? Neither Grider or Jacobson thought
so and co-wrote the NSF proposal together
other collaborators from Carnegie Mellon
University and the University of Utah. In
October 2010 the NMC was awarded $10M
from the NSF to build PRObE.
From a pure profitability standpoint
the answer to the scrap value question is
probably yes. Based on historical trends it is
usually possible to achieve about double the
performance in a 10th of the floor footprint
and ⅔ to one half of the power consumption
by performing an upgrade of systems that are
fouryearsintoproduction.Aswewillsee,the
operational expenses (OPEX) for running an
outdated computer system quickly exceeds
the capital expense (CAPEX) investment
with the accompanying reduced OPEX for a
new, more efficient system.
Many universities that begin deploying
cluster style research computing often resort
to using discarded desktop computers.
However, these cobbled together systems
are simply not adequate to meet the needs
of researchers who require very large
computer systems to perform their research.
This means the value of a decommissioned
supercomputer might be significantly higher
than the scrap value to the average person
or researcher at a university because these
older systems can provide plentiful and more
powerful computational capabilities than
would otherwise be available.
A Different Approach
PRObE is an answer to getting these
decommissioned systems into the hands of
people who can use them, but setting up
and maintaining large clusters containing
more than 1000 nodes requires overcoming
several obstacles:
1) Sheer volume: Decommissioning,
moving, inspecting, troubleshooting, and
bringing back thousands of old
computers on-
line takes sig-
nificant time
and effort.
Also, unlike
when a system
is slated for de-
struction - care
must be taken through-
out the decommissioning pro-
cess so that parts are not damaged.
2) Space: A computer system with 1000 or
more nodes and appropriate interconnect
networks will likely require about 40-
50 whole racks of computer equipment.
PRObE has capacity for 1MW of compute
power, about 280 tons of cooling, and
3000 sq ft of server room space to house
these large machines. This is sufficient
for housing two large and a few smaller
clusters.
3) Electricity cost: 1MW costs around $1M
per year in New Mexico. It is a required
OPEX and in PRObE’s case, is provided
by NSF funding. This is not a typical setup,
but since there is no procurement cost for
the computers - the electricity is covered
instead. This allows PRObE to provide the
compute services to the community at no
cost to the individual users.
4) Lack of spare parts: Vendors do not
necessarily keep old spare parts around
once a product has reached end-of-life and
sometimes the vendor of an old system
might have vanished. In such cases, the
only outlet is the gray market - such as
eBay and other vendors specializing in
reused computer equipment. In PRObE’s
case - LANL’s systems are usually
larger than what PRObE can house, so
a sufficient number of spares (typically
about 20 percent) can accompany each
system. Machines can also be cannibalized
to keep the system running once the spares
run out.
5) Staff to operate: PRObE is successful
primarily because of the workforce we
use to build the clusters and to maintain
them. In particular, our staff is creative
as they can both assemble and maintain
the hardware even
with limited funds.
Instead of hiring
consultants or full
time staff members
to perform this
work, PRObE relies
on local high-school
and early college
talent, which is
also a wonderful
way to train young
people. Over the past 6 years we
have employed close to 40 high school
students that spend a couple of hours with
us each week. During summers and winter
break, these students often work full time.
To PRObE this is an affordable solution
and the students get hands-on experience
building large computer systems.
The Future
PRObE is fortunate that the NSF sees
the value in what we do, the training
that we provide, and the scientific value
these older systems can contribute to the
academic and scientific communities.
Without NSF support, PRObE would
not be possible. While the operation of
PRObE require both skill and creativity,
the work is rewarding and the scientific
benefits are as real as exemplified by the
many research citations PRObE regularly
receives in the scientific literature.
The technology
competition is essentially
a modern version of the
space race that occurred
during the cold war as the
country with the fastest
computer will perform the
most advanced science