NWMore Than 30 Years of Experience
in Scientific Computing
were created, some of which later transformed into big players in
the IT market. Technical innovations brought dramatic changes
to the nascent computer market. In Tübingen, close to one of Ger-
many’s prime and oldest universities, transtec was founded.
In the early days, transtec focused on reselling DEC computers
and peripherals, delivering high-performance workstations to
university institutes and research facilities. In 1987, SUN/Sparc
and storage solutions broadened the portfolio, enhanced by
IBM/RS6000 products in 1991. These were the typical worksta-
tions and server systems for high performance computing then,
used by the majority of researchers worldwide.
In the late 90s, transtec was one of the first companies to offer
highly customized HPC cluster solutions based on standard Intel
architecture servers, some of which entered the TOP 500 list of
the world’s fastest computing systems.
Thus, given this background and history, it is fair to say that tran-
stec looks back upon a more than 30 years’ experience in scientif-
ic computing; our track record shows more than 750 HPC instal-
lations. With this experience, we know exactly what customers’
demands are and how to meet them. High performance and ease
of management – this is what customers require today.
HPC systems are for sure required to peak-perform, as their name
indicates, but that is not enough: they must also be easy to han-
dle. Unwieldy design and operational complexity must be avoid-
ed or at least hidden from administrators and particularly users
of HPC computer systems.
This brochure focusses on where transtec HPC solutions excel.
transtec HPC solutions use the latest and most innovative tech-
nology. Bright Cluster Manager as the technology leader for uni-
fied HPC cluster management, leading-edge Moab HPC Suite for
job and workload management, Intel Cluster Ready certification
as an independent quality standard for our systems, Panasas HPC
storage systems for highest performance and real ease of man-
agement required of a reliable HPC storage system. Again, with
these components, usability, reliability and ease of management
are central issues that are addressed, even in a highly heteroge-
neous environment. transtec is able to provide customers with
well-designed, extremely powerful solutions for Tesla GPU com-
puting, as well as thoroughly engineered Intel Xeon Phi systems.
Intel’s InfiniBand Fabric Suite makes managing a large InfiniBand
fabric easier than ever before, and Numascale provides excellent
Technology for AMD-based large-SMP Systems – transtec mas-
terly combines excellent and well-chosen components that are
already there to a fine-tuned, customer-specific, and thoroughly
designed HPC solution.
Your decision for a transtec HPC solution means you opt for most
intensive customer care and best service in HPC. Our experts will
be glad to bring in their expertise and support to assist you at
any stage, from HPC design to daily cluster operations, to HPC
Last but not least, transtec HPC Cloud Services provide custom-
ers with the possibility to have their jobs run on dynamically pro-
vided nodes in a dedicated datacenter, professionally managed
and individually customizable. Numerous standard applications
like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like
Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud management environment, and
ready to run.
Have fun reading the transtec HPC Compass 2015/16!
High Performance Computing
Director of HPC Solutions at transtec AG, discussing
the wide area of use of High Performance Comput-
ing (HPC), the benefits of turnkey solutions and the
increasing popularity of Big Data Analytics.
Interview with Dr. Oliver Tennert
Dr. Tennert, in which industries have you seen an increase in
demand for high performance computing?
an annual average growth of approximately 7% over the next 5
years across all sectors even in traditional HPC-dominated indus-
tries such as CAE (Computer-Aided Engineering) and Life Sciences.
For which type of applications is HPC usually deployed? Could
you give us a concrete example of its application?
High performance computing is basically divided into two areas
of application: simulation on the one hand and data analysis on
the other. Take the automotive industry for example.
During the vehicle design and development stage, a great deal of
effort is invested in crash simulations while the number of ma-
terial-intensive crash tests, which used to be performed regular-
ly, have been minimized and are only conducted for verification
The analysis of sequenced genome data for hereditary diseases
is a prime example of how data analysis is used effectively – or
from the field of molecular genetics – the measurement of evolu-
tionary relations between different types or the effect of viruses
on the germline.
High performance computing is often criticized under the as-
pect that a HPC installation is far too expensive and complex
for “ordinary” organisations. Which arguments would you use
against these accusations?
[Grinning:] In the past we have sold HPC solutions to many cus-
tomers who would describe themselves as “ordinary” and who
have been extremely satisfied with the results. But you are in
fact right: when using a complex HPC system, it involves hiding
this complexity away from users and administrators. When you
take a look under the hood of a car, you will see that it is also a
complex machine. However drivers do not have to concern them-
selves with features such as transmission technology and ABS
electronics when driving the vehicle.
It is therefore important to ensure you implement a straightfor-
ward and enterprise-ready management and service concept.
transtec offers an intelligent portfolio of service and manage-
ment concepts which are compatible with the paradigms of
“easy to manage” and “easy to use”.
What are the challenges when launching, installing and operat-
ing an HPC system? What should administrators bear in mind to
ensure that the HPC system remains manageable?
When looking for an HPC systems provider, customers should
make sure that the provider has precisely understood their needs
and requirements. This is what differentiates a solutions provider
High Performance Computing
Interview with Dr. Oliver Tennert
such as transtec from a product-centric hardware supplier or
manufacturer. The optimum sizing of the entire solution and the
above mentioned easy manageability has a direct impact on the
customer’s or user’s productivity.
Once the system has been delivered, customers want to start pro-
ductive operation as soon as possible. transtec provides custom-
ers with turnkey HPC solutions, which means: depending on the
installation and integration into the customer’s environment.
What is the role of HPC when deploying cloud computing and
big data analyses in large organisations?
When deploying cloud computing, it is important to make a dis-
tinction: that what we today call a “private cloud” is essentially
– leaving technological details to one side – a tried-and-tested
concept in which companies can consolidate and provide ac-
cess to their computing capacities from one central location.
The use of “public cloud” providers such as Amazon has signifi-
cantly dropped in light of previous NSA reports. Security aspects
have always been a serious obstacle for cloud computing and, in
terms of HPC, the situation becomes all the more complicated
for an industrial organisation as high performance computing is
an essential element in the organisation’s value chain. The data
the organisation uses for work is a vital asset and critical to the
large data volumes, is essentially nothing new. Recently however,
the amount of data has increased at a significantly faster pace
than the computing capacities available.
For example, with the use of biochemi-cal methods in “next
generation sequencing”, complete genome sequences can
today be generated and prepared for further analysis
at a far faster rate than ever before. The need for
HPC solutions has recently become increas-
ingly more evident.
While we are on the topic of big data: To
what extent are high-performance, modern
in-memory computing technologies competing
with HPC systems?
This depends on whether you see these technologies in direct
competition to each other! Shared in-memory databases and the
associated computing technologies are innovative concepts for
the particular application, for example real-time analytics or data
mining. We believe that both are contemporary technologies like
time. From transtec’s point of view, these are HPC technologies
and some of our customers are already using them on HPC
solutions we have provided.
What are expected of today’s supercomputers? What does the
future hold for the HPC segment – and what follows on from
Petaflop and Petabyte?
of the fastest computers in the world. “Flop” is used as a measure
of performance, in other words the number of floating point
operations per second. “Tianhe-2” from China currently holds
the number one position with an amazing 34 Petaflops, in other
words 34 billion Flops. However rank 500 scoring 133 Teraflops
(133 million Flops) is still an impressive figure. That is rough-
ly the equivalent of the consolidated performance of
2,000 well configured PCs.
Exabyte follows Petabyte. Over the last
few years it has been widely discussed
on the HPC market what kind of sys-
tem would be needed to manage these
data volumes, particularly in terms of capacity,
power supply and cooling. Significant technology
advancements will expectedly be required before this
type of supercomputer can be produced. Sheer mass alone
will not be able to manage these volumes.
Keyword - security: What is the best way to protect HPC systems
in an age of the NSA scandal and ever increasing cybercrime?
Basically the same applies to HPC systems as it does to company
notebooks and workstations where IT security is concerned:
access by unauthorised parties must be prohibited by deploying
secure IT structures. As described above, enormous volumes of
critical organisation data are stored on an HPC system which is
why efficient data management with effective access control are
essential in HPC environments.
The network connections between the data centre and user
workstations must also be secured on a physical level. This
especially applies when the above-mentioned “private cloud”
scenarios are set up across various locations to allow them to
work together regardless of the particular country or time zone.
is followed by
Exabyte, although before
that, several technological
jumps need to
Table of Contents and Introduction
High Performance Computing......................................... 10
Performance Turns Into Productivity....................................................12
Flexible Deployment With xCAT................................................................14
Service and Customer Care From A to Z ..............................................16
Vector Supercomputing....................................................... 18
The NEC SX Architecture................................................................................20
History of the SX Series..................................................................................25
Advanced Cluster Management Made Easy............ 34
Easy-to-use, Complete and Scalable.......................................................36
Cloud Bursting With Bright..........................................................................44
Intelligent HPC Workload Management.................... 46
Moab HPC Suite – Enterprise Edition.....................................................48
Moab HPC Suite – Grid Option....................................................................52
Optimizing Accelerators with Moab HPC Suite...............................54
Moab HPC Suite – Application Portal Edition...................................58
Moab HPC Suite – Remote Visualization Edition............................60
Remote Visualization and
Workflow Optimization........................................................ 68
NICE EnginFrame: A Technical Computing Portal..........................70
Desktop Cloud Virtualization.....................................................................74
Virtual GPU Technology..................................................................................82
Intel Cluster Ready ................................................................. 84
A Quality Standard for HPC Clusters......................................................86
Intel Cluster Ready builds HPC Momentum......................................90
The transtec Benchmarking Center........................................................94
Parallel NFS.................................................................................. 96
The New Standard for HPC Storage........................................................98
Whats´s New in NFS 4.1?............................................................................ 100
Panasas HPC Storage.................................................................................... 102
A Parallel File System...........................................................116
A Parallel File System.................................................................................... 118
NVIDIA GPU Computing......................................................126
What is GPU Computing?........................................................................... 128
Kepler GK110/210 GPU Computing Architecture......................... 130
A Quick Refresher on CUDA....................................................................... 140
Intel Xeon Phi Coprocessor..............................................142
The Architecture.............................................................................................. 144
An Outlook on Knights Landing and Omni Path.......................... 152
InfiniBand High-Speed Interconnect.........................156
The Architecture.............................................................................................. 158
Intel MPI Library 4.0 Performance......................................................... 160
Intel Fabric Suite 7.......................................................................................... 164
NumaConnect Background....................................................................... 170
Redefining Scalable OpenMP and MPI............................................... 178
Big Data................................................................................................................. 182
Cache Coherence............................................................................................. 184
NUMA and ccNUMA........................................................................................ 186
IBM Platform HPC...................................................................188
Enterprise-Ready Cluster & Workload Management................ 190
What’s New in IBM Platform LSF 8........................................................ 194
IBM Platform MPI 8.1..................................................................................... 202
General Parallel File System (GPFS).............................204
What is GPFS?.................................................................................................... 206
What´s New in GPFS Version 3.5............................................................ 220
High Performance Computing (HPC) has been with us from the very beginning of the
computer era. High-performance computers were built to solve numerous problems
which the “human computers” could not handle.
The term HPC just hadn’t been coined yet. More important, some of the early principles
have changed fundamentally.
Computing – Performance
Turns Into Productivity
Performance Turns Into Productivity
HPC systems in the early days were much different from those
we see today. First, we saw enormous mainframes from large
computer manufacturers, including a proprietary operating
system and job management system. Second, at universities
and research institutes, workstations made inroads and scien-
tists carried out calculations on their dedicated Unix or VMS
workstations. In either case, if you needed more computing
power, you scaled up, i.e. you bought a bigger machine. Today
the term High-Performance Computing has gained a fundamen-
tally new meaning. HPC is now perceived as a way to tackle
complex mathematical, scientific or engineering problems. The
integration of industry standard, “off-the-shelf” server hardware
into HPC clusters facilitates the construction of computer net-
works of such power that one single system could never achieve.
The new paradigm for parallelization is scaling out.
Computer-supported simulations of realistic processes (so-called
key pillar in the field of science and research alongside theory
and experimentation. It is nowadays inconceivable that an air-
craft manufacturer or a Formula One racing team would operate
without using simulation software. And scientific calculations,
such as in the fields of astrophysics, medicine, pharmaceuticals
and bio-informatics, will to a large extent be dependent on super-
computers in the future. Software manufacturers long ago rec-
ognized the benefit of high-performance computers based on
powerful standard servers and ported their programs to them
The main advantages of scale-out supercomputers is just that:
they are infinitely scalable, at least in principle. Since they are
based on standard hardware components, such a supercomput-
er can be charged with more power whenever the computational
capacity of the system is not sufficient any more, simply by add-
ing additional nodes of the same kind. A cumbersome switch to a
different technology can be avoided in most cases.
High Performance Computing
Performance Turns Into Productivity
“transtec HPC solutions are meant to provide customers with unpar-
alleled ease-of-management and ease-of-use. Apart from that, decid-
ing for a transtec HPC solution means deciding for the most intensive
customer care and the best service imaginable.”
Dr. Oliver Tennert
Director Technology Management & HPC Solutions
SW tech BOX
Variations on the theme: MPP and SMP
Parallel computations exist in two major variants today. Ap-
plications running in parallel on multiple compute nodes are
frequently so-called Massively Parallel Processing (MPP) appli-
cations. MPP indicates that the individual processes can each
utilize exclusive memory areas. This means that such jobs are
predestined to be computed in parallel, distributed across the
nodes in a cluster. The individual processes can thus utilize the
separate units of the respective node – especially the RAM, the
CPU power and the disk I/O.
Communication between the individual processes is implement-
ed in a standardized way through the MPI software interface
(Message Passing Interface), which abstracts the underlying
network connections between the nodes from the processes.
However, the MPI standard (current version 2.0) merely requires
source code compatibility, not binary compatibility, so an off-the-
shelf application usually needs specific versions of MPI libraries
in order to run. Examples of MPI implementations are OpenMPI,
MPICH2, MVAPICH2, Intel MPI or – for Windows clusters – MS-MPI.
If the individual processes engage in a large amount of commu-
nication, the response time of the network (latency) becomes im-
portant. Latency in a Gigabit Ethernet or a 10GE network is typi-
cally around 10 µs. High-speed interconnects such as InfiniBand,
reduce latency by a factor of 10 down to as low as 1 µs. Therefore,
high-speed interconnects can greatly speed up total processing.
The other frequently used variant is called SMP applications.
SMP, in this HPC context, stands for Shared Memory Processing.
It involves the use of shared memory areas, the specific imple-
mentation of which is dependent on the choice of the underlying
operating system. Consequently, SMP jobs generally only run on
a single node, where they can in turn be multi-threaded and thus
be parallelized across the number of CPUs per node. For many
HPC applications, both the MPP and SMP variant can be chosen.
Many applications are not inherently suitable for parallel execu-
tion. In such a case, there is no communication between the in-
dividual compute nodes, and therefore no need for a high-speed
network between them; nevertheless, multiple computing jobs
can be run simultaneously and sequentially on each individual
node, depending on the number of CPUs.
In order to ensure optimum computing performance for these
applications, it must be examined how many CPUs and cores de-
liver the optimum performance.
We find applications of this sequential type of work typically in
the fields of data analysis or Monte-Carlo simulations.
High Performance Computing
Flexible Deployment With xCAT
xCAT as a Powerful and Flexible Deployment Tool
xCAT (Extreme Cluster Administration Tool) is an open source
toolkit for the deployment and low-level administration of HPC
cluster environments, small as well as large ones.
xCAT provides simple commands for hardware control, node
discovery, the collection of MAC addresses, and the node deploy-
ment with (diskful) or without local (diskless) installation. The
cluster configuration is stored in a relational database. Node
groups for different operating system images can be defined.
Also, user-specific scripts can be executed automatically at in-
xCAT Provides the Following Low-Level Administrative
Remote console support
Parallel remote shell and remote copy commands
Plugins for various monitoring tools like Ganglia or Nagios
Hardware control commands for node discovery, collecting
MAC addresses, remote power switching and resetting of
Automatic configuration of syslog, remote shell, DNS, DHCP,
and ntp within the cluster
Extensive documentation and man pages
For cluster monitoring, we install and configure the open source
tool Ganglia or the even more powerful open source solution Na-
gios, according to the customer’s preferences and requirements.
Local Installation or Diskless Installation
We offer a diskful or a diskless installation of the cluster nodes.
A diskless installation means the operating system is hosted par-
tially within the main memory, larger parts may or may not be
included via NFS or other means. This approach allows for de-
ploying large amounts of nodes very efficiently, and the cluster
is up and running within a very small timescale. Also, updating
the cluster can be done in a very efficient way. For this, only the
boot image has to be updated, and the nodes have to be reboot-
ed. After this, the nodes run either a new kernel or even a new
operating system. Moreover, with this approach, partitioning the
cluster can also be very efficiently done, either for testing pur-
poses, or for allocating different cluster partitions for different
users or applications.
Development Tools, Middleware, and Applications
According to the application, optimization strategy, or underly-
ing architecture, different compilers lead to code results of very
different performance. Moreover, different, mainly commercial,
applications, require different MPI implementations. And even
when the code is self-developed, developers often prefer one MPI
implementation over another.
According to the customer’s wishes, we install various compilers,
MPI middleware, as well as job management systems like Para-
station, Grid Engine, Torque/Maui, or the very powerful Moab
HPC Suite for the high-level cluster management.
High Performance Computing
Services and Customer Care from A to Z
transtec HPC as a Service
You will get a range of applications like LS-Dyna, ANSYS,
Gromacs, NAMD etc. from all kinds of areas pre-installed,
integrated into an enterprise-ready cloud and workload
management system, and ready-to run. Do you miss your
Ask us: HPC@transtec.de
transtec Platform as a Service
You will be provided with dynamically provided compute
nodes for running your individual code. The operating sys-
tem will be pre-installed according to your requirements.
Common Linux distributions like RedHat, CentOS, or SLES
are the standard. Do you need another distribution?
Ask us: HPC@transtec.de
transtec Hosting as a Service
You will be provided with hosting space inside a profession-
ally managed and secured datacenter where you can have
your machines hosted, managed, maintained, according to
your requirements. Thus, you can build up your own private
cloud. What range of hosting and maintenance services do
Tell us: HPC@transtec.de
HPC @ transtec:
Services and Customer Care from A to Z
transtec AG has over 30 years of experience in scientific computing
clusters based on standard components to academic and industry
customer-centric approach that transtec is well known for. Every
transtec HPC solution is more than just a rack full of hardware – it is
a comprehensive solution with everything the HPC user, owner, and
operator need. In the early stages of any customer’s HPC project,
transtec experts provide extensive and detailed consulting to the
customer – they benefit from expertise and experience. Consulting
is followed by benchmarking of different systems with either spe-
routines; this aids customers in sizing and devising the optimal and
detailed HPC configuration. Each and every piece of HPC hardware
Services and Customer Care from A to Z
more if necessary. We make sure that any hardware shipped meets
are turnkey solutions. By default, a transtec HPC cluster has every-
thing installed and configured – from hardware and operating sys-
or developer tools and the customer’s production applications.
Onsite delivery means onsite integration into the customer’s pro-
corporate network, or setting up software and configuration parts.
transtec HPC clusters are ready-to-run systems – we deliver, you
turn the key, the system delivers high performance. Every HPC proj-
ect entails transfer to production: IT operation processes and poli-
hands-on, introduced to hardware components and software, with
all operational aspects of configuration management.
transtec services do not stop when the implementation projects
ends. Beyond transfer to production, transtec takes care. transtec
offers a variety of support and service options, tailored to the cus-
tomer’s needs. When you are in need of a new installation, a major
reconfiguration or an update of your solution – transtec is able to
support your staff and, if you lack the resources for maintaining the
cluster yourself, maintain the HPC solution for you.
From Professional Services to Managed Services for daily opera-
service and solution provider. transtec’s high standards of perfor-
mance, reliability and dependability assure your productivity and
and administration of the HPC cluster managed by transtec service
mand services help provide access to HPC resources whenever they
need them, for example, because they do not have the possibility
of owning and running an HPC cluster themselves, due to lacking
infrastructure, know-how, or admin staff.
transtec HPC Cloud Services
Last but not least transtec’s services portfolio evolves as custom-
ers‘ demands change. Starting this year, transtec is able to provide
computing power to customers who are in need of more capacity
than they own, which is why this workflow model is sometimes
called computing-on-demand. With these dynamically provided
resources, customers with the possibility to have their jobs run on
HPC nodes in a dedicated datacenter, professionally managed and
secured, and individually customizable. Numerous standard ap-
plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes
like Gromacs, NAMD, VMD, and others are pre-installed, integrated
into an enterprise-ready cloud and workload management environ-
ment, and ready to run.
Alternatively, whenever customers are in need of space for host-
ing their own HPC equipment because they do not have the space
capacity or cooling and power infrastructure themselves, transtec
is also able to provide Hosting Services to those customers who’d
managed. Customers can thus build up their own private cloud!
Are you interested in any of transtec’s broad range of HPC related
services? Write us an email to HPC@transtec.de. We’ll be happy to
hear from you!
Initially, in the 80s and 90s, the notion of supercomputing was almost a synonym for
vector-computing. And this concept is back now in different flavors. Now almost every
architecture is using SIMD, the acronym for “single instruction multiple data”.
Vector Supercomputing –
The NEC SX Architecture
History of HPC, and what we learn
During the last years systems based on x86 architecture domi-
nated the HPC market, with GPUs and many-core systems re-
cently making their inroads. But this period is coming to an end,
and it is again time for a differentiated complementary HPC-
targeted system. NEC will develop such a system.
The Early Days
When did HPC really start? Probably one would think about the
time when Seymour Cray developed high-end machines at CDC
and then started his own company Cray Research, with the first
very special, systems were expensive, and access was extremely
limited for researchers and students. But the business case
was obvious in many fields, scientifically as well as commercially.
These Cray machines were vector-computers, initially featuring
only one CPU, as parallelization was far from being mainstream,
even hardly a matter of research. And code development was
easier than today, a code of 20000 lines was already considered
“complex” at that time, and there were not so many standard
codes and algorithms available anyway. For this reason code-de-
velopment could adapt to any kind of architecture to achieve op-
In 1981 NEC started marketing its first vector supercomputer, the
SX-2, a single-CPU-vector-architecture like the Cray-1. The SX-2
was a milestone, it was the first CPU to exceed one GFlops of peak
The Attack of the “Killer-Micros”
In the 1990s the situation changed drastically, and what we see
today is a direct consequence. The “killer-micros” took over, new
parallel, later “massively parallel” systems (MPP) were built based
on micro-processor architectures which were predominantly
The NEC SX Architecture
targeted to commercial sectors, or in the case of Intel, AMD or
Motorola to PCs, to products which were successfully invading
the daily life, partially more than even experts had anticipated.
Clearly one could save a lot of cost by using, in the worst case
slightly adapting these CPUs for high-end-systems, and by fo-
cusing investments on other ingredients, like interconnect tech-
nology. At the same time increasingly Linux was adopted and of-
fered a cheap generally available OS platform to allow everybody
to use HPC for a reasonable price.
ment of standard CPUs was sponsored by different businesses,
and another expensive part, the memory architecture with many
tems comprised of many distinct so-called nodes, complete sys-
tems by themselves, plus some kind of interconnect technology,
which could also be taken from standard equipment partially.
The impact on the user was the necessity to code for these “dis-
tributed memory systems”. Standards for coding evolved, PVM
and later MPI, plus some proprietary paradigms. MPI is dominat-
ing still today.
In the high-end the special systems were still dominating, be-
cause they had architectural advantages and the codes were run-
ning with very good efficiency on such systems. Most codes were
not parallel at all or at least not well parallelized, let alone on
systems with distributed memory. The skills and experiences to
generate parallel versions of codes were lacking in those compa-
nies selling high-end-machines and also on the customers’ side.
So it took some time for the MPPs to take over. But it was inevi-
table, because this situation, sometimes called “democratization
of HPC”, was positive for both the commercial and the scientific
developments. From that time on researchers and students in
academia and industry got sufficient access to resources they
could not have dreamt of before.
The x86 Dominance
One could dispute whether the x86 architecture is the most
advanced or elegant, undoubtedly it became the dominant
one. It was developed for mass-markets, this was the strate-
gy, and HPC was a side-effect, sponsored by the mass-market.
This worked successfully and the proportion of x86 Linux clus-
ters in the Top500 list increased year by year. Because of the
mass-markets it made sense economically to invest a lot into
the development of LSI technology (large-scale integration).
The related progress, which led to more and more transistors
on the chips and increasing frequencies, helped HPC applica-
tions, which got more, faster and cheaper systems with each
generation. Software did not have to adapt a lot between gen-
erations and the software developments rather focused initial-
ly on the parallelization for distributed memory, later on new
functionalities and features. This worked for quite some years.
Systems of sufficient size to produce relevant results for typical
applications are cheap, and most codes do not really scale to a
level of parallelism to use thousands of processes. Some light-
house projects do, after a lot of development which is not easily
affordable, but the average researcher or student is often com-
pletely satisfied with something between a few and perhaps a
few hundred nodes of a standard Linux cluster.
In the meantime already a typical Intel-based dual-socket-node
of a standard cluster has 24 cores, soon a bit more. So paralleliza-
tion is increasingly important. There is “Moore’s law”, invented by
Gordon E. Moore, a co-founder of the Intel Corporation, in 1965.
circuit doubles about every 18 months, leading to an exponential
increase. This statement was often misinterpreted that the per-
formance of a CPU would double every 18 months, which is mis-
leading. We observe ever increasing core-counts, the transistors
are just there, so why not? That applications might not be able to
use these many cores is another question, in principle the poten-
tial performance is there.
The “standard benchmark” in the HPC-scene is the so-called HPL,
High Performance LINPACK, and the famous Top500 list is built
based on this. This HPL benchmark is only one of three parts of
the original LINPACK benchmark, this is often forgotten! But the
other parts did not show such a good progress in performance,
their scalability is limited, so they could not serve to spawn com-
petition even between nations, which leads to a lot of prestige.
Moreover there are at most a few real codes in the world with
similar characteristics as the HPL. In other words: it does not tell
a lot. This is a highly political situation and probably quite some
money is wasted which would better be invested in brain-ware.
Anyway it is increasingly doubtful that LSI technology will ad-
vance at the same pace in the future. The grid-constant of Silicon
is 5.43 Å. LSI technology is now using 14 nm processes, the Intel
Broadwell generation, which is about 25 times the grid-constant,
and with 7nm LSIs at the long-term horizon. Some scientists as-
sume there is still some room to improve even more, but once
the features on the LSI have a thickness of only a few atoms one
cannot get another factor of ten. Sometimes the impact of some
future “disruptive technology” is claimed …
Another obstacle is the cost to build a chip fab for a new genera-
tion of LSI technology. Certainly there are ways to optimize both
production and cost, and there will be bright ideas, but with cur-
rent generations a fab costs at least many billion $. There have
been ideas of slowing down the innovation cycle on that side to
merely be able to recover the investment.
For many HPC applications it is not the compute power of the
The NEC SX Architecture
CPU which counts, it is the speed with which the CPU can
exchange data with the memory, the so-called memory band-
width. If it is neglected the CPUs will wait for data, the perfor-
mance will decrease. To describe the balance between CPU
performance and memory bandwidth one often uses the ratio
between operations and the amount of bytes that can be trans-
ferred while they are running, the “byte-per-flop-ratio”. Tradi-
tionally vector machines were always great in this regard, and
this is to continue as far as current plans are known. And this is
the reason why a vector machine can realize a higher fraction
of its theoretical performance on a real application than a sca-
One other but really overwhelming problem is the increasing
power consumption of systems. Looking at different genera-
tions of Linux clusters one can assume that a dual-socket-node,
depending on how it is equipped with additional features, will
consume between 350 Watt and 400 Watt.
will slightly increase. Even in Europe sites are talking about elec-
tricity budgets in excess of 1 MWatt, even beyond 10 MWatts. One
also has to cool the system, which adds something like 10-30%
to the electricity cost! The cost to run such a system over 5 years
will be in the same order of magnitude as the purchase price!
The power consumption will grow with the frequency of the CPU,
is the dominant reason why frequencies are stagnating, if not de-
creasing between product generations. There are more cores, but
individual cores of an architecture are getting slower, not faster!
This has to be compensated on the application side by increased
scalability, which is not always easy to achieve. In the past us-
ers easily got a higher performance for almost every application
with each new generation, this is no longer the case.
and they are known since quite some time. Consequently this
leads to attempts to get additional performance by other means.
And the developments in the market clearly indicate that high-
end users are willing to consider the adaptation of codes.
To utilize available transistors CPUs with more complex func-
tionality are designed, utilizing SIMD instructions or providing
more operations per cycle than just an add and a multiply. SIMD
is “Single Instruction Multiple Data”. Initially at Intel the SIMD-in-
structions provided two results in 64-bit-precision per cycle rath-
er than one, in the meantime with AVX it is even four, on Xeon Phi
This makes a lot of sense. The electricity consumed by the “ad-
ministration” of the operations, code fetch and decode, configu-
ration of paths on the CPU etc. is a significant portion of the pow-
er consumption. If the energy is used not only for one operation
but for two or even four, the energy per operation is obviously
In essence and in order to overcome the performance-bottleneck
the architectures developed into the direction which to a larger
extent was already realized in the vector-supercomputers dec-
There is almost no architecture left without such features. But
SIMD-operations require application code to be vectorized. Be-
fore code developments and programming paradigms did not
care for fine-grain data-parallelism, which is the basis of appli-
cability of SIMD. Thus vector-machines could not show their
speaking, there is no scalar CPU left, standard x86-systems do not
achieve their optimal performance on purely scalar code as well!
During recent years GPUs and many-core architectures are in-
creasingly used in the HPC market. Although these systems are
somewhat diverse in detail, in any case the degree of parallelism
needed is much higher than in standard CPUs. This points into
the same direction, a low-level parallelism is used to address the
bottlenecks. In the case of Nvidia GPUs the individual so-called
CUDA cores, relatively simple entities, are steered by a more
complex central processor, realizing a high level of SIMD paral-
lelism. In the case of Xeon Phi the individual cores are also less
powerful, but they are used in parallel using coding paradigms
which operate on the level of loop nests, an approach which was
already used on vector supercomputers during the 1980s!
So these alternative architectures not only go into the same
direction, the utilization of fine-grain parallelism, they also in-
dicate that coding styles which suite real vector machines will
dominate the future software developments. Note that at last
some of these systems have an economic backing by mass mar-
The NEC SX Architecture
SW tech BOX
History of the SX Series
As already stated NEC started marketing the first commercially
available generation of this series, the SX-2, in 1981. By now the
latest generation is the SX-ACE.
In 1998 Prof. Tadashi Watanabe, the developer of the SX series
and the major driver of NEC’s HPC-business, received the Eck-
hard-Mauchly-medal, a renowned US-award for these achieve-
was a very pragmatic developer, a knowledgeable scientist and a
visionary manager, all at the same time, and a great person to
talk to. From the beginning the SX-architecture was different
from the competing Cray-vectors, it already integrated some of
the features that RISC CPUs at that time were using, but com-
bined them with vector registers and pipelines.
For example the SX could reorder instructions in hardware on
the fly, which Cray machines could not do. In effect this made
the system much easier to use. The single processor SX-2 was the
fastest CPU at that time, the first one to break the 1 GFlops bar-
rier. Naturally the compiler technology for automatic vectoriza-
tion and optimization was important.
With the SX-3 NEC developed a high-end shared-memory system.
It had up to four very strong CPUs. The resolution of memory ac-
cess conflicts by CPUs working in parallel, and the consistency of
data throughout the whole shared-memory system are non-triv-
ial, and the development of such a memory architecture was a
with regard to price and power consumption. The T90 was using
ECL technology and achieved 1.8 GFlops peak performance. The
SX-4 had a lower frequency because of CMOS, but because of the
higher density of the LSIs it featured a higher level of parallelism
on the CPU, leading to 2 GFlops per CPU with up to 32 CPUs.
The memory architecture to support those was a major leap
ahead. On this system one could typically reach an efficiency
of 20-40% of peak performance measured throughout complete
applications, far from what is achieved on standard components
today. With this system NEC also implemented the first genera-
tion of its own proprietary interconnect, the so-called IXS, which
allowed connecting several of these strong nodes into one highly
powerful system. The IXS featured advanced functions like a
global barrier and some hardware fault-tolerance, things which
other companies tackled years later.
With the SX-5 this direction was continued, it was a very suc-
cessful system pushing the limits of CMOS-based CPUs and mul-
ti-node configurations. At that time slowly the efforts on the us-
ers’ side to parallelize their codes using “message passing”, PVM
and MPI at that time, started to surface, so the need for a huge
shared memory system slowly disappeared. Still there were quite
some important codes in the scene which were only parallel us-
ing a shared-memory paradigm, initially “microtasking”, which
was vendor-specific, “autotasking”, NEC and some other vendors
could automatically parallelize code, and later the standard
OpenMP. So there was still sufficient need for a huge a highly ca-
But later it became clear that more and more codes could utilize
distributed memory systems, and moreover the memory-archi-
tecture of huge shared memory systems became increasingly
complex and that way costly. A memory is made of “banks”, in-
dividual pieces of memory which are used in parallel. In order
to provide the data at the necessary speed for a CPU a certain
minimum amount of banks per CPU is required, because a bank
needs some cycles to recover after every access. Consequently
the number of banks had to scale with the number of CPUs, so
that the complexity of the network connecting CPUs and banks
was growing with the square of the number of CPUs. And if the
CPUs became stronger they needed more banks, which made it
even worse. So this concept became too expensive.
History of the SX Series
At the same time NEC had a lot of experience in mixing MPI- and
OpenMP parallelization in applications, and it turned out that for
the very fast CPUs the shared-memory parallelization was typi-
cally not really efficient for more than four or eight threads. So
the concept was adapted accordingly, which lead to the SX-6 and
the famous Earth-Simulator, which kept the #1-spot of the Top500
for some years.
The Earth Simulator was a different machine, but based
on the same technology level and architecture as the SX-6.
on one single chip. The CPU by itself was not the challenge, but
the interface to the memory to support the bandwidth needed
for this chip, basically the “pins”. This kind of packaging tech-
no-logy was always a strong point in NEC’s technology portfolio.
technology level, but featured 32 CPUs on one shared memory.
The SX-8 then was a new implementation again with some ad-
vanced features which directly addressed the necessities of cer-
tain code-structures frequently encountered in applications. The
communication between the world-wide organization of appli-
cation-analysts and the hard- and software developers in Japan
With the SX-9 the number of CPUs per node was pushed up again
to 16 because of certain demands in the market. Just as an in-
dication for the complexity of the memory architecture: the
system had 16384 banks! The peak-performance of one node was
Keep in mind that the efficiency achieved on real applications
was still in the range of 10% in bad cases, up to 20% or even more,
depending on the code. One should not compare this peak perfor-
mance with some standard chip today, this would be misleading.
The Current Product: SX-ACE
In November 2014 NEC announced the next generation of the
SX-vector-product, the SX-ACE. Some systems are already in-
stalled and accepted, also in Europe.
The main design target was the reduction of power-consump-
tion per sustained performance, and measurements have prov-
en this has been achieved. In order to do so and in line with the
observation that the shared-memory parallelization with vector
CPUs will be useful on four cores, perhaps eight, but normally
not beyond that, the individual node consists of a single-chip-
four core-processor, which inherits memory-controllers and the
interface to the communication interconnect, a next generation
of the proprietary IXS.
That way a node is a small board, leading to a very compact de-
sign. The complicated network between the memory and the
CPUs, which took about 70% of the power-consumption of an SX-
9, was eliminated. And naturally a new generation of LSI-and PCB
technology also leads to reductions.
With the memory bandwidth of 256 GB/s this machine realizes
one “byte per flop”, a measure of the balance between the pure
compute power and the memory bandwidth of a chip. In compar-
ison nowadays typical standard processors have in the order of
one eights of this!
Memory bandwidth is increasingly a problem, this is known in
the market as the “memory wall”, and is a hot topic since more
than a decade. Standard components are built for a different
market, where memory bandwidth is not as important as it is for
HPC applications. Two of these node boards are put into a mod-
ule, eight such modules are combined in one cage, and four of
the cages, i.e. 64 nodes, are contained in one rack. The standard
configuration of 512 nodes consists of eight racks.
History of the SX Series
NEC already installed several SX-ACE systems in Europe, and
more installations are ongoing. The advantages of the system,
the superior memory bandwidth, the strong CPU with good effi-
ciency, will provide the scientists with performance levels on cer-
tain codes which otherwise could not be achieved. Moreover the
system has proven to be superior in terms of power consumption
per sustained application performance, and this is a real hot top-
do i = 1, n
a(i) = b(i) + c(i)
History of the SX Series
LSI technologies have improved drastically during the last
two decades, but the ever increasing power consumption of
systems now poses serious limitations. Tasks like fetching in-
structions, decoding them or configuring paths on the chip
consume energy, so it makes sense to execute these activities
not for one operation only, but for a set of such operations.
This is meant by SIMD.
The underlying structure to enable the usage of SIMD-instruc-
tions is called “data parallelism”. For example the following loop
is data-parallel, the individual loop iterations could be executed
in any arbitrary order, and the results would stay the same:
The complete independence of the loop iterations allows executing
Naturally one could also use other means to parallelize such a
loop, e.g. for a very long loop one could execute chunks of loop
iterations on different processors of a shared-memory machine
using OpenMP. But this will cause overhead, so it will not be effi-
cient for small trip counts, while a SIMD feature in hardware can
deal with such a fine granularity.
As opposed to the example above the following loop, a recursion,
is not data parallel:
do i = 1, n-1
a(i+1) = a(i) + c(i)
The time or number of cycles extends to the right hand side,
assuming five stages (vertical direction), the red squares indi-
cate which stage is busy at the given cycle. Obviously this is not
efficient, because white squares indicate transistors which con-
sume power without actively producing results. And the com-
pute-performance is not optimal, it takes five cycles to compute
Similarly to the assembly line in the automotive industry this is
what we want to achieve:
After the “latency”, five cycles, just the length of the pipeline, we
get a result every cycle. Obviously it is easy to feed the pipeline
with sufficient independent input data in the case of different
loop iterations of a data-parallel loop.
If this is not possible one could also use different operations
within the execution of a complicated loop body within one
Obviously a change of the order of the loop iterations, for ex-
ample the special case of starting with a loop-index of n-1 and
decreasing it afterwards, will lead to different results. Therefore
these iterations cannot be executed in parallel.
The following description is very much simplified, but explains
the basic principles: Operations like add and multiply consist of
so-called segments, which are executed one after the other in a
so-called pipeline. While each segment acts on the data for some
kind of intermediate step, only after the execution of the whole
pipeline the final result is available.
It is similar to an assembly-line in the automotive industry. Now,
the segments do not need to deal with the same operation at a
given point in time, they could be kept active if sufficient inputs
are fed into the pipeline. Again, in the automotive industry the
different stages of the assembly line are working on different
cars, and nobody would wait for a car to be completed before
starting the next one in the first stage of the production, that
way keeping almost all stages inactive. Such a situation would
be resembled by the following picture:
given loop iteration. But typically operations in such a loop body
need results from previous operations of the same iteration, so
one would rather arrive at a utilization of the pipeline like this:
Some operations can be calculated almost in parallel, others
have to wait for previous calculations to finish because they
need input. In total the utilization of the pipeline is better but
This was the situation before the trend to SIMD computing was
revitalized. Before that the advances of LSI technology always
provided an increased performance with every new generation
of product. But then the race for increasing frequency stopped,
and now architectural enhancements are important again.
Single Instruction Multiple Data (SIMD)
Hardware realizations of SIMD have been present since the first
Cray-1, but this was a vector architecture. Nowadays the instruc-
tion sets SSE and AVX in their different generations and flavors
use it, and similarly for example CUDA coding is also a way to uti-
lize data parallelism.
The “width” of the SIMD-instruction, the amount of loop itera-
tions to execute in parallel by one instruction, is kept relatively
small in the case of SSE (two for double precision) and AVX (four
for double precision). The implementations of SIMD in stand-
ard “scalar” processors typically involve add-and-multiply units
which can produce several results per clock-cycle. They also fea-
ture registers of the same “width” to keep several input parame-
ters or several results.
History of the SX Series
iterations are independent, means if the loop is data parallel. It
would not work if one computation would need a result of anoth-
er computation as input while this is just being processed.
In fact the SX-vector-machines always were combining both
ideas, the vector-register and parallel functional units, so-called
“pipe-sets”, four in this picture:
There is another advantage of the vector architecture in the SX,
and this proves to be very important. The scalar portion of the
CPU is acting independently, and while the pipelines are kept
busy with input from the vector registers, the inevitable scalar
instructions which come with every loop can be executed. This
involves checking of loop counters, updating of address pointers
and the like. That way these instructions essentially do not con-
sume time. The amount of cycles available to “hide” those is the
length of the vector-registers divided by the number of pipe sets.
In fact the white space in the picture above can partially be hid-
den as well, because new vector instructions can be issued in ad-
vance to immediately take over once the vector register with the
input data for the previous instruction is completely processed.
The instruction-issue-unit has the complete status information
of all available resources, specifically pipes and vector registers.
In a way the pipelines are “doubled” for SSE, but both are activat-
ed by the same single instruction:
The difference between those SIMD architectures and SIMD with
a real vector architecture like the NEC SX-ACE, the only remaining
vector system, lies in the existence of vector registers.
In order to provide sufficient input to the functional units the
vector registers have proven to be very efficient. In a way they are
a very natural match to the idea of a pipeline – or assembly line.
Again thinking about the automotive industry, there one would
have a stock of components to constantly feed the assembly line.
The following viewgraph resembles this situation:
When an instruction is issued it takes some cycles for the first
result to appear, but after that one would get a new result every
cycle, until the vector register, which has a certain length, is
completely processed. Naturally this can only work if the loop
High Performance Computing (HPC) clusters serve a crucial role at research-intensive
organizations in industries such as aerospace, meteorology, pharmaceuticals, and oil
and gas exploration.
The thing they have in common is a requirement for large-scale computations. Bright’s
solution for HPC puts the power of supercomputing within the reach of just about any
Management Made Easy
The Bright Advantage
Bright Cluster Manager for HPC makes it easy to build and oper-
ate HPC clusters using the server and networking equipment of
your choice, tying them together into a comprehensive, easy to
Bright Cluster Manager for HPC lets customers deploy complete
clusters over bare metal and manage them effectively. It pro-
vides single-pane-of-glass management for the hardware, the
operating system, HPC software, and users. With Bright Cluster
Manager for HPC, system administrators can get clusters up and
running quickly and keep them running reliably throughout their
life cycle – all with the ease and elegance of a fully featured, en-
terprise-grade cluster manager.
Bright Cluster Manager was designed by a group of highly expe-
rienced cluster specialists who perceived the need for a funda-
mental approach to the technical challenges posed by cluster
management. The result is a scalable and flexible architecture,
forming the basis for a cluster management solution that is ex-
tremely easy to install and use, yet is suitable for the largest and
most complex clusters.
Scale clusters to thousands of nodes
Bright Cluster Manager was designed to scale to thousands of
nodes. It is not dependent on third-party (open source) software
that was not designed for this level of scalability. Many advanced
features for handling scalability and complexity are included:
Management Daemon with Low CPU Overhead
The cluster management daemon (CMDaemon) runs on every
node in the cluster, yet has a very low CPU load – it will not cause
any noticeable slow-down of the applications running on your
Because the CMDaemon provides all cluster management func-
tionality, the only additional daemon required is the workload
manager (or queuing system) daemon. For example, no daemons
Management Made Easy
Easy-to-use, complete and scalable
The cluster installer takes you through the installation process and
offers advanced options such as “Express” and “Remote”.
for monitoring tools such as Ganglia or Nagios are required. This
minimizes the overall load of cluster management software on
Multiple Load-Balancing Provisioning Nodes
Node provisioning – the distribution of software from the head
node to the regular nodes – can be a significant bottleneck in
large clusters. With Bright Cluster Manager, provisioning capabil-
ity can be off-loaded to regular nodes, ensuring scalability to a
virtually unlimited number of nodes.
When a node requests its software image from the head node,
the head node checks which of the nodes with provisioning ca-
pability has the lowest load and instructs the node to download
or update its image from that provisioning node.
Synchronized Cluster Management Daemons
The CMDaemons are synchronized in time to execute tasks in
exact unison. This minimizes the effect of operating system jit-
ter (OS jitter) on parallel applications. OS jitter is the “noise” or
“jitter” caused by daemon processes and asynchronous events
such as interrupts. OS jitter cannot be totally prevented, but for
parallel applications it is important that any OS jitter occurs at
the same moment in time on every compute node.
Bright Cluster Manager supports redundant head nodes and re-
dundant provisioning nodes that can take over from each other
in case of failure. Both automatic and manual failover configura-
tions are supported.
Other redundancy features include support for hardware and
software RAID on all types of nodes in the cluster.
Bright Cluster Manager supports diskless nodes, which can be
useful to increase overall Mean Time Between Failure (MTBF),
particularly on large clusters. Nodes with only InfiniBand and no
Ethernet connection are possible too.
Bright Cluster Manager 7 for HPC
For organizations that need to easily deploy and manage HPC
clusters Bright Cluster Manager for HPC lets customers deploy
complete clusters over bare metal and manage them effectively.
It provides single-pane-of-glass management for the hardware,
the operating system, the HPC software, and users.
With Bright Cluster Manager for HPC, system administrators can
quickly get clusters up and running and keep them running relia-
bly throughout their life cycle – all with the ease and elegance of
a fully featured, enterprise-grade cluster manager.
Simple Deployment Process – Just answer a few questions
about your cluster and Bright takes care of the rest. It installs
all of the software you need and creates the necessary config-
uration files for you, so you can relax and get your new cluster
up and running right – first time, every time.
Management Made Easy
Easy-to-use, complete and scalable Head Nodes Regular Nodes
A cluster can have different types of nodes, but it will always
have at least one head node and one or more “regular” nodes.
A regular node is a node which is controlled by the head node
and which receives its software image from the head node or
from a dedicated provisioning node.
Regular nodes come in different flavors.
Some examples include:
Failover Node – A node that can take over all functionalities
of the head node when the head node becomes unavailable.
Compute Node – A node which is primarily used for compu-
Login Node – A node which provides login access to users of
the cluster. It is often also used for compilation and job sub-
I/O Node – A node which provides access to disk storage.
Provisioning Node – A node from which other regular nodes
can download their software image.
Workload Management Node – A node that runs the central
workload manager, also known as queuing system.
Subnet Management Node – A node that runs an InfiniBand
More types of nodes can easily be defined and configured
with Bright Cluster Manager. In simple clusters there is only
one type of regular node – the compute node – and the head
node fulfills all cluster management roles that are described
in the regular node types mentioned above.
All types of nodes (head nodes and regular nodes) run the same
CMDaemon. However, depending on the role that has been as-
signed to the node, the CMDaemon fulfills different tasks. This
node has a ‘Login Role’, ‘Storage Role’ and ‘PBS Client Role’.
The architecture of Bright Cluster Manager is implemented by
the following key elements:
The Cluster Management Daemon (CMDaemon)
The Cluster Management Shell (CMSH)
The Cluster Management GUI (CMGUI)
A Bright cluster is managed by the CMDaemon on the head node,
which communicates with the CMDaemons on the other nodes
over encrypted connections.
The CMDaemon on the head node exposes an API which can also
be accessed from outside the cluster. This is used by the cluster
management Shell and the cluster management GUI, but it can
also be used by other applications if they are programmed to ac-
cess the API. The API documentation is available on request.
The diagram below shows a cluster with one head node, three
regular nodes and the CMDaemon running on each node.
Elements of the Architecture
Can Install on Bare Metal – With Bright Cluster Manager there
is nothing to pre-install. You can start with bare metal serv-
ers, and we will install and configure everything you need. Go
from pallet to production in less time than ever.
Comprehensive Metrics and Monitoring – Bright Cluster Man-
ager really shines when it comes to showing you what’s going
on in your cluster. It’s beautiful graphical user interface lets
you monitor resources and services across the entire cluster
in real time.
Powerful User Interfaces – With Bright Cluster Manager you
don’t just get one great user interface, we give you two. That
way, you can choose to provision, monitor, and manage your
HPC clusters with a traditional command line interface or our
beautiful, intuitive graphical user interface.
ensures HPC applications get the resources they require
according to the policies of your organization. Your cluster
will prioritize workloads according to your business priorities.
Manager comes with a complete set of HPC tools and libraries
so you will be ready to develop, debug, and deploy your HPC
code right away.
Management Made Easy
Easy-to-use, complete and scalable
“The building blocks for transtec HPC solutions must be chosen
according to our goals ease-of-management and ease-of-use. With
Bright Cluster Manager, we are happy to have the technology leader
at hand, meeting these requirements, and our customers value that.”
HPC Solution Engineer
Monitor your entire cluster at an glance with Bright Cluster Manager
Quickly build and deploy an HPC cluster
When you need to get an HPC cluster up and running, don’t waste
time cobbling together a solution from various open source
tools. Bright’s solution comes with everything you need to set up
a complete cluster from bare metal and manage it with a beauti-
ful, powerful user interface.
Whether your cluster is on-premise or in the cloud, Bright is a
virtual one-stop-shop for your HPC project.
Build a cluster in the cloud
Bright’s dynamic cloud provisioning capability lets you build an
entire cluster in the cloud or expand your physical cluster into
the cloud for extra capacity. Bright allocates the compute re-
sources in Amazon Web Services automatically, and on demand.
Build a hybrid HPC/Hadoop cluster
Does your organization need HPC clusters for technical comput-
ing and Hadoop clusters for Big Data? The Bright Cluster Man-
ager for Apache Hadoop add-on enables you to easily build and
manage both types of clusters from a single pane of glass. Your
system administrators can monitor all cluster operations, man-
age users, and repurpose servers with ease.
Driven by research
Bright has been building enterprise-grade cluster management
software for more than a decade. Our solutions are deployed in
thousands of locations around the globe. Why reinvent the
wheel when you can rely on our experts to help you implement
the ideal cluster management solution for your organization.
Clusters in the cloud
Bright Cluster Manager can provision and manage clusters that
are running on virtual servers inside the AWS cloud as if they
were local machines. Use this feature to build an entire cluster
in AWS from scratch or extend a physical cluster into the cloud
when you need extra capacity.
Use native services for storage in AWS – Bright Cluster Manager
lets you take advantage of inexpensive, secure, durable, flexible
and simple storage services for data use, archiving, and backup
in the AWS cloud.
Make smart use of AWS for HPC – Bright Cluster Manager can
save you money by instantiating compute resources in AWS only
when they are needed. It uses built-in intelligence to create in-
stances only after the data is ready for processing and the back-
log in on-site workloads requires it.
You decide which metrics to monitor. Just drag a component into the display
area an Bright Cluster Manager instantly creates a graph of the data you need.
Management Made Easy
Easy-to-use, complete and scalable High performance meets efficiency
Initially, massively parallel systems constitute a challenge to
both administrators and users. They are complex beasts. Anyone
building HPC clusters will need to tame the beast, master the
complexity and present users and administrators with an easy-
to-use, easy-to-manage system landscape.
Leading HPC solution providers such as transtec achieve this
goal. They hide the complexity of HPC under the hood and match
high performance with efficiency and ease-of-use for both users
and administrators. The “P” in “HPC” gains a double meaning:
“Performance” plus “Productivity”.
Cluster and workload management software like Moab HPC
Suite, Bright Cluster Manager or Intel Fabric Suite provide the
means to master and hide the inherent complexity of HPC sys-
tems. For administrators and users, HPC clusters are presented
as single, large machines, with many different tuning parame-
ters. The software also provides a unified view of existing clus-
ters whenever unified management is added as a requirement by
the customer at any point in time after the first installation. Thus,
daily routine tasks such as job management, user management,
queue partitioning and management, can be performed easily
with either graphical or web-based tools, without any advanced
scripting skills or technical expertise required from the adminis-
trator or user.
SW tech BOX
Bright Cluster Manager unleashes and manages
the unlimited power of the cloud
Create a complete cluster in Amazon EC2, or easily extend your
onsite cluster into the cloud. Bright Cluster Manager provides a
“single pane of glass” to both your on-premise and cloud resourc-
es, enabling you to dynamically add capacity and manage cloud
nodes as part of your onsite cluster.
Cloud utilization can be achieved in just a few mouse clicks, with-
out the need for expert knowledge of Linux or cloud computing.
With Bright Cluster Manager, every cluster is cloud-ready, at no
extra cost. The same powerful cluster provisioning, monitoring,
scheduling and management capabilities that Bright Cluster
Manager provides to onsite clusters extend into the cloud.
The Bright advantage for cloud utilization
Ease of use: Intuitive GUI virtually eliminates user learning
curve; no need to understand Linux or EC2 to manage system.
Alternatively, cluster management shell provides powerful
scripting capabilities to automate tasks.
Complete management solution: Installation/initialization,
provisioning, monitoring, scheduling and management in
one integrated environment.
Integrated workload management: Wide selection of work-
load managers included and automatically configured with
local, cloud and mixed queues.
Single pane of glass; complete visibility and control: Cloud
sible from a single console with drill-downs and historic data.
Efficient data management via data-aware scheduling: Auto-
matically ensures data is in position at start of computation;
delivers results back when complete.
Secure, automatic gateway: Mirrored LDAP and DNS services
inside Amazon VPC, connecting local and cloud-based nodes
for secure communication over VPN.
Cost savings: More efficient use of cloud resources; support
for spot instances, minimal user intervention.
Two cloud scenarios
Bright Cluster Manager supports two cloud utilization scenarios:
“Cluster-on-Demand” and “Cluster Extension”.
Scenario 1: Cluster-on-Demand
The Cluster-on-Demand scenario is ideal if you do not have a clus-
ter onsite, or need to set up a totally separate cluster. With just a
few mouse clicks you can instantly create a complete cluster in
the public cloud, for any duration of time.
Management Made Easy
Cloud Bursting With Bright
Scenario 2: Cluster Extension
The Cluster Extension scenario is ideal if you have a cluster onsite
but you need more compute power, including GPUs. With Bright
Cluster Manager, you can instantly add EC2-based resources to
your onsite cluster, for any duration of time. Extending into the
cloud is as easy as adding nodes to an onsite cluster – there are
only a few additional, one-time steps after providing the public
cloud account information to Bright Cluster Manager.
The Bright approach to managing and monitoring a cluster in
the cloud provides complete uniformity, as cloud nodes are
managed and monitored the same way as local nodes:
Software image management
Integrated workload management
Interfaces – GUI, shell, user portal
Monitoring and health checking
Compilers, debuggers, MPI libraries, mathematical libraries
and environment modules
Bright Cluster Manager also provides additional features that are
unique to the cloud.
Amazon spot instance support – Bright Cluster Manager enables
users to take advantage of the cost savings offered by Amazon’s
spot instances. Users can specify the use of spot instances, and
Bright will automatically schedule as available, reducing the cost
to compute without the need to monitor spot prices and sched-
Hardware Virtual Machine (HVM) – Bright Cluster Manager auto-
matically initializes all Amazon instance types, including Cluster
Compute and Cluster GPU instances that rely on HVM virtualization.
Amazon VPC – Bright supports Amazon VPC setups which allows
compute nodes in EC2 to be placed in an isolated network, there-
by separating them from the outside world. It is even possible to
route part of a local corporate IP network to a VPC subnet in EC2,
so that local nodes and nodes in EC2 can communicate easily.
Data-aware scheduling – Data-aware scheduling ensures that
input data is automatically transferred to the cloud and made
accessible just prior to the job starting, and that the output data
is automatically transferred back. There is no need to wait for the
data to load prior to submitting jobs (delaying entry into the job
queue), nor any risk of starting the job before the data transfer is
complete (crashing the job). Users submit their jobs, and Bright’s
data-aware scheduling does the rest.
Bright Cluster Manager provides the choice:
“To Cloud, or Not to Cloud”
Not all workloads are suitable for the cloud. The ideal situation
for most organizations is to have the ability to choose between
onsite clusters for jobs that require low latency communication,
complex I/O or sensitive data; and cloud clusters for many other
types of jobs.
Bright Cluster Manager delivers the best of both worlds: a pow-
erful management solution for local and cloud clusters, with
the ability to easily extend local clusters into the cloud without
compromising provisioning, monitoring or managing the cloud
One or more cloud nodes can be configured under the Cloud Settings tab.
While all HPC systems face challenges in workload demand, workflow constraints,
resource complexity, and system scale, enterprise HPC systems face more stringent
challenges and expectations.
Enterprise HPC systems must meet mission-critical and priority HPC workload
demands for commercial businesses, business-oriented research, and academic
Solutions to Enterprise
High-Performance Computing Challenges
The Enterprise has requirements for dynamic scheduling, provi-
sioning and management of multi-step/multi-application servic-
es across HPC, cloud, and big data environments. Their workloads
directly impact revenue, product delivery, and organizational ob-
Enterprise HPC systems must process intensive simulation and
data analysis more rapidly, accurately and cost-effectively to
accelerate insights. By improving workflow and eliminating job
delays and failures, the business can more efficiently leverage
data to make data-driven decisions and achieve a competitive
advantage. The Enterprise is also seeking to improve resource
utilization and manage efficiency across multiple heterogene-
ous systems. User productivity must increase to speed the time
to discovery, making it easier to access and use HPC resources
and expand to other resources as workloads demand.
Enterprise-Ready HPC Workload Management
Moab HPC Suite – Enterprise Edition accelerates insights by uni-
fying data center resources, optimizing the analysis process,
and guaranteeing services to the business. Moab 8.1 for HPC
systems and HPC cloud continually meets enterprise priorities
through increased productivity, automated workload uptime,
and consistent SLAs. It uses the battle-tested and patented Moab
intelligence engine to automatically balance the complex, mis-
sion-critical workload priorities of enterprise HPC systems. En-
terprise customers benefit from a single integrated product that
brings together key Enterprise HPC capabilities, implementation
services, and 24/7 support.
It is imperative to automate workload workflows to speed time
to discovery. The following new use cases play a key role in im-
proving overall system performance:
Moab HPC Suite – Enterprise Edition
Moab is the “brain” of an HPC system, intelligently optimizing workload
throughput while balancing service levels and priorities.
SW tech BOX
Intelligent HPC Workload Management Across
Infrastructure and Organizational Complexity
Moab HPC Suite – Enterprise Edition acts as the “brain” of a high
performance computing (HPC) system to accelerate and auto-
mate complex decision making processes. The patented Moab in-
telligence engine provides multi-dimensional policies that mimic
real-world decision making.
These multi-dimensional, real-world policies are essential in
scheduling workload to optimize job speed, job success and re-
decision-making data from and automates actions through your
system’s existing mix of resource managers.
This enables all the dimensions of real-time granular resource
attributes, resource state, as well as current and future resource
commitments to be modeled against workload/application re-
quirements. All of these dimensions are then factored into the
most efficient scheduling decisions. Moab also dramatically sim-
plifies the management tasks and processes across these com-
plex, heterogeneous environments. Moab works with many of
the major resource management and industry standard resource
monitoring tools and covers mixed hardware, network, storage
Moab HPC Suite – Enterprise Edition multi-dimensional policies
are also able to factor in organizational priorities and complex-
ities when scheduling workload. Moab ensures workload is
processed according to organizational priorities and commit-
ments and that resources are shared fairly across users, groups
and even multiple organizations. This enables organizations to
organizational complexities with simple policy-based settings.
Moab HPC Suite – Enterprise Edition is architected to integrate
on top of your existing resource managers and other types of
management tools in your environment to provide policy-based
scheduling and management of workloads. It makes the optimal
complex decisions based on all of the data it integrates from the
various resource managers and then orchestrates the job and
management actions through those resource managers.
Moab’s flexible management integration makes it the ideal
choice to integrate with existing or new systems as well as to
manage your high performance computing system as it grows
and expands in the future.
Elastic Computing – unifies data center resources by assisting
the business management resource expansion through burst-
ing to private/public clouds and other data center resources
utilizing OpenStack as a common platform
Performance and Scale – optimizes the analysis process by
dramatically increasing TORQUE and Moab throughput and
scalability across the board
Tighter cooperation between Moab and Torque – harmoniz-
ing these key structures to reduce overhead and improve
communication between the HPC scheduler and the re-
More Parallelization – increasing the decoupling of Moab
and TORQUE’s network communication so Moab is less
dependent on TORQUE’S responsiveness, resulting in a 2x
speed improvement and shortening the duration of the av-
erage scheduling iteration in half
Accounting Improvements – allowing toggling between
multiple modes of accounting with varying levels of en-
forcement, providing greater flexibility beyond strict allo-
cation options. Additionally, new up-to-date accounting
balances provide real-time insights into usage tracking.
Viewpoint Admin Portal – guarantees services to the busi-
ness by allowing admins to simplify administrative reporting,
workload status tracking, and job resource viewing
Moab HPC Suite – Enterprise Edition accelerates insights by in-
creasing overall system, user and administrator productivity,
achieving more accurate results that are delivered faster and at
a lower cost from HPC resources. Moab provides the scalability,
90-99 percent utilization, and simple job submission required to
maximize productivity. This ultimately speeds the time to discov-
ery, aiding the enterprise in achieving a competitive advantage
due to its HPC system. Enterprise use cases and capabilities in-
clude the following:
Moab HPC Suite – Enterprise Edition
“With Moab HPC Suite, we can meet very demanding customers’
requirements as regards unified management of heterogeneous
cluster environments, grid management, and provide them with flex-
ible and powerful configuration and reporting options. Our custom-
ers value that highly.”
HPC Solution Architect
OpenStack integration to offer virtual and physical resource
provisioning for IaaS and PaaS
Performance boost to achieve 3x improvement in overall opti-
Advanced workflow data staging to enable improved cluster
utilization, multiple transfer methods, and new transfer types
Advanced power management with clock frequency control
and additional power state options, reducing energy costs by
Workload-optimized allocation policies and provisioning to
get more results out of existing heterogeneous resources and
reduce costs, including topology-based allocation
Unify workload management across heterogeneous clusters
by managing them as one cluster, maximizing resource availa-
bility and administration efficiency
Optimized, intelligent scheduling packs workloads and back-
fills around priority jobs and reservations while balancing
SLAs to efficiently use all available resources
Optimized scheduling and management of accelerators, both
Intel Xeon Phi and GPGPUs, for jobs to maximize their utiliza-
tion and effectiveness
Simplified job submission and management with advanced
job arrays and templates
Showback or chargeback for pay-for-use so actual resource
usage is tracked with flexible chargeback rates and reporting
by user, department, cost center, or cluster
Multi-cluster grid capabilities manage and share workload
across multiple remote clusters to meet growing workload
demand or surges
Guarantee Services to the Business
Job and resource failures in enterprise HPC systems lead to de-
layed results, missed organizational opportunities, and missed
objectives. Moab HPC Suite – Enterprise Edition intelligently
automates workload and resource uptime in the HPC system to
ensure that workloads complete successfully and reliably, avoid
failures, and guarantee services are delivered to the business.
The Enterprise benefits from these features:
Intelligent resource placement prevents job failures with
granular resource modeling, meeting workload requirements
and avoiding at-risk resources
Auto-response to failures and events with configurable ac-
tions to pre-failure conditions, amber alerts, or other metrics
Workload-aware future maintenance scheduling that helps
maintain a stable HPC system without disrupting workload
Real-world expertise for fast time-to-value and system uptime
with included implementation, training and 24/7 support re-
Auto SLA Enforcement
Moab HPC Suite – Enterprise Edition uses the powerful Moab in-
telligence engine to optimally schedule and dynamically adjust
workload to consistently meet service level agreements (SLAs),
guarantees, or business priorities. This automatically ensures
that the right workloads are completed at the optimal times,
taking into account the complex number of using departments,
priorities, and SLAs to be balanced. Moab provides the following
Usage accounting and budget enforcement that schedules
resources and reports on usage in line with resource sharing
agreements and precise budgets (includes usage limits, usage
reports, auto budget management, and dynamic fair share
SLA and priority policies that make sure the highest priority
workloads are processed first (includes Quality of Service and
hierarchical priority weighting)
Continuous plus future scheduling that ensures priorities and
guarantees are proactively met as conditions and workload
changes (i.e. future reservations, pre-emption)
Moab HPC Suite – Grid Option
Grid- and Cloud-Ready Workload Management
The benefits of a traditional HPC environment can be extended
to more efficiently manage and meet workload demand with
the multi-cluster grid and HPC cloud management capabilities in
Moab HPC Suite – Enterprise Edition. It allows you to:
Self-service job submission portal enables users to easily sub-
mit and track HPC cloud jobs at any time from any location
with little training required
Pay-for-use show back or chargeback so actual resource usage
is tracked with flexible chargeback rates reporting by user, de-
partment, cost center, or cluster
On-demand workload-optimized node provisioning dynami-
cally provisions nodes based on policies, to better meet work-
Manage and share workload across multiple remote clusters
to meet growing workload demand or surges by adding on the
Moab HPC Suite – Grid Option.
Choose the Moab HPC Suite That’s Right for You
This following chart provides a high-level comparison of the dif-
ference in key use cases and capabilities that are enabled be-
tween the different editions of Moab HPC Suite to intelligently
manage and optimize HPC workload efficiency. Customers can
use this chart to help determine which edition will best address
their HPC workload management challenges, maximize their
workload and resource productivity, and help them meet their
service level performance goals. To explore any use cases or ca-
pability compared in the chart further, see the HPC Workload
Management Use Cases.
Basic Edition is best for static, non-dynamic basic clusters of
all sizes who want to optimize workload throughput and uti-
lization while balancing priorities
Enterprise Edition is best for large clusters with dynamic,
mission-critical workloads and complex service levels and
priorities to balance. It is also for organizations who want to
integrate or who require HPC Cloud capabilities for their HPC
cluster systems. Enterprise Edition provides additional uses
cases these organizations need such as: Accounting for us-
age of cluster resources for usage budgeting and pay-for-use
showback or chargeback
Workload optimized node provisioning
Heterogeneous, multi-cluster workload management
Workload-aware power management
Optimizing Accelerators with
Moab HPC Suit
Application Portal Edition is suited for customers who need an
application-centric job submission experience to simplify and
extend high performance computing to more users without re-
quiring specialized HPC training. It provides all the benefits of
the Enterprise Edition plus meeting these requirements with
a technical computing application portal that supports most
common applications in manufacturing, oil gas, life sciences,
academic, government and other industries.
Remote Visualization Edition is suited for customers with
many users, internally or external to their organization, of
3D/CAE/CAD and other visualization applications who want
to reduce the hardware, networking and management costs
of running their visualization workloads. This edition also
provides the added benefits of improved workforce produc-
tivity and collaboration while optimizing the utilization and
guaranteeing shared resource access for users moving from
technical workstations to more efficient shared resources in
a technical compute cloud. It provides all the benefits of the
Enterprise Edition plus providing these additional capabili-
ties and benefits.
transtec HPC solutions are designed for maximum flexibility
and ease of management. We not only offer our customers
the most powerful and flexible cluster management solution
out there, but also provide them with customized setup and
sitespecific configuration. Whether a customer needs a dynam-
ical Linux-Windows dual-boot solution, unifi ed management
of different clusters at different sites, or the fi ne-tuning of the
Moab scheduler for implementing fi ne-grained policy confi
guration – transtec not only gives you the framework at hand,
but also helps you adapt the system according to your special
needs. Needless to say, when customers are in need of special
trainings, transtec will be there to provide customers, adminis-
trators, or users with specially adjusted Educational Services.
Having a many years’ experience in High Performance Comput-
ing enabled us to develop effi cient concepts for installing and
deploying HPC clusters. For this, we leverage well-proven 3rd
party tools, which we assemble to a total solution and adapt
and confi gure according to the customer’s requirements.
We manage everything from the complete installation and
configuration of the operating system, necessary middleware
components like job and cluster management systems up to the
Optimizing Accelerators with
Moab HPC Suit
Moab HPC Suite Use Case Basic Enterprise Application Portal Remote Visualization
Massive multifaceted scalability x x x x
Workload optimized resource allocation x x x x
Optimized, intelligent scheduling of work-
load and resources
x x x x
Optimized accelerator scheduling manage-
ment (Intel MIC GPGPU)
x x x x
Simplified user job submission mgmt.
- Application Portal
x x x
Administrator dashboard and management
x x x x
Unified workload management across het-
Single cluster only
x x x
Workload optimized node provisioning x x x
Visualization Workload Optimization x
Workload-aware auto-power management x x x
Intelligent resource placement for failure
x x x x
Workload-aware maintenance scheduling x x x x
Basic deployment, training and premium
(24x7) support service
x x x
Auto response to failures and events x
x x x
Auto SLA enforcement
Resource sharing and usage policies x x x x
SLA and oriority policies x x x x
Continuous and future reservations scheduling x x x x
Usage accounting and budget enforcement x x x
Grid- and Cloud-Ready
Manage and share workload across multiple
clusters in wich area grid
User self-service: job request management
x x x x
Pay-for-use showback and chargeback x x x
Workload optimized node provisioning
x x x
Multiple clusters in
Moab HPC Suite –
Application Portal Edition
Moab HPC Suite – Application Portal Edition
Moab HPC Suite – Application Portal Edition provides easy sin-
more of your internal staff and partners to collaborate from any-
where using HPC applications and resources without any special-
ized HPC training while your intellectual property stays secure.
The optimization policies Moab provides significantly reduces
the costs and complexity of keeping up with compute demand
by optimizing resource sharing, utilization and higher through-
put while still balancing different service levels and priorities.
Remove the Barriers to Harnessing HPC
Organizations of all sizes and types are looking to harness the
potential of high performance computing (HPC) to enable and
accelerate more competitive product design and research. To do
this, you must find a way to extend the power of HPC resources to
designers, researchers and engineers not familiar with or trained
on using the specialized HPC technologies. Simplifying the user
access to and interaction with applications running on more
efficient HPC clusters removes the barriers to discovery and in-
novation for all types of departments and organizations. Moab
HPC Suite – Application Portal Edition provides easy single-point
access to common technical applications, data, HPC resources,
and job submissions to accelerate research analysis and collab-
oration process for end users while reducing computing costs.
HPC Ease of Use Drives Productivity
Moab HPC Suite – Application Portal Edition improves ease of
use and productivity for end users using HPC resources with sin-
gle-point access to their common technical applications, data,
resources, and job submissions across the design and research
process. It simplifies the specification and running of jobs on HPC
resources to a few simple clicks on an application specific tem-
plate with key capabilities including: