SlideShare a Scribd company logo
1 of 19
11/16/2014
Copyright @IITBHU_CSE
Jagjit Singh(12100EN003)
Vivek Garg(12100EN009)
Shivam Anand(12100EN012)
Kshitij Singh(12100EN061)
Introduction
Supercomputer architectures were first
introduced in the 1960s. A lot of changes
have since been made since that time.
Early supercomputer architectures
pioneered by Seymour Cray relied on
compact innovative designs and local
parallelism to achieve superior
computational performance. However, in
time the demand for increased
computational power ushered in the age
of massively parallel systems.
While the supercomputers of the 1970s
used only a few processors, in the 1990s,
machines with many thousands of
processors began to appear and as the
20th century came to an end, massively
parallel supercomputers with tens of
thousands of "off-the-shelf" processors
were the norm. In the 21st century,
supercomputers can use as many as
100,000 processors connected by very fast
connections.
Systems with a massive number of
processors generally take one of two
paths: in one of the approaches, e.g.,
in grid computing the processing power of
a large number of computers in
distributed. Diverse domains are
opportunistically used whenever a
computer is available. Another approach
is to utilize many processors in close
proximity to each other, e.g., in
a computer cluster. In such
centralized massively parallel system the
speed the flexibility of the inter-connect
becomes very important, and modern
supercomputers have used approaches
ranging from enhanced Infiniband systems
to three-dimensional torus interconnects.
As the price/performance of general
purpose graphic processors (GPGPUs) has
improved, many petaflop supercomputers
such as Tianhe-I and Nebulae have started
to depend on them. However, other
systems such as the K computer continue
to use conventional processors such
as SPARC-based designs and the overall
applicability of GPGPUs in general purpose
high performance computing applications
has been the subject under consideration,
in that while a GPGPU may be tuned to
perform well on specific benchmarks its
overall applicability to everyday
algorithms may be limited unless
significant effort is spent to tune the
application towards it. But GPUs are
gaining ground and in 2012 the Jaguar
supercomputer was transformed
into Titan by replacing CPUs with GPUs.
As the number of independent processors
in a supercomputer increases, the method
by which they access data in the file
system and how they share and
access secondary storage resources
becomes prominent. Across the years a
number of systems for distributed file
management were made, e.g., the IBM
General Parallel File System, FhGFS,
theParallel Virtual File System, Hadoop,
etc. A number of supercomputers on
the TOP100 list such as the Tianhe-I
use Linux's Lustre file system.
Background
The CDC 6600 series of computers were
very early attempts at supercomputing
and gained their advantage over the
existing systems by relegating work
to peripheral devices, freeing the CPU
(Central Processing Unit) to process
valuable data. With the
Minnesota FORTRAN compiler the 6600
could sustain 500 kiloflops on standard
mathematical operations.
Other early supercomputers like the Cray
1 and Cray 2 that appeared afterwards
used a small number of fast processors
that worked in harmony and were
uniformly connected to the largest
amount of shared memory that could be
managed at the time.
Parallel processing at the processor level
were introduced by these early
architectures, with innovations such
as vector processing, in which the
processor can perform several operations
during one clock cycle, rather than having
to wait for successive cycles.
In time, as the number of processors
increased, different issues regarding
architecture emerged. Two issues that
need to be addressed as the number of
processors increases are the distribution
of processing and memory. In the
distributed memory approach, each
processor is packaged physically close
with some local memory. The memory
that is associated with other processors is
then "further away" based on
bandwidth and latency parameters in non-
uniform memory access.
Pipelining was an innovation of the 1960s,
and by the 1970s the use of vector
processors had been well established.
Parallel vector processing had gained
ground by 1990. By the 1980s, many
supercomputers used parallel vector
processors.
In early systems the relatively small
number of processors allowed them to
easily use a shared memory architecture,
hence processors are allowed to access a
common pool of memory. Earlier a
common approach was the use of uniform
memory access (UMA), in which access
time to a memory location was similar
between processors. The use of non-
uniform memory access (NUMA) allowed
a processor to access its own local
memory faster than other memory
locations, whereas cache-only memory
architectures(COMA) allowed for the local
memory of each processor to be used like
a cache, thus requiring coordination as
memory values changed.
As the number of processors increases,
efficient inter-processor
communication and synchronization on a
supercomputer becomes a challenge.
Many different approaches may be used
to achieve this goal. For example, in the
early 1980s, in the Cray X-MP system used
shared registers. In this approach, shared
registers could be accessed by all
processors that did not move data back
and forth but were only used for inter-
processor synchronization and
communication. However, inherent
challenges in managing a large amount of
shared memory among many processors
resulted in a move to more distributed architectures.
PROCESSOR YEAR CLOCK(MHZ) REGISTER
ELEMENT
FUCTIONAL
(PER
REGISTER)
UNITS
CRAY-1 1976 80 8 64 6
CRAY-XMP 1983 120 8 64 8
CRAY-YMP 1988 166 8 64 8
NEC SX/2 1984 160 8+8192 256
variable
16
CRAY C-90 1991 240 8 128 8
NEC SX/4 1995 400 8+8192 256
variable
16
CRAY J-90 1995 100 8 64 8
CRAY T-90 1996 500 8 128 8
NEC SX/5 1999
Approaches to supercomputing
Distributed supercomputing
Opportunistic Supercomputing is a form
of networked grid computing whereby a
“super virtual computer” of many loosely
coupled volunteer computing machines
performs very large computational tasks.
Grid computing has been applied to a
number of large-scale embarrassingly
parallel problems that require
supercomputing scale of performance.
However, basic grid and cloud
computing approaches that rely
on volunteer computing cannot handle
traditional supercomputing tasks such as
fluid dynamic simulations.
The fastest grid computing system is
the distributed computing project, 43.1
petaflops of x86 processing power as of
June 2014. Of this, 42.5 petaflops are
contributed by clients running on various
GPUs, and the rest from various CPU
systems.
The BOINC platform hosts a number of
distributed computing projects. By May
2011, BOINC recorded a processing power
of as much as 5.5 petaflops through over
480,000 active computers on the
network. The most active project
(measured by computational power)
reports processing power of over
700 teraflops through as much as 33,000
active computers.
As of May
2011, GIMPS's distributed Mersenne
Prime search currently achieves about 60
teraflops through over 25,000 registered
computers. The server of the Internet
PrimeNet supports GIMPS's grid
computing approach, among the earliest
and most successful grid computing
projects since 1997.
Quasi-opportunistic approaches
Quasi-opportunistic supercomputing is a
form of distributed computing whereby
the “super virtual computer” of a large
number of networked geographically
disperse computers performs computing
tasks that demand huge processing
power. Quasi-opportunistic
supercomputing aims to provide a higher
quality of service than opportunistic grid
computing by achieving more control over
the assignment of tasks to distributed
resources and the use of intelligence
about the availability and reliability of
individual systems within the
supercomputing network. Whereas quasi-
opportunistic distributed execution of
demanding parallel computing software in
grids should be achieved through
implementation of grid-wise agreements
of allocation, co-allocation subsystems,
communication topology-aware allocation
mechanisms, message passing libraries
that are fault tolerant and data pre-
conditioning.
Massive, centralized parallelism
During the 1980s, as the computing power
demand increased, the trend to a much
larger number of processors began,
bringing in the age of massively
parallel systems, with distributed
memory and file systems, provided
that shared memory architectures could
not scale to a large number of
processors. Hybrid approaches such
as distributed shared memory also
appeared after the early systems.
The computer clustering approach
connects a number of readily available
computing nodes (e.g. personal
computers used as servers) via a fast,
private local area network. The activities
of the computing nodes are orchestrated
by "clustering middleware" which is a
software layer that sits atop the nodes
and allows the users to treat the cluster as
by and large one cohesive computing unit,
for example via a single system
image concept.
Computer clustering relies on a
centralized management approach which
makes the nodes available as
orchestrated shared servers. It is different
from other approaches such as peer to
peer or grid computing which also use a
large number of nodes, but with a far
more distributed nature. By the 21st
century, the TOP500 organization's semi
annual list of the 500 fastest
supercomputers often includes many
clusters like the world's fastest in 2011,
the K computer which had a distributed
memory and a cluster architecture.
When a large number of local semi-
independent computing nodes are used
(e.g. in a cluster architecture) the speed
and flexibility of the interconnect
becomes very important. Modern
supercomputers have taken various
approaches to resolve this issue,
e.g. Tianhe-1 uses a proprietary high-
speed network based on
the Infiniband QDR, enhanced
with FeiTeng-1000 CPUs. On the other
hand, the Blue Gene/L system uses a
three-dimensional torus interconnect with
auxiliary networks for global
communications. In this approach each
node is connected to its six nearest
neighbours. Likewise a torus was used by
the Cray T3E.
Massive centralized systems at times use
special-purpose processors designed for a
specialised application, and may use field-
programmable gate arrays (FPGA) chips to
gain performance by sacrificing generality.
Such special-purpose supercomputers
have examples like Belle, Deep
Blue, and Hydra, for
playing chess, MDGRAPE-3 for protein
structure computation molecular
dynamics and Deep Crack for breaking
the DES cipher.
Massive distributed parallelism
Grid computing uses a large number of
computers in diverse, distributed
administrative domains which makes it an
opportunistic approach which uses
resources whenever they are available. An
example is BOINC a volunteer-based,
opportunistic grid
system. Some BOINC applications have
reached multi-petaflop levels by using
close to half a million computers
connected on the web, whenever
volunteer resources become
available. However, these types of results
often do not appear in the TOP500 ratings
because they do not run the general
purpose Linpack benchmark.
Although grid computing has had success
in parallel task execution but demanding
supercomputer applications such
as weather simulations or computational
fluid dynamics have not been successful,
partly due to the barriers in reliable sub-
assignment of a large number of tasks as
well as the reliable availability of
resources at a given time.
In quasi-opportunistic supercomputing a
large number of geographically disperse
computers are orchestrated with built-in
safeguards. The quasi-opportunistic
approach goes beyond volunteer
computing on a highly distributed systems
for example BOINC, or general grid
computing on a system such as Globus by
allowing the middleware to provide
almost seamless access to many
computing clusters so that existing
programs in languages such
as Fortran or C can be distributed among
multiple computing resources.
Quasi-opportunistic supercomputing aims
to provide a higher quality of service
than opportunistic resource sharing. The
quasi-opportunistic approach enables the
execution of demanding applications
within computer grids by establishing grid-
wise resource allocation agreements;
and fault tolerant message passing to
abstractly shield against the failures of the
underlying resources and maintaining
some opportunism as well as allowing a
higher level of control.
Vector processing principles
Ordered set of scalar data items is known
as vector. all the data items are of same
type stored in memory. generally the
vector elements are ordered to have fixed
addressing increment between successive
elements , called stride.
Vector processor includes processing
elements, vector registers, register
counters and functional pipelines, to
perform vector operations. Vector
processing involves arithmetic or logical
operations applied to vectors whereas
scalar processing operates on one datum.
The conversion from scalar code to vector
code is called vectorization
Vector processors are special purpose
computers that match a range of
computing (scientific) tasks. These tasks
usually consist of large active data sets,
poor locality, and long run times and in
addition, vector processors provide vector
instructions.
Vector processors are special purpose
computers that match a range of
(scientific) computing tasks. These tasks
usually consist of large active data sets,
often poor locality, and long run times. In
addition, vector processors provide vector
instructions.
These instructions operate in a pipeline
(sequentially on all elements of vector
registers), and in current machines. Some
properties of vector instructions are
 Since the calculation of every
result is independent of the
calculation of previous results it
allows a very deep pipeline
without any data issues.
 A vector instruction requires a
huge amount of work since it is the
same as executing an entire loop.
Hence, the instruction bandwidth
requirement is decreased.
 Vector instructions that require
memory have a predefined access
pattern that can easily be
predicted. If the vector elements
are all near each other, then
obtaining the vector from a set of
heavily interleaved memory banks
works extremely well. Because a
single access is initiated for the
entire vector rather than to a
single word, the high latency of
starting a main memory access
against accessing a cache is
amortized. Thus, the cost of the
latency to main memory is seen
only once for the entire vector,
rather than once for each word of
the vector.
 Control hazards are no longer
present since an entire loop is
replaced by a vector instruction
whose behaviour is determined
beforehand .
Typical vector operations include (integer
and floating point:
 Add two vectors to produce a
third.
 Subtract two vectors to produce a
third
 Multiply two vectors to produce a
third
 Divide two vectors to produce a
third
 Load a vector from memory
 Store a vector to memory.
These instructions could be augmented to
do typical array operations:
 Inner product of two vectors
(multiply and accumulate sums)
 Outer product of two vectors
(produce an array from vectors)
 Product of (small) arrays (this
would match the programming
language APL which uses vectors
and arrays as primitive data
elements.)
Hence vector processing is faster and
much more efficient than scalar
processing. Both SIMD computers and
pipelined processors can perform vector
operations. Vector processing generates
one result per clock cycle by continuously
matching with the pipelining and
segmentation concepts. It also reduces
the memory access conflicts and software
overhead.
Depending on the vectorization ratio in
user programs and speed ratio between
vector and scalar operations, a vector
processor can achieve a manifold speed
up which could go up to 10 to 20 times, as
compared to conventional machines.
Vector instruction types
Six types of vector instructions are
 Vector-vector instructions
One or two vector operands
may be fetched from their vector
registers which then enter through
a functional pipeline unit, and
produce results in another vector
register.
 Vector scalar instructions
 Vector memory instructions
 Vector reduced instructions
 Gather and scatter instructions
These use two vector registers to
gather or to scatter vector
elements randomly throughout
the memory.
‘Gather’ fetches the non-zero
elements from memory of a sparse
vector using indices that
themselves are indexed.
Scatter, on the other hand, does
the opposite: storing into memory
a vector in a sparse vector whose
non zero entries are indexed
 Masking instructions
These instructions use a mask
vector to expand or to compress a
vector to an index vector that is
either longer or shorter.
Vector access memory schemes
Usually, multiple access paths pipeline the
flow of vector operands between the
main memory and vector registers.
 Vector operand specifications
Vector operands can be arbitrarily
long. Vector elements may not be
stored in memory locations that
are contiguous.
To access a vector, its base
address, stride, and length must be
described. Since every vector
register has a predefined number
of component registers, in a fixed
number of cycles, only a small part
of the vector can be loaded to the
vector register.
 C-Access memory organisation.
 S-Access memory organisation.
 C/S-Access memory organisation.
The Effect of cache design into
vector computers
Cache memories have proven to be very
successful in the case of general purpose
computers to boost system performance.
However, their use in vector processing
has not yet been fully established.
Generally, the existing supercomputer
vector processors do not have cache
memories because of the results drawn
from the following points:
 Generally the data sets of
numerical programs are too large
for the cache sizes provided by the
present technology. Sweep
accesses of a large vector may end
up completely reloading the cache
before the processor can even
reuses them.
 Sequential addresses which are a
crucial assumption in the
conventional caches may not
prove to be as effective in
vectorised numerical algorithms
that usually acquire data with
certain stride which is the
distinction between addresses
associated with consecutive vector
elements.
 Register files and highly
interleaved memories are usually
used to achieve a high memory
bandwidth required for vector
processing.
It is not clear whether cache memories
can boost the performance of such
systems.
Although cache memories have the
capability for boosting the performance of
future vector processors, numerous
reasons counter the use of vector caches.
A single miss in the vector cache results in
a number of processors. Stall cycles equal
to the entire memory access time,
however the memory accesses of a vector
processor without cache are fully
pipelined. In order to benefit from a
vector cache, the miss ratio must be kept
extremely small. In general, cache misses
can be classified into these categories:
 Compulsory miss
 Capacity miss
 Conflict miss
The compulsory misses occur in the initial
loading of data, which are easily pipelined
in a vector computer. Next, the capacity
misses are because of the size restrictions
of a cache to retain data between
references. If algorithms are blocked as
mentioned, the capacity misses can be
linked to the compulsory misses during
the initial loading of every block of data
given that the block size is lesser than that
of cache. Finally, conflict misses, plays a
deciding role in the vector processing
environment. Conflicts occur when
elements of the same vector are mapped
directly to the same cache elements or
line from two different vectors compete
for the same cache line. Since conflict
misses that reduce vector cache
performance to do with vector access
stride, size of an application problem can
be adjusted to make a good access stride
for a machine. This approach burdens a
programmer for knowing architecture
details of a machine as well as it is
infeasible for many applications.
Ideas like prime-mapped cache schemes
have been studied. The new cache
organization reduces cache misses due to
cache line interferences that are critical in
numerical applications. Also, the cache
lookup time of the new mapping scheme
stays the same as conventional caches.
Creation of cache addresses for accessing
the prime-mapped cache can be done
parallel along with normal address
calculations. This address creation takes
lesser time than the normal address
calculation because of the special
properties of the Messene prime. Thus,
the new mapping scheme doesn’t cause
any performance penalty in terms of the
cache access. With this new mapping
scheme, the cache memory can show a
large amount of performance boost,
which will increase as the speed gap
between processor and memory is
increased.
GPU based supercomputing
The demand for an increased Personal
Computer (PC) graphics subsystem
performance never ceases. The GPU is an
ancillary coprocessor subsystem,
connected to an internal high-speed bus
and memory-mapped into global memory
resources. Computer vision, gaming, and
advanced graphics design applications
have led to sharp MIPS performance
boosts and increased variety and
algorithmic efficiency on part of relevant
graphics standards. All this is a part of a
larger evolutionary trend whereby PCs
supplant dedicated workstations for a
host of compute intensive applications. At
a deeper level GPU evolution depends on
the assumption of a processing model
that can achieve the highest possible
performance for a wide variety of graphics
algorithms. This then drives all relevant
aspects of hardware architecture and
design. The most efficient GPU processing
model is Single Instruction Multiple Data
(SIMD). The SIMD model has been of great
use in traditional vector
processor/supercomputer designs, (e.g.
Cray X-MP, Convex C1, CDC Star-100), by
capability to boost datapath calculation
based upon concurrent execution of
processing threads. The SIMD concept has
been employed in recent CPU
architectural advancements, like the IBM
Cell processor, x86 with MMX extensions,
SPARC VIS, Sun MAJC, ARM NEON etc.
SIMD processing model adopted for GPU
can be used for general classes of
scientific computation not specifically
associated with graphics applications. This
was the start of the General Purpose
computing on GPU (GPGPU) movement
and basis for many examples of GPU
accelerated scientific processing.
GPGPU closely depends upon Application
Programming Interface (API) access to
resources of GPU processing; GPU API
abstracts much of the complexity
associated with manipulation of hardware
resources and provides convenient access
to I/O, memory management, and thread
management functionality in form generic
programming function calls, (e.g. C, C++,
Python, Java). Thus, GPU hardware is
virtualized as a standard programming
resource, facilitating uninhibited
application development incorporating
GPU acceleration. APIs that are currently
in use include NVIDIA’s Compute Unified
Device Architecture (CUDA) and ATI’s
Data Parallel Virtual Machine (DPVM).
GPU Architecture
SIMD GPU is organized as a collection of
‘N’ distinct multiprocessors, each
consisting of ‘M’ distinct thread
processors. Multiprocessor operation is
modulo an ensemble of threads managed
and scheduled as a single entity, (i.e.
‘warp’). Like this, SIMD instruction fetch
and execution, shared-memory access,
and cache operations are completely
synchronized. Memory usually is
organized hierarchically where
Global/Device memory transactions are
understood as mediated by high-speed
bus transactions, (e.g.PCIe,
HyperTransport).
A feature associated with the CPU/GPU
processing architecture is GPU processing
is essentially non-blocking. Hence, CPU
may continue processing as soon as a
work-unit has been written to the GPU
transaction buffer. GPU work unit
assembly/disassembly and I/O at the GPU
transaction buffer may to large extent be
hidden. In these case, GPU performance
will effectively dominate the performance
of the entire system. Optimal GPU
processing gain is achieved at an I/O
constraint boundary whereby thread
processors never stall due to lack of data.
The maximum achievable speedup is
governed by Amdahl’s Law: any
acceleration (‘A’) due to thread
parallelization will critically depend upon:
 The fraction of code than can be
parallelized (‘P’)
 The degree of parallelization (‘N’),
and
 Any overhead associated with
parallelization
This indicates a theoretical maximum
acceleration for the application. CPU code
pipelining (i.e. overlap with GPU
processing) must also be factored into any
calculation for ‘P’; pipelining effectively
parallelizes CPU and GPU code segments
reducing the non-parallelized code
fraction '(1- P)'. Thus, under circumstances
where decrease is sufficient to claim (P).
Hence, well-motivated software
architecture design can take advantage of
this effect, greatly increasing acceleration
potential for the complete application.
21st-century architectural trends
The air cooled IBM Blue
Gene supercomputer architecture trades
processor speed for low power
consumption so that a larger number of
processors can be used at room
temperature, by using normal air-
conditioning. The second generation Blue
Gene/P systemis distinguished by the fact
that each chip can act as a 4-
way symmetric multiprocessor and also
includes the logic for node-to-node
communication. And at
371 MFLOPS/W the system is very energy
efficient.
The K computer has water cooling system,
homogeneous processor and distributed
memory system with a cluster
architecture. It uses more than 80,000
processors which are SPARC based, each
with eight cores, for a total of over
700,000 cores – almost twice as many as
any other system and more than 800
cabinets, each with 96 computing nodes
,each with 16 GB of memory , and 6 I/O
nodes although it is more powerful than
the next five systems on the TOP500 list
combined, at 824.56 MFLOPS/W but it
has the lowest power to performance
ratio of any current major supercomputer
system. The follow up system, called the
PRIMEHPC FX10 uses the same six-
dimensional torus interconnect, but only
one SPARC processor per node.
Unlike the K computer, the Tianhe-
1A system uses a hybrid architecture and
integrates CPUs and GPUs. It uses more
than 14,000Xeon general-purpose
processors and greater than 7,000 Nvidia
Tesla graphic-based processors on about
3,500 blades. It has 112 computer
cabinets and 262 terabytes of distributed
memory; 2 petabytes of disk storage is
implemented via Lustre clustered
files. Tianhe-1 uses a proprietary high-
speed communication network to connect
the processors. The proprietary
interconnect network was based on
the Infiniband QDR, along with Chinese
made FeiTeng-1000 CPUs. In the case of
the interconnect the system is twice as
fast as the Infiniband, but is slower than
some interconnects on other
supercomputers.
The limits of specific approaches continue
to be tested through large scale
experiments, such as in 2011 IBM ended
its participation in the Blue
Waters petaflops project at the University
of Illinois. The Blue Waters architecture
was based on the IBM POWER7 processor
and intended to have 200,000 cores with
a petabyte of "globally addressable
memory" and 10 petabytes of disk
space. The goal of a sustained petaflop led
to design choices that optimized single-
core performance, and a lower number of
cores which is then expected to help
performance on programs that did not
scale well to a large number of
processors. The large globally addressable
memory architecture aimed to solve
memory address problems in an efficient
manner, for the same type of
programs. Blue Waters had been expected
to run at sustained speeds of at least one
petaflop which relied on the specific
water-cooling approach to manage heat.
The National Science Foundation spent
about $200 million on the project in the
first four years of operation. IBM released
the Power 775 computing node derived
from that project's technology soon , but
effectively abandoned the Blue Waters
approach.
Architectural experiments are continuing
in a number of directions, for example
the Cyclops64 system uses a
supercomputer on a chip approach,
contrasting the use of massive distributed
processors. Each 64-bit Cyclops64 chip
contains 80 processors with the entire
system using a globally
addressable memory architecture. The
processors are connected with non-
internally blocking crossbar switch and
communicate with each other via global
interleaved memory with no data cache in
the architecture, while half of
each SRAM bank can be used as a
scratchpad memory. Although this type of
architecture allows unstructured
parallelism in a dynamically non-
contiguous memory system but it also
produces challenges in the efficient
mapping of parallel algorithms to a many-
core system.
Issues and challenges
we could significantly increase the
performance of a processor by issuing
multiple instructions per clock cycle and
by deeply pipelining the execution units
to allow greater exploitation of instruction
level parallelism. But there are serious
difficulties in exploiting ever larger
degrees of instruction level parallelism.
As we increase both the width of
instruction issue and the depth of the
machine pipelines, we as well increase the
number of independent instructions
required to keep the processor busy with
useful work. This means an increase in the
number of partially executed instructions
that can be in flight at one time. For a
dynamically-scheduled machine
hardware structures, such as reorder
buffers, instruction windows ,and rename
register files, must grow to have sufficient
capacity to hold all in-flight instructions,
and worse, the number of ports on each
element of these structures must grow
with the issue width. The logic to track
dependencies between all in-flight
instructions grows quadratically in the
number of instructions. Even a VLIW
machine, which is statically scheduled and
shifts more of the scheduling burden to
the compiler, needs more registers, more
ports per register, and more hazard
interlock logic (assuming a design where
hardware manages interlocks after issue
time) to support more in-flight
instructions, which similarly cause
quadratic increases in circuit size and
complexity. This rapid increase in circuit
complexity makes it difficult to build
machines that can control large numbers
of in-flight instructions which limits
practical issue widths and pipeline depths.
Vector processors were successfully
commercialized long before instruction
level parallel machines and take an
alternative approach to controlling
multiple functional units with deep
pipelines. Vector processors provide high-
level operations that work on vectors. A
typical vector operation might add two
floating-point vectors of 64 elements to
obtain a single 64-element vector result.
This instruction is equivalent to an entire
loop, in which each iteration is computing
one of the 64 elements of the result and
updating the indices, and branching back
to the beginning. Vector instructions have
several important properties that solve
most of the problems mentioned above:
A single vector instruction describes a
great deal of work—it is equivalent to
executing an entire loop where each
instruction represents tens or hundreds of
operations, and so the instruction fetch
and decode bandwidth needed to keep
multiple deeply pipelined functional units
busy is dramatically reduced.
By using a vector instruction, the compiler
or programmer indicates that the
computation of each result in the vector is
independent of the computation of other
results in the same vector and so
hardware does not have to check for data
hazards within a vector instruction. The
elements in the vector can be computed
using an array of parallel functional units,
or a single very deeply pipelined
functional unit, or any mixed
configuration of parallel and pipelined
functional units.
Hardware need only check for data
hazards between two vector instructions
once per vector operand and not once for
every element within the vectors. That
means the dependency checking logic
required between two vector instructions
is approximately the same as that
required between two scalar instructions,
but now many more elemental operations
can be in flight for the same complexity of
control logic.
Vector instructions that access memory
have a known access pattern then
fetching the vector from a set of heavily
interleaved memory banks works very
well if the vector’s elements are all
adjacent. The high latency of initiating a
main memory access versus accessing a
cache is amortized as a single access is
initiated for the entire vector not just to a
single word. Hence the cost of the latency
to main memory is seen only once for the
entire vector and not for each word of the
vector.
Because an entire loop is replaced by a
vector instruction whose behaviour is
predetermined the control hazards that
would normally arise from the loop
branch are non-existent. For these
reasons, vector operations can be made
faster than a sequence of scalar
operations on the same number of data
items, and if the application domain can
use them frequently, designers are
motivated to include vector units. As
mentioned above, vector processors
pipeline and parallelize the operations on
the individual elements of a vector. The
operations include not only the arithmetic
operations, but also memory accesses and
effective address calculations. Also, most
high-end vector processors allow multiple
vector instructions to be in progress at the
same time, creating further parallelism
among the operations on different
vectors.
Vector processors are particularly useful
for large scientific and engineering
applications, such as car crash simulations
and weather forecasting, for which a
typical job might take dozens of hours of
supercomputer time running over multi
gigabyte data sets. Multimedia
applications can also benefit from vector
processing, as they contain abundant data
parallelism and process large data
streams. A high-speed pipelined processor
will usually use a cache to avoid forcing
memory reference instructions to have
very long latency. Unfortunately, big
scientific programs often have very large
active data sets that are sometimes
accessed with low locality hence yielding
poor performance from the memory
hierarchy. This problem could be
overcome by not caching these structures
if it were possible to determine the
memory access patterns and pipeline the
memory accesses efficiently. Compiler
assistance and novel cache architectures
through blocking and prefetching are
decreasing these memory hierarchy
problems, but still they continue to be
serious in some applications.
Application
The machine can be used in scientific and
business applications, but more suited to
scientific applications. Large multinational
banks and corporations are using small
supercomputers. Some of the applications
include; special effects in film, weather
forecasting, processing of geological data
and data regarding genetic decoding,
aerodynamics and structural designing,
mass destruction weapons and
simulation. The users include; Film
makers, Geological data processing
agencies, National weather forecasting
agencies, Space agencies, Genetics
research organizations, Government
agencies, Scientific laboratories,, Military
and defence systems, research groups and
Large corporations.
Simulation
Duplicating an environment is called
simulation. It is done for reasons like;
Training of the users
Predict/forecast the result
If physical experimentation is not possible
If physical experimentation is very
expensive
All the expensive machines are
simulated before their actual
construction to prevent economic
losses and saving of time. Life
threating stunts are simulated before
performed which can predict any
technical or other fault and prevent
damage.
Movies
These are used to produce special effects.
Movies like The Star trek, Star fighter,
Babylon 5, Terminator’s sequel, Dante’s
Peak, Asteroid, Jurassic Park, The Lost
World, Matrix’s sequel, Lord of the Rings,
Godzilla and all the latest movies have
special effects generated on
supercomputers.
Weather forecasting
Data is collected from worldwide network
of space satellites, ground stations and
airplanes, is fed in to supercomputer for
analysis to forecast weather. Thousands
of variables are involved in weather
forecasting and can only be processed on
a supercomputer. Accurate predictions
cannot be made beyond one month
because we need more powerful
computers to do so.
Oil Exploration
To determine the most productive oil
exploration sites millions of pieces of data
is processed. Processing of geological data
involves billions of pieces of data and
thousands of variables, a very complex
calculation requiring very large computing
power.
Genetics engineering
Used for the processing and decoding of
genetic data this is used by genetics
scientists and engineers for research and
development to immune human beings
from heredity diseases. Since genetics
data processing involves thousands of
factors to be processed supercomputers
are the best choice. The latest
developments, like gene mapping and
cloning also require the capabilities of
supercomputers.
Space exploration
Great achievements are simply impossible
without supercomputers. The remarkable
accuracy and perfection in the landing of
pathfinder on the Mars is another proof of
the capabilities of this wonderful machine.
Famous IBM processor technology
RISC/6000 used as in flight computer, that
was modified for the project, made
hardened and called RAD/6000.
Aerodynamic designing of airplanes
In manufacturing of airplanes,
supercomputer to use to simulate the
passage of air around separate pieces of
the plane and then combine the results,
Today’s super computers are still unable
to simulate the passage of air around an
entire aircraft.
Aerospace and structural designing
Simulation in aerospace and structural
designing was used for the space station
and space plane. These projects required
extensive experiments, some of which are
physically impossible. Such as, the
proposed space station would collapse
under its own weight if built in the gravity
of Earth. The plane must be able to take
off from a runway on Earth and accelerate
directly into orbit at speeds greater than
8,800 miles per hour. Most of these
conditions cannot be duplicated; the
simulation and modelling for these
designs and tests include processing of
billions of pieces of data and solving
numerous complex mathematical
calculations for supercomputers.
Nuclear weapons
Simulation is also used for the production
of mass destruction weapons to simulate
the results of an atomic or nuclear bomb
formula. For this reason, USA government
is very cautious about the production and
export of this computer to several
nations. Some of the famous export deals
include.
• America provided Cray supercomputer
of type XMP to India for weather data
processing.
• USA supplied a supercomputer to China
for peaceful nuclear research.
• International Business Machines
Corporation exported supercomputer
RISCJ6000 SP to Russia, and they used it
for their nuclear and atomic research
purposes. This deal put International
Business Machines Corporation under
strong criticism by the US government.
Conclusion and future work
Given the current progress rate, industry
experts estimate that supercomputers will
reach 1 exaflops (1018, one quintillion
FLOPS) by 2018. China describes plans to
have a 1 exaflop supercomputer online by
2018. Using the Intel multi-core processor,
which is Intel's response to graphics
processor unit (GPU) systems, SGI plans to
achieve a 500 times increase in
performance by 2018, in order to achieve
one extra flop. Samples of MIC chips with
32 cores, which combine VPU with
standard CPU, have become available.
The government of India has also stated
ambitions for an exaflop-range
supercomputer, which they hope to
complete by 2017.].In November 2014 it
was reported that India is working on the
Fastest supercomputer ever which is set
to work at 132 Exaflops per second.
Supercomputers with this new
architecture could be out within the next
year. The aim is to improve data
processing at the memory, storage and
I/O levels.
That will help break down parallel
computational tasks into small parts,
reducing the compute cycles required to
solve problems. That is one way to
overcome economic and scaling
limitations of parallel computing that
affect conventional computing models.
Memory, storage and I/O work in tandem
to boost system performance, but there
are bottlenecks with present
supercomputing models. A lot of energy
and time is wasted in continuously moving
large chunks of data between processors,
memory and storage. Decreasing the
amount of data that has to be moved,
which could help process data increase
three times faster than current
supercomputing models.
When working with petabytes and
exabytes of data, moving this amount of
data is extremely inefficient and time
consuming, so processing to the data can
be moved by providing compute capability
throughout the system hierarchy.
IBM has built the world's fastest
computers for decagon, including the
third- and fifth-fastest, according to a
recent Top 500 list. But the amount of
data being put to servers is outpacing the
growth of supercomputing speeds.
Networks are not going faster, the chip
clock speeds are not increasing and there
is not a huge increase in data-access time.
Applications no longer live in the classic
compute microprocessors; instead
application and workflow computation are
distributed throughout the system
hierarchy.
A simple example of reducing the size of
data sets by decomposing information in
storage, which can then be moved to
memory of the computer. That type of
model can be applied to oil and gas
workflow -- which typically takes months -
- and it would significantly shorten the
time required to make decisions about
drilling.
A hierarchy of storage and memory
including non-volatile RAM, which means
much lower latency, higher bandwidths,
without the requirement to move the
data all the way back to central storage.
Following conventional computing
architectures such as the Von Neumann's
approach, in which data is put into a
processor, calculated and put back in the
memory. Most of the computer systems
today work on the type of architecture
only, which was derived in the 1940's by
mathematician named John von
Neumann.
At the individual compute element level,
we continue the Von Neumann's
approach. At the level of the system,
however, an additional way to compute,
which is to move the evaluate to the data
is provided. There are multiple ways to
reduce latency in a system and reduce the
amount of data which has to be moved.
This saves energy as well as time.
Moving computing closer to data in
storage or memory is not a new concept.
appliances and servers with CPUs targeted
at specific workloads, and with
disaggregating storage, memory and
processing subsystems into separate
boxes are built which can be improved by
optimizing entire supercomputing
workloads that involve simulation,
modeling, visualization and complex
analytics on massive data sets.
The model will work in research areas like
oil and gas life sciences, exploration,
materials research and weather
modelling. Applications will need to be
written and well-defined for processing at
different levels and IBM is working with
institution, companies and researchers to
define software models for key sectors.
The fastest supercomputers today are
calculated with the LINPACK benchmark, a
simple measurement based on fractional
(float) point operations. IBM is not
ignoring Top 500, but providing a different
approach to enhance supercomputing.
LINPACK is good to measure speed, but
has under-represented the utility of
supercomputers and the benchmark does
not fully account for specialized
processing elements like int processing
and FPGAs.
The Top 500 list measures some elements
of the behaviour of compute nodes, but it
is not complete in terms of its
characterization of workflows that require
merging modelling, simulation and
analytics but many classic applications
are only moderately related to the
measure of LINPACK
Different organizations building
supercomputers have studied to build
software to take advantage of LINPACK,
which is worse measurement of
supercomputing performance.
The actual performance of some
specialized applications goes far beyond
LINPACK, and IBM's seems convincing.
There are companies developing
computers that give a new spin on how
data is accessed and interpreted. System
(D-Wave) is offering what is believed to be
the world's first and only quantum based
computer, which is being used by NASA,
Lockheed Martin and Google for specific
tasks. The others are in phase of
experiments. IBM has built an
experimental computer with a chip
designed to mimic a human brain.
References
1. Sao-Jie Chen; Guang-Huei Lin;
Pao-Ann Hsiung; Yu-Hen Hu (9
February 2009).Hardware
Software Co-Design of a
Multimedia Soc Platform.
Springer. pp. 70–72.ISBN 978-1-
4020-9622-8. Retrieved 15 June
2012.
2. Hoffman, Allan R.
(1989). Supercomputers:
directions in technology and
applications. Washington, D.C.:
National Academy Press. pp. 35–
47. ISBN 0-309-04088-4.
3. Hill, Mark D.; Jouppi, Norman P.;
Sohi, Gurindar (2000). Readings in
computer architecture. San
Francisco: Morgan Kaufmann.
pp. 40–49. ISBN 1-55860-539-8.
4. i Yang, Xue-Jun; Liao, Xiang-Ke; Lu,
Kai; Hu, Qing-Feng; Song, Jun-
Qiang; Su, Jin-Shu (2011). "The
TianHe-1A Supercomputer: Its
Hardware and Software". Journal
of Computer Science and
Technology 26 (3): 344–
351. doi:10.1007/s02011-011-
1137-8.
5. Murray, Charles J. (1997). The
supermen : the story of Seymour
Cray and the technical wizards
behind the supercomputer. New
York: John Wiley. pp. 133–
135. ISBN 0-471-04885-2.
6. e Biswas, edited by Rupak
(2010). Parallel computational
fluid dynamics : recent advances
and future directions : papers
from the 21st International
Conference on Parallel
Computational Fluid Dynamics.
Lancaster, Pa.: DEStech
Publications. p. 401. ISBN 1-
60595-022-X.
7. c Yongge Huáng, ed.
(2008). Supercomputing research
advances. New York: Nova Science
Publishers. pp. 313–314. ISBN 1-
60456-186-6.
8. Tokhi, M. O.; Hossain, M. A.;
Shaheed, M. H. (2003). Parallel
computing for real-time signal
processing and control. London
[u.a.]: Springer. pp. 201–
202. ISBN 978-1-85233-599-1.
9. Vaidy S. Sunderam, ed.
(2005). Computational science --
ICCS 2005. 5th international
conference, Atlanta, GA, USA, May
22-25, 2005 : proceedings (1st
ed.). Berlin: Springer. pp. 60–
67. ISBN 3-540-26043-9.
10.Prodan, Radu; Thomas Fahringer
(2007). Grid computing
experiment management, tool
integration, and scientific
workflows. Berlin: Springer. pp. 1–
4. ISBN 3-540-69261-4.

More Related Content

What's hot

Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network reportAnjali Agrawal
 
Multi core processors
Multi core processorsMulti core processors
Multi core processorsNipun Sharma
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
Presentation on - Processors
Presentation on - Processors Presentation on - Processors
Presentation on - Processors The Avi Sharma
 
Disk scheduling algorithms
Disk scheduling algorithms Disk scheduling algorithms
Disk scheduling algorithms Paresh Parmar
 
Single &Multi Core processor
Single &Multi Core processorSingle &Multi Core processor
Single &Multi Core processorJustify Shadap
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance ComputingUmarudin Zaenuri
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelaratorsEmmanuel college
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial IntelligenceMhd Sb
 
Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 ShriyaGautam3
 
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)Akhila Dakshina
 
Intro Ch 02 A
Intro Ch 02 AIntro Ch 02 A
Intro Ch 02 Aali00061
 
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfDigital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfMahdi_Fahmideh
 
previous question solve of operating system.
previous question solve of operating system.previous question solve of operating system.
previous question solve of operating system.Ibrahim Khalil Shakik
 

What's hot (20)

Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network report
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 
supercomputer
supercomputersupercomputer
supercomputer
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
Presentation on - Processors
Presentation on - Processors Presentation on - Processors
Presentation on - Processors
 
Computer memory
Computer memoryComputer memory
Computer memory
 
Disk scheduling algorithms
Disk scheduling algorithms Disk scheduling algorithms
Disk scheduling algorithms
 
Single &Multi Core processor
Single &Multi Core processorSingle &Multi Core processor
Single &Multi Core processor
 
Introduction to High-Performance Computing
Introduction to High-Performance ComputingIntroduction to High-Performance Computing
Introduction to High-Performance Computing
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelarators
 
Supercomputer
SupercomputerSupercomputer
Supercomputer
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Multicore Processor Technology
Multicore Processor TechnologyMulticore Processor Technology
Multicore Processor Technology
 
Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9 Comparision between Core i3,i5,i7,i9
Comparision between Core i3,i5,i7,i9
 
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)
09. Memory, Storage (RAM, Cache, HDD, ODD, SSD, Flashdrives)
 
Intoduction of Artificial Intelligence
Intoduction of Artificial IntelligenceIntoduction of Artificial Intelligence
Intoduction of Artificial Intelligence
 
Intro Ch 02 A
Intro Ch 02 AIntro Ch 02 A
Intro Ch 02 A
 
Neural networks introduction
Neural networks introductionNeural networks introduction
Neural networks introduction
 
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfDigital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
 
previous question solve of operating system.
previous question solve of operating system.previous question solve of operating system.
previous question solve of operating system.
 

Viewers also liked

Viewers also liked (20)

Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Super computer
Super computerSuper computer
Super computer
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Top 10 Supercomputer 2014
Top 10 Supercomputer 2014Top 10 Supercomputer 2014
Top 10 Supercomputer 2014
 
Evolution of modern super computers
Evolution of modern  super computersEvolution of modern  super computers
Evolution of modern super computers
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Super computers
Super computersSuper computers
Super computers
 
Graph Data Processing With uRIKA Appliance
Graph Data Processing With uRIKA ApplianceGraph Data Processing With uRIKA Appliance
Graph Data Processing With uRIKA Appliance
 
Supercomputers and Supernetworks are Transforming Research
Supercomputers and Supernetworks are Transforming ResearchSupercomputers and Supernetworks are Transforming Research
Supercomputers and Supernetworks are Transforming Research
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
Generations of Computers
Generations of ComputersGenerations of Computers
Generations of Computers
 
Presentation On Generations Of Computer
Presentation On Generations Of ComputerPresentation On Generations Of Computer
Presentation On Generations Of Computer
 
Param yuva ii
Param yuva iiParam yuva ii
Param yuva ii
 
Generations of Computer
Generations of ComputerGenerations of Computer
Generations of Computer
 
Computer Generations
Computer GenerationsComputer Generations
Computer Generations
 
Laser Communications
Laser CommunicationsLaser Communications
Laser Communications
 
Laser Communication
Laser CommunicationLaser Communication
Laser Communication
 
Go Back N ARQ
Go  Back N ARQGo  Back N ARQ
Go Back N ARQ
 
Generation of computer
Generation of computerGeneration of computer
Generation of computer
 

Similar to Super-Computer Architecture

INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGGS Kosta
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
Real-world Concurrency : Notes
Real-world Concurrency : NotesReal-world Concurrency : Notes
Real-world Concurrency : NotesSubhajit Sahu
 
ITC4344_3_Cloud Computing Technologies.pptx
ITC4344_3_Cloud Computing Technologies.pptxITC4344_3_Cloud Computing Technologies.pptx
ITC4344_3_Cloud Computing Technologies.pptxZaharaddeenAbubuakar
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_TechnologiesManish Chopra
 
UNIT I -Cloud Computing (1).pdf
UNIT I -Cloud Computing (1).pdfUNIT I -Cloud Computing (1).pdf
UNIT I -Cloud Computing (1).pdflauroeuginbritto
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAHAkash M Shah
 
Distributed computing
Distributed computingDistributed computing
Distributed computingrohitsalunke
 
Grid computing assiment
Grid computing assimentGrid computing assiment
Grid computing assimentHuma Tariq
 
Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.pptsirajmohammed35
 
vssutcloud computing.pptx
vssutcloud computing.pptxvssutcloud computing.pptx
vssutcloud computing.pptxMunmunSaha7
 
Unit i cloud computing
Unit i  cloud computingUnit i  cloud computing
Unit i cloud computingMGkaran
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material ccAnkit Gupta
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdfTyStrk
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfJohn422973
 

Similar to Super-Computer Architecture (20)

INTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSINGINTRODUCTION TO PARALLEL PROCESSING
INTRODUCTION TO PARALLEL PROCESSING
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Real-world Concurrency : Notes
Real-world Concurrency : NotesReal-world Concurrency : Notes
Real-world Concurrency : Notes
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 
Computer cluster
Computer clusterComputer cluster
Computer cluster
 
ITC4344_3_Cloud Computing Technologies.pptx
ITC4344_3_Cloud Computing Technologies.pptxITC4344_3_Cloud Computing Technologies.pptx
ITC4344_3_Cloud Computing Technologies.pptx
 
Computer_Clustering_Technologies
Computer_Clustering_TechnologiesComputer_Clustering_Technologies
Computer_Clustering_Technologies
 
UNIT I -Cloud Computing (1).pdf
UNIT I -Cloud Computing (1).pdfUNIT I -Cloud Computing (1).pdf
UNIT I -Cloud Computing (1).pdf
 
Presentation-1.ppt
Presentation-1.pptPresentation-1.ppt
Presentation-1.ppt
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
Distributed computing
Distributed computingDistributed computing
Distributed computing
 
Grid computing assiment
Grid computing assimentGrid computing assiment
Grid computing assiment
 
Chapter 1-Introduction.ppt
Chapter 1-Introduction.pptChapter 1-Introduction.ppt
Chapter 1-Introduction.ppt
 
vssutcloud computing.pptx
vssutcloud computing.pptxvssutcloud computing.pptx
vssutcloud computing.pptx
 
Komputasi Awan
Komputasi AwanKomputasi Awan
Komputasi Awan
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Unit i cloud computing
Unit i  cloud computingUnit i  cloud computing
Unit i cloud computing
 
Week 1 lecture material cc
Week 1 lecture material ccWeek 1 lecture material cc
Week 1 lecture material cc
 
_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf_Cloud_Computing_Overview.pdf
_Cloud_Computing_Overview.pdf
 
Week 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdfWeek 1 Lecture_1-5 CC_watermark.pdf
Week 1 Lecture_1-5 CC_watermark.pdf
 

Recently uploaded

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 

Recently uploaded (20)

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 

Super-Computer Architecture

  • 1. 11/16/2014 Copyright @IITBHU_CSE Jagjit Singh(12100EN003) Vivek Garg(12100EN009) Shivam Anand(12100EN012) Kshitij Singh(12100EN061)
  • 2. Introduction Supercomputer architectures were first introduced in the 1960s. A lot of changes have since been made since that time. Early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems. While the supercomputers of the 1970s used only a few processors, in the 1990s, machines with many thousands of processors began to appear and as the 20th century came to an end, massively parallel supercomputers with tens of thousands of "off-the-shelf" processors were the norm. In the 21st century, supercomputers can use as many as 100,000 processors connected by very fast connections. Systems with a massive number of processors generally take one of two paths: in one of the approaches, e.g., in grid computing the processing power of a large number of computers in distributed. Diverse domains are opportunistically used whenever a computer is available. Another approach is to utilize many processors in close proximity to each other, e.g., in a computer cluster. In such centralized massively parallel system the speed the flexibility of the inter-connect becomes very important, and modern supercomputers have used approaches ranging from enhanced Infiniband systems to three-dimensional torus interconnects. As the price/performance of general purpose graphic processors (GPGPUs) has improved, many petaflop supercomputers such as Tianhe-I and Nebulae have started to depend on them. However, other systems such as the K computer continue to use conventional processors such as SPARC-based designs and the overall applicability of GPGPUs in general purpose high performance computing applications has been the subject under consideration, in that while a GPGPU may be tuned to perform well on specific benchmarks its overall applicability to everyday algorithms may be limited unless significant effort is spent to tune the application towards it. But GPUs are gaining ground and in 2012 the Jaguar supercomputer was transformed into Titan by replacing CPUs with GPUs. As the number of independent processors in a supercomputer increases, the method by which they access data in the file system and how they share and access secondary storage resources becomes prominent. Across the years a number of systems for distributed file management were made, e.g., the IBM General Parallel File System, FhGFS, theParallel Virtual File System, Hadoop, etc. A number of supercomputers on the TOP100 list such as the Tianhe-I use Linux's Lustre file system.
  • 3. Background The CDC 6600 series of computers were very early attempts at supercomputing and gained their advantage over the existing systems by relegating work to peripheral devices, freeing the CPU (Central Processing Unit) to process valuable data. With the Minnesota FORTRAN compiler the 6600 could sustain 500 kiloflops on standard mathematical operations. Other early supercomputers like the Cray 1 and Cray 2 that appeared afterwards used a small number of fast processors that worked in harmony and were uniformly connected to the largest amount of shared memory that could be managed at the time. Parallel processing at the processor level were introduced by these early architectures, with innovations such as vector processing, in which the processor can perform several operations during one clock cycle, rather than having to wait for successive cycles. In time, as the number of processors increased, different issues regarding architecture emerged. Two issues that need to be addressed as the number of processors increases are the distribution of processing and memory. In the distributed memory approach, each processor is packaged physically close with some local memory. The memory that is associated with other processors is then "further away" based on bandwidth and latency parameters in non- uniform memory access. Pipelining was an innovation of the 1960s, and by the 1970s the use of vector processors had been well established. Parallel vector processing had gained ground by 1990. By the 1980s, many supercomputers used parallel vector processors. In early systems the relatively small number of processors allowed them to easily use a shared memory architecture, hence processors are allowed to access a common pool of memory. Earlier a common approach was the use of uniform memory access (UMA), in which access time to a memory location was similar between processors. The use of non- uniform memory access (NUMA) allowed a processor to access its own local memory faster than other memory locations, whereas cache-only memory architectures(COMA) allowed for the local memory of each processor to be used like a cache, thus requiring coordination as memory values changed. As the number of processors increases, efficient inter-processor communication and synchronization on a supercomputer becomes a challenge. Many different approaches may be used to achieve this goal. For example, in the early 1980s, in the Cray X-MP system used shared registers. In this approach, shared registers could be accessed by all processors that did not move data back and forth but were only used for inter- processor synchronization and communication. However, inherent challenges in managing a large amount of shared memory among many processors
  • 4. resulted in a move to more distributed architectures. PROCESSOR YEAR CLOCK(MHZ) REGISTER ELEMENT FUCTIONAL (PER REGISTER) UNITS CRAY-1 1976 80 8 64 6 CRAY-XMP 1983 120 8 64 8 CRAY-YMP 1988 166 8 64 8 NEC SX/2 1984 160 8+8192 256 variable 16 CRAY C-90 1991 240 8 128 8 NEC SX/4 1995 400 8+8192 256 variable 16 CRAY J-90 1995 100 8 64 8 CRAY T-90 1996 500 8 128 8 NEC SX/5 1999 Approaches to supercomputing Distributed supercomputing Opportunistic Supercomputing is a form of networked grid computing whereby a “super virtual computer” of many loosely coupled volunteer computing machines performs very large computational tasks. Grid computing has been applied to a number of large-scale embarrassingly parallel problems that require supercomputing scale of performance. However, basic grid and cloud computing approaches that rely on volunteer computing cannot handle traditional supercomputing tasks such as fluid dynamic simulations. The fastest grid computing system is the distributed computing project, 43.1 petaflops of x86 processing power as of June 2014. Of this, 42.5 petaflops are contributed by clients running on various GPUs, and the rest from various CPU systems. The BOINC platform hosts a number of distributed computing projects. By May 2011, BOINC recorded a processing power of as much as 5.5 petaflops through over 480,000 active computers on the network. The most active project (measured by computational power) reports processing power of over 700 teraflops through as much as 33,000 active computers. As of May 2011, GIMPS's distributed Mersenne Prime search currently achieves about 60 teraflops through over 25,000 registered computers. The server of the Internet PrimeNet supports GIMPS's grid computing approach, among the earliest and most successful grid computing projects since 1997.
  • 5. Quasi-opportunistic approaches Quasi-opportunistic supercomputing is a form of distributed computing whereby the “super virtual computer” of a large number of networked geographically disperse computers performs computing tasks that demand huge processing power. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic grid computing by achieving more control over the assignment of tasks to distributed resources and the use of intelligence about the availability and reliability of individual systems within the supercomputing network. Whereas quasi- opportunistic distributed execution of demanding parallel computing software in grids should be achieved through implementation of grid-wise agreements of allocation, co-allocation subsystems, communication topology-aware allocation mechanisms, message passing libraries that are fault tolerant and data pre- conditioning. Massive, centralized parallelism During the 1980s, as the computing power demand increased, the trend to a much larger number of processors began, bringing in the age of massively parallel systems, with distributed memory and file systems, provided that shared memory architectures could not scale to a large number of processors. Hybrid approaches such as distributed shared memory also appeared after the early systems. The computer clustering approach connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast, private local area network. The activities of the computing nodes are orchestrated by "clustering middleware" which is a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, for example via a single system image concept. Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is different from other approaches such as peer to peer or grid computing which also use a large number of nodes, but with a far more distributed nature. By the 21st century, the TOP500 organization's semi annual list of the 500 fastest supercomputers often includes many clusters like the world's fastest in 2011, the K computer which had a distributed memory and a cluster architecture. When a large number of local semi- independent computing nodes are used (e.g. in a cluster architecture) the speed and flexibility of the interconnect becomes very important. Modern supercomputers have taken various approaches to resolve this issue, e.g. Tianhe-1 uses a proprietary high- speed network based on the Infiniband QDR, enhanced with FeiTeng-1000 CPUs. On the other hand, the Blue Gene/L system uses a three-dimensional torus interconnect with auxiliary networks for global
  • 6. communications. In this approach each node is connected to its six nearest neighbours. Likewise a torus was used by the Cray T3E. Massive centralized systems at times use special-purpose processors designed for a specialised application, and may use field- programmable gate arrays (FPGA) chips to gain performance by sacrificing generality. Such special-purpose supercomputers have examples like Belle, Deep Blue, and Hydra, for playing chess, MDGRAPE-3 for protein structure computation molecular dynamics and Deep Crack for breaking the DES cipher. Massive distributed parallelism Grid computing uses a large number of computers in diverse, distributed administrative domains which makes it an opportunistic approach which uses resources whenever they are available. An example is BOINC a volunteer-based, opportunistic grid system. Some BOINC applications have reached multi-petaflop levels by using close to half a million computers connected on the web, whenever volunteer resources become available. However, these types of results often do not appear in the TOP500 ratings because they do not run the general purpose Linpack benchmark. Although grid computing has had success in parallel task execution but demanding supercomputer applications such as weather simulations or computational fluid dynamics have not been successful, partly due to the barriers in reliable sub- assignment of a large number of tasks as well as the reliable availability of resources at a given time. In quasi-opportunistic supercomputing a large number of geographically disperse computers are orchestrated with built-in safeguards. The quasi-opportunistic approach goes beyond volunteer computing on a highly distributed systems for example BOINC, or general grid computing on a system such as Globus by allowing the middleware to provide almost seamless access to many computing clusters so that existing programs in languages such as Fortran or C can be distributed among multiple computing resources. Quasi-opportunistic supercomputing aims to provide a higher quality of service than opportunistic resource sharing. The quasi-opportunistic approach enables the execution of demanding applications within computer grids by establishing grid- wise resource allocation agreements; and fault tolerant message passing to abstractly shield against the failures of the underlying resources and maintaining some opportunism as well as allowing a higher level of control. Vector processing principles Ordered set of scalar data items is known as vector. all the data items are of same type stored in memory. generally the vector elements are ordered to have fixed addressing increment between successive elements , called stride. Vector processor includes processing elements, vector registers, register counters and functional pipelines, to perform vector operations. Vector
  • 7. processing involves arithmetic or logical operations applied to vectors whereas scalar processing operates on one datum. The conversion from scalar code to vector code is called vectorization Vector processors are special purpose computers that match a range of computing (scientific) tasks. These tasks usually consist of large active data sets, poor locality, and long run times and in addition, vector processors provide vector instructions. Vector processors are special purpose computers that match a range of (scientific) computing tasks. These tasks usually consist of large active data sets, often poor locality, and long run times. In addition, vector processors provide vector instructions. These instructions operate in a pipeline (sequentially on all elements of vector registers), and in current machines. Some properties of vector instructions are  Since the calculation of every result is independent of the calculation of previous results it allows a very deep pipeline without any data issues.  A vector instruction requires a huge amount of work since it is the same as executing an entire loop. Hence, the instruction bandwidth requirement is decreased.  Vector instructions that require memory have a predefined access pattern that can easily be predicted. If the vector elements are all near each other, then obtaining the vector from a set of heavily interleaved memory banks works extremely well. Because a single access is initiated for the entire vector rather than to a single word, the high latency of starting a main memory access against accessing a cache is amortized. Thus, the cost of the latency to main memory is seen only once for the entire vector, rather than once for each word of the vector.  Control hazards are no longer present since an entire loop is replaced by a vector instruction whose behaviour is determined beforehand . Typical vector operations include (integer and floating point:  Add two vectors to produce a third.  Subtract two vectors to produce a third  Multiply two vectors to produce a third  Divide two vectors to produce a third  Load a vector from memory  Store a vector to memory. These instructions could be augmented to do typical array operations:  Inner product of two vectors (multiply and accumulate sums)  Outer product of two vectors (produce an array from vectors)  Product of (small) arrays (this would match the programming language APL which uses vectors and arrays as primitive data elements.) Hence vector processing is faster and much more efficient than scalar processing. Both SIMD computers and pipelined processors can perform vector operations. Vector processing generates one result per clock cycle by continuously matching with the pipelining and segmentation concepts. It also reduces
  • 8. the memory access conflicts and software overhead. Depending on the vectorization ratio in user programs and speed ratio between vector and scalar operations, a vector processor can achieve a manifold speed up which could go up to 10 to 20 times, as compared to conventional machines. Vector instruction types Six types of vector instructions are  Vector-vector instructions One or two vector operands may be fetched from their vector registers which then enter through a functional pipeline unit, and produce results in another vector register.  Vector scalar instructions  Vector memory instructions  Vector reduced instructions  Gather and scatter instructions These use two vector registers to gather or to scatter vector elements randomly throughout the memory. ‘Gather’ fetches the non-zero elements from memory of a sparse vector using indices that themselves are indexed. Scatter, on the other hand, does the opposite: storing into memory a vector in a sparse vector whose non zero entries are indexed  Masking instructions These instructions use a mask vector to expand or to compress a vector to an index vector that is either longer or shorter.
  • 9. Vector access memory schemes Usually, multiple access paths pipeline the flow of vector operands between the main memory and vector registers.  Vector operand specifications Vector operands can be arbitrarily long. Vector elements may not be stored in memory locations that are contiguous. To access a vector, its base address, stride, and length must be described. Since every vector register has a predefined number of component registers, in a fixed number of cycles, only a small part of the vector can be loaded to the vector register.  C-Access memory organisation.  S-Access memory organisation.  C/S-Access memory organisation. The Effect of cache design into vector computers Cache memories have proven to be very successful in the case of general purpose computers to boost system performance. However, their use in vector processing has not yet been fully established. Generally, the existing supercomputer vector processors do not have cache memories because of the results drawn from the following points:  Generally the data sets of numerical programs are too large for the cache sizes provided by the present technology. Sweep accesses of a large vector may end up completely reloading the cache before the processor can even reuses them.  Sequential addresses which are a crucial assumption in the conventional caches may not prove to be as effective in vectorised numerical algorithms that usually acquire data with certain stride which is the distinction between addresses associated with consecutive vector elements.  Register files and highly interleaved memories are usually used to achieve a high memory bandwidth required for vector processing. It is not clear whether cache memories can boost the performance of such systems. Although cache memories have the capability for boosting the performance of future vector processors, numerous reasons counter the use of vector caches. A single miss in the vector cache results in a number of processors. Stall cycles equal to the entire memory access time, however the memory accesses of a vector processor without cache are fully pipelined. In order to benefit from a vector cache, the miss ratio must be kept extremely small. In general, cache misses can be classified into these categories:  Compulsory miss  Capacity miss  Conflict miss The compulsory misses occur in the initial loading of data, which are easily pipelined in a vector computer. Next, the capacity misses are because of the size restrictions of a cache to retain data between references. If algorithms are blocked as mentioned, the capacity misses can be linked to the compulsory misses during the initial loading of every block of data given that the block size is lesser than that of cache. Finally, conflict misses, plays a deciding role in the vector processing environment. Conflicts occur when elements of the same vector are mapped directly to the same cache elements or line from two different vectors compete
  • 10. for the same cache line. Since conflict misses that reduce vector cache performance to do with vector access stride, size of an application problem can be adjusted to make a good access stride for a machine. This approach burdens a programmer for knowing architecture details of a machine as well as it is infeasible for many applications. Ideas like prime-mapped cache schemes have been studied. The new cache organization reduces cache misses due to cache line interferences that are critical in numerical applications. Also, the cache lookup time of the new mapping scheme stays the same as conventional caches. Creation of cache addresses for accessing the prime-mapped cache can be done parallel along with normal address calculations. This address creation takes lesser time than the normal address calculation because of the special properties of the Messene prime. Thus, the new mapping scheme doesn’t cause any performance penalty in terms of the cache access. With this new mapping scheme, the cache memory can show a large amount of performance boost, which will increase as the speed gap between processor and memory is increased. GPU based supercomputing The demand for an increased Personal Computer (PC) graphics subsystem performance never ceases. The GPU is an ancillary coprocessor subsystem, connected to an internal high-speed bus and memory-mapped into global memory resources. Computer vision, gaming, and advanced graphics design applications have led to sharp MIPS performance boosts and increased variety and algorithmic efficiency on part of relevant graphics standards. All this is a part of a larger evolutionary trend whereby PCs supplant dedicated workstations for a host of compute intensive applications. At a deeper level GPU evolution depends on the assumption of a processing model that can achieve the highest possible performance for a wide variety of graphics algorithms. This then drives all relevant aspects of hardware architecture and design. The most efficient GPU processing model is Single Instruction Multiple Data (SIMD). The SIMD model has been of great use in traditional vector processor/supercomputer designs, (e.g. Cray X-MP, Convex C1, CDC Star-100), by capability to boost datapath calculation based upon concurrent execution of processing threads. The SIMD concept has been employed in recent CPU architectural advancements, like the IBM Cell processor, x86 with MMX extensions, SPARC VIS, Sun MAJC, ARM NEON etc. SIMD processing model adopted for GPU can be used for general classes of scientific computation not specifically associated with graphics applications. This was the start of the General Purpose computing on GPU (GPGPU) movement and basis for many examples of GPU accelerated scientific processing. GPGPU closely depends upon Application Programming Interface (API) access to resources of GPU processing; GPU API abstracts much of the complexity associated with manipulation of hardware resources and provides convenient access to I/O, memory management, and thread management functionality in form generic programming function calls, (e.g. C, C++, Python, Java). Thus, GPU hardware is virtualized as a standard programming resource, facilitating uninhibited application development incorporating GPU acceleration. APIs that are currently in use include NVIDIA’s Compute Unified Device Architecture (CUDA) and ATI’s Data Parallel Virtual Machine (DPVM).
  • 11. GPU Architecture SIMD GPU is organized as a collection of ‘N’ distinct multiprocessors, each consisting of ‘M’ distinct thread processors. Multiprocessor operation is modulo an ensemble of threads managed and scheduled as a single entity, (i.e. ‘warp’). Like this, SIMD instruction fetch and execution, shared-memory access, and cache operations are completely synchronized. Memory usually is organized hierarchically where Global/Device memory transactions are understood as mediated by high-speed bus transactions, (e.g.PCIe, HyperTransport). A feature associated with the CPU/GPU processing architecture is GPU processing is essentially non-blocking. Hence, CPU may continue processing as soon as a work-unit has been written to the GPU transaction buffer. GPU work unit assembly/disassembly and I/O at the GPU transaction buffer may to large extent be hidden. In these case, GPU performance will effectively dominate the performance of the entire system. Optimal GPU processing gain is achieved at an I/O constraint boundary whereby thread processors never stall due to lack of data. The maximum achievable speedup is governed by Amdahl’s Law: any acceleration (‘A’) due to thread parallelization will critically depend upon:  The fraction of code than can be parallelized (‘P’)  The degree of parallelization (‘N’), and  Any overhead associated with parallelization This indicates a theoretical maximum acceleration for the application. CPU code pipelining (i.e. overlap with GPU processing) must also be factored into any calculation for ‘P’; pipelining effectively parallelizes CPU and GPU code segments reducing the non-parallelized code fraction '(1- P)'. Thus, under circumstances where decrease is sufficient to claim (P). Hence, well-motivated software architecture design can take advantage of this effect, greatly increasing acceleration potential for the complete application. 21st-century architectural trends The air cooled IBM Blue Gene supercomputer architecture trades processor speed for low power consumption so that a larger number of processors can be used at room temperature, by using normal air- conditioning. The second generation Blue Gene/P systemis distinguished by the fact that each chip can act as a 4- way symmetric multiprocessor and also includes the logic for node-to-node communication. And at 371 MFLOPS/W the system is very energy efficient. The K computer has water cooling system, homogeneous processor and distributed memory system with a cluster architecture. It uses more than 80,000 processors which are SPARC based, each with eight cores, for a total of over 700,000 cores – almost twice as many as any other system and more than 800 cabinets, each with 96 computing nodes ,each with 16 GB of memory , and 6 I/O nodes although it is more powerful than the next five systems on the TOP500 list combined, at 824.56 MFLOPS/W but it has the lowest power to performance ratio of any current major supercomputer system. The follow up system, called the PRIMEHPC FX10 uses the same six-
  • 12. dimensional torus interconnect, but only one SPARC processor per node. Unlike the K computer, the Tianhe- 1A system uses a hybrid architecture and integrates CPUs and GPUs. It uses more than 14,000Xeon general-purpose processors and greater than 7,000 Nvidia Tesla graphic-based processors on about 3,500 blades. It has 112 computer cabinets and 262 terabytes of distributed memory; 2 petabytes of disk storage is implemented via Lustre clustered files. Tianhe-1 uses a proprietary high- speed communication network to connect the processors. The proprietary interconnect network was based on the Infiniband QDR, along with Chinese made FeiTeng-1000 CPUs. In the case of the interconnect the system is twice as fast as the Infiniband, but is slower than some interconnects on other supercomputers. The limits of specific approaches continue to be tested through large scale experiments, such as in 2011 IBM ended its participation in the Blue Waters petaflops project at the University of Illinois. The Blue Waters architecture was based on the IBM POWER7 processor and intended to have 200,000 cores with a petabyte of "globally addressable memory" and 10 petabytes of disk space. The goal of a sustained petaflop led to design choices that optimized single- core performance, and a lower number of cores which is then expected to help performance on programs that did not scale well to a large number of processors. The large globally addressable memory architecture aimed to solve memory address problems in an efficient manner, for the same type of programs. Blue Waters had been expected to run at sustained speeds of at least one petaflop which relied on the specific water-cooling approach to manage heat. The National Science Foundation spent about $200 million on the project in the first four years of operation. IBM released the Power 775 computing node derived from that project's technology soon , but effectively abandoned the Blue Waters approach. Architectural experiments are continuing in a number of directions, for example the Cyclops64 system uses a supercomputer on a chip approach, contrasting the use of massive distributed processors. Each 64-bit Cyclops64 chip contains 80 processors with the entire system using a globally addressable memory architecture. The processors are connected with non- internally blocking crossbar switch and communicate with each other via global interleaved memory with no data cache in the architecture, while half of each SRAM bank can be used as a scratchpad memory. Although this type of architecture allows unstructured parallelism in a dynamically non- contiguous memory system but it also produces challenges in the efficient mapping of parallel algorithms to a many- core system. Issues and challenges we could significantly increase the performance of a processor by issuing multiple instructions per clock cycle and by deeply pipelining the execution units
  • 13. to allow greater exploitation of instruction level parallelism. But there are serious difficulties in exploiting ever larger degrees of instruction level parallelism. As we increase both the width of instruction issue and the depth of the machine pipelines, we as well increase the number of independent instructions required to keep the processor busy with useful work. This means an increase in the number of partially executed instructions that can be in flight at one time. For a dynamically-scheduled machine hardware structures, such as reorder buffers, instruction windows ,and rename register files, must grow to have sufficient capacity to hold all in-flight instructions, and worse, the number of ports on each element of these structures must grow with the issue width. The logic to track dependencies between all in-flight instructions grows quadratically in the number of instructions. Even a VLIW machine, which is statically scheduled and shifts more of the scheduling burden to the compiler, needs more registers, more ports per register, and more hazard interlock logic (assuming a design where hardware manages interlocks after issue time) to support more in-flight instructions, which similarly cause quadratic increases in circuit size and complexity. This rapid increase in circuit complexity makes it difficult to build machines that can control large numbers of in-flight instructions which limits practical issue widths and pipeline depths. Vector processors were successfully commercialized long before instruction level parallel machines and take an alternative approach to controlling multiple functional units with deep pipelines. Vector processors provide high- level operations that work on vectors. A typical vector operation might add two floating-point vectors of 64 elements to obtain a single 64-element vector result. This instruction is equivalent to an entire loop, in which each iteration is computing one of the 64 elements of the result and updating the indices, and branching back to the beginning. Vector instructions have several important properties that solve most of the problems mentioned above: A single vector instruction describes a great deal of work—it is equivalent to executing an entire loop where each instruction represents tens or hundreds of operations, and so the instruction fetch and decode bandwidth needed to keep multiple deeply pipelined functional units busy is dramatically reduced. By using a vector instruction, the compiler or programmer indicates that the computation of each result in the vector is independent of the computation of other results in the same vector and so hardware does not have to check for data hazards within a vector instruction. The elements in the vector can be computed using an array of parallel functional units, or a single very deeply pipelined functional unit, or any mixed configuration of parallel and pipelined functional units.
  • 14. Hardware need only check for data hazards between two vector instructions once per vector operand and not once for every element within the vectors. That means the dependency checking logic required between two vector instructions is approximately the same as that required between two scalar instructions, but now many more elemental operations can be in flight for the same complexity of control logic. Vector instructions that access memory have a known access pattern then fetching the vector from a set of heavily interleaved memory banks works very well if the vector’s elements are all adjacent. The high latency of initiating a main memory access versus accessing a cache is amortized as a single access is initiated for the entire vector not just to a single word. Hence the cost of the latency to main memory is seen only once for the entire vector and not for each word of the vector. Because an entire loop is replaced by a vector instruction whose behaviour is predetermined the control hazards that would normally arise from the loop branch are non-existent. For these reasons, vector operations can be made faster than a sequence of scalar operations on the same number of data items, and if the application domain can use them frequently, designers are motivated to include vector units. As mentioned above, vector processors pipeline and parallelize the operations on the individual elements of a vector. The operations include not only the arithmetic operations, but also memory accesses and effective address calculations. Also, most high-end vector processors allow multiple vector instructions to be in progress at the same time, creating further parallelism among the operations on different vectors. Vector processors are particularly useful for large scientific and engineering applications, such as car crash simulations and weather forecasting, for which a typical job might take dozens of hours of supercomputer time running over multi gigabyte data sets. Multimedia applications can also benefit from vector processing, as they contain abundant data parallelism and process large data streams. A high-speed pipelined processor will usually use a cache to avoid forcing memory reference instructions to have very long latency. Unfortunately, big scientific programs often have very large active data sets that are sometimes accessed with low locality hence yielding poor performance from the memory hierarchy. This problem could be overcome by not caching these structures if it were possible to determine the memory access patterns and pipeline the memory accesses efficiently. Compiler assistance and novel cache architectures through blocking and prefetching are decreasing these memory hierarchy problems, but still they continue to be serious in some applications.
  • 15. Application The machine can be used in scientific and business applications, but more suited to scientific applications. Large multinational banks and corporations are using small supercomputers. Some of the applications include; special effects in film, weather forecasting, processing of geological data and data regarding genetic decoding, aerodynamics and structural designing, mass destruction weapons and simulation. The users include; Film makers, Geological data processing agencies, National weather forecasting agencies, Space agencies, Genetics research organizations, Government agencies, Scientific laboratories,, Military and defence systems, research groups and Large corporations. Simulation Duplicating an environment is called simulation. It is done for reasons like; Training of the users Predict/forecast the result If physical experimentation is not possible If physical experimentation is very expensive All the expensive machines are simulated before their actual construction to prevent economic losses and saving of time. Life threating stunts are simulated before performed which can predict any technical or other fault and prevent damage. Movies These are used to produce special effects. Movies like The Star trek, Star fighter, Babylon 5, Terminator’s sequel, Dante’s Peak, Asteroid, Jurassic Park, The Lost World, Matrix’s sequel, Lord of the Rings, Godzilla and all the latest movies have special effects generated on supercomputers. Weather forecasting Data is collected from worldwide network of space satellites, ground stations and airplanes, is fed in to supercomputer for analysis to forecast weather. Thousands of variables are involved in weather forecasting and can only be processed on a supercomputer. Accurate predictions cannot be made beyond one month because we need more powerful computers to do so. Oil Exploration To determine the most productive oil exploration sites millions of pieces of data is processed. Processing of geological data involves billions of pieces of data and thousands of variables, a very complex calculation requiring very large computing power. Genetics engineering Used for the processing and decoding of genetic data this is used by genetics
  • 16. scientists and engineers for research and development to immune human beings from heredity diseases. Since genetics data processing involves thousands of factors to be processed supercomputers are the best choice. The latest developments, like gene mapping and cloning also require the capabilities of supercomputers. Space exploration Great achievements are simply impossible without supercomputers. The remarkable accuracy and perfection in the landing of pathfinder on the Mars is another proof of the capabilities of this wonderful machine. Famous IBM processor technology RISC/6000 used as in flight computer, that was modified for the project, made hardened and called RAD/6000. Aerodynamic designing of airplanes In manufacturing of airplanes, supercomputer to use to simulate the passage of air around separate pieces of the plane and then combine the results, Today’s super computers are still unable to simulate the passage of air around an entire aircraft. Aerospace and structural designing Simulation in aerospace and structural designing was used for the space station and space plane. These projects required extensive experiments, some of which are physically impossible. Such as, the proposed space station would collapse under its own weight if built in the gravity of Earth. The plane must be able to take off from a runway on Earth and accelerate directly into orbit at speeds greater than 8,800 miles per hour. Most of these conditions cannot be duplicated; the simulation and modelling for these designs and tests include processing of billions of pieces of data and solving numerous complex mathematical calculations for supercomputers. Nuclear weapons Simulation is also used for the production of mass destruction weapons to simulate the results of an atomic or nuclear bomb formula. For this reason, USA government is very cautious about the production and export of this computer to several nations. Some of the famous export deals include. • America provided Cray supercomputer of type XMP to India for weather data processing. • USA supplied a supercomputer to China for peaceful nuclear research. • International Business Machines Corporation exported supercomputer RISCJ6000 SP to Russia, and they used it for their nuclear and atomic research purposes. This deal put International Business Machines Corporation under strong criticism by the US government. Conclusion and future work Given the current progress rate, industry experts estimate that supercomputers will reach 1 exaflops (1018, one quintillion FLOPS) by 2018. China describes plans to have a 1 exaflop supercomputer online by 2018. Using the Intel multi-core processor, which is Intel's response to graphics processor unit (GPU) systems, SGI plans to achieve a 500 times increase in performance by 2018, in order to achieve one extra flop. Samples of MIC chips with
  • 17. 32 cores, which combine VPU with standard CPU, have become available. The government of India has also stated ambitions for an exaflop-range supercomputer, which they hope to complete by 2017.].In November 2014 it was reported that India is working on the Fastest supercomputer ever which is set to work at 132 Exaflops per second. Supercomputers with this new architecture could be out within the next year. The aim is to improve data processing at the memory, storage and I/O levels. That will help break down parallel computational tasks into small parts, reducing the compute cycles required to solve problems. That is one way to overcome economic and scaling limitations of parallel computing that affect conventional computing models. Memory, storage and I/O work in tandem to boost system performance, but there are bottlenecks with present supercomputing models. A lot of energy and time is wasted in continuously moving large chunks of data between processors, memory and storage. Decreasing the amount of data that has to be moved, which could help process data increase three times faster than current supercomputing models. When working with petabytes and exabytes of data, moving this amount of data is extremely inefficient and time consuming, so processing to the data can be moved by providing compute capability throughout the system hierarchy. IBM has built the world's fastest computers for decagon, including the third- and fifth-fastest, according to a recent Top 500 list. But the amount of data being put to servers is outpacing the growth of supercomputing speeds. Networks are not going faster, the chip clock speeds are not increasing and there is not a huge increase in data-access time. Applications no longer live in the classic compute microprocessors; instead application and workflow computation are distributed throughout the system hierarchy. A simple example of reducing the size of data sets by decomposing information in storage, which can then be moved to memory of the computer. That type of model can be applied to oil and gas workflow -- which typically takes months - - and it would significantly shorten the time required to make decisions about drilling. A hierarchy of storage and memory including non-volatile RAM, which means much lower latency, higher bandwidths, without the requirement to move the data all the way back to central storage. Following conventional computing architectures such as the Von Neumann's approach, in which data is put into a processor, calculated and put back in the memory. Most of the computer systems today work on the type of architecture only, which was derived in the 1940's by mathematician named John von Neumann. At the individual compute element level, we continue the Von Neumann's approach. At the level of the system, however, an additional way to compute, which is to move the evaluate to the data is provided. There are multiple ways to reduce latency in a system and reduce the amount of data which has to be moved. This saves energy as well as time. Moving computing closer to data in storage or memory is not a new concept. appliances and servers with CPUs targeted at specific workloads, and with
  • 18. disaggregating storage, memory and processing subsystems into separate boxes are built which can be improved by optimizing entire supercomputing workloads that involve simulation, modeling, visualization and complex analytics on massive data sets. The model will work in research areas like oil and gas life sciences, exploration, materials research and weather modelling. Applications will need to be written and well-defined for processing at different levels and IBM is working with institution, companies and researchers to define software models for key sectors. The fastest supercomputers today are calculated with the LINPACK benchmark, a simple measurement based on fractional (float) point operations. IBM is not ignoring Top 500, but providing a different approach to enhance supercomputing. LINPACK is good to measure speed, but has under-represented the utility of supercomputers and the benchmark does not fully account for specialized processing elements like int processing and FPGAs. The Top 500 list measures some elements of the behaviour of compute nodes, but it is not complete in terms of its characterization of workflows that require merging modelling, simulation and analytics but many classic applications are only moderately related to the measure of LINPACK Different organizations building supercomputers have studied to build software to take advantage of LINPACK, which is worse measurement of supercomputing performance. The actual performance of some specialized applications goes far beyond LINPACK, and IBM's seems convincing. There are companies developing computers that give a new spin on how data is accessed and interpreted. System (D-Wave) is offering what is believed to be the world's first and only quantum based computer, which is being used by NASA, Lockheed Martin and Google for specific tasks. The others are in phase of experiments. IBM has built an experimental computer with a chip designed to mimic a human brain.
  • 19. References 1. Sao-Jie Chen; Guang-Huei Lin; Pao-Ann Hsiung; Yu-Hen Hu (9 February 2009).Hardware Software Co-Design of a Multimedia Soc Platform. Springer. pp. 70–72.ISBN 978-1- 4020-9622-8. Retrieved 15 June 2012. 2. Hoffman, Allan R. (1989). Supercomputers: directions in technology and applications. Washington, D.C.: National Academy Press. pp. 35– 47. ISBN 0-309-04088-4. 3. Hill, Mark D.; Jouppi, Norman P.; Sohi, Gurindar (2000). Readings in computer architecture. San Francisco: Morgan Kaufmann. pp. 40–49. ISBN 1-55860-539-8. 4. i Yang, Xue-Jun; Liao, Xiang-Ke; Lu, Kai; Hu, Qing-Feng; Song, Jun- Qiang; Su, Jin-Shu (2011). "The TianHe-1A Supercomputer: Its Hardware and Software". Journal of Computer Science and Technology 26 (3): 344– 351. doi:10.1007/s02011-011- 1137-8. 5. Murray, Charles J. (1997). The supermen : the story of Seymour Cray and the technical wizards behind the supercomputer. New York: John Wiley. pp. 133– 135. ISBN 0-471-04885-2. 6. e Biswas, edited by Rupak (2010). Parallel computational fluid dynamics : recent advances and future directions : papers from the 21st International Conference on Parallel Computational Fluid Dynamics. Lancaster, Pa.: DEStech Publications. p. 401. ISBN 1- 60595-022-X. 7. c Yongge Huáng, ed. (2008). Supercomputing research advances. New York: Nova Science Publishers. pp. 313–314. ISBN 1- 60456-186-6. 8. Tokhi, M. O.; Hossain, M. A.; Shaheed, M. H. (2003). Parallel computing for real-time signal processing and control. London [u.a.]: Springer. pp. 201– 202. ISBN 978-1-85233-599-1. 9. Vaidy S. Sunderam, ed. (2005). Computational science -- ICCS 2005. 5th international conference, Atlanta, GA, USA, May 22-25, 2005 : proceedings (1st ed.). Berlin: Springer. pp. 60– 67. ISBN 3-540-26043-9. 10.Prodan, Radu; Thomas Fahringer (2007). Grid computing experiment management, tool integration, and scientific workflows. Berlin: Springer. pp. 1– 4. ISBN 3-540-69261-4.