Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 1
1
Distributed and Cloud Computing
Distributed and Cloud Computing
K. Hwang, G. Fox and J. Dongarra
K. Hwang, G. Fox and J. Dongarra
Chapter 2: Computer Clusters for
Chapter 2: Computer Clusters for
Scalable parallel Computing
Scalable parallel Computing
(suggested for 3 lectures in 150 minutes)
(suggested for 3 lectures in 150 minutes)
Prepared by Kai Hwang
Prepared by Kai Hwang
University of Southern California
University of Southern California
March 30, 2012
March 30, 2012
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 2
2
Figure 2.3 The Top-500 supercomputer performance from
1993 to 2010 (Courtesy of http://www.top500.org [25)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 3
3
Figure 2.2 Architectural share of the Top-500
systems
(Courtesy of http://www.top500.org [25])
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 4
4
What is a computing cluster?
 A computing cluster consists of a collection of
interconnected stand-alone/complete computers,
which can cooperatively working together as a single,
integrated computing resource. Cluster explores
parallelism at job level and distributed computing
with higher availability.
 A typical cluster:
 Merging multiple system images to a SSI
(single-system image ) at certain functional levels.
 Low latency communication protocols applied
 Loosely coupled than an SMP with a SSI
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 5
5
Multicomputer Clusters:
 Cluster:
Cluster: A network of computers supported
A network of computers supported
by middleware and interacting by message passing
by middleware and interacting by message passing
 PC Cluster (Most
PC Cluster (Most Linux clusters
Linux clusters)
)
 Workstation Cluster
Workstation Cluster (NOW, COW)
(NOW, COW)
 Server cluster or Server Farm
Server cluster or Server Farm
 Cluster of
Cluster of SMPs
SMPs or
or ccNUMA
ccNUMA systems
systems
 Cluster-structured massively parallel processors
Cluster-structured massively parallel processors
(MPP) – about 85% of the top-500 systems
(MPP) – about 85% of the top-500 systems
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 6
6
Multi-Computer Cluster Components
Multi-Computer Cluster Components
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 7
7
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 8
8
Operational Benefits of Clustering
 System availability (HA) : Cluster offers inherent high system
availability due to the redundancy of hardware, operating
systems, and applications.
 Hardware Fault Tolerance: Cluster has some degree of
redundancy in most system components including both
hardware and software modules.
 OS and application reliability : Run multiple copies of the OS
and applications, and through this redundancy
 Scalability : Adding servers to a cluster or adding more
clusters to a network as the application need arises.
 High Performance : Running cluster enabled programs to
yield higher throughput.
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 9
9
Cluster Opportunities :
MPP/DSM:
 Compute across multiple systems: parallelism.
Network RAM:
 Idle memory in other nodes. Page across other nodes’ idle
memory
Software RAID:
 file system supporting parallel I/O and reliability, mass-
storage.
Multi-path Communication:
 Communicate across multiple networks: Ethernet, ATM,
Myrinet
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 10
10
Resource Sharing in Cluster of Computers
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 11
11
 Size Scalability (physical & application)
 Enhanced Availability (failure management)
 Single System Image (Middleware, OS extensions)
 Fast Communication (networks & protocols)
 Load Balancing (CPU, Net, Memory, Disk)
 Security and Encryption (clusters and Grids)
 Distributed Environment (User friendly)
 Manageability (Jobs and resources )
 Programmability (simple API required)
 Applicability (cluster- and grid-awareness)
Issues in Cluster Design
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 12
12
Operational Benefits of Clustering
 System availability (HA) : Cluster offers inherent high system
availability due to the redundancy of hardware, operating
systems, and applications.
 Hardware Fault Tolerance: Cluster has some degree of
redundancy in most system components including both
hardware and software modules.
 OS and application reliability : Run multiple copies of the OS
and applications, and through this redundancy
 Scalability : Adding servers to a cluster or adding more
clusters to a network as the application need arises.
 High Performance : Running cluster enabled programs to
yield higher throughput.
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 13
13
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 14
14
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 15
15
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 16
16
Compute Node Architectures :
Compute Node Architectures :
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 17
17
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 18
18
Cluster Interconnects :
Cluster Interconnects :
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 19
19
IBM BlueGene/L Supercomputer: The World
Fastest Message-Passing MPP built in 2005
Built jointly by IBM and LLNL teams and
funded by US DoE ASCI Research Program
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 20
20
Overview of Blue Gene L
 Blue Gene L is a supercomputer jointly developed
by IBM and Lawrence Livermore National Laboratory
 It occupies 17 of the top 100 slots in the rankings at
top500.org, including 5 of the top 10
 360 TeraFLOPS theoretical peak speed
 Largest configuration:
 At Lawrence Livermore Nat’l Lab.
 Runs simulations on US nuclear weapon stockpile
 64 physical racks
 65,536 compute nodes
 Torus interconnection network of 64 x 32 x 32
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 21
21
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 22
22
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 23
23
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 24
24
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 25
25
Cluster Middleware
 Resides Between OS and Applications and
offers in infrastructure for supporting:
 Single System Image (SSI)
 System Availability (SA)
 SSI makes collection appear as single
machine (globalised view of system
resources). Telnet cluster.myinstitute.edu
 Checkpointing and process migration..
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 26
26
What is Single System Image (SSI) ?
 A single system image is the illusion, created by
software or hardware, that presents a collection of
resources as an integrated powerful resource.
 SSI makes the cluster appear like a single machine
to the user, applications, and network.
 A cluster with multiple system images is nothing
but a collection of independent computers
(Distributed systems in general)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 27
27
Desired SSI Services
 Single Entry Point
 telnet cluster.usc.edu
 telnet node1.cluster.usc.edu
 Single File Hierarchy: xFS, AFS, Solaris MC Proxy
 Single Control Point: Management from single GUI
 Single virtual networking over multiple physical networks
 Single memory space - Network RAM / DSM
 Single Job Management: GlUnix, Codine, LSF, etc.
 Single User Interface: Like CDE in Solaris/NT
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 28
28
Single Entry Point to access a
Cluster from any physical point
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 29
29
Four SSI Features
Four SSI Features in
in
Networking, I/O Space, Memory
Networking, I/O Space, Memory
Sharing, and Cluster Control
Sharing, and Cluster Control
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 30
30
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 31
31
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 32
32
Availability Support Functions
 Single I/O Space (SIO):
 Any node can access any peripheral or disk devices without the
knowledge of their physical location.
 Single Process Space (SPS)
 Any process has cluster wide process id and they
communicate through signal, pipes, etc, as if they are on a
single node.
 Checkpointing and Process Migration.
 Saves the process state and intermediate results from memory
in rollback recovery from fails.
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 33
33
Distribute RAID - The RAID-x Architecture
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 34
34
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 35
35
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 36
36
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 37
37
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 38
38
Single Points of Failure in SMP and Clusters
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 39
39
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 40
40
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 41
41
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 42
42
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 43
43
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 44
44
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 45
45
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 46
46
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 47
47
Figure 2.4 Country share of the Top-500
supercomputers over time [25]
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 48
48
Figure 2.5 Application-area share of Top-500 systems over time.
(Courtesy of http://www.top500.org [25])
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 49
49
Top- 500 Release in June 2010
Top- 500 Release in June 2010
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 50
50
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 51
51
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 52
52
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 53
53
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 54
54
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 55
55
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 56
56
The Cray XT-5 Jagaur
The Cray XT-5 Jagaur
Supercomputer
Supercomputer
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 57
57
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 58
58
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 59
59
IBM
IBM
Roadrunner
Roadrunner
System
System
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 60
60
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 61
61
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 62
62
(Courtesy of Bill Dally, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 63
63
(Courtesy of Bill Dally, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 64
64
(Courtesy of Bill Dally, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 65
65
A proposed Nivdia GPU chip processor architecture with 128 cores (160 GFlpos each) plus
A proposed Nivdia GPU chip processor architecture with 128 cores (160 GFlpos each) plus
8 latency processors (LP) connected to 1024 SRAMs (L2 caches) by a NoC, where MS are
8 latency processors (LP) connected to 1024 SRAMs (L2 caches) by a NoC, where MS are
the memory controllers connecting to off-chip DRAMS and NI is the network interface to
the memory controllers connecting to off-chip DRAMS and NI is the network interface to
next level of network (Courtesy of Bill Dally, reprint with permission [10] ).
next level of network (Courtesy of Bill Dally, reprint with permission [10] ).
(Courtesy of Bill Dally, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 66
66
The architecture of a GPU cluster built with a hierarchical network of
The architecture of a GPU cluster built with a hierarchical network of
processor chips (GPUs) that can deliver 2.6 PFlops per cabinet. It takes at
processor chips (GPUs) that can deliver 2.6 PFlops per cabinet. It takes at
least N = 400 cabinets to achieve the desired PFlops or EFlops performance.
least N = 400 cabinets to achieve the desired PFlops or EFlops performance.
(Courtesy of Bill Dally, reprint with permission [10] ).
(Courtesy of Bill Dally, reprint with permission [10] ).
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 67
67
(Courtesy of Bill Dally, 2011)
Copyright © 2012, Elsevier Inc. All rights reserved.
2 -
2 - 68
68
Paper/Books on Clusters and MPPs:
Paper/Books on Clusters and MPPs:
1.
1. G. Bell, J. Gray. And A. Szalay, “Petascale Computational
G. Bell, J. Gray. And A. Szalay, “Petascale Computational
Systems : Balanced Cyberstructure in a Data-Centric
Systems : Balanced Cyberstructure in a Data-Centric
World”,
World”, IEEE Computer Magazine,
IEEE Computer Magazine, 2006.
2006.
2. K. Hwang, G. Fox. And J. Dongarra, Distributed and Cloud
Computing Systems, Chapter 2, Kauffmann, 2011
3.
3. G. F. Phister,
G. F. Phister, In Search of Clusters,
In Search of Clusters, (second Edition)
(second Edition)
Prentice-Hall, N.J. 2001.
Prentice-Hall, N.J. 2001.
Updated Studies:
Updated Studies: Conduct an updated Study of the Top 5
Conduct an updated Study of the Top 5
Systems in the latest Top- 500 List . Produce an updated Table 2.3
Systems in the latest Top- 500 List . Produce an updated Table 2.3
and describe the No.1 system in details like the treatment of
and describe the No.1 system in details like the treatment of
Tianhe-1A, Jagaur, and Roadrunner in 2011. 2010. and 2009,
Tianhe-1A, Jagaur, and Roadrunner in 2011. 2010. and 2009,
respectively, in Chapter 2.
respectively, in Chapter 2.

Distributed Computing and Cloud Computing 2

  • 1.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 1 1 Distributed and Cloud Computing Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra K. Hwang, G. Fox and J. Dongarra Chapter 2: Computer Clusters for Chapter 2: Computer Clusters for Scalable parallel Computing Scalable parallel Computing (suggested for 3 lectures in 150 minutes) (suggested for 3 lectures in 150 minutes) Prepared by Kai Hwang Prepared by Kai Hwang University of Southern California University of Southern California March 30, 2012 March 30, 2012
  • 2.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 2 2 Figure 2.3 The Top-500 supercomputer performance from 1993 to 2010 (Courtesy of http://www.top500.org [25)
  • 3.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 3 3 Figure 2.2 Architectural share of the Top-500 systems (Courtesy of http://www.top500.org [25])
  • 4.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 4 4 What is a computing cluster?  A computing cluster consists of a collection of interconnected stand-alone/complete computers, which can cooperatively working together as a single, integrated computing resource. Cluster explores parallelism at job level and distributed computing with higher availability.  A typical cluster:  Merging multiple system images to a SSI (single-system image ) at certain functional levels.  Low latency communication protocols applied  Loosely coupled than an SMP with a SSI
  • 5.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 5 5 Multicomputer Clusters:  Cluster: Cluster: A network of computers supported A network of computers supported by middleware and interacting by message passing by middleware and interacting by message passing  PC Cluster (Most PC Cluster (Most Linux clusters Linux clusters) )  Workstation Cluster Workstation Cluster (NOW, COW) (NOW, COW)  Server cluster or Server Farm Server cluster or Server Farm  Cluster of Cluster of SMPs SMPs or or ccNUMA ccNUMA systems systems  Cluster-structured massively parallel processors Cluster-structured massively parallel processors (MPP) – about 85% of the top-500 systems (MPP) – about 85% of the top-500 systems
  • 6.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 6 6 Multi-Computer Cluster Components Multi-Computer Cluster Components
  • 7.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 7 7
  • 8.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 8 8 Operational Benefits of Clustering  System availability (HA) : Cluster offers inherent high system availability due to the redundancy of hardware, operating systems, and applications.  Hardware Fault Tolerance: Cluster has some degree of redundancy in most system components including both hardware and software modules.  OS and application reliability : Run multiple copies of the OS and applications, and through this redundancy  Scalability : Adding servers to a cluster or adding more clusters to a network as the application need arises.  High Performance : Running cluster enabled programs to yield higher throughput.
  • 9.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 9 9 Cluster Opportunities : MPP/DSM:  Compute across multiple systems: parallelism. Network RAM:  Idle memory in other nodes. Page across other nodes’ idle memory Software RAID:  file system supporting parallel I/O and reliability, mass- storage. Multi-path Communication:  Communicate across multiple networks: Ethernet, ATM, Myrinet
  • 10.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 10 10 Resource Sharing in Cluster of Computers
  • 11.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 11 11  Size Scalability (physical & application)  Enhanced Availability (failure management)  Single System Image (Middleware, OS extensions)  Fast Communication (networks & protocols)  Load Balancing (CPU, Net, Memory, Disk)  Security and Encryption (clusters and Grids)  Distributed Environment (User friendly)  Manageability (Jobs and resources )  Programmability (simple API required)  Applicability (cluster- and grid-awareness) Issues in Cluster Design
  • 12.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 12 12 Operational Benefits of Clustering  System availability (HA) : Cluster offers inherent high system availability due to the redundancy of hardware, operating systems, and applications.  Hardware Fault Tolerance: Cluster has some degree of redundancy in most system components including both hardware and software modules.  OS and application reliability : Run multiple copies of the OS and applications, and through this redundancy  Scalability : Adding servers to a cluster or adding more clusters to a network as the application need arises.  High Performance : Running cluster enabled programs to yield higher throughput.
  • 13.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 13 13
  • 14.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 14 14
  • 15.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 15 15
  • 16.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 16 16 Compute Node Architectures : Compute Node Architectures :
  • 17.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 17 17
  • 18.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 18 18 Cluster Interconnects : Cluster Interconnects :
  • 19.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 19 19 IBM BlueGene/L Supercomputer: The World Fastest Message-Passing MPP built in 2005 Built jointly by IBM and LLNL teams and funded by US DoE ASCI Research Program
  • 20.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 20 20 Overview of Blue Gene L  Blue Gene L is a supercomputer jointly developed by IBM and Lawrence Livermore National Laboratory  It occupies 17 of the top 100 slots in the rankings at top500.org, including 5 of the top 10  360 TeraFLOPS theoretical peak speed  Largest configuration:  At Lawrence Livermore Nat’l Lab.  Runs simulations on US nuclear weapon stockpile  64 physical racks  65,536 compute nodes  Torus interconnection network of 64 x 32 x 32
  • 21.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 21 21
  • 22.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 22 22
  • 23.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 23 23
  • 24.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 24 24
  • 25.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 25 25 Cluster Middleware  Resides Between OS and Applications and offers in infrastructure for supporting:  Single System Image (SSI)  System Availability (SA)  SSI makes collection appear as single machine (globalised view of system resources). Telnet cluster.myinstitute.edu  Checkpointing and process migration..
  • 26.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 26 26 What is Single System Image (SSI) ?  A single system image is the illusion, created by software or hardware, that presents a collection of resources as an integrated powerful resource.  SSI makes the cluster appear like a single machine to the user, applications, and network.  A cluster with multiple system images is nothing but a collection of independent computers (Distributed systems in general)
  • 27.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 27 27 Desired SSI Services  Single Entry Point  telnet cluster.usc.edu  telnet node1.cluster.usc.edu  Single File Hierarchy: xFS, AFS, Solaris MC Proxy  Single Control Point: Management from single GUI  Single virtual networking over multiple physical networks  Single memory space - Network RAM / DSM  Single Job Management: GlUnix, Codine, LSF, etc.  Single User Interface: Like CDE in Solaris/NT
  • 28.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 28 28 Single Entry Point to access a Cluster from any physical point
  • 29.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 29 29 Four SSI Features Four SSI Features in in Networking, I/O Space, Memory Networking, I/O Space, Memory Sharing, and Cluster Control Sharing, and Cluster Control
  • 30.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 30 30
  • 31.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 31 31
  • 32.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 32 32 Availability Support Functions  Single I/O Space (SIO):  Any node can access any peripheral or disk devices without the knowledge of their physical location.  Single Process Space (SPS)  Any process has cluster wide process id and they communicate through signal, pipes, etc, as if they are on a single node.  Checkpointing and Process Migration.  Saves the process state and intermediate results from memory in rollback recovery from fails.
  • 33.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 33 33 Distribute RAID - The RAID-x Architecture
  • 34.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 34 34
  • 35.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 35 35
  • 36.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 36 36
  • 37.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 37 37
  • 38.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 38 38 Single Points of Failure in SMP and Clusters
  • 39.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 39 39
  • 40.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 40 40
  • 41.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 41 41
  • 42.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 42 42
  • 43.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 43 43
  • 44.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 44 44
  • 45.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 45 45
  • 46.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 46 46
  • 47.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 47 47 Figure 2.4 Country share of the Top-500 supercomputers over time [25]
  • 48.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 48 48 Figure 2.5 Application-area share of Top-500 systems over time. (Courtesy of http://www.top500.org [25])
  • 49.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 49 49 Top- 500 Release in June 2010 Top- 500 Release in June 2010
  • 50.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 50 50
  • 51.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 51 51
  • 52.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 52 52
  • 53.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 53 53
  • 54.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 54 54
  • 55.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 55 55
  • 56.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 56 56 The Cray XT-5 Jagaur The Cray XT-5 Jagaur Supercomputer Supercomputer
  • 57.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 57 57
  • 58.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 58 58
  • 59.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 59 59 IBM IBM Roadrunner Roadrunner System System
  • 60.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 60 60
  • 61.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 61 61
  • 62.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 62 62 (Courtesy of Bill Dally, 2011)
  • 63.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 63 63 (Courtesy of Bill Dally, 2011)
  • 64.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 64 64 (Courtesy of Bill Dally, 2011)
  • 65.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 65 65 A proposed Nivdia GPU chip processor architecture with 128 cores (160 GFlpos each) plus A proposed Nivdia GPU chip processor architecture with 128 cores (160 GFlpos each) plus 8 latency processors (LP) connected to 1024 SRAMs (L2 caches) by a NoC, where MS are 8 latency processors (LP) connected to 1024 SRAMs (L2 caches) by a NoC, where MS are the memory controllers connecting to off-chip DRAMS and NI is the network interface to the memory controllers connecting to off-chip DRAMS and NI is the network interface to next level of network (Courtesy of Bill Dally, reprint with permission [10] ). next level of network (Courtesy of Bill Dally, reprint with permission [10] ). (Courtesy of Bill Dally, 2011)
  • 66.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 66 66 The architecture of a GPU cluster built with a hierarchical network of The architecture of a GPU cluster built with a hierarchical network of processor chips (GPUs) that can deliver 2.6 PFlops per cabinet. It takes at processor chips (GPUs) that can deliver 2.6 PFlops per cabinet. It takes at least N = 400 cabinets to achieve the desired PFlops or EFlops performance. least N = 400 cabinets to achieve the desired PFlops or EFlops performance. (Courtesy of Bill Dally, reprint with permission [10] ). (Courtesy of Bill Dally, reprint with permission [10] ).
  • 67.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 67 67 (Courtesy of Bill Dally, 2011)
  • 68.
    Copyright © 2012,Elsevier Inc. All rights reserved. 2 - 2 - 68 68 Paper/Books on Clusters and MPPs: Paper/Books on Clusters and MPPs: 1. 1. G. Bell, J. Gray. And A. Szalay, “Petascale Computational G. Bell, J. Gray. And A. Szalay, “Petascale Computational Systems : Balanced Cyberstructure in a Data-Centric Systems : Balanced Cyberstructure in a Data-Centric World”, World”, IEEE Computer Magazine, IEEE Computer Magazine, 2006. 2006. 2. K. Hwang, G. Fox. And J. Dongarra, Distributed and Cloud Computing Systems, Chapter 2, Kauffmann, 2011 3. 3. G. F. Phister, G. F. Phister, In Search of Clusters, In Search of Clusters, (second Edition) (second Edition) Prentice-Hall, N.J. 2001. Prentice-Hall, N.J. 2001. Updated Studies: Updated Studies: Conduct an updated Study of the Top 5 Conduct an updated Study of the Top 5 Systems in the latest Top- 500 List . Produce an updated Table 2.3 Systems in the latest Top- 500 List . Produce an updated Table 2.3 and describe the No.1 system in details like the treatment of and describe the No.1 system in details like the treatment of Tianhe-1A, Jagaur, and Roadrunner in 2011. 2010. and 2009, Tianhe-1A, Jagaur, and Roadrunner in 2011. 2010. and 2009, respectively, in Chapter 2. respectively, in Chapter 2.