Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Lesson2 Lesson2 Presentation Transcript

  • Overview of Parallel Systems Architectures
    • Grand challenge problems
    • Shared memory multiprocessors
    • Distributed memory multicomputers
    • Static/direct link interconnects
    • Cluster computers
    • Computational grids
    • Formal classification of parallel architectures
    Parallel and distributed computers
  • Demand for Computational Speed
    • Continual demand for greater computational speed from a computer system than is currently possible
    • Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems
    • Computations must be completed within a reasonable time period
  • Grand Challenge Problems
    • A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers
    • Obviously an execution time of 10 years is always unreasonable
    • However, a grand challenge problem is not unsolvable
  • Examples
    • Global weather forecasting
    • Modeling the motion of astronomical bodies
    • Modeling large DNA structures
    • Simulating the human brain
  • Brain simulation The human brain contains 100,000,000,000 neurons each neuron receives input from 1000 others To compute a change of brain “state”, one requires 10 14 calculations If each could be done in 1  s, it would take ~3 years to complete one calculation (This problem also presents grand challenges in storage requirements)
  • Grand challenge problems are found in many fields of scientific research
    • Astronomy and astrophysics
    • Fluid dynamics
    • Meso-macro scale environmental modeling
    • Biomedical imaging
    • Molecular biology
    • Molecular design
    • Cognition
    • Nuclear power and weapons simulation
  • Parallel Computing
    • Using more than one processor to solve a problem
    • Motives
      • Idea is that n processors operating simultaneously can achieve the result n times faster. It will not be the case for various reasons.
      • Fault tolerance
      • Large amount of memory available
  • Background Parallel computers - computers with more than one processor – and their programming - have been around for more than 40 years
  • Gill writes in 1958 “ ... There is therefore nothing new in the idea of parallel programming, but its application to computers. The author cannot believe that there will be any insuperable difficulty in extending it to computers. It is not to be expected that the necessary programming techniques will be worked out overnight. Much experimenting remains to be done. After all, the techniques that are commonly used in programming today were only won at the cost of considerable toil several years ago. In fact the advent of parallel programming may do something to revive the pioneering spirit in programming which seems at the present to be degenerating into a rather dull and routine occupation ...” Gill, S. (1958), “ Parallel Programming ,” The Computer Journal, vol. 1, April, pp. 2-10.
  • Conventional Computer Consists of a processor executing a program stored in a (main) memory Main memory Processor Instructions to processor Data to/from processor Object in main memory located by its address. Addresses start at 0 and extend to 2 n - 1 where n is the number of bits in the address
  • Types of Parallel Computers
    • Shared memory multiprocessor (SMM)
    • Distributed memory multicomputer (DMM)
  • Shared Memory Multiprocessor System Interconnection network Memory modules Processors Natural way to extend the single processor model Have multiple processors connected to multiple memory modules All memory shared across all processors via a single address space
  • SMM Examples
    • Dual and quad pentiums
    • Power Mac G5s
      • Dual processor (2 GHz each)
  • Quad Pentium Shared Memory Multiprocessor Processor L1 cache L2 cache Bus interface Processor L1 cache L2 cache Bus interface Processor L1 cache L2 cache Bus interface Processor L1 cache L2 cache Bus interface Processor/ memory bus I/O interface Memory controller Memory I/O bus Shared memory
  • Shared memory
    • Any memory location is accessible by any of the processors
    • A single address space exists, meaning that each memory location is given a unique address within a single range of addresses
    • Generally shared memory programming is more convenient although it does require access to shared data to be controlled by the programmer
  • Building SMM systems
    • Building SMM machines with more than 4 sockets/processors is very difficult and very expensive
    • 8 socket 32 processor opteron systems available relatively cheaply
    • Eg Sun Microsystems E10000 “Starfire” server
      • 64 processors
      • Price: $US several million
  • Distributed Memory Multicomputer Complete computers linked by some type of Interconnection network. Computers Interconnection network Messages Processor Local memory
  • Interconnection networks
    • Static/direct link interconnection networks
    • Cluster interconnects/ networks
  • Static network message passing multicomputers Computers connected by direct links P M C P M C P M C
  • Static Link Interconnection Topologies
    • Ring
    • Tree
    • 2-D and 3-D arrays
    • Hypercubes
  • Mesh (2D array) Computer (ie processor/memory)
  • Cube (3D array) Wire up the connections to represent a 3-D lattice with computers arranged at the vertices of a cube 110 111 100 000 010 011 001 101 i.e. each computer is directly wired to 6 adjacent computers
  • Tesseract (4D hypercube) Hypercubes popular in the 1980s - not now i.e. each computer is directly wired to 8 adjacent computers
  • Thinking Machines Corp. CM-2 (The Connection Machine) Released 1987 Processors 65536 Memory 512 MB I/O Channels 8 Transfer rate 320 MB/s 4-D hypercube interconnect One preserved in Museum of American History, Smithsonian Institute
  • Cluster interconnects Static link interconnects fell out of favour during the 1990s - too expensive! A network of workstations (NOWs) became a very attractive alternative to the expensive supercomputers and parallel computer systems for high performance computing in the 1980s
  • Key advantages
    • Very high performance workstations and PCs readily available at low cost
    • The latest processors can easily be incorporated into the system as they become available (future-proof)
    • Existing software can be used or modified
  • Beowulf clusters
    • A group of interconnected “commodity” computers achieving high performance with low cost
    • Typically using commodity interconnects - high speed Ethernet and Linux OS
    • Beowulf comes from name given by NASA Goddard Space Flight Centre cluster project
  • Cluster interconnect hardware
    • Originally fast Ethernet on low cost clusters
    • Gigabit Ethernet - easy upgrade path
    • More specialized/higher performance
      • Myrinet - 2.4 Gbits/sec
      • cLan
      • SCI (Scalable Coherent Interface)
      • QsNet
      • Infiniband
  • Symmetrical Multiprocessor cluster Can have a cluster of shared memory computers (symmetrical multiprocessors) Processors Memories Processors Memories Interconnection SMP Computer 0 SMP Computer n-1
  • Earth Simulator at JAMSTEC, Yokohama, Japan 640 processor nodes
    • Some applications:
    • Ocean-atmosphere simulations
    • Interior Earth simulations
    • Holistic algorithm research
    8 vector processors/16 GB per node
  • Massey’s “Helix” Cluster 2  AMD 1GB 65 Nodes 2  AMD 1GB Ethernet Interconnect Beowulf cluster of 65 Linux PC boxes See http://helix.massey.ac.nz
  • Cluster of rack mounted servers Apple Xserve G5 Computing unit is a blade with dual processors and shared memory Interconnect by Gigabit ethernet Much more space efficient compared with clustering PC boxes - but more expensive
  • Computational Grids
    • The components of a parallel computer could be interconnected over the internet
    • Grid computing involves the application of internet distributed computing resources to a single problem
  • Eg: seti@home Home PC Server sends “work units” to internet enabled PCs and collects the results Internet Radio telescope scans the sky looking for alien signals See http://setiathome.ssl.berkeley.edu ~4000 users in a 24 hour period
  • Grid Computing
    • An extreme way of achieving parallelism
    • Involves developing software tools that allow internet distributed computing resources to function effectively as one machine
    • Resources need not be just processors - they can be databases, robotic systems, etc
    • See http://www.gridcomputing.com
  • Classification of Parallel Architectures
    • Flynn (1986) created a classification for computers based upon
    • instruction streams and data streams
    • There four types:
    • SISD Single instruction stream single data stream
    • SIMD Single instruction stream multiple data stream
    • MISD Multiple instruction stream single data stream
    • MIMD Multiple instruction stream multiple data stream
  • Single Instruction Stream Single Data Stream (SISD) In a single processor computer, a single stream of instructions is generated by the program. The instructions operate on a single stream of data items. Control Memory Instruction stream Data stream Algorithms for SISD computers do not contain any parallelism Processor
  • Single Instruction Stream Multiple Data Stream (SIMD) A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist. The Instructions from program are broadcast to more than one Processor. Each processor executes the same instruction in synchronism, but using different data. Vector computers
  • Control Shared memory or interconnection memory SIMD Architecture Instruction stream Data streams The processors operate synchronously and a global clock is used to ensure lockstep operation P1 PN P2
  • SIMD application example Add two matrices C = A + B Say we have two matrices A and B of order 2 and we have 4 processors, ie we wish to calculate: C11 = A11 + B11 C12 = A12 + B12 C21 = A21 + B21 C22 = A22 + B22 The same instruction (add the two numbers) is sent to each processor, but each processor receives different data
  • Multiple Instruction Stream Single Data Stream (MISD) A computer with multiple processors each sharing a common memory. There are multiple streams of instructions and one stream of data. Memory Control Control Processor Processor
  • MISD example Check whether a number Z is prime Each processor is assigned a set of test divisors in its instruction stream Each processor, takes Z as input and tries to divide it by its divisors MISD is awkward to implement and such machines are just experimental No commercial MISD machine exists
  • Multiple Instruction Stream Multiple Data Stream (MIMD) General purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction stream operates upon different data. The most general and most useful of our classifications.
  • Shared memory or interconnection network Processors Controls Each processor operates under the control of an instruction stream issued by its own control unit. Processors operate asynchronously in general MIMD architecture P1 PN P2 C1 C2 CN
  • MIMD computers MIMD machines with shared memory are described as tightly coupled (quad pentiums, Mac G5s, …) MIMD machines with interconnection network are described as loosely coupled (Beowulf and rack mounted clusters, etc) We will work with MIMD computers in this course
  • MIMD program structure Multiple Program Multiple Data (MPMD) Each processor will have its own program to execute Single Program Multiple Data (SPMD) A single source program is written, and each processor executes its own personal copy of the program