• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Multiprocessor architecture and programming
 

Multiprocessor architecture and programming

on

  • 1,125 views

Parallel Computing Architecture and Programming Techniques. ...

Parallel Computing Architecture and Programming Techniques.

Antecedents of Parallel Computing
Introduction to Parallel Architectures
Parallel Programming Concepts
Parallel Design Patterns
Performance & Optimization
Parallel Compilers
Actual Cases
Future of Parallel Architectures

Statistics

Views

Total Views
1,125
Views on SlideShare
1,123
Embed Views
2

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 2

https://duckduckgo.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.

Multiprocessor architecture and programming Multiprocessor architecture and programming Presentation Transcript

  • Parallel Computing Architecture & Programming Techniques Raul Goycoolea S. Solution Architect Manager Oracle Enterprise Architecture Group
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 216 February 2012
  • Antecedents of Parallel Computing
  • The “Software Crisis” “To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." -- E. Dijkstra, 1972 Turing Award Lecture Raul Goycoolea S. Multiprocessor Programming 416 February 2012
  • The First Software Crisis • Time Frame: ’60s and ’70s • Problem: Assembly Language Programming Computers could handle larger more complex programs • Needed to get Abstraction and Portability without losing Performance Raul Goycoolea S. Multiprocessor Programming 516 February 2012
  • Common Properties Single flow of control Single memory image Differences: Register File ISA Functional Units How Did We Solve The First Software Crisis? • High-level languages for von-Neumann machines FORTRAN and C • Provided “common machine language” for uniprocessors Raul Goycoolea S. Multiprocessor Programming 616 February 2012
  • The Second Software Crisis • Time Frame: ’80s and ’90s • Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers Computers could handle larger more complex programs • Needed to get Composability, Malleability and Maintainability High-performance was not an issue left for Moore’s Law Raul Goycoolea S. Multiprocessor Programming 716 February 2012
  • How Did We Solve the Second Software Crisis? • Object Oriented Programming C++, C# and Java • Also… Better tools • Component libraries, Purify Better software engineering methodology • Design patterns, specification, testing, code reviews Raul Goycoolea S. Multiprocessor Programming 816 February 2012
  • Today: Programmers are Oblivious to Processors • Solid boundary between Hardware and Software • Programmers don’t have to know anything about the processor High level languages abstract away the processors Ex: Java bytecode is machine independent Moore’s law does not require the programmers to know anything about the processors to get good speedups • Programs are oblivious of the processor works on all processors A program written in ’70 using C still works and is much faster today • This abstraction provides a lot of freedom for the programmers Raul Goycoolea S. Multiprocessor Programming 916 February 2012
  • The Origins of a Third Crisis • Time Frame: 2005 to 20?? • Problem: Sequential performance is left behind by Moore’s law • Needed continuous and reasonable performance improvements to support new features to support larger datasets • While sustaining portability, malleability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software Raul Goycoolea S. Multiprocessor Programming 1016 February 2012
  • Performance(vs.VAX-11/780) NumberofTransistors 52%/year 100 1000 10000 100000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 %/year 10 8086 1 286 25%/year 386 486 Pentium P2 P3 P4 Itanium 2 Itanium 1,000,000,000 100,000 10,000 1,000,000 10,000,000 100,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 The Road to Multicore: Moore’s Law Raul Goycoolea S. Multiprocessor Programming 1116 February 2012
  • Specint2000 10000.00 1000.00 100.00 10.00 1.00 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Spar c Super Spar c Spar c64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 The Road to Multicore: Uniprocessor Performance (SPECint) Raul Goycoolea S. Multiprocessor Programming 1216 February 2012 Intel 386 Intel 486
  • The Road to Multicore: Uniprocessor Performance (SPECint) General-purpose unicores have stopped historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism Raul Goycoolea S. Multiprocessor Programming 1316 February 2012
  • Power 1000 100 10 1 85 87 89 91 93 95 97 99 01 03 05 07 Intel 386 Intel 486 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha21064 Alpha21164 Alpha21264 Sparc SuperSparc Sparc64 Mips HPPA Power PC AMDK6 AMDK7 AMDx86-64 Power Consumption (watts) Raul Goycoolea S. Multiprocessor Programming 1416 February 2012
  • Watts/Spec 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1982 1984 1987 1990 1993 1995 1998 2001 2004 2006 Year intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 0 Power Efficiency (watts/spec) Raul Goycoolea S. Multiprocessor Programming 1516 February 2012
  • Process(microns) 0.06 0.04 0.02 0 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 1996 1998 2000 2002 2008 2010 2012 20142004 2006 Year 700 MHz 1.25 GHz 2.1 GHz 6 GHz 10 GHz 13.5 GHz • 400 mm2 Die • From the SIA Roadmap Range of a Wire in One Clock Cycle Raul Goycoolea S. Multiprocessor Programming 1616 February 2012
  • Performance 19 84 19 94 19 92 19 82 19 88 19 86 19 80 19 96 19 98 20 00 20 02 19 90 20 04 1000000 10000 100 1 Year µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM Access Latency • Access times are a speed of light issue • Memory technology is also changing SRAM are getting harder to scale DRAM is no longer cheapest cost/bit • Power efficiency is an issue here as well Raul Goycoolea S. Multiprocessor Programming 1716 February 2012
  • PowerDensity(W/cm2) 10,000 1,000 „70 „80 „90 „00 „10 10 4004 8008 8080 1 8086 8085 286 386 486 Pentium® Hot Plate Nuclear Reactor 100 Sun‟s Surface Rocket Nozzle Intel Developer Forum, Spring 2004 - Pat Gelsinger (Pentium at 90 W) Cube relationship between the cycle time and power CPUs Architecture Heat becoming an unmanageable problem Raul Goycoolea S. Multiprocessor Programming 1816 February 2012
  • Diminishing Returns • The ’80s: Superscalar expansion 50% per year improvement in performance Transistors applied to implicit parallelism - pipeline processor (10 CPI --> 1 CPI) • The ’90s: The Era of Diminishing Returns Squeaking out the last implicit parallelism 2-way to 6-way issue, out-of-order issue, branch prediction 1 CPI --> 0.5 CPI Performance below expectations projects delayed & canceled • The ’00s: The Beginning of the Multicore Era The need for Explicit Parallelism Raul Goycoolea S. Multiprocessor Programming 1916 February 2012
  • Mit Raw 16 Cores 2002 Intel Tanglewood Dual Core IA/64 Intel Dempsey Dual Core Xeon Intel Montecito 1.7 Billion transistors Dual Core IA/64 Intel Pentium D (Smithfield) Cancelled Intel Tejas & Jayhawk Unicore (4GHz P4) IBM Power 6 Dual Core IBM Power 4 and 5 Dual Cores Since 2001 Intel Pentium Extreme 3.2GHz Dual Core Intel Yonah Dual Core Mobile AMD Opteron Dual Core Sun Olympus and Niagara 8 Processor Cores IBM Cell Scalable Multicore … 1H 2005 1H 2006 2H 20062H 20052H 2004 Unicores are on extinction Now all is multicore
  • # of 1985 199019801970 1975 1995 2000 2005 Raw Cavium Octeon Raza XLR CSR-1 Intel Tflops Picochip PC102 Cisco Niagara Boardcom 1480 Xbox360 2010 2 1 8 4 32 cores 16 128 64 512 256 Cell Opteron 4P Xeon MP Ambric AM2045 4004 8008 80868080 286 386 486 Pentium PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 Multicores Future Raul Goycoolea S. Multiprocessor Programming 2116 February 2012
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 2216 February 2012
  • Introduction to Parallel Architectures
  • Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU) • A problem is broken into a discrete series of instructions • Instructions are executed one after another • Only one instruction may execute at any moment in time What is Parallel Computing? Raul Goycoolea S. Multiprocessor Programming 2416 February 2012
  • What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs Raul Goycoolea S. Multiprocessor Programming 2516 February 2012
  • Options in Parallel Computing? The compute resources might be: • A single computer with multiple processors; • An arbitrary number of computers connected by a network; • A combination of both. The computational problem should be able to: • Be broken apart into discrete pieces of work that can be solved simultaneously; • Execute multiple program instructions at any moment in time; • Be solved in less time with multiple compute resources than with a single compute resource. Raul Goycoolea S. Multiprocessor Programming 2616 February 2012
  • 27
  • The Real World is Massively Parallel • Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For example: • Galaxy formation • Planetary movement • Weather and ocean patterns • Tectonic plate drift Rush hour traffic • Automobile assembly line • Building a jet • Ordering a hamburger at the drive through. Raul Goycoolea S. Multiprocessor Programming 2816 February 2012
  • Architecture Concepts Von Neumann Architecture • Named after the Hungarian mathematician John von Neumann who first authored the general requirements for an electronic computer in his 1945 papers • Since then, virtually all computers have followed this basic design, differing from earlier computers which were programmed through "hard wiring” • Comprised of four main components: • Memory • Control Unit • Arithmetic Logic Unit • Input/Output • Read/write, random access memory is used to store both program instructions and data • Program instructions are coded data which tell the computer to do something • Data is simply information to be used by the program • Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. • Aritmetic Unit performs basic arithmetic operations • Input/Output is the interface to the human operator Raul Goycoolea S. Multiprocessor Programming 2916 February 2012
  • Flynn’s Taxonomy • There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. • Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. • The matrix below defines the 4 possible classifications according to Flynn: Raul Goycoolea S. Multiprocessor Programming 3016 February 2012
  • Single Instruction, Single Data (SISD): • A serial (non-parallel) computer • Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle • Single Data: Only one data stream is being used as input during any one clock cycle • Deterministic execution • This is the oldest and even today, the most common type of computer • Examples: older generation mainframes, minicomputers and workstations; most modern day PCs. Raul Goycoolea S. Multiprocessor Programming 3116 February 2012
  • Single Instruction, Single Data (SISD): Raul Goycoolea S. Multiprocessor Programming 3216 February 2012
  • Single Instruction, Multiple Data (SIMD): • A type of parallel computer • Single Instruction: All processing units execute the same instruction at any given clock cycle • Multiple Data: Each processing unit can operate on a different data element • Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. • Synchronous (lockstep) and deterministic execution • Two varieties: Processor Arrays and Vector Pipelines • Examples: • Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV • Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 • Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units. Raul Goycoolea S. Multiprocessor Programming 3316 February 2012
  • Single Instruction, Multiple Data (SIMD): ILLIAC IV MasPar TM CM-2 Cell GPU Cray X-MP Cray Y-MP Raul Goycoolea S. Multiprocessor Programming 3416 February 2012
  • • A type of parallel computer • Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams. • Single Data: A single data stream is fed into multiple processing units. • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). • Some conceivable uses might be: • multiple frequency filters operating on a single signal stream • multiple cryptography algorithms attempting to crack a single coded message. Multiple Instruction, Single Data (MISD): Raul Goycoolea S. Multiprocessor Programming 3516 February 2012
  • Multiple Instruction, Single Data (MISD): Raul Goycoolea S. Multiprocessor Programming 3616 February 2012
  • • A type of parallel computer • Multiple Instruction: Every processor may be executing a different instruction stream • Multiple Data: Every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or non-deterministic • Currently, the most common type of parallel computer - most modern supercomputers fall into this category. • Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution sub-components Multiple Instruction, Multiple Data (MIMD): Raul Goycoolea S. Multiprocessor Programming 3716 February 2012
  • Multiple Instruction, Multiple Data (MIMD): Raul Goycoolea S. Multiprocessor Programming 3816 February 2012
  • Multiple Instruction, Multiple Data (MIMD): IBM Power HP Alphaserver Intel IA32/x64 Oracle SPARC Cray XT3 Oracle Exadata/Exalogic Raul Goycoolea S. Multiprocessor Programming 3916 February 2012
  • Parallel Computer Memory Architecture Shared Memory Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA. Uniform Memory Access (UMA): • Most commonly represented today by Symmetric Multiprocessor (SMP) machines • Identical processors Non-Uniform Memory Access (NUMA): • Often made by physically linking two or more SMPs • One SMP can directly access memory of another SMP 40 Raul Goycoolea S. Multiprocessor Programming 4016 February 2012
  • Parallel Computer Memory Architecture Shared Memory 41 Shared Memory (UMA) Shared Memory (NUMA) Raul Goycoolea S. Multiprocessor Programming 4116 February 2012
  • Basic structure of a centralized shared-memory multiprocessor Processor Processor Processor Processor One or more levels of Cache One or more levels of Cache One or more levels of Cache One or more levels of Cache Multiple processor-cache subsystems share the same physical memory, typically connected by a bus. In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniform access time o all memory from all processors remains. Raul Goycoolea S. Multiprocessor Programming 4216 February 2012
  • Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Processor + Cache I/OMemory Interconnection Network Basic Architecture of a Distributed Multiprocessor Consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes. Individual nodes may contain a small number of processors, which may be interconnected by a small bus or a different interconnection technology, which is less scalable than the global interconnection network. Raul Goycoolea S. Multiprocessor Programming 4316 February 2012
  • Communication how do parallel operations communicate data results? Synchronization how are parallel operations coordinated? Resource Management how are a large number of parallel tasks scheduled onto finite hardware? Scalability how large a machine can be built? Issues in Parallel Machine Design Raul Goycoolea S. Multiprocessor Programming 4416 February 2012
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 4516 February 2012
  • Parallel Programming Concepts
  • ExplicitImplicit Hardware Compiler Superscalar Processors Explicitly Parallel Architectures Implicit vs. Explicit Parallelism Raul Goycoolea S. Multiprocessor Programming 4716 February 2012
  • Implicit Parallelism: Superscalar Processors Explicit Parallelism Shared Instruction Processors Shared Sequencer Processors Shared Network Processors Shared Memory Processors Multicore Processors Outline Raul Goycoolea S. Multiprocessor Programming 4816 February 2012
  • Issue varying numbers of instructions per clock statically scheduled – – using compiler techniques in-order execution dynamically scheduled – – – – – Extracting ILP by examining 100‟s of instructions Scheduling them in parallel as operands become available Rename registers to eliminate anti dependences out-of-order execution Speculative execution Implicit Parallelism: Superscalar Processors Raul Goycoolea S. Multiprocessor Programming 4916 February 2012
  • Instruction i IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 Instruction # 1 2 3 4 5 6 7 8 IF: Instruction fetch EX : Execution Cycles ID : Instruction decode WB : Write back Pipelining Execution Raul Goycoolea S. Multiprocessor Programming 5016 February 2012
  • Instruction type 1 2 3 4 5 6 7 Cycles Integer Floating point IF IF ID ID EX EX WB WB Integer Floating point Integer Floating point Integer Floating point IF IF ID ID EX EX WB WB IF IF ID ID EX EX WB WB IF IF ID ID EX EX WB WB 2-issue super-scalar machine Super-Scalar Execution Raul Goycoolea S. Multiprocessor Programming 5116 February 2012
  • Intrinsic data dependent (aka true dependence) on Instructions: I: add r1,r2,r3 J: sub r4,r1,r3 If two instructions are data dependent, they cannot execute simultaneously, be completely overlapped or execute in out-of- order If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard Data Dependence and Hazards Raul Goycoolea S. Multiprocessor Programming 5216 February 2012
  • HW/SW must preserve program order: order instructions would execute in if executed sequentially as determined by original source program Dependences are a property of programs Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Goal: exploit parallelism by preserving program order only where it affects the outcome of the program ILP and Data Dependencies, Hazards Raul Goycoolea S. Multiprocessor Programming 5316 February 2012
  • Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrIreads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard Name Dependence #1: Anti-dependece Raul Goycoolea S. Multiprocessor Programming 5416 February 2012
  • Instruction writes operand before InstrIwrites it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “output dependence” by compiler writers. This also results from the reuse of name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for registers Renaming can be done either by compiler or by HW Name Dependence #1: Output Dependence Raul Goycoolea S. Multiprocessor Programming 5516 February 2012
  • Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. Control dependence need not be preserved willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program Speculative Execution Control Dependencies Raul Goycoolea S. Multiprocessor Programming 5616 February 2012
  • Greater ILP: Overcome control dependence by hardware speculating on outcome of branches and executing program as if guesses were correct Speculation ⇒ fetch, issue, and execute instructions as if branch predictions were always correct Dynamic scheduling ⇒ only fetches and issues instructions Essentially a data flow execution model: Operations execute as soon as their operands are available Speculation Raul Goycoolea S. Multiprocessor Programming 5716 February 2012
  • Different predictors Branch Prediction Value Prediction Prefetching (memory access pattern prediction) Inefficient Predictions can go wrong Has to flush out wrongly predicted data While not impacting performance, it consumes power Speculation in Rampant in Modern Superscalars Raul Goycoolea S. Multiprocessor Programming 5816 February 2012
  • Implicit Parallelism: Superscalar Processors Explicit Parallelism Shared Instruction Processors Shared Sequencer Processors Shared Network Processors Shared Memory Processors Multicore Processors Outline Raul Goycoolea S. Multiprocessor Programming 5916 February 2012
  • Parallelism is exposed to software Compiler or Programmer Many different forms Loosely coupled Multiprocessors to tightly coupled VLIW Explicit Parallel Processors Raul Goycoolea S. Multiprocessor Programming 6016 February 2012
  • Throughput per Cycle One Operation Latency in Cycles Parallelism = Throughput * Latency To maintain throughput T/cycle when each operation has latency L cycles, need T*L independent operations For fixed parallelism: decreased latency allows increased throughput decreased throughput allows increased latency tolerance Little’s Law Raul Goycoolea S. Multiprocessor Programming 6116 February 2012
  • Time Time Time Time Data-Level Parallelism (DLP) Instruction-Level Parallelism (ILP) Pipelining Thread-Level Parallelism (TLP) Types of Software Parallelism Raul Goycoolea S. Multiprocessor Programming 6216 February 2012
  • Pipelining Thread Parallel Data Parallel Instruction Parallel Translating Parallelism Types Raul Goycoolea S. Multiprocessor Programming 6316 February 2012
  • What is a sequential program? A single thread of control that executes one instruction and when it is finished execute the next logical instruction What is a concurrent program? A collection of autonomous sequential threads, executing (logically) in parallel The implementation (i.e. execution) of a collection of threads can be: Multiprogramming – Threads multiplex their executions on a single processor. Multiprocessing – Threads multiplex their executions on a multiprocessor or a multicore system Distributed Processing – Processes multiplex their executions on several different machines What is concurrency? Raul Goycoolea S. Multiprocessor Programming 6416 February 2012
  • Concurrency is not (only) parallelism Interleaved Concurrency Logically simultaneous processing Interleaved execution on a single processor Parallelism Physically simultaneous processing Requires a multiprocessors or a multicore system Time Time A B C A B C Concurrency and Parallelism Raul Goycoolea S. Multiprocessor Programming 6516 February 2012
  • There are a lot of ways to use Concurrency in Programming Semaphores Blocking & non-blocking queues Concurrent hash maps Copy-on-write arrays Exchangers Barriers Futures Thread pool support Other Types of Synchronization Raul Goycoolea S. Multiprocessor Programming 6616 February 2012
  • Deadlock Two or more threads stop and wait for each other Livelock Two or more threads continue to execute, but make no progress toward the ultimate goal Starvation Some thread gets deferred forever Lack of fairness Each thread gets a turn to make progress Race Condition Some possible interleaving of threads results in an undesired computation result Potential Concurrency Problems Raul Goycoolea S. Multiprocessor Programming 6716 February 2012
  • Concurrency and Parallelism are important concepts in Computer Science Concurrency can simplify programming However it can be very hard to understand and debug concurrent programs Parallelism is critical for high performance From Supercomputers in national labs to Multicores and GPUs on your desktop Concurrency is the basis for writing parallel programs Next Lecture: How to write a Parallel Program Parallelism Conclusions Raul Goycoolea S. Multiprocessor Programming 6816 February 2012
  • Shared memory – – – – Ex: Intel Core 2 Duo/Quad One copy of data shared among many cores Atomicity, locking and synchronization essential for correctness Many scalability issues Distributed memory – – – – Ex: Cell Cores primarily access local memory Explicit data exchange between cores Data distribution and communication orchestration is essential for performance P1 P2 P3 Pn Memory Interconnection Network Interconnection Network P1 P2 P3 Pn M1 M2 M3 Mn Two primary patterns of multicore architecture design Architecture Recap Raul Goycoolea S. Multiprocessor Programming 6916 February 2012
  • Processor 1…n ask for X There is only one place to look Communication through shared variables Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize synchronization P1 P2 P3 Pn Memory x Interconnection Network Programming Shared Memory Processors Raul Goycoolea S. Multiprocessor Programming 7016 February 2012
  • Data parallel Perform same computation but operate on different data A single process can fork multiple concurrent threads Each thread encapsulate its own execution path Each thread has local state and shared resources Threads communicate through shared resources such as global memory for (i = 0; i < 12; i++) C[i] = A[i] + B[i]; i=0 i=1 i=2 i=3 i=8 i=9 i = 10 i = 11 i=4 i=5 i=6 i=7 join (barrier) fork (threads) Example of Parallelization Raul Goycoolea S. Multiprocessor Programming 7116 February 2012
  • int A[12] = {...}; int B[12] = {...}; int C[12]; void add_arrays(int start) { int i; for (i = start; i < start + 4; i++) C[i] = A[i] + B[i]; } int main (int argc, char *argv[]) { pthread_t threads_ids[3]; int rc, t; for(t = 0; t < 4; t++) { rc = pthread_create(&thread_ids[t], NULL /* attributes */, add_arrays /* function */, t * 4 /* args to function */); } pthread_exit(NULL); } join (barrier) i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i = 10 i = 11 fork (threads) Example Parallelization with Threads Raul Goycoolea S. Multiprocessor Programming 7216 February 2012
  • Data parallelism Perform same computation but operate on different data Control parallelism Perform different functions fork (threads) join (barrier) pthread_create(/* thread id */, /* attributes */, /* /* any function args to function */, */); Types of Parallelism Raul Goycoolea S. Multiprocessor Programming 7316 February 2012
  • Uniform Memory Access (UMA) Centrally located memory All processors are equidistant (access times) Non-Uniform Access (NUMA) Physically partitioned but accessible by all Processors have the same address space Placement of data affects performance Memory Access Latency in Shared Memory Architectures Raul Goycoolea S. Multiprocessor Programming 7416 February 2012
  • Coverage or extent of parallelism in algorithm Granularity of data partitioning among processors Locality of computation and communication … so how do I parallelize my program? Summary of Parallel Performance Factors Raul Goycoolea S. Multiprocessor Programming 7516 February 2012
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 7616 February 2012
  • Parallel Design Patterns
  • P0 Tasks Processes Processors P1 P2 P3 p0 p1 p2 p3 p0 p1 p2 p3 Partitioning Sequential computation Parallel program d e c o m p o s i t i o n a s s i g n m e n t o r c h e s t r a t i o n m a p p i n g Common Steps to Create a Parallel Program
  • Identify concurrency and decide at what level to exploit it Break up computation into tasks to be divided among processes Tasks may become available dynamically Number of tasks may vary with time Enough tasks to keep processors busy Number of tasks available at a time is upper bound on achievable speedup Decomposition (Amdahl’s Law)
  • Specify mechanism to divide work among core Balance work and reduce communication Structured approaches usually work well Code inspection or understanding of application Well-known design patterns As programmers, we worry about partitioning first Independent of architecture or programming model But complexity often affect decisions! Granularity
  • Computation and communication concurrency Preserve locality of data Schedule tasks to satisfy dependences early Orchestration and Mapping
  • Provides a cookbook to systematically guide programmers Decompose, Assign, Orchestrate, Map Can lead to high quality solutions in some domains Provide common vocabulary to the programming community Each pattern has a name, providing a vocabulary for discussing solutions Helps with software reusability, malleability, and modularity Written in prescribed format to allow the reader to quickly understand the solution and its context Otherwise, too difficult for programmers, and software will not fully exploit parallel hardware Parallel Programming by Pattern
  • Berkeley architecture professor Christopher Alexander In 1977, patterns for city planning, landscaping, and architecture in an attempt to capture principles for “living” design History
  • Example 167 (p. 783)
  • Design Patterns: Elements of Reusable Object- Oriented Software (1995) Gang of Four (GOF): Gamma, Helm, Johnson, Vlissides Catalogue of patterns Creation, structural, behavioral Patterns in Object-Oriented Programming
  • Algorithm Expression Finding Concurrency Expose concurrent tasks Algorithm Structure Map tasks to processes to exploit parallel architecture 4 Design Spaces Software Construction Supporting Structures Code and data structuring patterns Implementation Mechanisms Low level mechanisms used to write parallel programs Patterns for Parallel Programming. Mattson, Sanders, and Massingill (2005). Patterns for Parallelizing Programs
  • split frequency encoded macroblocks ZigZag IQuantization IDCT Saturation spatially encoded macroblocks differentially coded motion vectors Motion Vector Decode Repeat motion vectors MPEG bit stream VLD macroblocks, motion vectors MPEG Decoder join Motion Compensation recovered picture Picture Reorder Color Conversion Display Here’s my algorithm, Where’s the concurrency?
  • Task decomposition Independent coarse-grained computation Inherent to algorithm Sequence of statements (instructions) that operate together as a group Corresponds to some logical part of program Usually follows from the way programmer thinks about a problem join motion vectorsspatially encoded macroblocks IDCT Saturation MPEG Decoder frequency encoded macroblocks ZigZag IQuantization MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Motion Compensation recovered picture Picture Reorder Color Conversion Display Here’s my algorithm, Where’s the concurrency?
  • join motion vectors Saturation spatially encoded macroblocks MPEG Decoder frequency encoded macroblocks ZigZag IQuantization IDCT Motion Compensation recovered picture Picture Reorder Color Conversion Display MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Task decomposition Parallelism in the application Data decomposition Same computation is applied to small data chunks derived from large data set Here’s my algorithm, Where’s the concurrency?
  • motion vectorsspatially encoded macroblocks MPEG Decoder frequency encoded macroblocks ZigZag IQuantization IDCT Saturation join Motion Compensation recovered picture Picture Reorder Color Conversion Display MPEG bit stream VLD macroblocks, motion vectors split differentially coded motion vectors Motion Vector Decode Repeat Task decomposition Parallelism in the application Data decomposition Same computation many data Pipeline decomposition Data assembly lines Producer-consumer chains Here’s my algorithm, Where’s the concurrency?
  • Algorithms start with a good understanding of the problem being solved Programs often naturally decompose into tasks Two common decompositions are – – Function calls and Distinct loop iterations Easier to start with many tasks and later fuse them, rather than too few tasks and later try to split them Guidelines for Task Decomposition
  • Flexibility Program design should afford flexibility in the number and size of tasks generated – – Tasks should not tied to a specific architecture Fixed tasks vs. Parameterized tasks Efficiency Tasks should have enough work to amortize the cost of creating and managing them Tasks should be sufficiently independent so that managing dependencies doesn‟t become the bottleneck Simplicity The code has to remain readable and easy to understand, and debug Guidelines for Task Decomposition
  • Data decomposition is often implied by task decomposition Programmers need to address task and data decomposition to create a parallel program Which decomposition to start with? Data decomposition is a good starting point when Main computation is organized around manipulation of a large data structure Similar operations are applied to different parts of the data structure Guidelines for Data Decomposition Raul Goycoolea S. Multiprocessor Programming 9316 February 2012
  • Array data structures Decomposition of arrays along rows, columns, blocks Recursive data structures Example: decomposition of trees into sub-trees problem compute subproblem compute subproblem compute subproblem compute subproblem merge subproblem merge subproblem merge solution subproblem split subproblem split split Common Data Decompositions Raul Goycoolea S. Multiprocessor Programming 9416 February 2012
  • Flexibility Size and number of data chunks should support a wide range of executions Efficiency Data chunks should generate comparable amounts of work (for load balancing) Simplicity Complex data compositions can get difficult to manage and debug Raul Goycoolea S. Multiprocessor Programming 9516 February 2012 Guidelines for Data Decompositions
  • Data is flowing through a sequence of stages Assembly line is a good analogy What’s a prime example of pipeline decomposition in computer architecture? Instruction pipeline in modern CPUs What’s an example pipeline you may use in your UNIX shell? Pipes in UNIX: cat foobar.c | grep bar | wc Other examples Signal processing Graphics ZigZag IQuantization IDCT Saturation Guidelines for Pipeline Decomposition Raul Goycoolea S. Multiprocessor Programming 9616 February 2012
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 9716 February 2012
  • Performance & Optimization
  • Coverage or extent of parallelism in algorithm Amdahl‟s Law Granularity of partitioning among processors Communication cost and load balancing Locality of computation and communication Communication between processors or between processors and their memories Review: Keys to Parallel Performance
  • n/m B t overlap)C f (o l frequency of messages overhead per message (at both ends) network delay per message number of messages amount of latency hidden by concurrency with computation total data sent cost induced by contention per message bandwidth along path (determined by network) Communication Cost Model
  • synchronization point Get Data Compute Get Data CPU is idle Memory is idle Compute Overlapping Communication with Computation
  • Computation to communication ratio limits performance gains from pipelining Get Data Compute Get Data Compute Where else to look for performance? Limits in Pipelining Communication
  • Determined by program implementation and interactions with the architecture Examples: Poor distribution of data across distributed memories Unnecessarily fetching data that is not used Redundant data fetches Artifactual Communication
  • In uniprocessors, CPU communicates with memory Loads and stores are to uniprocessors as _______ and ______ are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality “get” “put” Lessons From Uniprocessors
  • CPU asks for data at address 1000 Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001, 1002, …, 1064 Otherwise wasted bandwidth and cache capacity Spatial Locality
  • Main memory access is expensive Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over … Otherwise wasted cache capacity main memory cache (level 2) cache (level 1) Temporal Locality
  • Data is transferred in chunks to amortize communication cost Cell: DMA gets up to 16K Usually get a contiguous chunk of memory Spatial locality Computation should exhibit good spatial locality characteristics Temporal locality Reorder computation to maximize use of data fetched Reducing Artifactual Costs in Distributed Memory Architectures
  • Tasks mapped to execution units (threads) Threads run on individual processors (cores) finish line: sequential time + longest parallel time Two keys to faster execution Load balance the work among the processors Make execution on each processor faster sequential parallel sequential parallel Single Thread Performance
  • Need some way of measuring performance Coarse grained measurements % gcc sample.c % time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3 % time a.out 1.921u 0.093s 0:02.03 99.0% … but did we learn much about what’s going on? #define N (1 << 23) #define T (10) #include <string.h> double a[N],b[N]; void cleara(double a[N]) { int i; for (i = 0; i < N; i++) { a[i] = 0; } } int main() { double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); for (i = 0; i < N; i++) { s += a[i] * b[i]; s2 += a[i] * a[i] + b[i] * b[i]; } } printf("s %f s2 %fn",s,s2); } record stop time record start time Understanding Performance
  • Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time time stopclear/start Measurements Using Counters
  • Event-based profiling Interrupt execution when an event counter reaches a threshold Time-based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies Dynamic Profiling
  • Cycles (clock ticks) Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations … Counter Examples
  • Processor utilization Cycles / Wall Clock Time Instructions per cycle Instructions / Cycles Instructions per memory operation Instructions / Loads + Stores Average number of instructions per load miss Instructions / L1 Load Misses Memory traffic Loads + Stores * Lk Cache Line Size Bandwidth consumed Loads + Stores * Lk Cache Line Size / Wall Clock Time Many others Cache miss rate Branch misprediction rate … Useful Derived Measurements
  • application source run (profiles execution) performance profile binary object code compiler binary analysis interpret profile source correlation Common Profiling Workflow
  • GNU gprof Widely available with UNIX/Linux distributions gcc –O2 –pg foo.c –o foo ./foo gprof foo HPC Toolkit http://www.hipersoft.rice.edu/hpctoolkit/ PAPI http://icl.cs.utk.edu/papi/ VTune http://www.intel.com/cd/software/products/asmo-na/eng/vtune/ Many others Popular Runtime Profiling Tools
  • Instruction level parallelism Multiple functional units, deeply pipelined, speculation, ... Data level parallelism SIMD (Single Inst, Multiple Data): short vector instructions (multimedia extensions)– – – Hardware is simpler, no heavily ported register files Instructions are more compact Reduces instruction fetch bandwidth Complex memory hierarchies Multiple level caches, may outstanding misses, prefetching, … Performance un Uniprocessors time = compute + wait
  • Single Instruction, Multiple Data SIMD registers hold short vectors Instruction operates on all elements in SIMD register at once a b c Vector code for (int i = 0; i < n; i += 4) { c[i:i+3] = a[i:i+3] + b[i:i+3] } SIMD register Scalar code for (int i = 0; i < n; i+=1) { c[i] = a[i] + b[i] } a b c scalar register Single Instruction, Multiple Data
  • For Example Cell SPU has 128 128-bit registers All instructions are SIMD instructions Registers are treated as short vectors of 8/16/32-bit integers or single/double-precision floats Instruction Set AltiVec MMX/SSE 3DNow! VIS MAX2 MVI MDMX Architecture PowerPC Intel AMD Sun HP Alpha MIPS V SIMD Width 128 64/128 64 64 64 64 64 Floating Point yes yes yes no no no yes SIMD in Major Instruction Set Architectures (ISAs)
  • Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Compiler vs. Crypto Oracle T4 Using SIMD Instructions
  • Tune the parallelism first Then tune performance on individual processors Modern processors are complex Need instruction level parallelism for performance Understanding performance requires a lot of probing Optimize for the memory hierarchy Memory is much slower than processors Multi-layer memory hierarchies try to hide the speed gap Data locality is essential for performance Programming for Performance
  • May have to change everything! Algorithms, data structures, program structure Focus on the biggest performance impediments Too many issues to study everything Remember the law of diminishing returns Programming for Performance
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Actual Cases • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 12216 February 2012
  • Parallel Compilers
  • Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Compilers Outline Raul Goycoolea S. Multiprocessor Programming 12416 February 2012
  • Instruction Level Parallelism (ILP) Task Level Parallelism (TLP) Loop Level Parallelism (LLP) or Data Parallelism Pipeline Parallelism Divide and Conquer Parallelism Scheduling and Hardware Mainly by hand Hand or Compiler Generated Hardware or Streaming Recursive functions Types of Parallelism Raul Goycoolea S. Multiprocessor Programming 12516 February 2012
  • 90% of the execution time in 10% of the code Mostly in loops If parallel, can get good performance Load balancing Relatively easy to analyze Why Loops? Raul Goycoolea S. Multiprocessor Programming 12616 February 2012
  • FORALL No “loop carried dependences” Fully parallel FORACROSS Some “loop carried dependences” Programmer Defined Parallel Loop Raul Goycoolea S. Multiprocessor Programming 12716 February 2012
  • Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 12816 February 2012
  • Finding FORALL Loops out of FOR loops Examples FOR I = 0 to 5 A[I+1] = A[I] + 1 FOR I = 0 to 5 A[I] = A[I+6] + 1 For I = 0 to 5 A[2*I] = A[2*I + 1] + 1 Parallelizing Compilers Raul Goycoolea S. Multiprocessor Programming 12916 February 2012
  • True dependence a = = a Anti dependence = a a = Output dependence a a = = Definition: Data dependence exists for a dynamic instance i and j iff either i or j is a write operation i and j refer to the same variable i executes before j How about array accesses within loops? Dependences Raul Goycoolea S. Multiprocessor Programming 13016 February 2012
  • Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 13116 February 2012
  • FOR I = 0 to 5 A[I] = A[I] + 1 0 1 2 Iteration Space 0 1 2 3 4 5 Data Space 3 4 5 6 7 8 9 10 11 12 A[I] A[I] A[I] A[I] A[I] = A[I] = A[I] = A[I] = A[I] = A[I] Array Access in a Loop Raul Goycoolea S. Multiprocessor Programming 13216 February 2012
  • Find data dependences in loop For every pair of array acceses to the same array If the first access has at least one dynamic instance (an iteration) in which it refers to a location in the array that the second access also refers to in at least one of the later dynamic instances (iterations). Then there is a data dependence between the statements (Note that same array can refer to itself – output dependences) Definition Loop-carried dependence: dependence that crosses a loop boundary If there are no loop carried dependences are parallelizable Recognizing FORALL Loops Raul Goycoolea S. Multiprocessor Programming 13316 February 2012
  • FOR I = 1 to n FOR J = 1 to n A[I, J] = A[I-1, J+1] + 1 FOR I = 1 to n FOR J = 1 to n A[I] = A[I-1] + 1 J J I I What is the Dependence? Raul Goycoolea S. Multiprocessor Programming 13416 February 2012
  • Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel Loops Communication Code Generation Outline Raul Goycoolea S. Multiprocessor Programming 13516 February 2012
  • Scalar Privatization Reduction Recognition Induction Variable Identification Array Privatization Interprocedural Parallelization Loop Transformations Granularity of Parallelism Increasing Parallelization Opportunities Raul Goycoolea S. Multiprocessor Programming 13616 February 2012
  • Example FOR i = 1 to n X = A[i] * 3; B[i] = X; Is there a loop carried dependence? What is the type of dependence? Scalar Privatization Raul Goycoolea S. Multiprocessor Programming 13716 February 2012
  • Reduction Analysis: Only associative operations The result is never used within the loop Transformation Integer Xtmp[NUMPROC]; Barrier(); FOR i = myPid*Iters to MIN((myPid+1)*Iters, n) Xtmp[myPid] = Xtmp[myPid] + A[i]; Barrier(); If(myPid == 0) { FOR p = 0 to NUMPROC-1 X = X + Xtmp[p]; … Reduction Recognition Raul Goycoolea S. Multiprocessor Programming 13816 February 2012
  • Example FOR i = 0 to N A[i] = 2^i; After strength reduction t = 1 FOR i = 0 to N A[i] = t; t = t*2; What happened to loop carried dependences? Need to do opposite of this! Perform induction variable analysis Rewrite IVs as a function of the loop variable Induction Variables Raul Goycoolea S. Multiprocessor Programming 13916 February 2012
  • Similar to scalar privatization However, analysis is more complex Array Data Dependence Analysis: Checks if two iterations access the same location Array Data Flow Analysis: Checks if two iterations access the same value Transformations Similar to scalar privatization Private copy for each processor or expand with an additional dimension Array Privatization Raul Goycoolea S. Multiprocessor Programming 14016 February 2012
  • Function calls will make a loop unparallelizatble Reduction of available parallelism A lot of inner-loop parallelism Solutions Interprocedural Analysis Inlining Interprocedural Parallelization Raul Goycoolea S. Multiprocessor Programming 14116 February 2012
  • Cache Coherent Shared Memory Machine Generate code for the parallel loop nest No Cache Coherent Shared Memory or Distributed Memory Machines Generate code for the parallel loop nest Identify communication Generate communication code Communication Code Generation Raul Goycoolea S. Multiprocessor Programming 14216 February 2012
  • Eliminating redundant communication Communication aggregation Multi-cast identification Local memory management Communication Optimizations Raul Goycoolea S. Multiprocessor Programming 14316 February 2012
  • Automatic parallelization of loops with arrays Requires Data Dependence Analysis Iteration space & data space abstraction An integer programming problem Many optimizations that’ll increase parallelism Transforming loop nests and communication code generation Fourier-Motzkin Elimination provides a nice framework Summary Raul Goycoolea S. Multiprocessor Programming 14416 February 2012
  • <Insert Picture Here> Program Agenda • Antecedents of Parallel Computing • Introduction to Parallel Architectures • Parallel Programming Concepts • Parallel Design Patterns • Performance & Optimization • Parallel Compilers • Future of Parallel Architectures Raul Goycoolea S. Multiprocessor Programming 14516 February 2012
  • Future of Parallel Architectures
  • "I think there is a world market for maybe five computers.“ – Thomas Watson, chairman of IBM, 1949 "There is no reason in the world anyone would want a computer in their home. No reason.” – Ken Olsen, Chairman, DEC, 1977 "640K of RAM ought to be enough for anybody.” – Bill Gates, 1981 Predicting the Future is Always Risky Raul Goycoolea S. Multiprocessor Programming 14716 February 2012
  • Evolution Relatively easy to predict Extrapolate the trends Revolution A completely new technology or solution Hard to Predict Paradigm Shifts can occur in both Future = Evolution + Revolution Raul Goycoolea S. Multiprocessor Programming 14816 February 2012
  • Evolution Trends Architecture Languages, Compilers and Tools Revolution Crossing the Abstraction Boundaries Outline Raul Goycoolea S. Multiprocessor Programming 14916 February 2012
  • Look at the trends Moore‟s Law Power Consumption Wire Delay Hardware Complexity Parallelizing Compilers Program Design Methodologies Design Drivers are different in Different Generations Evolution Raul Goycoolea S. Multiprocessor Programming 15016 February 2012
  • Performance(vs.VAX-11/780) NumberofTransistors 52%/year 100 1000 10000 100000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 %/year 10 8086 1 286 25%/year 386 486 Pentium P2 P3 P4 Itanium 2 Itanium 1,000,000,000 100,000 10,000 1,000,000 10,000,000 100,000,000 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 The Road to Multicore: Moore’s Law Raul Goycoolea S. Multiprocessor Programming 15116 February 2012
  • Specint2000 10000.00 1000.00 100.00 10.00 1.00 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Spar c Super Spar c Spar c64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 The Road to Multicore: Uniprocessor Performance (SPECint) Raul Goycoolea S. Multiprocessor Programming 15216 February 2012 Intel 386 Intel 486
  • The Road to Multicore: Uniprocessor Performance (SPECint) General-purpose unicores have stopped historic performance scaling Power consumption Wire delays DRAM access latency Diminishing returns of more instruction-level parallelism Raul Goycoolea S. Multiprocessor Programming 15316 February 2012
  • Power 1000 100 10 1 85 87 89 91 93 95 97 99 01 03 05 07 Intel 386 Intel 486 intel pentium intel pentium2 intel pentium3 intel pentium4 intel itanium Alpha21064 Alpha21164 Alpha21264 Sparc SuperSparc Sparc64 Mips HPPA Power PC AMDK6 AMDK7 AMDx86-64 Power Consumption (watts) Raul Goycoolea S. Multiprocessor Programming 15416 February 2012
  • Watts/Spec 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1982 1984 1987 1990 1993 1995 1998 2001 2004 2006 Year intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 0 Power Efficiency (watts/spec) Raul Goycoolea S. Multiprocessor Programming 15516 February 2012
  • Process(microns) 0.06 0.04 0.02 0 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 1996 1998 2000 2002 2008 2010 2012 20142004 2006 Year 700 MHz 1.25 GHz 2.1 GHz 6 GHz 10 GHz 13.5 GHz • 400 mm2 Die • From the SIA Roadmap Range of a Wire in One Clock Cycle Raul Goycoolea S. Multiprocessor Programming 15616 February 2012
  • Performance 19 84 19 94 19 92 19 82 19 88 19 86 19 80 19 96 19 98 20 00 20 02 19 90 20 04 1000000 10000 100 1 Year µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) DRAM Access Latency • Access times are a speed of light issue • Memory technology is also changing SRAM are getting harder to scale DRAM is no longer cheapest cost/bit • Power efficiency is an issue here as well Raul Goycoolea S. Multiprocessor Programming 15716 February 2012
  • PowerDensity(W/cm2) 10,000 1,000 „70 „80 „90 „00 „10 10 4004 8008 8080 1 8086 8085 286 386 486 Pentium® Hot Plate Nuclear Reactor 100 Sun‟s Surface Rocket Nozzle Intel Developer Forum, Spring 2004 - Pat Gelsinger (Pentium at 90 W) Cube relationship between the cycle time and power CPUs Architecture Heat becoming an unmanageable problem Raul Goycoolea S. Multiprocessor Programming 15816 February 2012
  • 1970 1980 1990 2000 2010 Improvement in Automatic Parallelization Automatic Parallelizing Compilers for FORTRAN Vectorization technology Prevalence of type unsafe languages and complex data structures (C, C++) Typesafe languages (Java, C#) Demand driven by Multicores? Compiling for Instruction Level Parallelism Raul Goycoolea S. Multiprocessor Programming 15916 February 2012
  • # of 1985 199019801970 1975 1995 2000 2005 Raw Cavium Octeon Raza XLR CSR-1 Intel Tflops Picochip PC102 Cisco Niagara Boardcom 1480 Xbox360 2010 2 1 8 4 32 cores 16 128 64 512 256 Cell Opteron 4P Xeon MP Ambric AM2045 4004 8008 80868080 286 386 486 Pentium PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 Multicores Future Raul Goycoolea S. Multiprocessor Programming 16016 February 2012
  • Evolution Trends Architecture Languages, Compilers and Tools Revolution Crossing the Abstraction Boundaries Outline Raul Goycoolea S. Multiprocessor Programming 16116 February 2012
  • Don‟t have to contend with uniprocessors The era of Moore‟s Law induced performance gains is over! Parallel programming will be required by the masses – not just a few supercomputer super-users Novel Opportunities in Multicores Raul Goycoolea S. Multiprocessor Programming 16216 February 2012
  • Don‟t have to contend with uniprocessors The era of Moore‟s Law induced performance gains is over! Parallel programming will be required by the masses – not just a few supercomputer super-users Not your same old multiprocessor problem How does going from Multiprocessors to Multicores impact programs? What changed? Where is the Impact? – – Communication Bandwidth Communication Latency Novel Opportunities in Multicores Raul Goycoolea S. Multiprocessor Programming 16316 February 2012
  • How much data can be communicated between two cores? What changed? Number of Wires – – IO is the true bottleneck On-chip wire density is very high Clock rate – IO is slower than on-chip Multiplexing No sharing of pins– Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck processor affinity not that important 32 Giga bits/sec ~300 Tera bits/sec 10,000X Communication Bandwidth Raul Goycoolea S. Multiprocessor Programming 16416 February 2012
  • How long does it take for a round trip communication? What changed? Length of wire – Very short wires are faster Pipeline stages – – – No multiplexing On-chip is much closer Bypass and Speculation? Impact on programming model? Ultra-fast synchronization Can run real-time apps on multiple cores 50X ~200 Cycles ~4 cycles Communication Latency Raul Goycoolea S. Multiprocessor Programming 16516 February 2012
  • MemoryMemory PE $$ PE $$ Memory PEPE $$ Memory $$ PE $$ X PE $$ X PE $$ X PE $$ X Memory Memory Basic Multicore IBM Power Traditional Multiprocessor Integrated Multicore 8 Core 8 Thread Oracle T4 Past, Present and the Future? Raul Goycoolea S. Multiprocessor Programming 16616 February 2012
  • Summary • As technology evolves, the inherent flexibility of Multi processor to adapts to new requirements • Processors can be used at anytime for a lots of kinds of applications • Optimization adapts processors to High Performance requirements Raul Goycoolea S. Multiprocessor Programming 16716 February 2012
  • References • Author: Raul Goycoolea, Oracle Corporation. • A search on the WWW for "parallel programming" or "parallel computing" will yield a wide variety of information. • Recommended reading: • "Designing and Building Parallel Programs". Ian Foster. 
http://www- unix.mcs.anl.gov/dbpp/ • "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar. 
http://www-users.cs.umn.edu/~karypis/parbook/ • "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra. 
www.phys.uu.nl/~steen/web03/overview.html • MIT Multicore Programming Class: 6.189 • Prof. Saman Amarasinghe • Photos/Graphics have been created by the author, obtained from non-copyrighted, government or public domain (such as http://commons.wikimedia.org/) sources, or used with the permission of authors from other presentations and web pages. 168
  • <Insert Picture Here> Twitter http://twitter.com/raul_goycoolea Raul Goycoolea Seoane Keep in Touch Facebook http://www.facebook.com/raul.goycoolea Linkedin http://www.linkedin.com/in/raulgoy Blog http://blogs.oracle.com/raulgoy/ Raul Goycoolea S. Multiprocessor Programming 16916 February 2012
  • Questions?