Your SlideShare is downloading. ×
0
Massively Parallel Computing                          CS 264 / CSCI E-292Lecture #2: Architecture, Theory & Patterns | Feb...
Objectives• introduce important computational thinking  skills for massively parallel computing• understand hardware limit...
During this course,                          r CS264                adapted fowe’ll try to          “                     ...
Outline• Thinking Parallel• Architecture• Programming Model• Bits of Theory• Patterns
ti vat i on                                     Mo!   7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$;    12+2E-I1,,6.%C,""<"&88"+&!   P1...
ti vat i on                                 Mo!   T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+;    O+;$9":0"#$$.F+<"$I1#"+;/+28U!   ...
Thinking Parallel
Getting your feet wet• Common scenario: “I want to make the  algorithm X run faster, help me!”• Q: How do you approach the...
How?
How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Opt...
What else ?
How about analysis ?
Getting your feet wet           Algorithm X v1.0 Profiling Analysis on Input 10x10x10            100                       ...
Getting your feet wet           Algorithm X v1.0 Profiling Analysis on Input 10x10x10            100                       ...
Getting your feet wet           Algorithm X v1.0 Profiling Analysis on Input 100x100x100           9,000                   ...
You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and  their constraints• ....
A better way ?                                  ...                           ale!                       t sc             ...
Some PerspectiveThe “problem tree” for scientific problem solving  9 Some Perspective                               Technic...
Computational Thinking• translate/formulate domain problems into  computational models that can be solved  efficiently by a...
Getting ready...                 Programming ModelsArchitecture      Algorithms                     Languages             ...
Fundamental Skills• Computer architecture• Programming models and compilers• Algorithm techniques and patterns• Domain kno...
Computer Architecturecritical in understanding tradeoffs btw algorithms • memory organization, bandwidth and latency;   ca...
Programming models for optimal data structure and code execution• parallel execution models (threading hierarchy)• optimal...
Algorithms and patterns• toolbox for designing good parallel algorithms• it is critical to understand their scalability an...
Domain Knowledge• abstract modeling• mathematical properties• accuracy requirements• coming back to the drawing board to e...
You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... ...
Architecture
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
What’s in a computer?adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?             Processor             Intel Q6600 Core2 Quad, 2.4 GHzadapted from Berger & Klöckner (NYU...
What’s in a computer?                                                          Die             Processor                  ...
What’s in a computer?adapted from Berger & Klöckner (NYU 2010)                     Intro Basics Assembly Memory Pipelines
What’s in a computer?                                                      Memoryadapted from Berger & Klöckner (NYU 2010)...
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines
A Basic Processor                                                                                   Memory Interface      ...
How all of this fits together      Everything synchronizes to the Clock.      Control Unit (“CU”): The brains of the       ...
What is. . . an ALU?      Arithmetic Logic Unit      One or two operands A, B      Operation selector (Op):            • (...
What is. . . a Register File?      Registers are On-Chip Memory                                                           ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
How does computer memory work?           One (reading) memory transaction (simplified):                                    ...
What is. . . a Memory Interface?      Memory Interface gets and stores binary      words in off-chip memory.      Smallest ...
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
A Very Simple Program                                              4:   c7   45   f4 05 00 00 00 movl   $0x5,−0xc(%rbp)   ...
A Very Simple Program: Intel Form                4:          c7     45       f4 05 00 00 00   mov    DWORD PTR [rbp−0xc],0...
Machine Language Loops                                               0:   55                          push     %rbp       ...
Machine Language Loops                                               0:   55                          push     %rbp       ...
We know how a computer works!           All of this can be built in about 4000 transistors.           (e.g. MOS 6502 in Ap...
We know how a computer works!           All of this can be built in about 4000 transistors.           (e.g. MOS 6502 in Ap...
We know how a computer works!           All of this can be built in about 4000 transistors.           (e.g. MOS 6502 in Ap...
The High-Performance Mindset                                                  Writing high-performance Codes              ...
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
Source of Slowness: Memory           Memory is slow.           Distinguish two different versions of “slow”:             • ...
Source of Slowness: Memory           Memory is slow.           Distinguish two different versions of “slow”:             • ...
The Memory Hierarchy           Hierarchy of increasingly bigger, slower memories:    faster                               ...
Performance of computer system                                                                                           P...
The Memory Hierarchy           Hierarchy of increasingly bigger, slower memories:                                         ...
Cache: Actual Implementation         Demands on cache implementation:               • Fast, small, cheap, low-power       ...
Cache: Associativity                                  Direct Mapped            2-way set associative                     M...
Cache: Associativity                                  Direct Mapped                     2-way set associative             ...
Cache Example: Intel Q6600/Core2 Quad           --- L1 data cache ---           fully associative cache    =     false    ...
Measuring the Cache I           void go(unsigned count, unsigned stride)           {             const unsigned arr size =...
Measuring the Cache I           void go(unsigned count, unsigned stride)           {             const unsigned arr size =...
Measuring the Cache II           void go(unsigned array size , unsigned steps)           {             int ∗ary = (int ∗) ...
Measuring the Cache II           void go(unsigned array size , unsigned steps)           {             int ∗ary = (int ∗) ...
Measuring the Cache III           void go(unsigned array size , unsigned stride , unsigned steps)           {             ...
Measuring the Cache III           void go(unsigned array size , unsigned stride , unsigned steps)           {             ...
Mike Bauer (Stanford)
http://sequoia.stanford.edu/Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
Source of Slowness: Sequential Operation                                    IF Instruction fetch                          ...
Solution: Pipeliningadapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Pipelining           (MIPS, 110,000 transistors)adapted from Berger & Klöckner (NYU 2010)                Intro Basics Asse...
Issues with Pipelines      Pipelines generally help      performance–but not always.      Possible issues:            • St...
Intel Q6600 Pipelineadapted from Berger & Klöckner (NYU 2010)                      Intro Basics Assembly Memory Pipelines
Intel Q6600 Pipeline                                                            New concept:                              ...
Programming for the Pipeline           How to upset a processor pipeline:           for (int i = 0; i < 1000; ++i)        ...
A Puzzle           int steps = 256 ∗ 1024 ∗ 1024;           int [] a = new int[2];           // Loop 1           for (int ...
Two useful Strategies           Loop unrolling:                                                          for (int i = 0; i...
SIMD           Control Units are large and expensive.                         SIMD        Instruction Pool           Funct...
Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
GPUs ?!   6401-@&)*(&+,3AB0-3-407:&C,(,DDD&    C(*8D+4/!   E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F&    .*-3(*D&,-@&@,3,&....
Intro PyOpenCL           What and Why? OpenCL“CPU-style” Cores     CPU-“style” cores                              Fetch/  ...
Intro PyOpenCL           What and Why? OpenCLSlimming down      Slimming down                             Fetch/          ...
Intro PyOpenCL       What and Why? OpenCLMore Space: Double the Numberparallel)   Two cores (two fragments in of Cores    ...
Intro PyOpenCL        What and Why? OpenCLFouragain  . . . cores                  (four fragments in parallel)            ...
Intro PyOpenCL       What and Why? OpenCLxteen cores  . . . and again                  (sixteen fragments in parallel)    ...
Intro PyOpenCL       What and Why? OpenCLxteen cores  . . . and again                  (sixteen fragments in parallel)    ...
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL Saving Yet More Space               Fetch/        ...
ecall: simple processing core  Intro PyOpenCL      What and Why? OpenCL Saving Yet More Space               Fetch/        ...
ecall: simple processing coredd ALUs                        Intro PyOpenCL       What and Why? OpenCL Saving Yet More Spac...
dd ALUs                        Intro PyOpenCL       What and Why? OpenCL Saving Yet More Space               Fetch/       ...
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL  Gratuitous Amounts of Paral...
http://www.youtube.com/watch?v=1yH_j8-VVLo           Intro PyOpenCL      What and Why? OpenCL  Gratuitous Amounts of Paral...
Intro PyOpenCL      What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . ....
Intro PyOpenCL      What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . ....
Intro PyOpenCL      What and Why? OpenCL  Remaining Problem: Slow Memory                                  Fetch/          ...
Intro PyOpenCL      What and Why? OpenCL  Remaining Problem: Slow Memory                              Fetch/              ...
Intro PyOpenCL      What and Why? OpenCLGPU Architecture Summary Core Ideas:   1   Many slimmed down cores       → lots of...
Is it free?!   GA,3&,(&3A&.*-4H2-.4I!   $(*1(,+&+243&8&+*(&C(@0.3,8D/    ! 6,3,&,..44&.*A(-.5    ! $(*1(,+&)D*F           ...
dvariables.   variables.uted memory private memory for each processor, only acces uted memory private memory for each proc...
Some terminology                 Some More TerminologyOne way to classify machines distinguishes betweenshared memory glob...
Programming Model      (Overview)
GPU ArchitectureCUDA Programming Model
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model       Fetch/       Decode                ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                                          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                                          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                          Axis 0          ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
Intro PyOpenCL      What and Why? OpenCLConnection: Hardware ↔ Programming Model                           Axis 0         ...
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns

1,418

Published on

http://cs264.org

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,418
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
126
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns"

  1. 1. Massively Parallel Computing CS 264 / CSCI E-292Lecture #2: Architecture, Theory & Patterns | February 1st, 2011 Nicolas Pinto (MIT, Harvard) pinto@mit.edu
  2. 2. Objectives• introduce important computational thinking skills for massively parallel computing• understand hardware limitations• understand algorithm constraints• identify common patterns
  3. 3. During this course, r CS264 adapted fowe’ll try to “ ”and use existing material ;-)
  4. 4. Outline• Thinking Parallel• Architecture• Programming Model• Bits of Theory• Patterns
  5. 5. ti vat i on Mo! 7F"/.;$"#.2./1#2%/C"&.O#./0.2"2$; 12+2E-I1,,6.%C,""<"&88"+&! P1;$.&1#+,,8! -*Q;3"$O+;$"& " P+&6I+&"&"+#F123O&"R%"2#8,1/1$+$1.2;! S.I! -*Q;3"$I16"& slide by Matthew Bolitho
  6. 6. ti vat i on Mo! T+$F"&$F+2":0"#$123-*Q;$.3"$$I1#"+; O+;$9":0"#$$.F+<"$I1#"+;/+28U! *+&+,,",0&.#";;123O.&$F"/+;;";! Q2O.&$%2+$",8)*+&+,,",0&.3&+//1231;F+&6V " D,3.&1$F/;+26B+$+?$&%#$%&";/%;$C" O%26+/"2$+,,8&"6";132"6 slide by Matthew Bolitho
  7. 7. Thinking Parallel
  8. 8. Getting your feet wet• Common scenario: “I want to make the algorithm X run faster, help me!”• Q: How do you approach the problem?
  9. 9. How?
  10. 10. How?• Option 1: wait• Option 2: gcc -O3 -msse4.2• Option 3: xlc -O5• Option 4: use parallel libraries (e.g. (cu)blas)• Option 5: hand-optimize everything!• Option 6: wait more
  11. 11. What else ?
  12. 12. How about analysis ?
  13. 13. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() Q: What is the maximum speed up ?
  14. 14. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 10x10x10 100 100% parallelizable 75 sequential in naturetime (s) 50 50 25 29 10 11 0 load_data() foo() bar() yey() A: 2X ! :-(
  15. 15. Getting your feet wet Algorithm X v1.0 Profiling Analysis on Input 100x100x100 9,000 9,000 100% parallelizable 6,750 sequential in naturetime (s) 4,500 2,250 0 350 250 300 load_data() foo() bar() yey() Q: and now?
  16. 16. You need to...• ... understand the problem (duh!)• ... study the current (sequential?) solutions and their constraints• ... know the input domain• ... profile accordingly• ... “refactor” based on new constraints (hw/sw)
  17. 17. A better way ? ... ale! t sc es n’ doSpeculation: (input) domain-aware optimization usingsome sort of probabilistic modeling ?
  18. 18. Some PerspectiveThe “problem tree” for scientific problem solving 9 Some Perspective Technical Problem to be Analyzed Consultation with experts Scientific Model "A" Model "B" Theoretical analysis Discretization "A" Discretization "B" Experiments Iterative equation solver Direct elimination equation solver Parallel implementation Sequential implementation Figure 11: There“problem tree” for to try to achieve the same goal. are many The are many options scientific problem solving. There options to try to achieve the same goal. from Scott et al. “Scientific Parallel Computing” (2005)
  19. 19. Computational Thinking• translate/formulate domain problems into computational models that can be solved efficiently by available computing resources• requires a deep understanding of their relationships adapted from Hwu & Kirk (PASI 2011)
  20. 20. Getting ready... Programming ModelsArchitecture Algorithms Languages Patterns il ers C omp Parallel Thinking Parallel Computing APPLICATIONS adapted from Scott et al. “Scientific Parallel Computing” (2005)
  21. 21. Fundamental Skills• Computer architecture• Programming models and compilers• Algorithm techniques and patterns• Domain knowledge
  22. 22. Computer Architecturecritical in understanding tradeoffs btw algorithms • memory organization, bandwidth and latency; caching and locality (memory hierarchy) • floating-point precision vs. accuracy • SISD, SIMD, MISD, MIMD vs. SIMT, SPMD
  23. 23. Programming models for optimal data structure and code execution• parallel execution models (threading hierarchy)• optimal memory access patterns• array data layout and loop transformations
  24. 24. Algorithms and patterns• toolbox for designing good parallel algorithms• it is critical to understand their scalability and efficiency• many have been exposed and documented• sometimes hard to “extract”• ... but keep trying!
  25. 25. Domain Knowledge• abstract modeling• mathematical properties• accuracy requirements• coming back to the drawing board to expose more/better parallelism ?
  26. 26. You can do it!• thinking parallel is not as hard as you may think• many techniques have been thoroughly explained...• ... and are now “accessible” to non-experts !
  27. 27. Architecture
  28. 28. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  29. 29. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  30. 30. What’s in a computer?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  31. 31. What’s in a computer? Processor Intel Q6600 Core2 Quad, 2.4 GHzadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  32. 32. What’s in a computer? Die Processor (2×) 143 mm2 , 2 × 2 cores Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors ∼ 100Wadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  33. 33. What’s in a computer?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  34. 34. What’s in a computer? Memoryadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  35. 35. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines
  36. 36. A Basic Processor Memory Interface Address ALU Address Bus Data Bus Register File Flags Internal Bus Insn. fetch PC Data ALU Control Unit (loosely based on Intel 8086)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  37. 37. How all of this fits together Everything synchronizes to the Clock. Control Unit (“CU”): The brains of the Memory Interface operation. Everything connects to it. Address ALU Address Bus Data Bus Bus entries/exits are gated and Register File Flags (potentially) buffered. Internal Bus CU controls gates, tells other units Insn. fetch PC Control Unit Data ALU about ‘what’ and ‘how’: • What operation? • Which register? • Which addressing mode?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  38. 38. What is. . . an ALU? Arithmetic Logic Unit One or two operands A, B Operation selector (Op): • (Integer) Addition, Subtraction • (Logical) And, Or, Not • (Bitwise) Shifts (equivalent to multiplication by power of two) • (Integer) Multiplication, Division Specialized ALUs: • Floating Point Unit (FPU) • Address ALU Operates on binary representations of numbers. Negative numbers represented by two’s complement.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  39. 39. What is. . . a Register File? Registers are On-Chip Memory %r0 • Directly usable as operands in %r1 Machine Language %r2 • Often “general-purpose” %r3 • Sometimes special-purpose: Floating %r4 point, Indexing, Accumulator %r5 • Small: x86 64: 16×64 bit GPRs %r6 • Very fast (near-zero latency) %r7adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  40. 40. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  41. 41. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  42. 42. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  43. 43. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  44. 44. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  45. 45. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLKadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  46. 46. How does computer memory work? One (reading) memory transaction (simplified): D0..15 Processor Memory A0..15 ¯ R/W CLK Observation: Access (and addressing) happens in bus-width-size “chunks”.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  47. 47. What is. . . a Memory Interface? Memory Interface gets and stores binary words in off-chip memory. Smallest granularity: Bus width Tells outside memory • “where” through address bus • “what” through data bus Computer main memory is “Dynamic RAM” (DRAM): Slow, but small and cheap.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  48. 48. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  49. 49. A Very Simple Program 4: c7 45 f4 05 00 00 00 movl $0x5,−0xc(%rbp) b: c7 45 f8 11 00 00 00 movl $0x11,−0x8(%rbp) int a = 5; 12: 8b 45 f4 mov −0xc(%rbp),%eax int b = 17; 15: 0f af 45 f8 imul −0x8(%rbp),%eax int z = a ∗ b; 19: 89 45 fc mov %eax,−0x4(%rbp) 1c: 8b 45 fc mov −0x4(%rbp),%eax Things to know: • Addressing modes (Immediate, Register, Base plus Offset) • 0xHexadecimal • “AT&T Form”: (we’ll use this) <opcode><size> <source>, <dest>adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  50. 50. A Very Simple Program: Intel Form 4: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp−0xc],0x5 b: c7 45 f8 11 00 00 00 mov DWORD PTR [rbp−0x8],0x11 12: 8b 45 f4 mov eax,DWORD PTR [rbp−0xc] 15: 0f af 45 f8 imul eax,DWORD PTR [rbp−0x8] 19: 89 45 fc mov DWORD PTR [rbp−0x4],eax 1c: 8b 45 fc mov eax,DWORD PTR [rbp−0x4] • “Intel Form”: (you might see this on the net) <opcode> <sized dest>, <sized source> • Goal: Reading comprehension. • Don’t understand an opcode? Google “<opcode> intel instruction”.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  51. 51. Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: • Condition Codes (Flags): Zero, Sign, Carry, etc. • Call Stack: Stack frame, stack pointer, base pointer • ABI: Calling conventionsadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  52. 52. Machine Language Loops 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp int main() 4: c7 45 f8 00 00 00 00 movl $0x0,−0x8(%rbp) { b: c7 45 fc 00 00 00 00 movl $0x0,−0x4(%rbp) int y = 0, i ; 12: eb 0a jmp 1e <main+0x1e> 14: 8b 45 fc mov −0x4(%rbp),%eax for ( i = 0; 17: 01 45 f8 add %eax,−0x8(%rbp) y < 10; ++i) 1a: 83 45 fc 01 addl $0x1,−0x4(%rbp) y += i; 1e: 83 7d f8 09 cmpl $0x9,−0x8(%rbp) return y; 22: 7e f0 jle 14 <main+0x14> 24: 8b 45 f8 mov −0x8(%rbp),%eax } 27: c9 leaveq 28: c3 retq Things to know: Want to make those yourself? • Condition Codes (Flags): Zero, Sign, Carry, etc. Write myprogram.c. • Call Stack:-c myprogram.c $ cc Stack frame, stack pointer, base pointer • ABI: $ objdump --disassemble myprogram.o Calling conventionsadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  53. 53. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer:adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  54. 54. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  55. 55. We know how a computer works! All of this can be built in about 4000 transistors. (e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600) So what exactly is Intel doing with the other 581,996,000 transistors? Answer: Make things go faster! Goal now: Understand sources of slowness, and how they get addressed. Remember: High Performance Computingadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  56. 56. The High-Performance Mindset Writing high-performance Codes Mindset: What is going to be the limiting factor? • ALU? • Memory? • Communication? (if multi-machine) Benchmark the assumed limiting factor right away. Evaluate • Know your peak throughputs (roughly) • Are you getting close? • Are you tracking the right limiting factor?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  57. 57. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  58. 58. Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Size of die vs. distance to memory: big! Dynamic RAM: long intrinsic latency!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  59. 59. Source of Slowness: Memory Memory is slow. Distinguish two different versions of “slow”: • Bandwidth • Latency → Memory has long latency, but can have large bandwidth. Idea: Put a look-up table of recently-used data onto the chip. Size of die vs. distance to memory: big! → “Cache” Dynamic RAM: long intrinsic latency!adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  60. 60. The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: faster Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) biggeradapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  61. 61. Performance of computer system Performance of computer system Entire problem fits within registers Entire problem fits within cachefrom Scott et al. “Scientific Parallel Computing” (2005) Entire problem fits within main memory Problem requires Size of problem being solved Size of problem being solved secondary (disk) memory for system! Performance Impact on Problem too big
  62. 62. The Memory Hierarchy Hierarchy of increasingly bigger, slower memories: Registers 1 kB, 1 cycle L1 Cache 10 kB, 10 cycles L2 Cache 1 MB, 100 cycles DRAM 1 GB, 1000 cycles Virtual Memory 1 TB, 1 M cycles (hard drive) How might data locality factor into this? What is a working set?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  63. 63. Cache: Actual Implementation Demands on cache implementation: • Fast, small, cheap, low-power • Fine-grained • High “hit”-rate (few “misses”) Problem: Goals at odds with each other: Access matching logic expensive! Solution 1: More data per unit of access matching logic → Larger “Cache Lines” Solution 2: Simpler/less access matching logic → Less than full “Associativity” Other choices: Eviction strategy, sizeadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  64. 64. Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . .adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  65. 65. Cache: Associativity Direct Mapped 2-way set associative Memory Cache Memory Cache 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 5 5 6 6 . . . . . . Miss rate versus cache size on the Integer por- tion of SPEC CPU2000 [Cantin, Hill 2003]adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  66. 66. Cache Example: Intel Q6600/Core2 Quad --- L1 data cache --- fully associative cache = false threads sharing this cache = 0x0 (0) processor cores on this die= 0x3 (3) system coherency line size = 0x3f (63) ways of associativity = 0x7 (7) number of sets - 1 (s) = 63 --- L1 instruction --- fully associative cache = false --- L2 unified cache --- threads sharing this cache = 0x0 (0) fully associative cache false processor cores on this die= 0x3 (3) threads sharing this cache = 0x1 (1) system coherency line size = 0x3f (63) processor cores on this die= 0x3 (3) ways of associativity = 0x7 (7) system coherency line size = 0x3f (63) number of sets - 1 (s) = 63 ways of associativity = 0xf (15) number of sets - 1 (s) = 4095 More than you care to know about your CPU: http://www.etallen.com/cpuid.htmladapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  67. 67. Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  68. 68. Measuring the Cache I void go(unsigned count, unsigned stride) { const unsigned arr size = 64 ∗ 1024 ∗ 1024; int ∗ary = (int ∗) malloc(sizeof (int) ∗ arr size ); for (unsigned it = 0; it < count; ++it) { for (unsigned i = 0; i < arr size ; i += stride) ary [ i ] ∗= 17; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  69. 69. Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  70. 70. Measuring the Cache II void go(unsigned array size , unsigned steps) { int ∗ary = (int ∗) malloc(sizeof (int) ∗ array size ); unsigned asm1 = array size − 1; for (unsigned i = 0; i < steps; ++i) ary [( i ∗16) & asm1] ++; free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  71. 71. Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  72. 72. Measuring the Cache III void go(unsigned array size , unsigned stride , unsigned steps) { char ∗ary = (char ∗) malloc(sizeof (int) ∗ array size ); unsigned p = 0; for (unsigned i = 0; i < steps; ++i) { ary [p] ++; p += stride; if (p >= array size) p = 0; } free (ary ); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  73. 73. Mike Bauer (Stanford)
  74. 74. http://sequoia.stanford.edu/Tue 4/5/11: Guest Lecture by Mike Bauer (Stanford)
  75. 75. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  76. 76. Source of Slowness: Sequential Operation IF Instruction fetch ID Instruction Decode EX Execution MEM Memory Read/Write WB Result Writebackadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  77. 77. Solution: Pipeliningadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  78. 78. Pipelining (MIPS, 110,000 transistors)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  79. 79. Issues with Pipelines Pipelines generally help performance–but not always. Possible issues: • Stalls • Dependent Instructions • Branches (+Prediction) • Self-Modifying Code “Solution”: Bubbling, extra circuitryadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  80. 80. Intel Q6600 Pipelineadapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  81. 81. Intel Q6600 Pipeline New concept: Instruction-level parallelism (“Superscalar”)adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  82. 82. Programming for the Pipeline How to upset a processor pipeline: for (int i = 0; i < 1000; ++i) for (int j = 0; j < 1000; ++j) { if ( j % 2 == 0) do something(i , j ); } . . . why is this bad?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  83. 83. A Puzzle int steps = 256 ∗ 1024 ∗ 1024; int [] a = new int[2]; // Loop 1 for (int i =0; i<steps; i ++) { a[0]++; a[0]++; } // Loop 2 for (int i =0; i<steps; i ++) { a[0]++; a[1]++; } Which is faster? . . . and why?adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  84. 84. Two useful Strategies Loop unrolling: for (int i = 0; i < 500; i+=2) { for (int i = 0; i < 1000; ++i) do something(i ); → do something(i ); do something(i+1); } Software pipelining: for (int i = 0; i < 500; i+=2) for (int i = 0; i < 1000; ++i) { { do a( i ); do a( i ); → do a( i +1); do b(i ); do b(i ); } do b(i+1); }adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  85. 85. SIMD Control Units are large and expensive. SIMD Instruction Pool Functional Units are simple and cheap. → Increase the Function/Control ratio: Data Pool Control several functional units with one control unit. All execute same operation. GCC vector extensions: typedef int v4si attribute (( vector size (16))); v4si a, b, c; c = a + b; // +, −, ∗, /, unary minus, ˆ, |, &, ˜, % Will revisit for OpenCL, GPUs.adapted from Berger & Klöckner (NYU 2010) Intro Basics Assembly Memory Pipelines
  86. 86. Architecture• What’s in a (basic) computer?• Basic Subsystems• Machine Language• Memory Hierarchy• Pipelines• CPUs to GPUs
  87. 87. GPUs ?! 6401-@&)*(&+,3AB0-3-407:&C,(,DDD& C(*8D+4/! E*(&3(,-4043*(4&@@0.,3@&3*&?">&3A,-&)D*F& .*-3(*D&,-@&@,3,&.,.A
  88. 88. Intro PyOpenCL What and Why? OpenCL“CPU-style” Cores CPU-“style” cores Fetch/ Out-of-order control logic Decode Fancy branch predictor ALU (Execute) Memory pre-fetcher Execution Context Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13 Credit: Kayvon Fatahalian (Stanford)
  89. 89. Intro PyOpenCL What and Why? OpenCLSlimming down Slimming down Fetch/ Decode Idea #1: ALU Remove components that (Execute) help a single instruction Execution stream run fast Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  90. 90. Intro PyOpenCL What and Why? OpenCLMore Space: Double the Numberparallel) Two cores (two fragments in of Cores fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode !"#$$%&()*"+,-. !"#$$%&()*"+,-. ALU ALU &*/01.+23.453.623.&2. &*/01.+23.453.623.&2. /%1..+73.423.892:2;. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. (Execute) (Execute) /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A23.+23.+7. Execution Execution /%1..A<3.+<3.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /%1..A=3.+=3.+7. Context Context /A4..A73.1><?2@. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  91. 91. Intro PyOpenCL What and Why? OpenCLFouragain . . . cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context ContextGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  92. 92. Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  93. 93. Intro PyOpenCL What and Why? OpenCLxteen cores . . . and again (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU → 16 independent instruction streams ALU ALU ALU Reality: instruction streams not actually 16 cores = 16very different/independent simultaneous instruction streamsH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ Credit: Kayvon Fatahalian (Stanford) 17 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  94. 94. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU (Execute) Execution Context Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  95. 95. ecall: simple processing core Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Decode ALU Idea #2 (Execute) Amortize cost/complexity of managing an instruction stream Execution across many ALUs Context → SIMD Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  96. 96. ecall: simple processing coredd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 ALU managing an instruction Idea #2 (Execute) ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream Execution across many ALUs Ctx Ctx Ctx Context Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  97. 97. dd ALUs Intro PyOpenCL What and Why? OpenCL Saving Yet More Space Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction Idea #2 ALU 5 ALU 6 ALU 7 ALU 8 stream across many of Amortize cost/complexity ALUs managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing → SIMD Ctx Ctx Ctx Ctx Shared Ctx Data Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  98. 98. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  99. 99. http://www.youtube.com/watch?v=1yH_j8-VVLo Intro PyOpenCL What and Why? OpenCL Gratuitous Amounts of Parallelism!ragments in parallel Example: 128 instruction streams in parallel 16 independent groups of 8 synchronized streams 16 cores = 128 ALUs = 16 simultaneous instruction streams Credit: Shading: http://s09.idav.ucdavis.edu/ Kayvon Fatahalian (Stanford)Beyond Programmable 24 slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  100. 100. Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction out-of-order execution So what now? slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  101. 101. Intro PyOpenCL What and Why? OpenCLRemaining Problem: Slow Memory Problem Memory still has very high latency. . . . . . but we’ve removed most of the hardware that helps us deal with that. We’ve removed caches branch prediction Idea #3 out-of-order execution Even more parallelism So what now? + Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  102. 102. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. Ctx Ctx Ctx Ctx We’ve removedCtx Ctx Ctx Ctx caches Shared Ctx Data branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ So what now? + 33 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  103. 103. Intro PyOpenCL What and Why? OpenCL Remaining Problem: Slow Memory Fetch/ Decode Problem ALU ALU ALU ALU Memory still has very high latency. . . ALU ALU ALU ALU . . . but we’ve removed most of the hardware that helps us deal with that. 1 2 We’ve removed caches 3 4 branch prediction Idea #3 out-of-order execution Even more parallelismv.ucdavis.edu/ now? So what + 34 Some extra memory = A solution! slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  104. 104. Intro PyOpenCL What and Why? OpenCLGPU Architecture Summary Core Ideas: 1 Many slimmed down cores → lots of parallelism 2 More ALUs, Fewer Control Units 3 Avoid memory stalls by interleaving execution of SIMD groups (“warps”) Credit: Kayvon Fatahalian (Stanford) slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  105. 105. Is it free?! GA,3&,(&3A&.*-4H2-.4I! $(*1(,+&+243&8&+*(&C(@0.3,8D/ ! 6,3,&,..44&.*A(-.5 ! $(*1(,+&)D*F slide by Matthew Bolitho
  106. 106. dvariables. variables.uted memory private memory for each processor, only acces uted memory private memory for each processor, only acce Some terminology ocessor, so no synchronization for memory accesses neede ocessor, so no synchronization for memory accesses needemationexchanged by sending data from one processor to ano ation exchanged by sending data from one processor to an interconnection network using explicit communication opera interconnection network using explicit communication opera M M M M M M PP PP PP PP PP PP Interconnection Network Interconnection Network Interconnection Network Interconnection Network M M M M M M “shared memory” approach increasingly common “distributed memory”d approach increasingly common now: mostly hybrid
  107. 107. Some terminology Some More TerminologyOne way to classify machines distinguishes betweenshared memory global memory can be acessed by all processors or Some More Terminologyshared variablescores. Information exchanged between threads usingOne way to classify machines distinguishes Need to coordinate access towritten by one thread and read by another. betweenshared memory global memory can be acessed by all processors orshared variables.cores. Information exchanged between threads using shared accessibledistributed memory private memory for each processor, only variableswritten by one thread synchronization for memoryto coordinate access tothis processor, so no and read by another. Need accesses needed.shared variables.Information exchanged by sending data from one processor to anotherdistributed memory private memory for each processor, only accessiblevia an interconnection network using explicit communication operations.this processor, so no synchronization for memory accesses needed.InformationM exchanged by sending data from one processor to another M M P P Pvia an interconnection network using explicit communication operations. P M P M P M P P P Interconnection Network
  108. 108. Programming Model (Overview)
  109. 109. GPU ArchitectureCUDA Programming Model
  110. 110. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Decode Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx (“Registers”) Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 16 kiB Ctx 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) 32 kiB Ctx Private (“Registers”) Shared 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  111. 111. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Program as if there were Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  112. 112. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Fetch/ Fetch/ Fetch/ Decode Decode Decode show are s? 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx o c ore Private Private Private (“Registers”) (“Registers”) (“Registers”) h W ny c 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared ma Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) Idea: 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared Consider: Which there were do automatically? Program as if is easy to Fetch/ Decode Fetch/ Decode Fetch/ Decode “infinitely” many cores Parallel program → sequential hardware 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) or Program as if there were 16 kiB Ctx Shared 16 kiB Ctx Shared 16 kiB Ctx Shared “infinitely” many ALUs per Sequential program → parallel hardware? core slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  113. 113. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode Axis 1 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Fetch/ Fetch/ Fetch/ Decode Decode Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  114. 114. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) (Work) Item 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation or “Thread” Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  115. 115. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  116. 116. Intro PyOpenCL What and Why? OpenCLConnection: Hardware ↔ Programming Model Axis 0 Fetch/ Decode Fetch/ Decode Fetch/ Decode (Work) Group 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx or “Block” Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Grid nc- Fetch/ Fetch/ Fetch/ Decode Decode Decode nel: Fu er Axis 1 (K 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) nG r i d) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared ti on o Fetch/ Decode Fetch/ Decode Fetch/ Decode 32 kiB Ctx 32 kiB Ctx 32 kiB Ctx Private Private Private (“Registers”) (“Registers”) (“Registers”) 16 kiB Ctx 16 kiB Ctx 16 kiB Ctx Shared Shared Shared Software representation Hardware slide by Andreas Kl¨ckner o GPU-Python with PyOpenCL and PyCUDA
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×