• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lecture3
 

Lecture3

on

  • 568 views

 

Statistics

Views

Total Views
568
Views on SlideShare
568
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lecture3 Lecture3 Presentation Transcript

    • CS-416 Parallel and Distributed Systems
      JawwadShamsi
      Lecture #3
      20th January 2010
    • Announcement
      Possible Name Change to
      High Performance Computing
    • Recap
      Pipelining
      Vector Instruction
      Super Scalar Execution
    • Super-Scalar Execution
    • Dependencies
      Data Dependency
      Resource dependency
      Branch Dependency
    • Dynamic Instruction Issue
      3rd Segment
      Processor needs capability of
      Out of order sequencing
    • Limitations of Memory Systems
      Latency
      Bandwidth
    • Effect of Latency - Example
      1 GHZ processor (1 ns)
      100 ns latency
      Two multiply-add units
      four instructions in each cycle of 1 ns
      Peak Rating
      4GLOPS
      Memory latency 100 cycles
      block size is one word
      Processor must wait 100 cycles before it can process the data.
      Peak speed 1 floating point operation / 100 nsec
      10 MFLOPS
    • Effect of Bandwidth
      Process 1 GHZ
      100 cycle latency DRAM
      Block size is one word, the processor takes 100 cycles to fetch each word.
      Therefore, the algorithm performs one FLOP every 100 cycles for a peak speed of 10 MFLOPS
      Increase Block Size??
    • 1 for (i = 0; i < 1000; i++)
      2 column_sum[i] = 0.0;
      3 for (j = 0; j < 1000; j++)
      4 column_sum[i] += b[j][i];
    • Pre-fetching
      Multi-Threading
    • Impact of bandwidth on multithreaded programs
      Threads share Memory
      Cache
      Cache size will be limited
      Limited Cache-hit ratio
      Decrease in effective bandwith
    • Simple Execution
      for(i=0;i<n;i++)
      2 c[i] = dot_product(get_row(a, i), b);
    • Threaded Execution
      for(i=0;i<n;i++)
      2 c[i] = create_thread(dot_product, get_row(a, i), b);
    • 1 for (i = 0; i < 1000; i++)
      2 column_sum[i] = 0.0; 3
      for (j = 0; j < 1000; j++)
      4 for (i = 0; i < 1000; i++)
      5 column_sum[i] += b[j][i];