SlideShare a Scribd company logo
1 of 24
Introduction to Parallel Computing
BY
MISS ZARAH ZAINAB
LECTURER CS DEPARTMENT
"For over a decades that the organization of a single computer has
reached its limits and that truly significant advances can be made only
by interconnection of a multiplicity of computers." Gene Amdahl in
1967.
Content
Implicit Parallelism: Trends in Microprocessor Architectures
Limitations of Memory System Performance
Communication Costs in Parallel Machines
9/30/2023 2
Introduction to parallelism
The traditional logical view of a sequential computer consists of a memory connected to a processor
via a data path.
All three components – processor, memory, and data path – present bottlenecks to the overall
processing rate of a computer system.
A number of architectural innovations over the years have addressed these bottlenecks. One of the
most important innovations is multiplicity
an overview of important architectural concepts are presented in this lecture as they relate to parallel
processing.
9/30/2023 3
Implicit Parallelism: Trends in MicroprocessorArchitectures
Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of
magnitude).
Higher levels of device integration have made available a large number of transistors.
The question of how best to utilize these resources is an important one.
Current processors use these resources in multiple functional units and execute multiple instructions in the
same cycle.
The precise way these instructions are selected and executed provides impressive diversity in architectures.
9/30/2023 4
a. Pipelining Execution Method
Pipelining overlaps various stages of instruction execution to achieve performance.
At a high level of abstraction, an instruction can be executed while the next one is being decoded
and the next one is being fetched.
To explain this idea of pipelining consider the following example in the next slide of car
assembly.
9/30/2023 5
Contd…
If the assembly of a car, taking 100 time units, can be broken into 10 pipelined stages of 10 units
each, a single assembly line can produce a car every 10 time units! This represents a 10-fold speedup
over producing cars entirely serially, one after the other.
For further details lets have a look at pipelining in separate ppt slides in a glance.
9/30/2023 6
Contd…
Pipelining, however, has several limitations.
The speed of a pipeline is eventually limited by the slowest stage.
For this reason, conventional processors rely on very deep pipelines (20 stage pipelines in state-of-
the-art Pentium processors).
However, in typical program traces, every 5-6th instruction is a conditional jump! This requires
very accurate branch prediction.
The penalty of a misprediction grows with the depth of the pipeline, since a larger number of
instructions will have to be flushed.
9/30/2023 7
b. Superscalar Execution Method
A scalar processor executes single instruction for each clock cycle; a superscalar processor can
execute more than one instruction during a clock cycle.
The basic processor in general executes one instructions per clock cycle. In way to improve their
processing speed so that the machines can improve their speed came into being is,
the superscalar processor which has pipelining algorithm to enable it to execute two instructions
per clock cycle.
9/30/2023 8
9/30/2023 9
Superscalar Execution
Scheduling of instructions is determined by a number of factors:
True Data Dependency: The result of one operation is an input to the next.
Resource Dependency: Two operations require the same resource.
Branch Dependency: Scheduling instructions across conditional branch statements cannot be done
deterministically a-priori.
The scheduler, a piece of hardware looks at a large number of instructions in an instruction
queue and selects appropriate number of instructions to execute concurrently based on these
factors.
The complexity of this hardware is an important constraint on superscalar processors.
9/30/2023 10
Superscalar Execution: Efficiency Considerations
Not all functional units can be kept busy at all times.
If during a cycle, no functional units are utilized, this is referred to as vertical waste.
If during a cycle, only some of the functional units are utilized, this is referred to as horizontal
waste.
Due to limited parallelism in typical instruction traces, dependencies, or the inability of the
scheduler to extract parallelism, the performance of superscalar processors is eventually limited.
Conventional microprocessors typically support four-way superscalar execution.
9/30/2023 11
Limitations of Memory System Performance
Memory system, and not processor speed, is often the bottleneck for many applications.
Memory system performance is largely captured by two parameters, latency and bandwidth.
Latency is the time from the issue of a memory request to the time the data is available at the
processor.
Bandwidth is the rate at which data can be pumped to the processor by the memory system.
9/30/2023 12
Memory System Performance: Bandwidth and Latency
It is very important to understand the difference between latency and bandwidth.
Consider the example of a fire-hose. If the water comes out of the hose two seconds after the
hydrant is turned on, the latency of the system is two seconds.
Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the
bandwidth of the system is 5 gallons/second.
If you want immediate response from the hydrant, it is important to reduce latency.
If you want to fight big fires, you want high bandwidth.
9/30/2023 13
Memory Latency:An Example
Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100
ns (no caches). Assume that the processor has two multiply-add units and is capable of executing
four instructions in each cycle of 1 ns. The following observations follow:
◦ The peak processor rating is 4 GFLOPS.
◦ Since the memory latency is equal to 100 cycles and block size is one word, every time a
memory request is made, the processor must wait 100 cycles before it can process the data.
9/30/2023 14
Memory Latency:An Example
On the above architecture, consider the problem of computing a dot-product of two vectors.
◦ A dot-product computation performs one multiply-add on a single pair of vector elements, i.e.,
each floating point operation requires one data fetch.
◦ It follows that the peak speed of this computation is limited to one floating point operation
every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating!
9/30/2023 15
Improving Effective Memory Latency Using Caches
Caches are small and fast memory elements between the processor and DRAM.
This memory acts as a low-latency high-bandwidth storage.
If a piece of data is repeatedly used, the effective latency of this memory system can be reduced
by the cache.
The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
Cache hit ratio achieved by a code on a memory system often determines its performance.
9/30/2023 16
Impact of Caches: Example
Consider the architecture from the previous example. In this case, we introduce a cache of size 32
KB with a latency of 1 ns or one cycle. We use this setup to multiply two matrices A and B of
dimensions 32 × 32. We have carefully chosen these numbers so that the cache is large enough to
store matrices A and B, as well as the result matrix C.
9/30/2023 17
Impact of Caches: Example (continued)
The following observations can be made about the problem:
◦ Fetching the two matrices into the cache corresponds to fetching 2K words, which takes approximately
200 µs.
◦ Multiplying two n × n matrices takes 2n3 operations. For our problem, this corresponds to 64K
operations, which can be performed in 16K cycles (or 16 µs) at four instructions per cycle.
◦ The total time for the computation is therefore approximately the sum of time for load/store operations
and the time for the computation itself, i.e., 200 + 16 µs.
◦ This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS.
9/30/2023 18
c. Communication Costs in Parallel Machines
Along with idling and contention, communication is a major overhead in parallel programs.
The cost of communication is dependent on a variety of features including
the programming model semantics,
the network topology,
data handling and routing,
and associated software protocols.
9/30/2023 19
Message Passing Costs in Parallel Computers
The total time to transfer a message over a network comprises of the following:
Startup time (ts): Time spent at sending and receiving nodes (executing the routing algorithm,
programming routers, etc.).
Per-hop time (th): This time is a function of number of hops and includes factors such as switch
latencies, network delays, etc.
Per-word transfer time (tw): This time includes all overheads that are determined by the length of the
message. This includes bandwidth of links, error checking and correction, etc.
9/30/2023 20
Packet Routing
Store-and-forward makes poor use of communication resources.
Packet routing breaks messages into packets and pipelines them through the network.
Since packets may take different paths, each packet must carry routing information, error
checking, sequencing, and other related header information.
The total communication time for packet routing is approximated by:
The factor tw accounts for overheads in packet headers.
9/30/2023 21
Cut-Through Routing
Takes the concept of packet routing to an extreme by further dividing messages into basic units
called flits.
Since flits are typically small, the header information must be minimized.
This is done by forcing all flits to take the same path, in sequence.
A tracer message first programs all intermediate routers. All flits then take the same route.
Error checks are performed on the entire message, as opposed to flits.
No sequence numbers are needed.
9/30/2023 22
Routing Mechanisms for Interconnection Networks
How does one compute the route that a message takes from source to destination?
Routing must prevent deadlocks - for this reason, we use dimension-ordered or e-cube
routing.
Routing must avoid hot-spots - for this reason, two-step routing is often used. In this case, a
message from source s to destination d is first sent to a randomly chosen intermediate
processor i and then forwarded to destination d.
9/30/2023 23
End of Lecture-02
9/30/2023 24

More Related Content

Similar to week_2Lec02_CS422.pptx

TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKS
TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKSTCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKS
TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKSIJCNCJournal
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridadMarketing Donalba
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline MechanismAshik Iqbal
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A surveyCSITiaesprime
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPijaceeejournal
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Interconnect technology Final Revision
Interconnect technology Final RevisionInterconnect technology Final Revision
Interconnect technology Final RevisionMatthew Agostinelli
 
Application Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipApplication Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipIvonne Liu
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Traffic-aware adaptive server load balancing for softwaredefined networks
Traffic-aware adaptive server load balancing for softwaredefined networks Traffic-aware adaptive server load balancing for softwaredefined networks
Traffic-aware adaptive server load balancing for softwaredefined networks IJECEIAES
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 

Similar to week_2Lec02_CS422.pptx (20)

TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKS
TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKSTCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKS
TCP INCAST AVOIDANCE BASED ON CONNECTION SERIALIZATION IN DATA CENTER NETWORKS
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A survey
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
shashank_spdp1993_00395543
shashank_spdp1993_00395543shashank_spdp1993_00395543
shashank_spdp1993_00395543
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Interconnect technology Final Revision
Interconnect technology Final RevisionInterconnect technology Final Revision
Interconnect technology Final Revision
 
Application Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-ChipApplication Behavior-Aware Flow Control in Network-on-Chip
Application Behavior-Aware Flow Control in Network-on-Chip
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Traffic-aware adaptive server load balancing for softwaredefined networks
Traffic-aware adaptive server load balancing for softwaredefined networks Traffic-aware adaptive server load balancing for softwaredefined networks
Traffic-aware adaptive server load balancing for softwaredefined networks
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Multiprocessing operating systems
Multiprocessing operating systemsMultiprocessing operating systems
Multiprocessing operating systems
 

Recently uploaded

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 

Recently uploaded (20)

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 

week_2Lec02_CS422.pptx

  • 1. Introduction to Parallel Computing BY MISS ZARAH ZAINAB LECTURER CS DEPARTMENT "For over a decades that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers." Gene Amdahl in 1967.
  • 2. Content Implicit Parallelism: Trends in Microprocessor Architectures Limitations of Memory System Performance Communication Costs in Parallel Machines 9/30/2023 2
  • 3. Introduction to parallelism The traditional logical view of a sequential computer consists of a memory connected to a processor via a data path. All three components – processor, memory, and data path – present bottlenecks to the overall processing rate of a computer system. A number of architectural innovations over the years have addressed these bottlenecks. One of the most important innovations is multiplicity an overview of important architectural concepts are presented in this lecture as they relate to parallel processing. 9/30/2023 3
  • 4. Implicit Parallelism: Trends in MicroprocessorArchitectures Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. The question of how best to utilize these resources is an important one. Current processors use these resources in multiple functional units and execute multiple instructions in the same cycle. The precise way these instructions are selected and executed provides impressive diversity in architectures. 9/30/2023 4
  • 5. a. Pipelining Execution Method Pipelining overlaps various stages of instruction execution to achieve performance. At a high level of abstraction, an instruction can be executed while the next one is being decoded and the next one is being fetched. To explain this idea of pipelining consider the following example in the next slide of car assembly. 9/30/2023 5
  • 6. Contd… If the assembly of a car, taking 100 time units, can be broken into 10 pipelined stages of 10 units each, a single assembly line can produce a car every 10 time units! This represents a 10-fold speedup over producing cars entirely serially, one after the other. For further details lets have a look at pipelining in separate ppt slides in a glance. 9/30/2023 6
  • 7. Contd… Pipelining, however, has several limitations. The speed of a pipeline is eventually limited by the slowest stage. For this reason, conventional processors rely on very deep pipelines (20 stage pipelines in state-of- the-art Pentium processors). However, in typical program traces, every 5-6th instruction is a conditional jump! This requires very accurate branch prediction. The penalty of a misprediction grows with the depth of the pipeline, since a larger number of instructions will have to be flushed. 9/30/2023 7
  • 8. b. Superscalar Execution Method A scalar processor executes single instruction for each clock cycle; a superscalar processor can execute more than one instruction during a clock cycle. The basic processor in general executes one instructions per clock cycle. In way to improve their processing speed so that the machines can improve their speed came into being is, the superscalar processor which has pipelining algorithm to enable it to execute two instructions per clock cycle. 9/30/2023 8
  • 10. Superscalar Execution Scheduling of instructions is determined by a number of factors: True Data Dependency: The result of one operation is an input to the next. Resource Dependency: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently based on these factors. The complexity of this hardware is an important constraint on superscalar processors. 9/30/2023 10
  • 11. Superscalar Execution: Efficiency Considerations Not all functional units can be kept busy at all times. If during a cycle, no functional units are utilized, this is referred to as vertical waste. If during a cycle, only some of the functional units are utilized, this is referred to as horizontal waste. Due to limited parallelism in typical instruction traces, dependencies, or the inability of the scheduler to extract parallelism, the performance of superscalar processors is eventually limited. Conventional microprocessors typically support four-way superscalar execution. 9/30/2023 11
  • 12. Limitations of Memory System Performance Memory system, and not processor speed, is often the bottleneck for many applications. Memory system performance is largely captured by two parameters, latency and bandwidth. Latency is the time from the issue of a memory request to the time the data is available at the processor. Bandwidth is the rate at which data can be pumped to the processor by the memory system. 9/30/2023 12
  • 13. Memory System Performance: Bandwidth and Latency It is very important to understand the difference between latency and bandwidth. Consider the example of a fire-hose. If the water comes out of the hose two seconds after the hydrant is turned on, the latency of the system is two seconds. Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second, the bandwidth of the system is 5 gallons/second. If you want immediate response from the hydrant, it is important to reduce latency. If you want to fight big fires, you want high bandwidth. 9/30/2023 13
  • 14. Memory Latency:An Example Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns (no caches). Assume that the processor has two multiply-add units and is capable of executing four instructions in each cycle of 1 ns. The following observations follow: ◦ The peak processor rating is 4 GFLOPS. ◦ Since the memory latency is equal to 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data. 9/30/2023 14
  • 15. Memory Latency:An Example On the above architecture, consider the problem of computing a dot-product of two vectors. ◦ A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. ◦ It follows that the peak speed of this computation is limited to one floating point operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating! 9/30/2023 15
  • 16. Improving Effective Memory Latency Using Caches Caches are small and fast memory elements between the processor and DRAM. This memory acts as a low-latency high-bandwidth storage. If a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. The fraction of data references satisfied by the cache is called the cache hit ratio of the computation on the system. Cache hit ratio achieved by a code on a memory system often determines its performance. 9/30/2023 16
  • 17. Impact of Caches: Example Consider the architecture from the previous example. In this case, we introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We use this setup to multiply two matrices A and B of dimensions 32 × 32. We have carefully chosen these numbers so that the cache is large enough to store matrices A and B, as well as the result matrix C. 9/30/2023 17
  • 18. Impact of Caches: Example (continued) The following observations can be made about the problem: ◦ Fetching the two matrices into the cache corresponds to fetching 2K words, which takes approximately 200 µs. ◦ Multiplying two n × n matrices takes 2n3 operations. For our problem, this corresponds to 64K operations, which can be performed in 16K cycles (or 16 µs) at four instructions per cycle. ◦ The total time for the computation is therefore approximately the sum of time for load/store operations and the time for the computation itself, i.e., 200 + 16 µs. ◦ This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS. 9/30/2023 18
  • 19. c. Communication Costs in Parallel Machines Along with idling and contention, communication is a major overhead in parallel programs. The cost of communication is dependent on a variety of features including the programming model semantics, the network topology, data handling and routing, and associated software protocols. 9/30/2023 19
  • 20. Message Passing Costs in Parallel Computers The total time to transfer a message over a network comprises of the following: Startup time (ts): Time spent at sending and receiving nodes (executing the routing algorithm, programming routers, etc.). Per-hop time (th): This time is a function of number of hops and includes factors such as switch latencies, network delays, etc. Per-word transfer time (tw): This time includes all overheads that are determined by the length of the message. This includes bandwidth of links, error checking and correction, etc. 9/30/2023 20
  • 21. Packet Routing Store-and-forward makes poor use of communication resources. Packet routing breaks messages into packets and pipelines them through the network. Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information. The total communication time for packet routing is approximated by: The factor tw accounts for overheads in packet headers. 9/30/2023 21
  • 22. Cut-Through Routing Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits. Since flits are typically small, the header information must be minimized. This is done by forcing all flits to take the same path, in sequence. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed. 9/30/2023 22
  • 23. Routing Mechanisms for Interconnection Networks How does one compute the route that a message takes from source to destination? Routing must prevent deadlocks - for this reason, we use dimension-ordered or e-cube routing. Routing must avoid hot-spots - for this reason, two-step routing is often used. In this case, a message from source s to destination d is first sent to a randomly chosen intermediate processor i and then forwarded to destination d. 9/30/2023 23