SlideShare a Scribd company logo
1 of 35
ADVANCED COMPUTER
ARCHITECTURE
PRESENTED BY,
KRISHNA
1
Contents
• Vector Architecture
• SIMD instruction set & extension for multimedia
• Graphic processing unit(GPU)
• Comparison between vector Architecture &GPU
• Comparison between multimedia SIMD computers and GPU
• Loop level parallelism
• Finding dependencies
2
Data parallelism
• Data parallelism is a form of parallelization across
multiple processors in Parallel computing
environments.
• It focuses on distributing the data across different
nodes, which operate on the data in parallel. It can be
applied on regular data structures like arrays and
matrices by working on each element in parallel.
3
SIMD
• Single Instruction, Multiple Data
• SIMD, is a class of Parallel computers in Flynn's taxonomy.
• It describes computers with multiple processing
elements that perform the same operation on multiple
data points simultaneously.
• Three variations of SIMD:
• vector architectures,
• multimedia SIMD instruction set extensions,
• graphics processing units (GPUs).
4
• SIMD architectures can exploit significant
data‐level parallelism for:
– matrix‐oriented scientific computing
– media‐oriented image and sound processors
• Most modern CPU designs include SIMD
instructions in order to improve the
performance of multimedia use.
5
Vector Architecture
• Vector architectures are easier to understand
and to compile to than other SIMD variations,
• But they were considered too expensive for
microprocessors until recently.
6
Vector Architecture
• Vector architectures grab sets of data
elements scattered about memory, place
them into large, sequential register files,
operate on data in those register files, and
then disperse the results back into memory.
7
Vector Architecture
• Registers are controlled by compiler
• Register files act as compiler controlled buffers
• Used to hide memory latency
• Leverage memory bandwidth
8
Basic structure of a vector architecture
9
Components of vector architecture
• Vector registers
‐ Each register holds a 64‐element, 64 bits/element vector
‐ Register file has 16 read ports and 8 write ports
• Vector functional units
• Fully pipelined
• Data and control hazards are detected
• Vector load‐store unit
‐ Fully pipelined
‐ Words move between registers
‐ One word per clock cycle after initial latency
• Scalar registers
• 32 general‐purpose registers
• 32 floating‐point registers
10
How Vector Processors Work: An
Example
• Let’s take a typical vector problem,
Y = a × X + Y
• X and Y are vectors, initially resident in memory, and a
is a scalar.
• Here is the VMIPS code for DAXPY.
L.D F0,a load scalar a
LV V1,Rx load vector X
MULVS.D V2,V1,F0 vector-scalar multiply
LV V3,Ry load vector Y
ADDVV.D V4,V2,V3 add
SV V4,Ry store the result
11
• In MIPS Code
• ADD waits for MUL, SD waits for ADD
• In VMIPS
• Stall once for the first vector element,
subsequent elements will flow smoothly down
the pipeline.
• Pipeline stall required once per vector
instruction!
12
Memory banks
• Memory system must be designed to support high bandwidth
for vector loads and stores
• Spread accesses across multiple banks
– Control bank addresses independently
– Load or store non sequential words
– Support multiple vector processors sharing the same memory
• Example:
– 32 processors, each generating 4 loads and 2 stores/cycle
– Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
– How many memory banks needed?
13
Vector Execution Time
• Execution time depends on three factors:
‐ Length of operand vectors
‐ Structural hazards
‐ Data dependencies
• VMIPS functional units consume one element
per clock cycle
• Execution time is approximately the vector length
14
Vector Processor Advantages
• Each instruction generates a lot of work
• Reduces instruction fetch bandwidth
• Highly regular memory access pattern
• Interleaving multiple banks for higher memory
bandwidth
• No need to explicitly code loops
• Fewer branches in the instruction sequence
15
Vector Processor Limitations
• Memory (bandwidth) can easily become a
bottleneck, especially if
1. compute/memory operation balance is
not maintained
16
SIMD Instruction Set Extensions for
Multimedia
• Media applications operate on data types
narrower than the native word size
‐Graphics systems use 8 bits per primary color
‐ Audio samples use 8‐16 bits
‐ 256‐bit adder
‐ 16 simultaneous operations on 16 bits
‐ 32 simultaneous operations on 8 bits
17
• In contrast to vector architectures, which offer
an elegant instruction set that is intended to
be the target of a vectorizing compiler, SIMD
extensions have three major omissions:
• Multimedia SIMD extensions fix the number of
operands in the opcode
• Multimedia SIMD extensions: No sophisticated
addressing modes
• No mask registers
18
• The goal of these SIMD extensions has been to
accelerate carefully written libraries rather
than for the compiler to generate them
19
Why is it popular?
• Costs little to add to the standard arithmetic unit
• Easy to implement
• Need smaller memory bandwidth than vector
• Separate data transfers aligned in memory
• Use much smaller register space
• Fewer operands
• No need for sophisticated mechanisms of vector
architecture
20
GPU
What is GPU?
• It is a processor optimized for 2D/3D graphics, video,
visual computing, and display.
• It is highly parallel, highly multithreaded
multiprocessor optimized for visual computing.
• It provide real-time visual interaction with computed
objects via graphics images, and video.
• It serves as both a programmable graphics processor
and a scalable parallel computing platform.
• Heterogeneous Systems: combine a GPU with a CPU
21
GPU Evolution
• 1980’s – No GPU. PC used VGA controller
• 1990’s – Add more function into VGA controller
• 1997 – 3D acceleration functions:
Hardware for triangle setup and rasterization
Texture mapping
Shading
• 2000 – A single chip graphics processor (
beginning of GPU term)
• 2005 – Massively parallel programmable
processors
• 2007 – CUDA (Compute Unified Device
Architecture)
22
Similarities and Differences between
Vector Architectures and GPUs
• Both architectures are designed to execute
data-level parallel programs.
• The VMIPS register file holds entire vectors—
that is, a contiguous block of 64 doubles. In
contrast, a single vector in a GPU would be
distributed across the registers of all SIMD
Lanes.
• The calculations are implicit in vector
architecture but they are explicit in GPUs.
23
• With respect to conditional branch instructions,
both architectures implement them using mask
registers.
• GPUs hide memory latency using multithreading
but this is not possible in case of vector
architecture.
• The control processors in the vector architecture
broadcasts operations to all vector lanes but
these control processors are absent in GPU
24
Similarities and Differences between
Multimedia SIMD Computers and
GPUs
• Both are multiprocessors whose processors
use multiple SIMD lanes, although GPUs have
more processors and many more lanes.
• Both use hardware multithreading to improve
processor utilization, although GPUs have
hardware support for many more threads
25
Similarities
• Both have similar performance ratios between
single-precision and double-precision floating-
point arithmetic.
• Both use caches, although GPUs use smaller
streaming caches and multicore computers use
large multilevel caches that try to contain whole
working sets completely.
• Both use a 64-bit address space, although the
physical main memory is much smaller in GPUs.
26
Similarities and Differences between
Multimedia SIMD Computers and GPUs
27
Loop Level Parallelism
• Loop-level parallelism is normally analyzed at
the source level or close to it, while most
analysis of ILP is done once instructions have
been generated by the compiler.
• Loop-level analysis involves determining what
dependences exist among the operands in a
loop across the iterations of that loop.
28
• The analysis of loop-level parallelism focuses
on determining whether data accesses in later
iterations are dependent on data values
produced in earlier iterations.
• Such dependence is called a loop-carried
dependence.
29
• e.g. for (i=1; i<=1000; i++)
x[i] = x[i] + s;
• The computation in each iteration is
independent of the previous iterations and
the loop is thus parallel. The use of X[i] twice
is within a single iteration.
• Thus loop iterations are parallel (or independent from
each other).
30
for (i=1; i<=100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses the value B[i] computed by S2 in the
previous iteration (loop-carried dependence)
31
This dependence is not circular because
S1 depends on S2 but S2 does not depend onS1.
32
Finding Dependences
• Finding dependences in the program is very
important for renaming and executing
instructions in parallel.
• Arrays and pointers makes finding
dependences very difficult.
• Assume array indices are affine, which means
on the form a x i+b where a and b are
constant.
• GCD test can be used to detect dependences.
33
GCD test
• Assume we stored an array with index value of
a x i+b and loaded an array with an index
value of c x i+d.
• If a loop dependence exists, then
GCD(c,a) must divides (d − b)
• If that test fails, there is no guarantee there is
dependence (loop bound)
34
Example
for(i=1; i<=100; i=i+1) {
x[2*i+3] = x[2*i] * 5.0;
}
a = 2 b = 3 c = 2 d = 0
GCD(a, c) = 2
d - b = -3
2 does not divide -3 ⇒ No
dependence is not possible
35

More Related Content

What's hot

Symmetric encryption and message confidentiality
Symmetric encryption and message confidentialitySymmetric encryption and message confidentiality
Symmetric encryption and message confidentialityCAS
 
Cache coherence
Cache coherenceCache coherence
Cache coherenceEmployee
 
Memory management
Memory managementMemory management
Memory managementcpjcollege
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputersPankaj Kumar Jain
 
Computer networks network layer,routing
Computer networks network layer,routingComputer networks network layer,routing
Computer networks network layer,routingDeepak John
 
Centralized shared memory architectures
Centralized shared memory architecturesCentralized shared memory architectures
Centralized shared memory architecturesGokuldhev mony
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingAkhila Prabhakaran
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System ManagementIbrahim Amer
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applicationsBurhan Ahmed
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared MemoryPrakhar Rastogi
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessorKishan Panara
 
Mobile transport layer - traditional TCP
Mobile transport layer - traditional TCPMobile transport layer - traditional TCP
Mobile transport layer - traditional TCPVishal Tandel
 
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...sumithragunasekaran
 

What's hot (20)

Symmetric encryption and message confidentiality
Symmetric encryption and message confidentialitySymmetric encryption and message confidentiality
Symmetric encryption and message confidentiality
 
Cache coherence
Cache coherenceCache coherence
Cache coherence
 
Memory management
Memory managementMemory management
Memory management
 
multiprocessors and multicomputers
 multiprocessors and multicomputers multiprocessors and multicomputers
multiprocessors and multicomputers
 
Parallelism
ParallelismParallelism
Parallelism
 
Raid
Raid Raid
Raid
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Computer networks network layer,routing
Computer networks network layer,routingComputer networks network layer,routing
Computer networks network layer,routing
 
Centralized shared memory architectures
Centralized shared memory architecturesCentralized shared memory architectures
Centralized shared memory architectures
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
 
Distributed Shared Memory
Distributed Shared MemoryDistributed Shared Memory
Distributed Shared Memory
 
Pram model
Pram modelPram model
Pram model
 
Semaphore
SemaphoreSemaphore
Semaphore
 
Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Hash Function
Hash FunctionHash Function
Hash Function
 
Mobile transport layer - traditional TCP
Mobile transport layer - traditional TCPMobile transport layer - traditional TCP
Mobile transport layer - traditional TCP
 
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...
Terminologies Used In Big data Environments,G.Sumithra,II-M.sc(computer scien...
 

Similar to Advanced computer architecture

COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxRuhul Amin
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptKadri20
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptxAbcvDef
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Sumalatha A
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORSAmirthavalli Senthil
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureBalaji Vignesh
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxEliasPetros
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitecturesNomy059
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principlesDhaval Bagal
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemmohantysikun0
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and ciscMark Gibbs
 
Multiprocessor.pptx
 Multiprocessor.pptx Multiprocessor.pptx
Multiprocessor.pptxMuhammad54342
 

Similar to Advanced computer architecture (20)

COA-Unit4-PPT.pptx
COA-Unit4-PPT.pptxCOA-Unit4-PPT.pptx
COA-Unit4-PPT.pptx
 
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.pptBIL406-Chapter-2-Classifications of Parallel Systems.ppt
BIL406-Chapter-2-Classifications of Parallel Systems.ppt
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
Unit 5 Advanced Computer Architecture
Unit 5 Advanced Computer ArchitectureUnit 5 Advanced Computer Architecture
Unit 5 Advanced Computer Architecture
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Lec3 final
Lec3 finalLec3 final
Lec3 final
 
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptxehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
ehhhhhhhhhhhhhhhhhhhhhhhhhjjjjjllaye.pptx
 
Throughput oriented aarchitectures
Throughput oriented aarchitecturesThroughput oriented aarchitectures
Throughput oriented aarchitectures
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and cisc
 
Node architecture
Node architectureNode architecture
Node architecture
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Multiprocessor.pptx
 Multiprocessor.pptx Multiprocessor.pptx
Multiprocessor.pptx
 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Advanced computer architecture

  • 2. Contents • Vector Architecture • SIMD instruction set & extension for multimedia • Graphic processing unit(GPU) • Comparison between vector Architecture &GPU • Comparison between multimedia SIMD computers and GPU • Loop level parallelism • Finding dependencies 2
  • 3. Data parallelism • Data parallelism is a form of parallelization across multiple processors in Parallel computing environments. • It focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures like arrays and matrices by working on each element in parallel. 3
  • 4. SIMD • Single Instruction, Multiple Data • SIMD, is a class of Parallel computers in Flynn's taxonomy. • It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. • Three variations of SIMD: • vector architectures, • multimedia SIMD instruction set extensions, • graphics processing units (GPUs). 4
  • 5. • SIMD architectures can exploit significant data‐level parallelism for: – matrix‐oriented scientific computing – media‐oriented image and sound processors • Most modern CPU designs include SIMD instructions in order to improve the performance of multimedia use. 5
  • 6. Vector Architecture • Vector architectures are easier to understand and to compile to than other SIMD variations, • But they were considered too expensive for microprocessors until recently. 6
  • 7. Vector Architecture • Vector architectures grab sets of data elements scattered about memory, place them into large, sequential register files, operate on data in those register files, and then disperse the results back into memory. 7
  • 8. Vector Architecture • Registers are controlled by compiler • Register files act as compiler controlled buffers • Used to hide memory latency • Leverage memory bandwidth 8
  • 9. Basic structure of a vector architecture 9
  • 10. Components of vector architecture • Vector registers ‐ Each register holds a 64‐element, 64 bits/element vector ‐ Register file has 16 read ports and 8 write ports • Vector functional units • Fully pipelined • Data and control hazards are detected • Vector load‐store unit ‐ Fully pipelined ‐ Words move between registers ‐ One word per clock cycle after initial latency • Scalar registers • 32 general‐purpose registers • 32 floating‐point registers 10
  • 11. How Vector Processors Work: An Example • Let’s take a typical vector problem, Y = a × X + Y • X and Y are vectors, initially resident in memory, and a is a scalar. • Here is the VMIPS code for DAXPY. L.D F0,a load scalar a LV V1,Rx load vector X MULVS.D V2,V1,F0 vector-scalar multiply LV V3,Ry load vector Y ADDVV.D V4,V2,V3 add SV V4,Ry store the result 11
  • 12. • In MIPS Code • ADD waits for MUL, SD waits for ADD • In VMIPS • Stall once for the first vector element, subsequent elements will flow smoothly down the pipeline. • Pipeline stall required once per vector instruction! 12
  • 13. Memory banks • Memory system must be designed to support high bandwidth for vector loads and stores • Spread accesses across multiple banks – Control bank addresses independently – Load or store non sequential words – Support multiple vector processors sharing the same memory • Example: – 32 processors, each generating 4 loads and 2 stores/cycle – Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns – How many memory banks needed? 13
  • 14. Vector Execution Time • Execution time depends on three factors: ‐ Length of operand vectors ‐ Structural hazards ‐ Data dependencies • VMIPS functional units consume one element per clock cycle • Execution time is approximately the vector length 14
  • 15. Vector Processor Advantages • Each instruction generates a lot of work • Reduces instruction fetch bandwidth • Highly regular memory access pattern • Interleaving multiple banks for higher memory bandwidth • No need to explicitly code loops • Fewer branches in the instruction sequence 15
  • 16. Vector Processor Limitations • Memory (bandwidth) can easily become a bottleneck, especially if 1. compute/memory operation balance is not maintained 16
  • 17. SIMD Instruction Set Extensions for Multimedia • Media applications operate on data types narrower than the native word size ‐Graphics systems use 8 bits per primary color ‐ Audio samples use 8‐16 bits ‐ 256‐bit adder ‐ 16 simultaneous operations on 16 bits ‐ 32 simultaneous operations on 8 bits 17
  • 18. • In contrast to vector architectures, which offer an elegant instruction set that is intended to be the target of a vectorizing compiler, SIMD extensions have three major omissions: • Multimedia SIMD extensions fix the number of operands in the opcode • Multimedia SIMD extensions: No sophisticated addressing modes • No mask registers 18
  • 19. • The goal of these SIMD extensions has been to accelerate carefully written libraries rather than for the compiler to generate them 19
  • 20. Why is it popular? • Costs little to add to the standard arithmetic unit • Easy to implement • Need smaller memory bandwidth than vector • Separate data transfers aligned in memory • Use much smaller register space • Fewer operands • No need for sophisticated mechanisms of vector architecture 20
  • 21. GPU What is GPU? • It is a processor optimized for 2D/3D graphics, video, visual computing, and display. • It is highly parallel, highly multithreaded multiprocessor optimized for visual computing. • It provide real-time visual interaction with computed objects via graphics images, and video. • It serves as both a programmable graphics processor and a scalable parallel computing platform. • Heterogeneous Systems: combine a GPU with a CPU 21
  • 22. GPU Evolution • 1980’s – No GPU. PC used VGA controller • 1990’s – Add more function into VGA controller • 1997 – 3D acceleration functions: Hardware for triangle setup and rasterization Texture mapping Shading • 2000 – A single chip graphics processor ( beginning of GPU term) • 2005 – Massively parallel programmable processors • 2007 – CUDA (Compute Unified Device Architecture) 22
  • 23. Similarities and Differences between Vector Architectures and GPUs • Both architectures are designed to execute data-level parallel programs. • The VMIPS register file holds entire vectors— that is, a contiguous block of 64 doubles. In contrast, a single vector in a GPU would be distributed across the registers of all SIMD Lanes. • The calculations are implicit in vector architecture but they are explicit in GPUs. 23
  • 24. • With respect to conditional branch instructions, both architectures implement them using mask registers. • GPUs hide memory latency using multithreading but this is not possible in case of vector architecture. • The control processors in the vector architecture broadcasts operations to all vector lanes but these control processors are absent in GPU 24
  • 25. Similarities and Differences between Multimedia SIMD Computers and GPUs • Both are multiprocessors whose processors use multiple SIMD lanes, although GPUs have more processors and many more lanes. • Both use hardware multithreading to improve processor utilization, although GPUs have hardware support for many more threads 25
  • 26. Similarities • Both have similar performance ratios between single-precision and double-precision floating- point arithmetic. • Both use caches, although GPUs use smaller streaming caches and multicore computers use large multilevel caches that try to contain whole working sets completely. • Both use a 64-bit address space, although the physical main memory is much smaller in GPUs. 26
  • 27. Similarities and Differences between Multimedia SIMD Computers and GPUs 27
  • 28. Loop Level Parallelism • Loop-level parallelism is normally analyzed at the source level or close to it, while most analysis of ILP is done once instructions have been generated by the compiler. • Loop-level analysis involves determining what dependences exist among the operands in a loop across the iterations of that loop. 28
  • 29. • The analysis of loop-level parallelism focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations. • Such dependence is called a loop-carried dependence. 29
  • 30. • e.g. for (i=1; i<=1000; i++) x[i] = x[i] + s; • The computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration. • Thus loop iterations are parallel (or independent from each other). 30
  • 31. for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } • S1 uses the value B[i] computed by S2 in the previous iteration (loop-carried dependence) 31
  • 32. This dependence is not circular because S1 depends on S2 but S2 does not depend onS1. 32
  • 33. Finding Dependences • Finding dependences in the program is very important for renaming and executing instructions in parallel. • Arrays and pointers makes finding dependences very difficult. • Assume array indices are affine, which means on the form a x i+b where a and b are constant. • GCD test can be used to detect dependences. 33
  • 34. GCD test • Assume we stored an array with index value of a x i+b and loaded an array with an index value of c x i+d. • If a loop dependence exists, then GCD(c,a) must divides (d − b) • If that test fails, there is no guarantee there is dependence (loop bound) 34
  • 35. Example for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; } a = 2 b = 3 c = 2 d = 0 GCD(a, c) = 2 d - b = -3 2 does not divide -3 ⇒ No dependence is not possible 35