SlideShare a Scribd company logo
1 of 47
Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
Verilog
Digital Design — Chapter 9 — Accelerators 2
Performance and Parallelism
 A processor core performs steps in sequence
 Performance limited by the instruction rate
 Accelerating performance
 Perform steps in parallel
 Takes less time overall to complete an operation
 Instruction-level parallelism
 Within a processor core
 Pipelining, multiple-issue
 Accelerators
 Custom hardware for parallel operations
Verilog
Digital Design — Chapter 9 — Accelerators 3
Achievable Parallelism
 How many steps can be performed at
once?
 Regularly structured data
 Independent processing steps
 Examples
 Video and image pixel processing
 Audio or sensor signal processing
 Constrained by data dependencies
 Operations that depend on results of
previous steps
Verilog
Digital Design — Chapter 9 — Accelerators 4
Algorithm Kernels
 Algorithm: specification of the required
processing steps
 Often expressed in a programming
language
 Kernel: the part that involves the most
intensive, repetitive processing
 “10% of operations take 90% of the time”
 Accelerating a kernel with parallel
hardware gives the best payback
Verilog
Digital Design — Chapter 9 — Accelerators 5
Amdahl’s Law
 Time for an algorithm is t
 Fraction f is spent on a kernel
t
f
ft
t )
1
( 


 Accelerator speeds up
kernel by a factor s
t
f
s
ft
t )
1
( 



 Overall speedup factor s'
 For large f, s'  s
 For small f, s'  1
)
1
(
1
f
s
f
t
t
s






Verilog
Digital Design — Chapter 9 — Accelerators 6
Amdahl’s Law Example
 An algorithm with two kernels
 Kernel 1: 80% of time, can be sped up 10 times
 Kernel 2: 15% of time, can be sped up 100 times
 Which speedup gives best overall improvement?
 For kernel 1:
 For kernel 2:
57
.
3
2
.
0
08
.
0
1
)
8
.
0
1
(
10
8
.
0
1







s
17
.
1
85
.
0
0015
.
0
1
)
15
.
0
1
(
100
15
.
0
1







s
Verilog
Digital Design — Chapter 9 — Accelerators 7
Parallel Architectures
 An architecture for an accelerator
specifies
 Processing blocks
 Data flow between them
 Parallelism through replication
 Multiple identical block operating on
different data elements
 Works well when elements can be
processed independently
Verilog
Digital Design — Chapter 9 — Accelerators 8
Parallel Architectures
 Parallelism through pipelining
 Break a computation into steps, performs them in
assembly-line fashion
 Latency (time to complete a single operation) is
not increased
 Throughput (rate of completion of operations) is
increased
 Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3
data
in
data
out
Verilog
Digital Design — Chapter 9 — Accelerators 9
Direct Memory Access (DMA)
 Input/Output data for accellerators
must be transferred at high speed
 Using the processor would be too slow
 Direct memory access
 I/O controller and accellerator transfer data
to and from memory autononously
 Program supplies starting address and
length
Verilog
Digital Design — Chapter 9 — Accelerators 10
Bus Arbitration
 Bus masters take turns to use bus to access
slaves
 Controlled by a bus arbiter
 Arbitration policies
 Priority, round-robin,
…
processor
memory
arbiter
accelerator controller
request
grant
request
request
grant
grant
memory
bus
Verilog
Digital Design — Chapter 9 — Accelerators 11
Block-Processing Accelerator
 Data arranged in regular groups of
contiguous memory locations
 Accelerator works block by block
 E.g., images in blocks of 8 × 8 × 16-bit
pixels
 Datapath comprises
 Memory access: address generation,
counters
 Computation section
 Control section: finite-state machine(s)
Verilog
Digital Design — Chapter 9 — Accelerators 12
Stream-Processing Accelerator
 Streams of data from an input source
 E.g., high-speed sensors
 Digital signal processing (DSP)
 Analog sensor signal converted to stream
of digital sample values
 Filtering, gain/attenuation, frequency-
domain conversion (Fourier transform)
Verilog
Digital Design — Chapter 9 — Accelerators 13
Processor/Accelerator Interface
 Embedded software controls an
accelerator
 Providing control parameters
 Synchronizing operations
 Input/output registers and interrupts
 Interact with the control sequencer
Verilog
Digital Design — Chapter 9 — Accelerators 14
Case Study: Edge Detection
 Illustration of accelerator design
 Edge detection in video processing
 Identify where image intensity changes abruptly
 Typically at the boundary of objects
 First step in identifying objects in a scene
 Application areas
 Video surveillance, computer vision, …
 For this case study
 Monochrome images of 640 × 480 × 8-bit pixels
 Stored row-by-row in memory
 Pixel values: 0 (black) – 255 (white)
Verilog
Digital Design — Chapter 9 — Accelerators 15
Sobel Edge Detection
 Compute derivatives of intensity in x
and y directions
 Look for minima and maxima (where
intensity changes most rapidly)
Verilog
Digital Design — Chapter 9 — Accelerators 16
The Sobel Algorithm
 Use convolution to approximate partial
derivatives Dx and Dy at each position
 Weighted sum of value of a pixel and its eight
nearest neighbors
 Coefficients represented using a 3×3 convolution
mask
 Sobel masks for x and y derivatives
–1 0 +1
–2 0 +2
–1 0 +2
x
G
+1 +2 +1
0 0 0
–1 –2 –1
y
G
x
x G
j
i
O
j
i
D 
)
,
(
)
,
(  y
y G
j
i
O
j
i
D 
)
,
(
)
,
( 
Verilog
Digital Design — Chapter 9 — Accelerators 17
The Sobel Algorithm
 Combine partial derivatives
2
2
y
x D
D
D 

 Since we just want maxima and minima
in magnitude, approximate as:
y
x D
D
D 

 Edge pixels don’t have eight neighbors
 Skip computation of |D| for edges
 Just set them to 0 using software
Verilog
Digital Design — Chapter 9 — Accelerators 18
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
Verilog
Digital Design — Chapter 9 — Accelerators 19
Data Formats and Rates
 Pixel values: 0 to 255 (8 bits)
 Coefficients are 0, ±1 and ±2
 Partial products: –510 to +510 (10 bits)
 Dx and Dy: –1020 to +1020 (11 bits)
 |D|: 0 to 2040 (11 bits)
 Final pixel value: scale back to 8 bits
 Video rate: 30 frames/sec
 640 × 480 = 307,200 pixels
 307,200 × 30  10 million pixels/sec
Verilog
Digital Design — Chapter 9 — Accelerators 20
Data Dependencies
 Pixels can be computed independently
 For each pixel:
Verilog
Digital Design — Chapter 9 — Accelerators 21
System Architecture
 Data dependencies suggest a pipeline
 Coefficient multiplies are simple shift/negate, so
merge with adder stage
Verilog
Digital Design — Chapter 9 — Accelerators 22
Memory Bandwidth
 Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
 Memory is 32-bits wide, byte addressable
 Bandwidth = 50M operations/sec
 Camera produces 10Mpixels/sec
 Accelerator needs to process at this rate
 (8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
 Greater than memory bandwidth
Verilog
Digital Design — Chapter 9 — Accelerators 23
Memory Bandwidth
 Read 4 pixels at once from each of previous,
current, and next rows
 Store in accelerator to compute multiple derivative
image pixels
 Produce derivative pixels row-by-row, left-to-
right
 Read 3 × 32-bit words for every 4th derivative
pixel computed
 Write 4 pixels at a time
 (3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
Verilog
Digital Design — Chapter 9 — Accelerators 24
Sobel Accelerator Architecture
Verilog
Digital Design — Chapter 9 — Accelerators 25
Accelerator Sequence
 Steady state
 Write 4 result pixels
 Read 4 pixels for previous,
current, next rows
 Compute for 4 cycles
 Repeat…
 Start of row
 Omit writes until pipeline
full
 End of row
 Omit reads to drain
pipeline
Verilog
Digital Design — Chapter 9 — Accelerators 26
Memory Operation Timing
 Steady state
Verilog
Digital Design — Chapter 9 — Accelerators 27
Pixel Datapath
// Computation datapath signals
reg [31:0] prev_row, curr_row, next_row;
reg [7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg [7:0] abs_D;
reg [31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row <= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
Verilog
Digital Design — Chapter 9 — Accelerators 28
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <= $signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
Verilog
Digital Design — Chapter 9 — Accelerators 29
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};
Verilog
Digital Design — Chapter 9 — Accelerators 30
Address Generation
 Given an image in memory at base
address B
 Address for pixel in row r, column c is
B + r × 640 + c
 Base address (B) is fixed
 Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
 Use word-aligned addresses
 Two least-significant bits always 00
 Increment word address by 1
Verilog
Digital Design — Chapter 9 — Accelerators 31
Address Generation
Verilog
Digital Design — Chapter 9 — Accelerators 32
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset) O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset) D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...
Verilog
Digital Design — Chapter 9 — Accelerators 33
Address Generation
assign O_prev_addr = O_base + O_offset;
assign O_curr_addr = O_prev_addr + 640/4;
assign O_next_addr = O_prev_addr + 1280/4;
assign D_addr = D_base + D_offset;
assign adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;
Verilog
Digital Design — Chapter 9 — Accelerators 34
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start
(value ignored).
O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640.
Status 0 Read-only Processing done (bit 0). Reading clears
interrupt.
Verilog
Digital Design — Chapter 9 — Accelerators 35
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
Verilog
Digital Design — Chapter 9 — Accelerators 36
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0; // other registers read as 0
else
dat_o = result_row; // for master write
Verilog
Digital Design — Chapter 9 — Accelerators 37
Control Sequencing
 Use a finite-state machine
 Counters keep track of rows (0 to 477) and
columns (0 to 159)
 See textbook for details of FSM output
functions
Verilog
Digital Design — Chapter 9 — Accelerators 38
State Transition Diagram
Verilog
Digital Design — Chapter 9 — Accelerators 39
Accelerator Verification
 Simulation-based verification of each section
of the accelerator
 Slave bus operations
 Computation sequencing
 Master bus operations
 Address generation
 Pixel computation
 Testbench including the accelerator
 Bus functional processor model
 Simplified memory and bus arbiter models
Verilog
Digital Design — Chapter 9 — Accelerators 40
Sobel Verification Testbench
Processor
BFM
Sobel
Accelerator
Memory
Model
Arbiter
Multiplexed Bus: Muxes and Connections
Verilog
Digital Design — Chapter 9 — Accelerators 41
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
Verilog
Digital Design — Chapter 9 — Accelerators 42
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end
Verilog
Digital Design — Chapter 9 — Accelerators 43
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
Verilog
Digital Design — Chapter 9 — Accelerators 44
Bus Arbiter
 Uses sobel_cyc_o and cpu_cyc_o
as request inputs
 If both request at the same time, give
accelerator priority
 Mealy FSM
Verilog
Digital Design — Chapter 9 — Accelerators 45
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
cpu: if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
endcase
Verilog
Digital Design — Chapter 9 — Accelerators 46
Simulation Results
 See waveforms in textbook
 Demonstrates sequencing and address
generation
 But what about…
 Data values computed correctly
 Interactions between processor and
accelerator
 Need to use more sophisticated
verification techniques
 Due to complexity of the design
Verilog
Digital Design — Chapter 9 — Accelerators 47
Summary
 Accelerators boost performance using
parallel hardware
 Replication, pipelining, …
 Ahmdahl’s Law
 Best payback from accelerating a kernel
 DMA avoids processor overhead
 Verification requires advanced
techniques

More Related Content

What's hot

Fpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterFpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterMalik Tauqir Hasan
 
Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Pankaj Debbarma
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveIJORCS
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
 
Digital filter design using VHDL
Digital filter design using VHDLDigital filter design using VHDL
Digital filter design using VHDLArko Das
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 

What's hot (19)

Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
DSP_Assign_1
DSP_Assign_1DSP_Assign_1
DSP_Assign_1
 
Fpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filterFpga 11-sequence-detector-fir-iir-filter
Fpga 11-sequence-detector-fir-iir-filter
 
Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06
 
FIR filter on GPU
FIR filter on GPUFIR filter on GPU
FIR filter on GPU
 
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A RetrospectiveFPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
 
FPGA Implementation of High Speed FIR Filters and less power consumption stru...
FPGA Implementation of High Speed FIR Filters and less power consumption stru...FPGA Implementation of High Speed FIR Filters and less power consumption stru...
FPGA Implementation of High Speed FIR Filters and less power consumption stru...
 
Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
Ch7 031102
Ch7 031102Ch7 031102
Ch7 031102
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
05 defense
05 defense05 defense
05 defense
 
Digital filter design using VHDL
Digital filter design using VHDLDigital filter design using VHDL
Digital filter design using VHDL
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 

Similar to Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemMelissa Luster
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1NVIDIA
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentationjesujoseph
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...IAEME Publication
 
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Carlos Reaño González
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
Lecture1 - Computer Architecture
Lecture1 - Computer ArchitectureLecture1 - Computer Architecture
Lecture1 - Computer ArchitectureVolodymyr Ushenko
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 

Similar to Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware (20)

The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging System
 
An35225228
An35225228An35225228
An35225228
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1High Performance Pedestrian Detection On TEGRA X1
High Performance Pedestrian Detection On TEGRA X1
 
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Thesis Final Presentation
Thesis Final PresentationThesis Final Presentation
Thesis Final Presentation
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...
 
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
Pipelined Compression in Remote GPU Virtualization Systems using rCUDA: Early...
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
Lecture1 - Computer Architecture
Lecture1 - Computer ArchitectureLecture1 - Computer Architecture
Lecture1 - Computer Architecture
 
Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Data race
Data raceData race
Data race
 

Recently uploaded

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 

Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

  • 1. Digital Design: An Embedded Systems Approach Using Verilog Chapter 9 Accelerators Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
  • 2. Verilog Digital Design — Chapter 9 — Accelerators 2 Performance and Parallelism  A processor core performs steps in sequence  Performance limited by the instruction rate  Accelerating performance  Perform steps in parallel  Takes less time overall to complete an operation  Instruction-level parallelism  Within a processor core  Pipelining, multiple-issue  Accelerators  Custom hardware for parallel operations
  • 3. Verilog Digital Design — Chapter 9 — Accelerators 3 Achievable Parallelism  How many steps can be performed at once?  Regularly structured data  Independent processing steps  Examples  Video and image pixel processing  Audio or sensor signal processing  Constrained by data dependencies  Operations that depend on results of previous steps
  • 4. Verilog Digital Design — Chapter 9 — Accelerators 4 Algorithm Kernels  Algorithm: specification of the required processing steps  Often expressed in a programming language  Kernel: the part that involves the most intensive, repetitive processing  “10% of operations take 90% of the time”  Accelerating a kernel with parallel hardware gives the best payback
  • 5. Verilog Digital Design — Chapter 9 — Accelerators 5 Amdahl’s Law  Time for an algorithm is t  Fraction f is spent on a kernel t f ft t ) 1 (     Accelerator speeds up kernel by a factor s t f s ft t ) 1 (      Overall speedup factor s'  For large f, s'  s  For small f, s'  1 ) 1 ( 1 f s f t t s      
  • 6. Verilog Digital Design — Chapter 9 — Accelerators 6 Amdahl’s Law Example  An algorithm with two kernels  Kernel 1: 80% of time, can be sped up 10 times  Kernel 2: 15% of time, can be sped up 100 times  Which speedup gives best overall improvement?  For kernel 1:  For kernel 2: 57 . 3 2 . 0 08 . 0 1 ) 8 . 0 1 ( 10 8 . 0 1        s 17 . 1 85 . 0 0015 . 0 1 ) 15 . 0 1 ( 100 15 . 0 1        s
  • 7. Verilog Digital Design — Chapter 9 — Accelerators 7 Parallel Architectures  An architecture for an accelerator specifies  Processing blocks  Data flow between them  Parallelism through replication  Multiple identical block operating on different data elements  Works well when elements can be processed independently
  • 8. Verilog Digital Design — Chapter 9 — Accelerators 8 Parallel Architectures  Parallelism through pipelining  Break a computation into steps, performs them in assembly-line fashion  Latency (time to complete a single operation) is not increased  Throughput (rate of completion of operations) is increased  Ideally by a factor equal to the number of pipeline stages step 1 step 2 step 3 data in data out
  • 9. Verilog Digital Design — Chapter 9 — Accelerators 9 Direct Memory Access (DMA)  Input/Output data for accellerators must be transferred at high speed  Using the processor would be too slow  Direct memory access  I/O controller and accellerator transfer data to and from memory autononously  Program supplies starting address and length
  • 10. Verilog Digital Design — Chapter 9 — Accelerators 10 Bus Arbitration  Bus masters take turns to use bus to access slaves  Controlled by a bus arbiter  Arbitration policies  Priority, round-robin, … processor memory arbiter accelerator controller request grant request request grant grant memory bus
  • 11. Verilog Digital Design — Chapter 9 — Accelerators 11 Block-Processing Accelerator  Data arranged in regular groups of contiguous memory locations  Accelerator works block by block  E.g., images in blocks of 8 × 8 × 16-bit pixels  Datapath comprises  Memory access: address generation, counters  Computation section  Control section: finite-state machine(s)
  • 12. Verilog Digital Design — Chapter 9 — Accelerators 12 Stream-Processing Accelerator  Streams of data from an input source  E.g., high-speed sensors  Digital signal processing (DSP)  Analog sensor signal converted to stream of digital sample values  Filtering, gain/attenuation, frequency- domain conversion (Fourier transform)
  • 13. Verilog Digital Design — Chapter 9 — Accelerators 13 Processor/Accelerator Interface  Embedded software controls an accelerator  Providing control parameters  Synchronizing operations  Input/output registers and interrupts  Interact with the control sequencer
  • 14. Verilog Digital Design — Chapter 9 — Accelerators 14 Case Study: Edge Detection  Illustration of accelerator design  Edge detection in video processing  Identify where image intensity changes abruptly  Typically at the boundary of objects  First step in identifying objects in a scene  Application areas  Video surveillance, computer vision, …  For this case study  Monochrome images of 640 × 480 × 8-bit pixels  Stored row-by-row in memory  Pixel values: 0 (black) – 255 (white)
  • 15. Verilog Digital Design — Chapter 9 — Accelerators 15 Sobel Edge Detection  Compute derivatives of intensity in x and y directions  Look for minima and maxima (where intensity changes most rapidly)
  • 16. Verilog Digital Design — Chapter 9 — Accelerators 16 The Sobel Algorithm  Use convolution to approximate partial derivatives Dx and Dy at each position  Weighted sum of value of a pixel and its eight nearest neighbors  Coefficients represented using a 3×3 convolution mask  Sobel masks for x and y derivatives –1 0 +1 –2 0 +2 –1 0 +2 x G +1 +2 +1 0 0 0 –1 –2 –1 y G x x G j i O j i D  ) , ( ) , (  y y G j i O j i D  ) , ( ) , ( 
  • 17. Verilog Digital Design — Chapter 9 — Accelerators 17 The Sobel Algorithm  Combine partial derivatives 2 2 y x D D D    Since we just want maxima and minima in magnitude, approximate as: y x D D D    Edge pixels don’t have eight neighbors  Skip computation of |D| for edges  Just set them to 0 using software
  • 18. Verilog Digital Design — Chapter 9 — Accelerators 18 The Algorithm in Pseudocode for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0; for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end end D[row][col] = abs(sumx) + abs(sumy); end end
  • 19. Verilog Digital Design — Chapter 9 — Accelerators 19 Data Formats and Rates  Pixel values: 0 to 255 (8 bits)  Coefficients are 0, ±1 and ±2  Partial products: –510 to +510 (10 bits)  Dx and Dy: –1020 to +1020 (11 bits)  |D|: 0 to 2040 (11 bits)  Final pixel value: scale back to 8 bits  Video rate: 30 frames/sec  640 × 480 = 307,200 pixels  307,200 × 30  10 million pixels/sec
  • 20. Verilog Digital Design — Chapter 9 — Accelerators 20 Data Dependencies  Pixels can be computed independently  For each pixel:
  • 21. Verilog Digital Design — Chapter 9 — Accelerators 21 System Architecture  Data dependencies suggest a pipeline  Coefficient multiplies are simple shift/negate, so merge with adder stage
  • 22. Verilog Digital Design — Chapter 9 — Accelerators 22 Memory Bandwidth  Assume memory read/write takes 20ns (2 cycles of 100MHz clock)  Memory is 32-bits wide, byte addressable  Bandwidth = 50M operations/sec  Camera produces 10Mpixels/sec  Accelerator needs to process at this rate  (8 reads + 1 write) × 10Mpixel/sec = 90M operations/sec  Greater than memory bandwidth
  • 23. Verilog Digital Design — Chapter 9 — Accelerators 23 Memory Bandwidth  Read 4 pixels at once from each of previous, current, and next rows  Store in accelerator to compute multiple derivative image pixels  Produce derivative pixels row-by-row, left-to- right  Read 3 × 32-bit words for every 4th derivative pixel computed  Write 4 pixels at a time  (3 reads + 1 write) / 4 × 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth
  • 24. Verilog Digital Design — Chapter 9 — Accelerators 24 Sobel Accelerator Architecture
  • 25. Verilog Digital Design — Chapter 9 — Accelerators 25 Accelerator Sequence  Steady state  Write 4 result pixels  Read 4 pixels for previous, current, next rows  Compute for 4 cycles  Repeat…  Start of row  Omit writes until pipeline full  End of row  Omit reads to drain pipeline
  • 26. Verilog Digital Design — Chapter 9 — Accelerators 26 Memory Operation Timing  Steady state
  • 27. Verilog Digital Design — Chapter 9 — Accelerators 27 Pixel Datapath // Computation datapath signals reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1]; reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D; reg [31:0] result_row; ... // Computational datapath always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i; else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register ... // Next row register function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x; endfunction ...
  • 28. Verilog Digital Design — Chapter 9 — Accelerators 28 Pixel Datapath always @(posedge clk_i) // Computation pipeline if (shift_en) begin D = abs(Dx) + abs(Dy); abs_D <= D[10:3]; Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]}) - ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]}) + $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]}) + ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]}) - $signed({3'b000, O[+1][-1]}) - ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]}); ...
  • 29. Verilog Digital Design — Chapter 9 — Accelerators 29 Pixel Datapath O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1]; O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0]; O[ 0][ 0] <= O[0][+1]; O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0]; O[+1][ 0] <= O[+1][+1]; O[+1][+1] <= next_row[31:24]; end always @(posedge clk_i) // Result row register if (shift_en) result_row <= {result_row[23:0], abs_D};
  • 30. Verilog Digital Design — Chapter 9 — Accelerators 30 Address Generation  Given an image in memory at base address B  Address for pixel in row r, column c is B + r × 640 + c  Base address (B) is fixed  Offset (r × 640 + c) increments by 4 for each group of 4 pixels read/written  Use word-aligned addresses  Two least-significant bits always 00  Increment word address by 1
  • 31. Verilog Digital Design — Chapter 9 — Accelerators 31 Address Generation
  • 32. Verilog Digital Design — Chapter 9 — Accelerators 32 Address Generation always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2]; always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0; else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2]; always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0; else if (D_offset_cnt_en) D_offset <= D_offset + 1; ...
  • 33. Verilog Digital Design — Chapter 9 — Accelerators 33 Address Generation assign O_prev_addr = O_base + O_offset; assign O_curr_addr = O_prev_addr + 640/4; assign O_next_addr = O_prev_addr + 1280/4; assign D_addr = D_base + D_offset; assign adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr; assign adr_o[1:0] = 2'b00;
  • 34. Verilog Digital Design — Chapter 9 — Accelerators 34 Control/Status Registers Register Offset Read/Write Purpose Int_en 0 Write-only Interrupt enable (bit 0). Start 4 Write-only Write causes image processing to start (value ignored). O_base 8 Write-only Original image base address. D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears interrupt.
  • 35. Verilog Digital Design — Chapter 9 — Accelerators 35 Slave Bus Interface assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register if (rst_i) int_en <= 1'b0; else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0]; always @(posedge clk_i) // Status register if (rst_i) done <= 1'b0; else if (done_set) // This occurs when last write is acknowledged, // and so cannot coincide with a read of the status register. done <= 1'b1; else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0; assign int_req = int_en && done; ...
  • 36. Verilog Digital Design — Chapter 9 — Accelerators 36 Slave Bus Interface always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o; // Wishbone data output multiplexer always @* if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00) dat_o = {31'b0, done}; // status register read else dat_o = 32'b0; // other registers read as 0 else dat_o = result_row; // for master write
  • 37. Verilog Digital Design — Chapter 9 — Accelerators 37 Control Sequencing  Use a finite-state machine  Counters keep track of rows (0 to 477) and columns (0 to 159)  See textbook for details of FSM output functions
  • 38. Verilog Digital Design — Chapter 9 — Accelerators 38 State Transition Diagram
  • 39. Verilog Digital Design — Chapter 9 — Accelerators 39 Accelerator Verification  Simulation-based verification of each section of the accelerator  Slave bus operations  Computation sequencing  Master bus operations  Address generation  Pixel computation  Testbench including the accelerator  Bus functional processor model  Simplified memory and bus arbiter models
  • 40. Verilog Digital Design — Chapter 9 — Accelerators 40 Sobel Verification Testbench Processor BFM Sobel Accelerator Memory Model Arbiter Multiplexed Bus: Muxes and Connections
  • 41. Verilog Digital Design — Chapter 9 — Accelerators 41 Processor Bus Functional Model initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000; cpu_sel_o <= 4'b0000; cpu_dat_o <= 32'h00000000; cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst); @(posedge clk); // Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations ...
  • 42. Verilog Digital Design — Chapter 9 — Accelerators 42 Processor Bus Functional Model cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop forever begin #10000; @(posedge clk); // Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111; cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk); cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop; end end end
  • 43. Verilog Digital Design — Chapter 9 — Accelerators 43 Memory Bus Functional Model always begin // Memory bus-functional model mem_ack_o <= 1'b0; mem_dat_o <= 32'h00000000; @(posedge clk); while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we) mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1; @(posedge clk); end
  • 44. Verilog Digital Design — Chapter 9 — Accelerators 44 Bus Arbiter  Uses sobel_cyc_o and cpu_cyc_o as request inputs  If both request at the same time, give accelerator priority  Mealy FSM
  • 45. Verilog Digital Design — Chapter 9 — Accelerators 45 Bus Arbiter always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel; else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic case (arbiter_current_state) sobel: if (sobel_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else if (!sobel_cyc_o && cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end cpu: if (cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else if (sobel_cyc_o && !cpu_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end endcase
  • 46. Verilog Digital Design — Chapter 9 — Accelerators 46 Simulation Results  See waveforms in textbook  Demonstrates sequencing and address generation  But what about…  Data values computed correctly  Interactions between processor and accelerator  Need to use more sophisticated verification techniques  Due to complexity of the design
  • 47. Verilog Digital Design — Chapter 9 — Accelerators 47 Summary  Accelerators boost performance using parallel hardware  Replication, pipelining, …  Ahmdahl’s Law  Best payback from accelerating a kernel  DMA avoids processor overhead  Verification requires advanced techniques

Editor's Notes

  1. 24 September 2021
  2. 24 September 2021
  3. 24 September 2021
  4. 24 September 2021
  5. 24 September 2021
  6. 24 September 2021
  7. 24 September 2021
  8. 24 September 2021
  9. 24 September 2021
  10. 24 September 2021
  11. 24 September 2021
  12. 24 September 2021
  13. 24 September 2021
  14. 24 September 2021
  15. 24 September 2021
  16. 24 September 2021
  17. 24 September 2021
  18. 24 September 2021
  19. 24 September 2021
  20. 24 September 2021
  21. 24 September 2021
  22. 24 September 2021
  23. 24 September 2021
  24. 24 September 2021
  25. 24 September 2021
  26. 24 September 2021
  27. 24 September 2021
  28. 24 September 2021
  29. 24 September 2021
  30. 24 September 2021
  31. 24 September 2021
  32. 24 September 2021
  33. 24 September 2021
  34. 24 September 2021
  35. 24 September 2021
  36. 24 September 2021
  37. 24 September 2021
  38. 24 September 2021
  39. 24 September 2021
  40. 24 September 2021
  41. 24 September 2021
  42. 24 September 2021
  43. 24 September 2021
  44. 24 September 2021
  45. 24 September 2021
  46. 24 September 2021
  47. 24 September 2021