Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.

Verilog
Digital Design — Chapter 9 — Accelerators 2
Performance and Parallelism
 A processor core performs steps in sequence
 Performance limited by the instruction rate
 Accelerating performance
 Perform steps in parallel
 Takes less time overall to complete an operation
 Instruction-level parallelism
 Within a processor core
 Pipelining, multiple-issue
 Accelerators
 Custom hardware for parallel operations

Verilog
Achievable Parallelism
 How many steps can be performed at
once?
 Regularly structured data
 Independent processing steps
 Examples
 Video and image pixel processing
 Audio or sensor signal processing
 Constrained by data dependencies
 Operations that depend on results of
previous steps

Verilog
Algorithm Kernels
 Algorithm: specification of the required
processing steps
 Often expressed in a programming
language
 Kernel: the part that involves the most
intensive, repetitive processing
 “10% of operations take 90% of the time”
 Accelerating a kernel with parallel
hardware gives the best payback

Verilog
Amdahl’s Law
 Time for an algorithm is t
 Fraction f is spent on a kernel
t
f
ft
t )
1
( 


 Accelerator speeds up
kernel by a factor s
t
f
s
ft
t )
1
( 



 Overall speedup factor s'
 For large f, s'  s
 For small f, s'  1
)
1
(
1
f
s
f
t
t
s







Verilog
Amdahl’s Law Example
 An algorithm with two kernels
 Kernel 1: 80% of time, can be sped up 10 times
 Kernel 2: 15% of time, can be sped up 100 times
 Which speedup gives best overall improvement?
 For kernel 1:
 For kernel 2:
57
.
3
2
.
0
08
.
0
1
)
8
.
0
1
(
10
8
.
0
1







s
17
.
1
85
.
0
0015
.
0
1
)
15
.
0
1
(
100
15
.
0
1







s

Verilog
Parallel Architectures
 An architecture for an accelerator
specifies
 Processing blocks
 Data flow between them
 Parallelism through replication
 Multiple identical block operating on
different data elements
 Works well when elements can be
processed independently

Verilog
Parallel Architectures
 Parallelism through pipelining
 Break a computation into steps, performs them in
assembly-line fashion
 Latency (time to complete a single operation) is
not increased
 Throughput (rate of completion of operations) is
increased
 Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3
data
in
data
out

Verilog
Direct Memory Access (DMA)
 Input/Output data for accellerators
must be transferred at high speed
 Using the processor would be too slow
 Direct memory access
 I/O controller and accellerator transfer data
to and from memory autononously
 Program supplies starting address and
length

Verilog
Bus Arbitration
 Bus masters take turns to use bus to access
slaves
 Controlled by a bus arbiter
 Arbitration policies
 Priority, round-robin,
…
processor
memory
arbiter
accelerator controller
request
grant
request
request
grant
grant
memory
bus

Verilog
Block-Processing Accelerator
 Data arranged in regular groups of
contiguous memory locations
 Accelerator works block by block
 E.g., images in blocks of 8 × 8 × 16-bit
pixels
 Datapath comprises
 Memory access: address generation,
counters
 Computation section
 Control section: finite-state machine(s)

Verilog
Stream-Processing Accelerator
 Streams of data from an input source
 E.g., high-speed sensors
 Digital signal processing (DSP)
 Analog sensor signal converted to stream
of digital sample values
 Filtering, gain/attenuation, frequency-
domain conversion (Fourier transform)

Verilog
Processor/Accelerator Interface
 Embedded software controls an
accelerator
 Providing control parameters
 Synchronizing operations
 Input/output registers and interrupts
 Interact with the control sequencer

Verilog
Case Study: Edge Detection
 Illustration of accelerator design
 Edge detection in video processing
 Identify where image intensity changes abruptly
 Typically at the boundary of objects
 First step in identifying objects in a scene
 Application areas
 Video surveillance, computer vision, …
 For this case study
 Monochrome images of 640 × 480 × 8-bit pixels
 Stored row-by-row in memory
 Pixel values: 0 (black) – 255 (white)

Verilog
Sobel Edge Detection
 Compute derivatives of intensity in x
and y directions
 Look for minima and maxima (where
intensity changes most rapidly)

Verilog
The Sobel Algorithm
 Use convolution to approximate partial
derivatives Dx and Dy at each position
 Weighted sum of value of a pixel and its eight
nearest neighbors
 Coefficients represented using a 3×3 convolution
mask
 Sobel masks for x and y derivatives
–1 0 +1
–2 0 +2
–1 0 +2
x
G
+1 +2 +1
0 0 0
–1 –2 –1
y
G
x
x G
j
i
O
j
i
D 
)
,
(
)
,
(  y
y G
j
i
O
j
i
D 
)
,
(
)
,
( 

Verilog
The Sobel Algorithm
 Combine partial derivatives
2
2
y
x D
D
D 

 Since we just want maxima and minima
in magnitude, approximate as:
y
x D
D
D 

 Edge pixels don’t have eight neighbors
 Skip computation of |D| for edges
 Just set them to 0 using software

Verilog
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end

Verilog
Data Formats and Rates
 Pixel values: 0 to 255 (8 bits)
 Coefficients are 0, ±1 and ±2
 Partial products: –510 to +510 (10 bits)
 Dx and Dy: –1020 to +1020 (11 bits)
 |D|: 0 to 2040 (11 bits)
 Final pixel value: scale back to 8 bits
 Video rate: 30 frames/sec
 640 × 480 = 307,200 pixels
 307,200 × 30  10 million pixels/sec

Verilog
Data Dependencies
 Pixels can be computed independently
 For each pixel:

Verilog
System Architecture
 Data dependencies suggest a pipeline
 Coefficient multiplies are simple shift/negate, so
merge with adder stage

Verilog
Memory Bandwidth
 Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
 Memory is 32-bits wide, byte addressable
 Bandwidth = 50M operations/sec
 Camera produces 10Mpixels/sec
 Accelerator needs to process at this rate
 (8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
 Greater than memory bandwidth

Verilog
Memory Bandwidth
 Read 4 pixels at once from each of previous,
current, and next rows
 Store in accelerator to compute multiple derivative
image pixels
 Produce derivative pixels row-by-row, left-to-
right
 Read 3 × 32-bit words for every 4th derivative
pixel computed
 Write 4 pixels at a time
 (3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth

Verilog
Sobel Accelerator Architecture

Verilog
Accelerator Sequence
 Steady state
 Write 4 result pixels
 Read 4 pixels for previous,
current, next rows
 Compute for 4 cycles
 Repeat…
 Start of row
 Omit writes until pipeline
full
 End of row
 Omit reads to drain
pipeline

Verilog
Memory Operation Timing
 Steady state

Verilog
Pixel Datapath
// Computation datapath signals
reg [31:0] prev_row, curr_row, next_row;
reg [7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg [7:0] abs_D;
reg [31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row <= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...

Verilog
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <= $signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...

Verilog
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};

Verilog
Address Generation
 Given an image in memory at base
address B
 Address for pixel in row r, column c is
B + r × 640 + c
 Base address (B) is fixed
 Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
 Use word-aligned addresses
 Two least-significant bits always 00
 Increment word address by 1

Verilog
Address Generation

Verilog
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset) O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset) D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...

Verilog
Address Generation
assign O_prev_addr = O_base + O_offset;
assign O_curr_addr = O_prev_addr + 640/4;
assign O_next_addr = O_prev_addr + 1280/4;
assign D_addr = D_base + D_offset;
assign adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;

Verilog
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start
(value ignored).
O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640.
Status 0 Read-only Processing done (bit 0). Reading clears
interrupt.

Verilog
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...

Verilog
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0; // other registers read as 0
else
dat_o = result_row; // for master write

Verilog
Control Sequencing
 Use a finite-state machine
 Counters keep track of rows (0 to 477) and
columns (0 to 159)
 See textbook for details of FSM output
functions

Verilog
State Transition Diagram

Verilog
Accelerator Verification
 Simulation-based verification of each section
of the accelerator
 Slave bus operations
 Computation sequencing
 Master bus operations
 Address generation
 Pixel computation
 Testbench including the accelerator
 Bus functional processor model
 Simplified memory and bus arbiter models

Verilog
Sobel Verification Testbench
Processor
BFM
Sobel
Accelerator
Memory
Model
Arbiter
Multiplexed Bus: Muxes and Connections

Verilog
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...

Verilog
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
if (cpu_dat_i[0]) disable loop;
end
end
end

Verilog
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end

Verilog
Bus Arbiter
 Uses sobel_cyc_o and cpu_cyc_o
as request inputs
 If both request at the same time, give
accelerator priority
 Mealy FSM

Verilog
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end
else begin
end
cpu: if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end else if (sobel_cyc_o && !cpu_cyc_o) begin
end else begin
end
endcase

Verilog
Simulation Results
 See waveforms in textbook
 Demonstrates sequencing and address
generation
 But what about…
 Data values computed correctly
 Interactions between processor and
accelerator
 Need to use more sophisticated
verification techniques
 Due to complexity of the design

Verilog
Summary
 Accelerators boost performance using
parallel hardware
 Replication, pipelining, …
 Ahmdahl’s Law
 Best payback from accelerating a kernel
 DMA avoids processor overhead
 Verification requires advanced
techniques

Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

Similar to Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware (20)

Recently uploaded

Recently uploaded (20)

Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware

Editor's Notes