Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
Verilog
Digital Design — Chapter 9 — Accelerators 2
Performance and Parallelism
 A processor core performs steps in sequence
 Performance limited by the instruction rate
 Accelerating performance
 Perform steps in parallel
 Takes less time overall to complete an operation
 Instruction-level parallelism
 Within a processor core
 Pipelining, multiple-issue
 Accelerators
 Custom hardware for parallel operations
Verilog
Digital Design — Chapter 9 — Accelerators 3
Achievable Parallelism
 How many steps can be performed at
once?
 Regularly structured data
 Independent processing steps
 Examples
 Video and image pixel processing
 Audio or sensor signal processing
 Constrained by data dependencies
 Operations that depend on results of
previous steps
Verilog
Digital Design — Chapter 9 — Accelerators 4
Algorithm Kernels
 Algorithm: specification of the required
processing steps
 Often expressed in a programming
language
 Kernel: the part that involves the most
intensive, repetitive processing
 “10% of operations take 90% of the time”
 Accelerating a kernel with parallel
hardware gives the best payback
Verilog
Digital Design — Chapter 9 — Accelerators 5
Amdahl’s Law
 Time for an algorithm is t
 Fraction f is spent on a kernel
t
f
ft
t )
1
( 


 Accelerator speeds up
kernel by a factor s
t
f
s
ft
t )
1
( 



 Overall speedup factor s'
 For large f, s'  s
 For small f, s'  1
)
1
(
1
f
s
f
t
t
s






Verilog
Digital Design — Chapter 9 — Accelerators 6
Amdahl’s Law Example
 An algorithm with two kernels
 Kernel 1: 80% of time, can be sped up 10 times
 Kernel 2: 15% of time, can be sped up 100 times
 Which speedup gives best overall improvement?
 For kernel 1:
 For kernel 2:
57
.
3
2
.
0
08
.
0
1
)
8
.
0
1
(
10
8
.
0
1







s
17
.
1
85
.
0
0015
.
0
1
)
15
.
0
1
(
100
15
.
0
1







s
Verilog
Digital Design — Chapter 9 — Accelerators 7
Parallel Architectures
 An architecture for an accelerator
specifies
 Processing blocks
 Data flow between them
 Parallelism through replication
 Multiple identical block operating on
different data elements
 Works well when elements can be
processed independently
Verilog
Digital Design — Chapter 9 — Accelerators 8
Parallel Architectures
 Parallelism through pipelining
 Break a computation into steps, performs them in
assembly-line fashion
 Latency (time to complete a single operation) is
not increased
 Throughput (rate of completion of operations) is
increased
 Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3
data
in
data
out
Verilog
Digital Design — Chapter 9 — Accelerators 9
Direct Memory Access (DMA)
 Input/Output data for accellerators
must be transferred at high speed
 Using the processor would be too slow
 Direct memory access
 I/O controller and accellerator transfer data
to and from memory autononously
 Program supplies starting address and
length
Verilog
Digital Design — Chapter 9 — Accelerators 10
Bus Arbitration
 Bus masters take turns to use bus to access
slaves
 Controlled by a bus arbiter
 Arbitration policies
 Priority, round-robin,
…
processor
memory
arbiter
accelerator controller
request
grant
request
request
grant
grant
memory
bus
Verilog
Digital Design — Chapter 9 — Accelerators 11
Block-Processing Accelerator
 Data arranged in regular groups of
contiguous memory locations
 Accelerator works block by block
 E.g., images in blocks of 8 × 8 × 16-bit
pixels
 Datapath comprises
 Memory access: address generation,
counters
 Computation section
 Control section: finite-state machine(s)
Verilog
Digital Design — Chapter 9 — Accelerators 12
Stream-Processing Accelerator
 Streams of data from an input source
 E.g., high-speed sensors
 Digital signal processing (DSP)
 Analog sensor signal converted to stream
of digital sample values
 Filtering, gain/attenuation, frequency-
domain conversion (Fourier transform)
Verilog
Digital Design — Chapter 9 — Accelerators 13
Processor/Accelerator Interface
 Embedded software controls an
accelerator
 Providing control parameters
 Synchronizing operations
 Input/output registers and interrupts
 Interact with the control sequencer
Verilog
Digital Design — Chapter 9 — Accelerators 14
Case Study: Edge Detection
 Illustration of accelerator design
 Edge detection in video processing
 Identify where image intensity changes abruptly
 Typically at the boundary of objects
 First step in identifying objects in a scene
 Application areas
 Video surveillance, computer vision, …
 For this case study
 Monochrome images of 640 × 480 × 8-bit pixels
 Stored row-by-row in memory
 Pixel values: 0 (black) – 255 (white)
Verilog
Digital Design — Chapter 9 — Accelerators 15
Sobel Edge Detection
 Compute derivatives of intensity in x
and y directions
 Look for minima and maxima (where
intensity changes most rapidly)
Verilog
Digital Design — Chapter 9 — Accelerators 16
The Sobel Algorithm
 Use convolution to approximate partial
derivatives Dx and Dy at each position
 Weighted sum of value of a pixel and its eight
nearest neighbors
 Coefficients represented using a 3×3 convolution
mask
 Sobel masks for x and y derivatives
–1 0 +1
–2 0 +2
–1 0 +2
x
G
+1 +2 +1
0 0 0
–1 –2 –1
y
G
x
x G
j
i
O
j
i
D 
)
,
(
)
,
(  y
y G
j
i
O
j
i
D 
)
,
(
)
,
( 
Verilog
Digital Design — Chapter 9 — Accelerators 17
The Sobel Algorithm
 Combine partial derivatives
2
2
y
x D
D
D 

 Since we just want maxima and minima
in magnitude, approximate as:
y
x D
D
D 

 Edge pixels don’t have eight neighbors
 Skip computation of |D| for edges
 Just set them to 0 using software
Verilog
Digital Design — Chapter 9 — Accelerators 18
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
Verilog
Digital Design — Chapter 9 — Accelerators 19
Data Formats and Rates
 Pixel values: 0 to 255 (8 bits)
 Coefficients are 0, ±1 and ±2
 Partial products: –510 to +510 (10 bits)
 Dx and Dy: –1020 to +1020 (11 bits)
 |D|: 0 to 2040 (11 bits)
 Final pixel value: scale back to 8 bits
 Video rate: 30 frames/sec
 640 × 480 = 307,200 pixels
 307,200 × 30  10 million pixels/sec
Verilog
Digital Design — Chapter 9 — Accelerators 20
Data Dependencies
 Pixels can be computed independently
 For each pixel:
Verilog
Digital Design — Chapter 9 — Accelerators 21
System Architecture
 Data dependencies suggest a pipeline
 Coefficient multiplies are simple shift/negate, so
merge with adder stage
Verilog
Digital Design — Chapter 9 — Accelerators 22
Memory Bandwidth
 Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
 Memory is 32-bits wide, byte addressable
 Bandwidth = 50M operations/sec
 Camera produces 10Mpixels/sec
 Accelerator needs to process at this rate
 (8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
 Greater than memory bandwidth
Verilog
Digital Design — Chapter 9 — Accelerators 23
Memory Bandwidth
 Read 4 pixels at once from each of previous,
current, and next rows
 Store in accelerator to compute multiple derivative
image pixels
 Produce derivative pixels row-by-row, left-to-
right
 Read 3 × 32-bit words for every 4th derivative
pixel computed
 Write 4 pixels at a time
 (3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
Verilog
Digital Design — Chapter 9 — Accelerators 24
Sobel Accelerator Architecture
Verilog
Digital Design — Chapter 9 — Accelerators 25
Accelerator Sequence
 Steady state
 Write 4 result pixels
 Read 4 pixels for previous,
current, next rows
 Compute for 4 cycles
 Repeat…
 Start of row
 Omit writes until pipeline
full
 End of row
 Omit reads to drain
pipeline
Verilog
Digital Design — Chapter 9 — Accelerators 26
Memory Operation Timing
 Steady state
Verilog
Digital Design — Chapter 9 — Accelerators 27
Pixel Datapath
// Computation datapath signals
reg [31:0] prev_row, curr_row, next_row;
reg [7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D;
reg [7:0] abs_D;
reg [31:0] result_row;
...
// Computational datapath
always @(posedge clk_i) // Previous row register
if (prev_row_load) prev_row <= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0];
... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x);
abs = x >= 0 ? x : -x;
endfunction
...
Verilog
Digital Design — Chapter 9 — Accelerators 28
Pixel Datapath
always @(posedge clk_i) // Computation pipeline
if (shift_en) begin
D = abs(Dx) + abs(Dy);
abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]})
+ $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1)
+ ($signed({3'b000, O[ 0][+1]}) << 1)
- $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]});
Dy <= $signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1)
+ $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1)
- $signed({3'b000, O[+1][+1]});
...
Verilog
Digital Design — Chapter 9 — Accelerators 29
Pixel Datapath
O[-1][-1] <= O[-1][0];
O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24];
O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24];
O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24];
end
always @(posedge clk_i) // Result row register
if (shift_en) result_row <= {result_row[23:0], abs_D};
Verilog
Digital Design — Chapter 9 — Accelerators 30
Address Generation
 Given an image in memory at base
address B
 Address for pixel in row r, column c is
B + r × 640 + c
 Base address (B) is fixed
 Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
 Use word-aligned addresses
 Two least-significant bits always 00
 Increment word address by 1
Verilog
Digital Design — Chapter 9 — Accelerators 31
Address Generation
Verilog
Digital Design — Chapter 9 — Accelerators 32
Address Generation
always @(posedge clk_i) // O base address register
if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter
if (offset_reset) O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1;
always @(posedge clk_i) // D base address register
if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter
if (offset_reset) D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1;
...
Verilog
Digital Design — Chapter 9 — Accelerators 33
Address Generation
assign O_prev_addr = O_base + O_offset;
assign O_curr_addr = O_prev_addr + 640/4;
assign O_next_addr = O_prev_addr + 1280/4;
assign D_addr = D_base + D_offset;
assign adr_o[21:2] = prev_row_load ? O_prev_addr :
curr_row_load ? O_curr_addr :
next_row_load ? O_next_addr :
D_addr;
assign adr_o[1:0] = 2'b00;
Verilog
Digital Design — Chapter 9 — Accelerators 34
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start
(value ignored).
O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640.
Status 0 Read-only Processing done (bit 0). Reading clears
interrupt.
Verilog
Digital Design — Chapter 9 — Accelerators 35
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
Verilog
Digital Design — Chapter 9 — Accelerators 36
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0; // other registers read as 0
else
dat_o = result_row; // for master write
Verilog
Digital Design — Chapter 9 — Accelerators 37
Control Sequencing
 Use a finite-state machine
 Counters keep track of rows (0 to 477) and
columns (0 to 159)
 See textbook for details of FSM output
functions
Verilog
Digital Design — Chapter 9 — Accelerators 38
State Transition Diagram
Verilog
Digital Design — Chapter 9 — Accelerators 39
Accelerator Verification
 Simulation-based verification of each section
of the accelerator
 Slave bus operations
 Computation sequencing
 Master bus operations
 Address generation
 Pixel computation
 Testbench including the accelerator
 Bus functional processor model
 Simplified memory and bus arbiter models
Verilog
Digital Design — Chapter 9 — Accelerators 40
Sobel Verification Testbench
Processor
BFM
Sobel
Accelerator
Memory
Model
Arbiter
Multiplexed Bus: Muxes and Connections
Verilog
Digital Design — Chapter 9 — Accelerators 41
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
Verilog
Digital Design — Chapter 9 — Accelerators 42
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end
Verilog
Digital Design — Chapter 9 — Accelerators 43
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
Verilog
Digital Design — Chapter 9 — Accelerators 44
Bus Arbiter
 Uses sobel_cyc_o and cpu_cyc_o
as request inputs
 If both request at the same time, give
accelerator priority
 Mealy FSM
Verilog
Digital Design — Chapter 9 — Accelerators 45
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
cpu: if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
endcase
Verilog
Digital Design — Chapter 9 — Accelerators 46
Simulation Results
 See waveforms in textbook
 Demonstrates sequencing and address
generation
 But what about…
 Data values computed correctly
 Interactions between processor and
accelerator
 Need to use more sophisticated
verification techniques
 Due to complexity of the design
Verilog
Digital Design — Chapter 9 — Accelerators 47
Summary
 Accelerators boost performance using
parallel hardware
 Replication, pipelining, …
 Ahmdahl’s Law
 Best payback from accelerating a kernel
 DMA avoids processor overhead
 Verification requires advanced
techniques

09 accelerators

  • 1.
    Digital Design: An EmbeddedSystems Approach Using Verilog Chapter 9 Accelerators Portions of this work are from the book, Digital Design: An Embedded Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
  • 2.
    Verilog Digital Design —Chapter 9 — Accelerators 2 Performance and Parallelism  A processor core performs steps in sequence  Performance limited by the instruction rate  Accelerating performance  Perform steps in parallel  Takes less time overall to complete an operation  Instruction-level parallelism  Within a processor core  Pipelining, multiple-issue  Accelerators  Custom hardware for parallel operations
  • 3.
    Verilog Digital Design —Chapter 9 — Accelerators 3 Achievable Parallelism  How many steps can be performed at once?  Regularly structured data  Independent processing steps  Examples  Video and image pixel processing  Audio or sensor signal processing  Constrained by data dependencies  Operations that depend on results of previous steps
  • 4.
    Verilog Digital Design —Chapter 9 — Accelerators 4 Algorithm Kernels  Algorithm: specification of the required processing steps  Often expressed in a programming language  Kernel: the part that involves the most intensive, repetitive processing  “10% of operations take 90% of the time”  Accelerating a kernel with parallel hardware gives the best payback
  • 5.
    Verilog Digital Design —Chapter 9 — Accelerators 5 Amdahl’s Law  Time for an algorithm is t  Fraction f is spent on a kernel t f ft t ) 1 (     Accelerator speeds up kernel by a factor s t f s ft t ) 1 (      Overall speedup factor s'  For large f, s'  s  For small f, s'  1 ) 1 ( 1 f s f t t s      
  • 6.
    Verilog Digital Design —Chapter 9 — Accelerators 6 Amdahl’s Law Example  An algorithm with two kernels  Kernel 1: 80% of time, can be sped up 10 times  Kernel 2: 15% of time, can be sped up 100 times  Which speedup gives best overall improvement?  For kernel 1:  For kernel 2: 57 . 3 2 . 0 08 . 0 1 ) 8 . 0 1 ( 10 8 . 0 1        s 17 . 1 85 . 0 0015 . 0 1 ) 15 . 0 1 ( 100 15 . 0 1        s
  • 7.
    Verilog Digital Design —Chapter 9 — Accelerators 7 Parallel Architectures  An architecture for an accelerator specifies  Processing blocks  Data flow between them  Parallelism through replication  Multiple identical block operating on different data elements  Works well when elements can be processed independently
  • 8.
    Verilog Digital Design —Chapter 9 — Accelerators 8 Parallel Architectures  Parallelism through pipelining  Break a computation into steps, performs them in assembly-line fashion  Latency (time to complete a single operation) is not increased  Throughput (rate of completion of operations) is increased  Ideally by a factor equal to the number of pipeline stages step 1 step 2 step 3 data in data out
  • 9.
    Verilog Digital Design —Chapter 9 — Accelerators 9 Direct Memory Access (DMA)  Input/Output data for accellerators must be transferred at high speed  Using the processor would be too slow  Direct memory access  I/O controller and accellerator transfer data to and from memory autononously  Program supplies starting address and length
  • 10.
    Verilog Digital Design —Chapter 9 — Accelerators 10 Bus Arbitration  Bus masters take turns to use bus to access slaves  Controlled by a bus arbiter  Arbitration policies  Priority, round-robin, … processor memory arbiter accelerator controller request grant request request grant grant memory bus
  • 11.
    Verilog Digital Design —Chapter 9 — Accelerators 11 Block-Processing Accelerator  Data arranged in regular groups of contiguous memory locations  Accelerator works block by block  E.g., images in blocks of 8 × 8 × 16-bit pixels  Datapath comprises  Memory access: address generation, counters  Computation section  Control section: finite-state machine(s)
  • 12.
    Verilog Digital Design —Chapter 9 — Accelerators 12 Stream-Processing Accelerator  Streams of data from an input source  E.g., high-speed sensors  Digital signal processing (DSP)  Analog sensor signal converted to stream of digital sample values  Filtering, gain/attenuation, frequency- domain conversion (Fourier transform)
  • 13.
    Verilog Digital Design —Chapter 9 — Accelerators 13 Processor/Accelerator Interface  Embedded software controls an accelerator  Providing control parameters  Synchronizing operations  Input/output registers and interrupts  Interact with the control sequencer
  • 14.
    Verilog Digital Design —Chapter 9 — Accelerators 14 Case Study: Edge Detection  Illustration of accelerator design  Edge detection in video processing  Identify where image intensity changes abruptly  Typically at the boundary of objects  First step in identifying objects in a scene  Application areas  Video surveillance, computer vision, …  For this case study  Monochrome images of 640 × 480 × 8-bit pixels  Stored row-by-row in memory  Pixel values: 0 (black) – 255 (white)
  • 15.
    Verilog Digital Design —Chapter 9 — Accelerators 15 Sobel Edge Detection  Compute derivatives of intensity in x and y directions  Look for minima and maxima (where intensity changes most rapidly)
  • 16.
    Verilog Digital Design —Chapter 9 — Accelerators 16 The Sobel Algorithm  Use convolution to approximate partial derivatives Dx and Dy at each position  Weighted sum of value of a pixel and its eight nearest neighbors  Coefficients represented using a 3×3 convolution mask  Sobel masks for x and y derivatives –1 0 +1 –2 0 +2 –1 0 +2 x G +1 +2 +1 0 0 0 –1 –2 –1 y G x x G j i O j i D  ) , ( ) , (  y y G j i O j i D  ) , ( ) , ( 
  • 17.
    Verilog Digital Design —Chapter 9 — Accelerators 17 The Sobel Algorithm  Combine partial derivatives 2 2 y x D D D    Since we just want maxima and minima in magnitude, approximate as: y x D D D    Edge pixels don’t have eight neighbors  Skip computation of |D| for edges  Just set them to 0 using software
  • 18.
    Verilog Digital Design —Chapter 9 — Accelerators 18 The Algorithm in Pseudocode for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0; for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end end D[row][col] = abs(sumx) + abs(sumy); end end
  • 19.
    Verilog Digital Design —Chapter 9 — Accelerators 19 Data Formats and Rates  Pixel values: 0 to 255 (8 bits)  Coefficients are 0, ±1 and ±2  Partial products: –510 to +510 (10 bits)  Dx and Dy: –1020 to +1020 (11 bits)  |D|: 0 to 2040 (11 bits)  Final pixel value: scale back to 8 bits  Video rate: 30 frames/sec  640 × 480 = 307,200 pixels  307,200 × 30  10 million pixels/sec
  • 20.
    Verilog Digital Design —Chapter 9 — Accelerators 20 Data Dependencies  Pixels can be computed independently  For each pixel:
  • 21.
    Verilog Digital Design —Chapter 9 — Accelerators 21 System Architecture  Data dependencies suggest a pipeline  Coefficient multiplies are simple shift/negate, so merge with adder stage
  • 22.
    Verilog Digital Design —Chapter 9 — Accelerators 22 Memory Bandwidth  Assume memory read/write takes 20ns (2 cycles of 100MHz clock)  Memory is 32-bits wide, byte addressable  Bandwidth = 50M operations/sec  Camera produces 10Mpixels/sec  Accelerator needs to process at this rate  (8 reads + 1 write) × 10Mpixel/sec = 90M operations/sec  Greater than memory bandwidth
  • 23.
    Verilog Digital Design —Chapter 9 — Accelerators 23 Memory Bandwidth  Read 4 pixels at once from each of previous, current, and next rows  Store in accelerator to compute multiple derivative image pixels  Produce derivative pixels row-by-row, left-to- right  Read 3 × 32-bit words for every 4th derivative pixel computed  Write 4 pixels at a time  (3 reads + 1 write) / 4 × 10Mpixel/sec = 10M operations/sec = 20% of available memory bandwidth
  • 24.
    Verilog Digital Design —Chapter 9 — Accelerators 24 Sobel Accelerator Architecture
  • 25.
    Verilog Digital Design —Chapter 9 — Accelerators 25 Accelerator Sequence  Steady state  Write 4 result pixels  Read 4 pixels for previous, current, next rows  Compute for 4 cycles  Repeat…  Start of row  Omit writes until pipeline full  End of row  Omit reads to drain pipeline
  • 26.
    Verilog Digital Design —Chapter 9 — Accelerators 26 Memory Operation Timing  Steady state
  • 27.
    Verilog Digital Design —Chapter 9 — Accelerators 27 Pixel Datapath // Computation datapath signals reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1]; reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D; reg [31:0] result_row; ... // Computational datapath always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i; else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register ... // Next row register function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x; endfunction ...
  • 28.
    Verilog Digital Design —Chapter 9 — Accelerators 28 Pixel Datapath always @(posedge clk_i) // Computation pipeline if (shift_en) begin D = abs(Dx) + abs(Dy); abs_D <= D[10:3]; Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]}) - ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]}) + $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]}) + ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]}) - $signed({3'b000, O[+1][-1]}) - ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]}); ...
  • 29.
    Verilog Digital Design —Chapter 9 — Accelerators 29 Pixel Datapath O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1]; O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0]; O[ 0][ 0] <= O[0][+1]; O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0]; O[+1][ 0] <= O[+1][+1]; O[+1][+1] <= next_row[31:24]; end always @(posedge clk_i) // Result row register if (shift_en) result_row <= {result_row[23:0], abs_D};
  • 30.
    Verilog Digital Design —Chapter 9 — Accelerators 30 Address Generation  Given an image in memory at base address B  Address for pixel in row r, column c is B + r × 640 + c  Base address (B) is fixed  Offset (r × 640 + c) increments by 4 for each group of 4 pixels read/written  Use word-aligned addresses  Two least-significant bits always 00  Increment word address by 1
  • 31.
    Verilog Digital Design —Chapter 9 — Accelerators 31 Address Generation
  • 32.
    Verilog Digital Design —Chapter 9 — Accelerators 32 Address Generation always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2]; always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0; else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2]; always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0; else if (D_offset_cnt_en) D_offset <= D_offset + 1; ...
  • 33.
    Verilog Digital Design —Chapter 9 — Accelerators 33 Address Generation assign O_prev_addr = O_base + O_offset; assign O_curr_addr = O_prev_addr + 640/4; assign O_next_addr = O_prev_addr + 1280/4; assign D_addr = D_base + D_offset; assign adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr; assign adr_o[1:0] = 2'b00;
  • 34.
    Verilog Digital Design —Chapter 9 — Accelerators 34 Control/Status Registers Register Offset Read/Write Purpose Int_en 0 Write-only Interrupt enable (bit 0). Start 4 Write-only Write causes image processing to start (value ignored). O_base 8 Write-only Original image base address. D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears interrupt.
  • 35.
    Verilog Digital Design —Chapter 9 — Accelerators 35 Slave Bus Interface assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register if (rst_i) int_en <= 1'b0; else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0]; always @(posedge clk_i) // Status register if (rst_i) done <= 1'b0; else if (done_set) // This occurs when last write is acknowledged, // and so cannot coincide with a read of the status register. done <= 1'b1; else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0; assign int_req = int_en && done; ...
  • 36.
    Verilog Digital Design —Chapter 9 — Accelerators 36 Slave Bus Interface always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o; // Wishbone data output multiplexer always @* if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00) dat_o = {31'b0, done}; // status register read else dat_o = 32'b0; // other registers read as 0 else dat_o = result_row; // for master write
  • 37.
    Verilog Digital Design —Chapter 9 — Accelerators 37 Control Sequencing  Use a finite-state machine  Counters keep track of rows (0 to 477) and columns (0 to 159)  See textbook for details of FSM output functions
  • 38.
    Verilog Digital Design —Chapter 9 — Accelerators 38 State Transition Diagram
  • 39.
    Verilog Digital Design —Chapter 9 — Accelerators 39 Accelerator Verification  Simulation-based verification of each section of the accelerator  Slave bus operations  Computation sequencing  Master bus operations  Address generation  Pixel computation  Testbench including the accelerator  Bus functional processor model  Simplified memory and bus arbiter models
  • 40.
    Verilog Digital Design —Chapter 9 — Accelerators 40 Sobel Verification Testbench Processor BFM Sobel Accelerator Memory Model Arbiter Multiplexed Bus: Muxes and Connections
  • 41.
    Verilog Digital Design —Chapter 9 — Accelerators 41 Processor Bus Functional Model initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000; cpu_sel_o <= 4'b0000; cpu_dat_o <= 32'h00000000; cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst); @(posedge clk); // Write 008000 (hex) to O_base_addr register bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt) bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored) bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations ...
  • 42.
    Verilog Digital Design —Chapter 9 — Accelerators 42 Processor Bus Functional Model cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop forever begin #10000; @(posedge clk); // Read status register cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111; cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk); cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop; end end end
  • 43.
    Verilog Digital Design —Chapter 9 — Accelerators 43 Memory Bus Functional Model always begin // Memory bus-functional model mem_ack_o <= 1'b0; mem_dat_o <= 32'h00000000; @(posedge clk); while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we) mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1; @(posedge clk); end
  • 44.
    Verilog Digital Design —Chapter 9 — Accelerators 44 Bus Arbiter  Uses sobel_cyc_o and cpu_cyc_o as request inputs  If both request at the same time, give accelerator priority  Mealy FSM
  • 45.
    Verilog Digital Design —Chapter 9 — Accelerators 45 Bus Arbiter always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel; else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic case (arbiter_current_state) sobel: if (sobel_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else if (!sobel_cyc_o && cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end cpu: if (cpu_cyc_o) begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else if (sobel_cyc_o && !cpu_cyc_o) begin sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else begin sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end endcase
  • 46.
    Verilog Digital Design —Chapter 9 — Accelerators 46 Simulation Results  See waveforms in textbook  Demonstrates sequencing and address generation  But what about…  Data values computed correctly  Interactions between processor and accelerator  Need to use more sophisticated verification techniques  Due to complexity of the design
  • 47.
    Verilog Digital Design —Chapter 9 — Accelerators 47 Summary  Accelerators boost performance using parallel hardware  Replication, pipelining, …  Ahmdahl’s Law  Best payback from accelerating a kernel  DMA avoids processor overhead  Verification requires advanced techniques

Editor's Notes