HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
Digital Design Chapter 9 - Accelerating Performance with Parallel Hardware
1. Digital Design:
An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Portions of this work are from the book, Digital Design: An Embedded
Systems Approach Using Verilog, by Peter J. Ashenden, published by Morgan
Kaufmann Publishers, Copyright 2007 Elsevier Inc. All rights reserved.
2. Verilog
Digital Design — Chapter 9 — Accelerators 2
Performance and Parallelism
A processor core performs steps in sequence
Performance limited by the instruction rate
Accelerating performance
Perform steps in parallel
Takes less time overall to complete an operation
Instruction-level parallelism
Within a processor core
Pipelining, multiple-issue
Accelerators
Custom hardware for parallel operations
3. Verilog
Digital Design — Chapter 9 — Accelerators 3
Achievable Parallelism
How many steps can be performed at
once?
Regularly structured data
Independent processing steps
Examples
Video and image pixel processing
Audio or sensor signal processing
Constrained by data dependencies
Operations that depend on results of
previous steps
4. Verilog
Digital Design — Chapter 9 — Accelerators 4
Algorithm Kernels
Algorithm: specification of the required
processing steps
Often expressed in a programming
language
Kernel: the part that involves the most
intensive, repetitive processing
“10% of operations take 90% of the time”
Accelerating a kernel with parallel
hardware gives the best payback
5. Verilog
Digital Design — Chapter 9 — Accelerators 5
Amdahl’s Law
Time for an algorithm is t
Fraction f is spent on a kernel
t
f
ft
t )
1
(
Accelerator speeds up
kernel by a factor s
t
f
s
ft
t )
1
(
Overall speedup factor s'
For large f, s' s
For small f, s' 1
)
1
(
1
f
s
f
t
t
s
6. Verilog
Digital Design — Chapter 9 — Accelerators 6
Amdahl’s Law Example
An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times
Kernel 2: 15% of time, can be sped up 100 times
Which speedup gives best overall improvement?
For kernel 1:
For kernel 2:
57
.
3
2
.
0
08
.
0
1
)
8
.
0
1
(
10
8
.
0
1
s
17
.
1
85
.
0
0015
.
0
1
)
15
.
0
1
(
100
15
.
0
1
s
7. Verilog
Digital Design — Chapter 9 — Accelerators 7
Parallel Architectures
An architecture for an accelerator
specifies
Processing blocks
Data flow between them
Parallelism through replication
Multiple identical block operating on
different data elements
Works well when elements can be
processed independently
8. Verilog
Digital Design — Chapter 9 — Accelerators 8
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in
assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased
Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3
data
in
data
out
9. Verilog
Digital Design — Chapter 9 — Accelerators 9
Direct Memory Access (DMA)
Input/Output data for accellerators
must be transferred at high speed
Using the processor would be too slow
Direct memory access
I/O controller and accellerator transfer data
to and from memory autononously
Program supplies starting address and
length
10. Verilog
Digital Design — Chapter 9 — Accelerators 10
Bus Arbitration
Bus masters take turns to use bus to access
slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin,
…
processor
memory
arbiter
accelerator controller
request
grant
request
request
grant
grant
memory
bus
11. Verilog
Digital Design — Chapter 9 — Accelerators 11
Block-Processing Accelerator
Data arranged in regular groups of
contiguous memory locations
Accelerator works block by block
E.g., images in blocks of 8 × 8 × 16-bit
pixels
Datapath comprises
Memory access: address generation,
counters
Computation section
Control section: finite-state machine(s)
12. Verilog
Digital Design — Chapter 9 — Accelerators 12
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
Digital signal processing (DSP)
Analog sensor signal converted to stream
of digital sample values
Filtering, gain/attenuation, frequency-
domain conversion (Fourier transform)
13. Verilog
Digital Design — Chapter 9 — Accelerators 13
Processor/Accelerator Interface
Embedded software controls an
accelerator
Providing control parameters
Synchronizing operations
Input/output registers and interrupts
Interact with the control sequencer
14. Verilog
Digital Design — Chapter 9 — Accelerators 14
Case Study: Edge Detection
Illustration of accelerator design
Edge detection in video processing
Identify where image intensity changes abruptly
Typically at the boundary of objects
First step in identifying objects in a scene
Application areas
Video surveillance, computer vision, …
For this case study
Monochrome images of 640 × 480 × 8-bit pixels
Stored row-by-row in memory
Pixel values: 0 (black) – 255 (white)
15. Verilog
Digital Design — Chapter 9 — Accelerators 15
Sobel Edge Detection
Compute derivatives of intensity in x
and y directions
Look for minima and maxima (where
intensity changes most rapidly)
16. Verilog
Digital Design — Chapter 9 — Accelerators 16
The Sobel Algorithm
Use convolution to approximate partial
derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight
nearest neighbors
Coefficients represented using a 3×3 convolution
mask
Sobel masks for x and y derivatives
–1 0 +1
–2 0 +2
–1 0 +2
x
G
+1 +2 +1
0 0 0
–1 –2 –1
y
G
x
x G
j
i
O
j
i
D
)
,
(
)
,
( y
y G
j
i
O
j
i
D
)
,
(
)
,
(
17. Verilog
Digital Design — Chapter 9 — Accelerators 17
The Sobel Algorithm
Combine partial derivatives
2
2
y
x D
D
D
Since we just want maxima and minima
in magnitude, approximate as:
y
x D
D
D
Edge pixels don’t have eight neighbors
Skip computation of |D| for edges
Just set them to 0 using software
18. Verilog
Digital Design — Chapter 9 — Accelerators 18
The Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin
for (col = 1; col <= 638; col = col + 1) begin
sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin
for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j];
sumy = sumy + 0[row+i][col+j] * Gy[i][j];
end
end
D[row][col] = abs(sumx) + abs(sumy);
end
end
19. Verilog
Digital Design — Chapter 9 — Accelerators 19
Data Formats and Rates
Pixel values: 0 to 255 (8 bits)
Coefficients are 0, ±1 and ±2
Partial products: –510 to +510 (10 bits)
Dx and Dy: –1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec
640 × 480 = 307,200 pixels
307,200 × 30 10 million pixels/sec
20. Verilog
Digital Design — Chapter 9 — Accelerators 20
Data Dependencies
Pixels can be computed independently
For each pixel:
21. Verilog
Digital Design — Chapter 9 — Accelerators 21
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so
merge with adder stage
22. Verilog
Digital Design — Chapter 9 — Accelerators 22
Memory Bandwidth
Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable
Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec
Accelerator needs to process at this rate
(8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
Greater than memory bandwidth
23. Verilog
Digital Design — Chapter 9 — Accelerators 23
Memory Bandwidth
Read 4 pixels at once from each of previous,
current, and next rows
Store in accelerator to compute multiple derivative
image pixels
Produce derivative pixels row-by-row, left-to-
right
Read 3 × 32-bit words for every 4th derivative
pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
= 20% of available memory bandwidth
25. Verilog
Digital Design — Chapter 9 — Accelerators 25
Accelerator Sequence
Steady state
Write 4 result pixels
Read 4 pixels for previous,
current, next rows
Compute for 4 cycles
Repeat…
Start of row
Omit writes until pipeline
full
End of row
Omit reads to drain
pipeline
30. Verilog
Digital Design — Chapter 9 — Accelerators 30
Address Generation
Given an image in memory at base
address B
Address for pixel in row r, column c is
B + r × 640 + c
Base address (B) is fixed
Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses
Two least-significant bits always 00
Increment word address by 1
34. Verilog
Digital Design — Chapter 9 — Accelerators 34
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start
(value ignored).
O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640.
Status 0 Read-only Processing done (bit 0). Reading clears
interrupt.
35. Verilog
Digital Design — Chapter 9 — Accelerators 35
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01;
assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10;
assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11;
always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00)
int_en <= dat_i[0];
always @(posedge clk_i) // Status register
if (rst_i)
done <= 1'b0;
else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register.
done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o)
done <= 1'b0;
assign int_req = int_en && done;
...
36. Verilog
Digital Design — Chapter 9 — Accelerators 36
Slave Bus Interface
always @(posedge clk_i) // Generate ack output
ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer
always @*
if (cyc_i && stb_i && !we_i)
if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read
else
dat_o = 32'b0; // other registers read as 0
else
dat_o = result_row; // for master write
37. Verilog
Digital Design — Chapter 9 — Accelerators 37
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and
columns (0 to 159)
See textbook for details of FSM output
functions
39. Verilog
Digital Design — Chapter 9 — Accelerators 39
Accelerator Verification
Simulation-based verification of each section
of the accelerator
Slave bus operations
Computation sequencing
Master bus operations
Address generation
Pixel computation
Testbench including the accelerator
Bus functional processor model
Simplified memory and bus arbiter models
40. Verilog
Digital Design — Chapter 9 — Accelerators 40
Sobel Verification Testbench
Processor
BFM
Sobel
Accelerator
Memory
Model
Arbiter
Multiplexed Bus: Muxes and Connections
41. Verilog
Digital Design — Chapter 9 — Accelerators 41
Processor Bus Functional Model
initial begin // Processor bus-functional model
cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
@(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000);
// Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280);
// Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001);
// Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000);
// End of write operations
...
42. Verilog
Digital Design — Chapter 9 — Accelerators 42
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0;
begin: loop
forever begin
#10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset;
cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0;
@(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0;
if (cpu_dat_i[0]) disable loop;
end
end
end
43. Verilog
Digital Design — Chapter 9 — Accelerators 43
Memory Bus Functional Model
always begin // Memory bus-functional model
mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000;
@(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk);
if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data
mem_ack_o <= 1'b1;
@(posedge clk);
end
44. Verilog
Digital Design — Chapter 9 — Accelerators 44
Bus Arbiter
Uses sobel_cyc_o and cpu_cyc_o
as request inputs
If both request at the same time, give
accelerator priority
Mealy FSM
45. Verilog
Digital Design — Chapter 9 — Accelerators 45
Bus Arbiter
always @(posedge clk) // Arbiter FSM register
if (rst) arbiter_current_state <= sobel;
else arbiter_current_state <= arbiter_next_state;
always @* // Arbiter logic
case (arbiter_current_state)
sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
cpu: if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu;
end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel;
end
endcase
46. Verilog
Digital Design — Chapter 9 — Accelerators 46
Simulation Results
See waveforms in textbook
Demonstrates sequencing and address
generation
But what about…
Data values computed correctly
Interactions between processor and
accelerator
Need to use more sophisticated
verification techniques
Due to complexity of the design
47. Verilog
Digital Design — Chapter 9 — Accelerators 47
Summary
Accelerators boost performance using
parallel hardware
Replication, pipelining, …
Ahmdahl’s Law
Best payback from accelerating a kernel
DMA avoids processor overhead
Verification requires advanced
techniques