We propose a parametrized memory template for applications with parallel 'for' loops. The template's parameters reflect important trade-offs made during system design. The template is incorporated in our high level synthesis (HLS) compiler, where the template's parameters are adjusted to the application. The template fits parallel 'for' loops with no loop dependencies and sequential bodies. We found two alternative template implementations using our compiler. In the future, we will develop templates for other types of 'for' loops. These will be added to the compiler and it will identify the template that works best for the application it is compiling. Once a template is selected, the compiler will use design space exploration to select the best combination of template parameters for the targeted hardware and application.
A parallel 'for' loop memory template for a high level synthesis compiler
1. A parallel for loop memory template
for a high level synthesis compiler
Craig Moore
Wim Meeus, Harald Devos, and Dirk Stroobandt
Euromicro Conference on
Digital System Design
Lille, France
02/09/2010
2. Outline
● High Level Synthesis
● Hardware Development
● External Memory
● Burst memory transfers
● Parallel For Loops
● Memory Template Overview
● Small Example
● Future Work
● Conclusions
30/06/2010 Craig Moore, DSD 02/09/2010 2
6. Memory Templates
as Tools
● HDL Programmers have:
● Toolkit of memory designs
● Use the right tool for the job
● Manually adapt their designs
● HLS Compilers should:
● Have a toolkit of templates
● Adapt the template to the app
● Evaluate each template
● Suggest the best template
30/06/2010 Craig Moore, DSD 02/09/2010 6
7. Basic Steps for any Algorithm
1) Read values from memory
2) Process each value
3) Store output in memory
for (int i = start; i < end; i++)
{
b[i] = func(a[i]);
}
30/06/2010 Craig Moore, DSD 02/09/2010 7
9. External Memory
for FPGAs
● A bottle neck
● Sequential in nature
● Number of values
returned each cycle
depends on bus
width.
● Each memory request
requires a handshake
30/06/2010 Craig Moore, DSD 02/09/2010 9
10. Adapting to
the Bottleneck
● Stream values from
memory
● Pre-fetch values
● Read/Write more than
one value each clock
cycle
● Store values locally to
mask latency
● Reduce number of
requests
30/06/2010 Craig Moore, DSD 02/09/2010 10
23. Parallel for Loop
● Each iteration is run in parallel
● No loop dependencies
● Loop Transformations to remove them
Example with Dependencies
for i = 1 to 4
{
a(i) = a(i) + 1
b(i) = a(i – 1) + a(i + 1)
}
30/06/2010 Craig Moore, DSD 02/09/2010 23
25. Template Overview
Requests read bursts
and controls execution
of data paths, waits for
output buffer if it is full
30/06/2010 Craig Moore, DSD 02/09/2010 25
29. Manual Design
Controls access to memory,
grants permission based on
request (output buffer priority)
30/06/2010 Craig Moore, DSD 02/09/2010 29
30. Manual Design
Controls access to memory,
Non-pipelined loop bodies
grants permission based on
executing in parallel.
request (output buffer priority)
Requests read bursts
and controls execution With enough values,
Starts and stops execution
of data paths, waits for performs write bursts.
output buffer if it is full
30/06/2010 Craig Moore, DSD 02/09/2010 30
31. Byte-Enable Signal
● Multiple values for each memory transaction
● Tells which bytes to replace and preserve
30/06/2010 Craig Moore, DSD 02/09/2010 31
32. Byte-Enable Signal
● Multiple values for each memory transaction
● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 32
33. Byte-Enable Signal
● Multiple values for each memory transaction
● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 33
34. Byte-Enable Signal
● Multiple values for each memory transaction
● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 34
35. Byte-Enable Signal
● Multiple values for each memory transaction
● Tells which bytes to replace and preserve
Ignore
Enable
30/06/2010 Craig Moore, DSD 02/09/2010 35
37. Parametrized Template
Parameters
● Memory Bus Width = M
30/06/2010 Craig Moore, DSD 02/09/2010 37
38. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
30/06/2010 Craig Moore, DSD 02/09/2010 38
39. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
● Max Words = A = M / W
30/06/2010 Craig Moore, DSD 02/09/2010 39
40. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
30/06/2010 Craig Moore, DSD 02/09/2010 40
41. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs =
N = CN * X
30/06/2010 Craig Moore, DSD 02/09/2010 41
42. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs =
N = CN * X
● Burst Length
● Output FIFO Length
● Iteration Length
● Input FIFO Length
30/06/2010 Craig Moore, DSD 02/09/2010 42
43. Parametrized Template
Parameters
● Memory Bus Width = M
● Word Width = W
● Max Words = A = M / W
● Input FIFOs = X = Cx * A
● Iterations = Output FIFOs =
N = CN * X
● Burst Length
● Output FIFO Length
● Iteration Length
● Input FIFO Length
30/06/2010 Craig Moore, DSD 02/09/2010 43
44. Example – Reading Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 44
45. Example – Processing Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 45
46. Example – Writing Values
Values in Memory
Values to be read
Byte enabled
Byte disabled
Values processed
30/06/2010 Craig Moore, DSD 02/09/2010 46
47. Future Work
● More templates for other parallel for loops
● Pipelined loop body
● Data reuse
● Compiler identifies parallel for loop
● No keywords
● Check for loop dependencies, and do loop
transformations if required
● Compiler suggests best memory template
● Chosen based on performance estimate
● Design space exploration using templates
30/06/2010 Craig Moore, DSD 02/09/2010 47
48. Conclusions
● HLS Tools don't create memory designs
● Manual memory designs can take
days/weeks/months to complete
● Parametrized memory template designs are
generated in seconds
● Easy to perform design space exploration using
different parameter values and/or templates
30/06/2010 Craig Moore, DSD 02/09/2010 48