Parallella: Embedded HPC For Everybody

Motivation Intro Memory Network Communication Measurements Parallella
Parallella: Embedded HPC For Everybody
Jacob Erlbeck
Sysmocom s.f.m.c. GmbH
Berlin
Softwarekonferenz für Parallel Programming, Concurrency
und Multicore-Systeme, Karlsruhe, 5.-7. Mai 2014
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai

The Parallella

The Parallella (2)
It’s cool!
Credit card size
Co-processors of multiple boards can be linked
Inexpensive
Software and design ﬁles are Open Source (github)
GCC / GDB / GNU tool chain

What did I want to know?
Suitable for ...
Audio processing?
Software deﬁned radio?
Stream analysis?
Real performance values
How much of the peak performance rates do I get?
How does it compare to other platforms (Dual Cortex A9)?
What else?
Is the system easy or difﬁcult to use or understand?
Are there helpful libraries or frameworks?
Which tools are available?

Example
The example problem
Input matrix I
· · · · · ·
· d n d · ·
· n c n · ·
· d n d · ·
· · · · · ·
Oi,j = dIi−1,j−1 + nIi,j−1 + dIi+1,j−1 +
nIi−1,j + cIi,j + nIi+1,j +
dIi−1,j+1 + nIi,j+1 + dIi+1,j+1
1 = c + 4d + 4n
Apply a 3 × 3 stencil ﬁlter to a 1000 × 1000 matrix
These are 998 * 998 * 9 multiplications and summations
On 16 cores á 600 MHz with fused multiply-add this should
take 0.9 ms for the FPU
This problem is being described in Brown Deer Technology’s STDCL documentation for the Parallella, see
www.browndeertechnology.com/docs/app_note_programming_parallella_using_stdcl.pdf

Software
Programming frameworks
Preinstalled
Epiphany speciﬁc libraries
e-lib Target library, access to registers and hardware
units, context information, utilities
e-hal Host library, access to the co-processor, loading
and starting kernels
newlib Port of libc/libm that runs on the co-processor
Generic frameworks
libcoprthr POSIX like threading abstraction for co-processors
OpenCL Compiler and libraries
STDCL Simpliﬁed layer on top of the above (host side)

Software
Programming frameworks
Preinstalled (2)
Tools
GNU Tools GNU suite of bin utils and compilers: e-gcc/e-g++,
e-nm, e-objdump, ...
e-server Remote GDB debugging proxy for Epiphany cores
e-run Single core emulator, supports tracing & proﬁling
e-gdb GDB for the Epiphany, remote and emulation
e-tools Load programs, read/write core data, reset cores
OS
Linaro Linaro 14.01 / Ubuntu ’Saucy’ 13.10
The tools and libraries can be build and used on standard
computers, e.g. for cross-compiling and emulation

Implementation
Example Implementation (STDCL)
Host Part
stencil2d_host.c (snippet), similar to the STDCL app note)
int w=1000; int h=1000;
float d=0.01/8; float n=d; float c=0.99;
size_t size = sizeof(float)*w*h;
float* in = clmalloc(stdacc, size, 0);
float* out = clmalloc(stdacc, size, 0);
// initialize ndr, in, out, ctx
clmsync(ctx, 0, in, ...);
clmsync(ctx, 0, out, ...);
clexec(ctx, 0, &ndr, stencil2d_kern,
in, out, w, h, c, n, d);
clmsync(ctx, 0, out, ...);
clwait(ctx, 0, CL_ALL_EVENT);

Implementation
Example Implementation (OpenCL)
Co-processor Kernel
stencil2d_kern.cl (snippet), similar to the STDCL note
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{ // initialize x1, x2, y1, y2 based on core id
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}

Testing
Trying it out
Let’s see ...
linaro@linaro-nano: stencil/stdcl/
$ clcc -k -o kernel.o -c stencil2d_kern.cl
$ gcc -o stencil2d.x stencil2d_host.c kernel.o
$ sudo ./stencil2d.x
time used 1.184 s
$ ./stencil2d.x -stdcpu
time used 0.281 s
$
Oops !
99.9 % of the time is not being used for ﬂoating point ops!
It’s 4 times faster to use the ARM CPUs than the Epiphany!

Testing
What to do
What is happening here ?
Questions
What are 99.9 % of the time spent for?
How can we ﬁx it?
Next steps
Do measurements
Look at the board’s architecture
Try to improve the test programm accordingly
Iterate ...

Testing
Measurements (1)
Setup vs. computation
Modiﬁcations
Measure setup and computation time separately
Original kernel (times in ms)
Host Epiphany TE /TH
Set up 252 388 150%
Computation 32 773 2410%

Parallella Architecture
Parallella Architecture Overview

Epiphany Architecture
Epiphany Architecture Overview
MIMD Architecture
On-chip 2D mesh network
One shared 4 GiB address space (except for the ﬁrst 1
MiB)
16 and 64 core versions available in silicon
256, 1024, and 4095 core versions offered as IP
Multiple devices can be linked together via 4 eLinks

Single Epiphany Node
Overview
Components
eCore processor
32kiB SRAM memory
Mesh network interface
2 DMA controller
2 event counter
Data busses 64 bit wide
Address busses 32 bit wide
Network bus 104 bit wide

Single Epiphany node
Processor (eCore)
Processor features
RISC architecture
Load/Store of 8, 16, 32, and 64 bit words
64 general purpose 32 bit registers
ALU/FPU: 32 bit only
No SIMD instructions
All registers are also memory mapped
Instruction pipeline (5 - 8 stages)
RISC: ALU/FPU only operate on registers, memory access
is only done via load/store instructions
Pipeline stalls until all register dependencies are fulﬁlled

Local RAM
RAM features
32 kiB SRAM
Organized in 4 × 8 kiB banks that can be accessed in
parallel
Used for code and data
Access in 1 clock cycle
External memory can be used for code and data
No cache for external memory

Memory Model
Memory address ranges
Node local 0x00000 – 0xFFFFF 1 MiB
Local SRAM 0x00000 – 0x07FFF 32 kiB
Local Registers 0xF0000 – 0xFFFFF 64 kiB
External DRAM 0x8E000000 – 0x8FFFFFFF 32 MiB
Map local to global addresses
Row, 6 bit Col, 6 bit Local address, 20 bit
Set user interrupt on core (33, 11) → core id 0x84B
*(unsigned *)( (0x84B << 20) | 0xF042C ) = 0x20
Read external DRAM at offset 0x1234
val = *(unsigned *)(0x8E000000 + 0x1234)

Memory Model
Default memory layout of the Parallella as seen from each node
Row Column
0 · · · 8 9 10 11 · · · 32 · · · 63
0 Local
0x00000000
−→ Global address
...
... (External cores via NORTH link)
32 Core 0,0
0x80800000
Core 0,1
0x80900000
Core 0,2
0x80A00000
Core 0,3
0x80B00000
33 Core 1,0
0x84800000
Core 1,1
0x84900000
Core 1,2
0x84A00000
Core 1,3
0x84B00000
34 Core 2,0
0x88800000
Core 2,1
0x88900000
Core 2,2
0x88A00000
Core 2,3
0x88B00000
· · · Ext DRAM
0x8E000000
35 Core 3,0
0x8C800000
Core 3,1
0x8C900000
Core 3,2
0x8CA00000
Core 3,3
0x8CB00000
... (External cores via SOUTH link)
...
63

Example
Example Revisited
External memory accesses
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}

Example
Example Revisited
Implementing caching
Modiﬁcations
Use local SRAM to cache 3 rows at a time — this reduces
external ﬂoat reads from 9 to 1 per output value
Use register variables
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
Access to the external memory is the bottleneck
Is there still room for improvements?

Architecture
On-Chip Network (eMesh)
Properties
Separate Networks
cMesh Write data on-chip (fast, async)
rMesh Read data (slow, high latency)
xMesh Write data from/to external devices (DRAM, async)
Transaction per clock cycle: 64 bit data
External transaction per clock cycle: 8 bit (16 bit peak) data
DRAM accesses do not disturb on-chip write transactions
Back-pressure (push-back) on congestion
Read-after-write can return the old value

Architecture
Transactions
Messages
Write Indication to write data atomically (includes data
and destination address)
Read Request to create a write transaction (includes
source and destination address)
Testset Request for atomic TESTSET (includes data,
source and destination address)
Data size is 8, 16, 32, or 64 bit
Data is read/written atomically
Messages include control bits (routing mode, interrupt,
end-of-block)

Architecture
Routing
Standard routing algorithm
1 If the column does not match: route horizontally
2 If the row does not match: route vertically
3 Both match: route to the attached core
Other routing methods can be selected at the sending core
External memory is accessed as if it were cores

Architecture
Routing examples

Example
Example Revisited
Using 64 bit accesses
Modifications
Read and write 2 floats (à 32 bit) at a time
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Using 64bit transactions is more efficient

Architecture
Read transactions
eCore eLink FIFO/AXI
FPGA
extMem
DRAM
read_req
read_req
mem read
write_ind
write_ind
msc External load
’load’ operations stall the eCore until the data has arrived
This adds latency
The read requests share bandwidth with write indications
This can reduce (write) throughput

Architecture
DMA controller
DMA controller features
2 independant DMA controllers
Conﬁgurations can be chained per controller
Basically implements ... (without chaining)
do_dma(*dst, dinc[2], *src, sinc[2], count[2])
{
for (o = count[0]; o > 0; o--)
for (i = count[1]; ; i--)
*(item_t *)dst = *(item_t *)src;
if (i == 1) break;
dst += dinc[1]; src += sinc[1];
dst += dinc[0]; src += sinc[0];
}

Architecture
DMA Transactions
eCore DMA eLink FIFO/AXI
FPGA
extMem
DRAM
dma_start
read_req
read_req
mem read
mem read
write_ind
write_ind
mem read
write_ind
write_ind
mem write
write_ind
buf_ex
loop
msc Using the DMA

Example
Example Revisited
Using the DMA controller
Modiﬁcations
Port the example to the e-lib
Use the DMA to read/write rows asynchronously
linaro@linaro-nano: stencil/elib/
$ gcc -Wall -o stencil.o -c stencil.c
$ gcc -Wall -o stencil stencil.o -le-hal
$ e-gcc -Wall -O3 -ffast-math -c kern.c -o kern.o
$ e-gcc -T fast.ldf kern.o -o kern.elf -le-lib
$ e-objcopy -srec-forceS3 -output-target srec
kern.elf kern.srec
$ sudo ./stencil_host -K kern -R100
time used 5.130 s (0.219 s + 100 * 0.049 s)
$

Example
Example Revisited
Using the DMA controller (2)
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Set-up — 220 — — %
+ DMA — 49 — 153%
Using the DMA avoids stalling the eCore

External Memory
Read/Write Throughput From/To External Memory
By number of cores
4 8 12 16
50
100
150
200
250
Cores
Throughput(MB/s)
Write 4096
(’slow’ DMA)
Write 256
Read 4096
Read + Write

Example
Example Revisited
Reduce transaction rate
Modiﬁcations
Use DMA ’slow’ mode (use outer loop, increases
transaction interval)
Host Epiphany TE /THmin
FPU-Rate
DMA ’fast’ — 49 153% 2%
DMA ’slow’ — 44 137% 2%
DMA write-only — 26 (81%) 3%

Example
Example Revisited
Further improvements?
Measurement overview (times in ms)
Time (ms) Clk/ﬂoat FPU-Rate
Set up 220 - 388 — —
No caching 773 7240 —
Caching 152 1459 —
Registers 129 1238 —
64bit 50 480 —
DMA ’fast’ 49 470 2 %
DMA ’slow’ 44 422 2 %
Local mem 2.2 21 38 %
Stencil problem is too ’small’ per ﬂoat (uses 5 % only)
2.4ms * 0.38% FPU instruction rate ≈ estimated 0.9ms

More Measurements
Measuring Throughput
Overview (16 cores, 600 MHz clock)
MB/s clock/ﬂoat Peak Remarks
Read DRAM 139 275 46 % sync, slow
Write DRAM 152 252 50 % sync, slow
R/W DRAM 2 × 86 441 57 % sync
Read (0,3) 519 74 — sync, slow
Write (0,3) 4802 8 100 % no sync, loop
Read next 5952 6.5 — no sync
Write next 21796 1.7 28 % no sync
Stencil DRAM 2 × 91 422 60 %
Stencil local 1774 21 43 %
Reached throughput of 91MB/s ≈ measured max R/W rate

More Measurements
Raw Read/Write Throughput From/To External Memory
On continious overload, throughput differs
The rows don’t affect each other

Results
Learned From Working With the Example
Set-up time Prefer long running kernels (»200ms)
DRAM Avoid accessing the external memory twice for the
same data, cache locally instead
Write only Avoid reading remote data (latency, throughput)
DMA read Use the DMA asynchronously to read data
FPU Be computation intensive per external data value
Compiler Using register variables and optimization options
(-O3 –ffast-math) yields good results

Results
Applications
Audio processing
≈ 200 channels à 24 bit at 96 kHz sampling rate (in & out)
Less if external DRAM is needed for delay lines
Video processing
≈ 1 stream 720p HD, 16 bit/pixel at 46 bps (in & out) / at
80 bps (out)
Stream analysis
≈ 4 GBit/s data stream (in only)
Less if external DRAM is needed to store data

Results
SW-Architecture Considerations
Throughput is limited
Can be an issue with star topologies
Starvation on network overload
Does not compromise throughput
Can be an issue with work-stealing scheduling
Barriers can help to ensure fairness
Reading is slower than writing
Throughput is highly asymmetric (by design)
Can be an issue with shared memory synchronization and
reference counting

Results
Application To Other Architectures
Write vs. Read
Where reading is also slower
Main memory accesses in current mainstream CPUs
NUMA interconnects
Access to IO devices, e.g. via PCIe
Access to remote data in clusters
Considerations
Prefer sending copies over sharing local data
Use asynchronous messaging

Closing Measurements
Finally
Thank you
Many thanks to
Sylvain Munaut for lending me a board
Links
Parallella http://www.parallella.org
Epiphany Docs http://www.adapteva.com/all-documents/
http://www.adapteva.com/analyst-reports/
Speciﬁcations http://www.parallella.org/board/
STDCL/Coprthr http://www.browndeertechnology.com/resources.htm
STDCL App Note http://www.browndeertechnology.com/docs/app_note_programming_
parallella_using_stdcl.pdf
MPR article http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf

Finally
Contact
Jacob Erlbeck, jacob.erlbeck@gmail.com
Copyright
c Jacob Erlbeck, 2014. Please contact the author if you wish to redistribute the work as a whole or in parts.

Write Throughput To External Memory
By number of cores and size
4
8
12
16
1,000
2,000
3,000
4,000
100
150
200
250
Cores
Chunk size (B)
Throughput(MB/s)

Throughput By Transfer Method
Overview (16 cores, 600 MHz clock)
eLib Loop DMA DMA DMA DMA Peak
slow sync slow, sync (spec.)
Read DRAM 48 94 131 139 128 136 300/1200
Write DRAM 77 152 143 152 140 149 300/1200
Read (0,3) 270 536 471 501 485 519 —
Write (0,3) 2405 4802 4528 4757 3787 3508 4800
Read col 3 1069 2134 1873 1997 1796 1899 —
Write col 3 7779 7651 16522 9807 9624 7894 19200
Read next 1750 2834 5952 5414 5076 4616 —
Write next 8749 7651 21769 9807 14571 7895 (76800)
Read self 2275 3485 6530 5439 5602 4501 —
Write self 8785 7651 21758 9807 14628 7895 (76800)
According to the errata lists in the Epiphany III/IV data sheets (E16G301 and E64G401), the peak node→eMesh
rate is currently limited.

Raw Read/Write Throughput From/To Columns 3 (East)

Raw Read/Write Throughput From/To Core (0, 3)

Results
Measurement Summary
On-Chip Mesh Behaviour
Reading is much slower than writing
Overload can lead to core starvation
Overall throughput is maintained on overload
Cores can receive a constant data stream at peak rate
Cores can only send signiﬁcantly below the peak rate
(errata)

Parallella: Embedded HPC For Everybody

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallella: Embedded HPC For Everybody

Similar to Parallella: Embedded HPC For Everybody (20)

Recently uploaded

Recently uploaded (20)

Parallella: Embedded HPC For Everybody