5. Challenges
• Hard to code. Need for very
manual memory allocation and
management makes complex
coding difficult.
• Hard to debug. Epiphany
doesn’t share memory with
Linux
• Temperature. After a week of
frustration I realized I needed
to put a fan over it.
• Documentation. SDK and
examples are poor and
frequently broken. Few
beginner examples. Small
community of users.
My “thermal management solution”
6. Process Synchronization
• Each core runs a process, not a thread
– Every core can run a different process
– “Workgroups” can be created in SDK
• Functions exist in OpenCL, COPRTHR and eSDK for
synchronizing processes
– Mutexes only provided between cores
– SDK examples tend to use wait for single bits to change for
synchronization
• MPI, OpenMP currently not supported for coprocessor
– Some “community” projects in works… not much of a
community though
7. Memory Management
• “Shared” DRAM
– Memory allocated specifically for Epiphany using e_alloc
– 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978)
• SRAM in each core
– Only 32kB available
– 4 GB/s (1 GB/s in practice per DMA channel)
– Use DMA channel functions to transfer memory between cores
– Can’t use malloc!!! – must keep track manually
– Have to know addresses on other cores you want to send data to
– Must watch out for both code size and stack growth
32kB of memory
Prog. Stack(matrix buffers go here…
essentially the heap)
Debugging
8. Chip Architecture
• 32kb SRAM per core for program + stack
• ~2 GB/s DMA transfers between cores
• ~150 MB/s to transfer to/from shared DRAM
DMA engine frees up processor
Graphic from Adapteva
9. SUMMA/Blocking Implementation
Block matrix
Execute
SUMMA on
sub-blocks
Each core copies it’s
designated sub-block
Example code - copy sub-blocks from
shared DRAM to Epiphany
Epiphany
DRAM
Note: ~1000x1000 matrix
size limitation due to
Parallella Linux shared
memory size
150 MB/s
2 GB/s
10. Results
0
50
100
150
200
250
300
350
0 200 400 600 800 1000
ExecutionTime(s)
Matrix Side Size
Matrix Multiplication Execution Times
Single Epiphany Core
2x2 Core Grid
3x3 Core Grid
4x4 Core Grid
ARM Naive
ARM Blocked
11. Epiphany
Version
Grid Side Size
Epiphany
Time (s)
Speedup vs.
Single Core
1 317.2 1
2 80.9 3.92
3 35.43 8.95
E16G3 4 21.5 14.76
E64G4 8 7.7 41.24
E256G4 16 1.98 160.02
E1KG4 32 0.51 620.96
E4KG4 64 0.13 2409.56
Speedup
(vs. single core)
More cores -> Larger Blocks -> Exponentially Less Blocking
y = 1.0083x1.9562
R² = 0.9995
0
2
4
6
8
10
12
14
16
1 2 3 4
Speedup(vsSingleCore)
Grid Side Size
Speedups vs. Grid Side Size
Estimated
12. Conclusions
• Potentially powerful device, especially in embedded AI
applications with large search spaces
– Needs passive cooling
• 32kB SRAM is extremely limiting
– Needs either L2 cache or just some kind of faster near-chip
shared memory
– Really limitation of Parallella architecture, not Epiphany
• Incredibly difficult to code
– SDK & Documentation needs improvement
– Better debugging tools needed ASAP!
Editor's Notes
Epiphany is a co-processor architecture by Adapteva
It’s a matrix of tiny RISC CPUs connected by a communications framework
Unlike other MIMD co-processors (Intel Xenon Phi) everything exists on a single chip
Adapteva generally sells these processors for OEM use
The Parallella board is a dev board for this processor – raised close to a million on Kickstarter
The chip provided with the Parallella is 16 core
Adapteva believes this can scale up to 4096 cores, but the only other one they’re producing is 64
The 16 core is 32 GFLOPS
For comparison, a high end i5 mobile processor is around 40-50 GFLOPS
Need 2 versions of gcc – one for host and one for Epiphany
Host loads executable onto Epiphany and starts it
The Parallella was extremely to difficult develop on
There are some SDKs to facilitate multi-threading
Better of using Adapteva’s SDK
The problem with MPI and OpenMP is the limited memory in the core
Very explicit memory management – need to pass address pointer to each function and increment
Can’t use malloc to keep track
Need to start at some offset for the program
Stack grows from the end
Need to be very careful about about balancing stack vs. heap space
Also need to set some pointers explicitly for DMA transfers
Adapteva calls this “a network on a chip”
Fast inter-core memory transfers
Very slow transfers to DRAM – want to work on largest matrix block possible at a time
Block distributed among cores, then SUMMA used to perform multiplication
Could potentially require a lot of loops – great deal of overhead
Pretty much expected
9-16 cores needed to beat non-blocked multiplication on ARM
ARM is shown as an example, but isn’t a good benchmark
Speedup from 1 to 16 cores is substantial
More than 16x due to inter-core communication