SlideShare a Scribd company logo
Motivation Intro Memory Network Communication Measurements Parallella
Parallella: Embedded HPC For Everybody
Jacob Erlbeck
Sysmocom s.f.m.c. GmbH
Berlin
Softwarekonferenz für Parallel Programming, Concurrency
und Multicore-Systeme, Karlsruhe, 5.-7. Mai 2014
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
The Parallella
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
The Parallella (2)
It’s cool!
Credit card size
Co-processors of multiple boards can be linked
Inexpensive
Software and design files are Open Source (github)
GCC / GDB / GNU tool chain
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
What did I want to know?
Suitable for ...
Audio processing?
Software defined radio?
Stream analysis?
Real performance values
How much of the peak performance rates do I get?
How does it compare to other platforms (Dual Cortex A9)?
What else?
Is the system easy or difficult to use or understand?
Are there helpful libraries or frameworks?
Which tools are available?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
The example problem
Input matrix I
· · · · · ·
· d n d · ·
· n c n · ·
· d n d · ·
· · · · · ·
Oi,j = dIi−1,j−1 + nIi,j−1 + dIi+1,j−1 +
nIi−1,j + cIi,j + nIi+1,j +
dIi−1,j+1 + nIi,j+1 + dIi+1,j+1
1 = c + 4d + 4n
Apply a 3 × 3 stencil filter to a 1000 × 1000 matrix
These are 998 * 998 * 9 multiplications and summations
On 16 cores á 600 MHz with fused multiply-add this should
take 0.9 ms for the FPU
This problem is being described in Brown Deer Technology’s STDCL documentation for the Parallella, see
www.browndeertechnology.com/docs/app_note_programming_parallella_using_stdcl.pdf
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Software
Programming frameworks
Preinstalled
Epiphany specific libraries
e-lib Target library, access to registers and hardware
units, context information, utilities
e-hal Host library, access to the co-processor, loading
and starting kernels
newlib Port of libc/libm that runs on the co-processor
Generic frameworks
libcoprthr POSIX like threading abstraction for co-processors
OpenCL Compiler and libraries
STDCL Simplified layer on top of the above (host side)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Software
Programming frameworks
Preinstalled (2)
Tools
GNU Tools GNU suite of bin utils and compilers: e-gcc/e-g++,
e-nm, e-objdump, ...
e-server Remote GDB debugging proxy for Epiphany cores
e-run Single core emulator, supports tracing & profiling
e-gdb GDB for the Epiphany, remote and emulation
e-tools Load programs, read/write core data, reset cores
OS
Linaro Linaro 14.01 / Ubuntu ’Saucy’ 13.10
The tools and libraries can be build and used on standard
computers, e.g. for cross-compiling and emulation
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Implementation
Example Implementation (STDCL)
Host Part
stencil2d_host.c (snippet), similar to the STDCL app note)
int w=1000; int h=1000;
float d=0.01/8; float n=d; float c=0.99;
size_t size = sizeof(float)*w*h;
float* in = clmalloc(stdacc, size, 0);
float* out = clmalloc(stdacc, size, 0);
// initialize ndr, in, out, ctx
clmsync(ctx, 0, in, ...);
clmsync(ctx, 0, out, ...);
clexec(ctx, 0, &ndr, stencil2d_kern,
in, out, w, h, c, n, d);
clmsync(ctx, 0, out, ...);
clwait(ctx, 0, CL_ALL_EVENT);
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Implementation
Example Implementation (OpenCL)
Co-processor Kernel
stencil2d_kern.cl (snippet), similar to the STDCL note
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{ // initialize x1, x2, y1, y2 based on core id
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Testing
Trying it out
Let’s see ...
linaro@linaro-nano: stencil/stdcl/
$ clcc -k -o kernel.o -c stencil2d_kern.cl
$ gcc -o stencil2d.x stencil2d_host.c kernel.o
$ sudo ./stencil2d.x
time used 1.184 s
$ ./stencil2d.x -stdcpu
time used 0.281 s
$
Oops !
99.9 % of the time is not being used for floating point ops!
It’s 4 times faster to use the ARM CPUs than the Epiphany!
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Testing
What to do
What is happening here ?
Questions
What are 99.9 % of the time spent for?
How can we fix it?
Next steps
Do measurements
Look at the board’s architecture
Try to improve the test programm accordingly
Iterate ...
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Testing
Measurements (1)
Setup vs. computation
Modifications
Measure setup and computation time separately
Original kernel (times in ms)
Host Epiphany TE /TH
Set up 252 388 150%
Computation 32 773 2410%
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Parallella Architecture
Parallella Architecture Overview
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Epiphany Architecture Overview
MIMD Architecture
On-chip 2D mesh network
One shared 4 GiB address space (except for the first 1
MiB)
16 and 64 core versions available in silicon
256, 1024, and 4095 core versions offered as IP
Multiple devices can be linked together via 4 eLinks
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany Node
Overview
Components
eCore processor
32kiB SRAM memory
Mesh network interface
2 DMA controller
2 event counter
Data busses 64 bit wide
Address busses 32 bit wide
Network bus 104 bit wide
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany node
Processor (eCore)
Processor features
RISC architecture
Load/Store of 8, 16, 32, and 64 bit words
64 general purpose 32 bit registers
ALU/FPU: 32 bit only
No SIMD instructions
All registers are also memory mapped
Instruction pipeline (5 - 8 stages)
RISC: ALU/FPU only operate on registers, memory access
is only done via load/store instructions
Pipeline stalls until all register dependencies are fulfilled
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Single Epiphany node
Local RAM
RAM features
32 kiB SRAM
Organized in 4 × 8 kiB banks that can be accessed in
parallel
Used for code and data
Access in 1 clock cycle
External memory can be used for code and data
No cache for external memory
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Memory Model
Memory address ranges
Node local 0x00000 – 0xFFFFF 1 MiB
Local SRAM 0x00000 – 0x07FFF 32 kiB
Local Registers 0xF0000 – 0xFFFFF 64 kiB
External DRAM 0x8E000000 – 0x8FFFFFFF 32 MiB
Map local to global addresses
Row, 6 bit Col, 6 bit Local address, 20 bit
Set user interrupt on core (33, 11) → core id 0x84B
*(unsigned *)( (0x84B << 20) | 0xF042C ) = 0x20
Read external DRAM at offset 0x1234
val = *(unsigned *)(0x8E000000 + 0x1234)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Epiphany Architecture
Memory Model
Default memory layout of the Parallella as seen from each node
Row Column
0 · · · 8 9 10 11 · · · 32 · · · 63
0 Local
0x00000000
−→ Global address
...
... (External cores via NORTH link)
32 Core 0,0
0x80800000
Core 0,1
0x80900000
Core 0,2
0x80A00000
Core 0,3
0x80B00000
33 Core 1,0
0x84800000
Core 1,1
0x84900000
Core 1,2
0x84A00000
Core 1,3
0x84B00000
34 Core 2,0
0x88800000
Core 2,1
0x88900000
Core 2,2
0x88A00000
Core 2,3
0x88B00000
· · · Ext DRAM
0x8E000000
35 Core 3,0
0x8C800000
Core 3,1
0x8C900000
Core 3,2
0x8CA00000
Core 3,3
0x8CB00000
... (External cores via SOUTH link)
...
63
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
External memory accesses
void stencil2d_kern(
float* in, float* out, int w, int h,
float c, float n, float d)
{
for(y=y1; y<y2; y++)
for(x=x1; x<x2; x++) {
int k = x+y*w;
out[k] =
d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] +
n*in[k-1] + c*in[k] + n*in[k+1] +
d*in[k+w-1] + n*in[k+w] + d*in[k+w+1];
}
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Implementing caching
Modifications
Use local SRAM to cache 3 rows at a time — this reduces
external float reads from 9 to 1 per output value
Use register variables
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
Access to the external memory is the bottleneck
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Properties
Separate Networks
cMesh Write data on-chip (fast, async)
rMesh Read data (slow, high latency)
xMesh Write data from/to external devices (DRAM, async)
Transaction per clock cycle: 64 bit data
External transaction per clock cycle: 8 bit (16 bit peak) data
DRAM accesses do not disturb on-chip write transactions
Back-pressure (push-back) on congestion
Read-after-write can return the old value
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Transactions
Messages
Write Indication to write data atomically (includes data
and destination address)
Read Request to create a write transaction (includes
source and destination address)
Testset Request for atomic TESTSET (includes data,
source and destination address)
Data size is 8, 16, 32, or 64 bit
Data is read/written atomically
Messages include control bits (routing mode, interrupt,
end-of-block)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Routing
Standard routing algorithm
1 If the column does not match: route horizontally
2 If the row does not match: route vertically
3 Both match: route to the attached core
Other routing methods can be selected at the sending core
External memory is accessed as if it were cores
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
On-Chip Network (eMesh)
Routing examples
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using 64 bit accesses
Modifications
Read and write 2 floats (à 32 bit) at a time
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Using 64bit transactions is more efficient
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
Read transactions
eCore eLink FIFO/AXI
FPGA
extMem
DRAM
read_req
read_req
mem read
write_ind
write_ind
msc External load
’load’ operations stall the eCore until the data has arrived
This adds latency
The read requests share bandwidth with write indications
This can reduce (write) throughput
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
Single Epiphany node
DMA controller
DMA controller features
2 independant DMA controllers
Configurations can be chained per controller
Basically implements ... (without chaining)
do_dma(*dst, dinc[2], *src, sinc[2], count[2])
{
for (o = count[0]; o > 0; o--)
for (i = count[1]; ; i--)
*(item_t *)dst = *(item_t *)src;
if (i == 1) break;
dst += dinc[1]; src += sinc[1];
dst += dinc[0]; src += sinc[0];
}
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Architecture
DMA Transactions
eCore DMA eLink FIFO/AXI
FPGA
extMem
DRAM
dma_start
read_req
read_req
mem read
mem read
write_ind
write_ind
mem read
write_ind
write_ind
mem write
write_ind
buf_ex
loop
msc Using the DMA
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using the DMA controller
Modifications
Port the example to the e-lib
Use the DMA to read/write rows asynchronously
linaro@linaro-nano: stencil/elib/
$ gcc -Wall -o stencil.o -c stencil.c
$ gcc -Wall -o stencil stencil.o -le-hal
$ e-gcc -Wall -O3 -ffast-math -c kern.c -o kern.o
$ e-gcc -T fast.ldf kern.o -o kern.elf -le-lib
$ e-objcopy -srec-forceS3 -output-target srec
kern.elf kern.srec
$ sudo ./stencil_host -K kern -R100
time used 5.130 s (0.219 s + 100 * 0.049 s)
$
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Using the DMA controller (2)
Measurements (times in ms)
Host Epiphany TE /TH TE /THmin
No caching 32 773 2410% 2410%
Caching 78 152 194% 475%
+ Registers 65 129 198% 402%
+ 64bit 50 76 152% 237%
Set-up — 220 — — %
+ DMA — 49 — 153%
Using the DMA avoids stalling the eCore
Is there still room for improvements?
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
External Memory
Read/Write Throughput From/To External Memory
By number of cores
4 8 12 16
50
100
150
200
250
Cores
Throughput(MB/s)
Write 4096
(’slow’ DMA)
Write 256
Read 4096
Read + Write
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Reduce transaction rate
Modifications
Use DMA ’slow’ mode (use outer loop, increases
transaction interval)
Measurements (times in ms)
Host Epiphany TE /THmin
FPU-Rate
DMA ’fast’ — 49 153% 2%
DMA ’slow’ — 44 137% 2%
DMA write-only — 26 (81%) 3%
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Example
Example Revisited
Further improvements?
Measurement overview (times in ms)
Time (ms) Clk/float FPU-Rate
Set up 220 - 388 — —
No caching 773 7240 —
Caching 152 1459 —
Registers 129 1238 —
64bit 50 480 —
DMA ’fast’ 49 470 2 %
DMA ’slow’ 44 422 2 %
Local mem 2.2 21 38 %
Stencil problem is too ’small’ per float (uses 5 % only)
2.4ms * 0.38% FPU instruction rate ≈ estimated 0.9ms
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
More Measurements
Measuring Throughput
Overview (16 cores, 600 MHz clock)
MB/s clock/float Peak Remarks
Read DRAM 139 275 46 % sync, slow
Write DRAM 152 252 50 % sync, slow
R/W DRAM 2 × 86 441 57 % sync
Read (0,3) 519 74 — sync, slow
Write (0,3) 4802 8 100 % no sync, loop
Read next 5952 6.5 — no sync
Write next 21796 1.7 28 % no sync
Stencil DRAM 2 × 91 422 60 %
Stencil local 1774 21 43 %
Reached throughput of 91MB/s ≈ measured max R/W rate
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
More Measurements
Raw Read/Write Throughput From/To External Memory
On continious overload, throughput differs
The rows don’t affect each other
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Results
Learned From Working With the Example
Set-up time Prefer long running kernels (»200ms)
DRAM Avoid accessing the external memory twice for the
same data, cache locally instead
Write only Avoid reading remote data (latency, throughput)
DMA read Use the DMA asynchronously to read data
FPU Be computation intensive per external data value
Compiler Using register variables and optimization options
(-O3 –ffast-math) yields good results
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Results
Applications
Audio processing
≈ 200 channels à 24 bit at 96 kHz sampling rate (in & out)
Less if external DRAM is needed for delay lines
Video processing
≈ 1 stream 720p HD, 16 bit/pixel at 46 bps (in & out) / at
80 bps (out)
Stream analysis
≈ 4 GBit/s data stream (in only)
Less if external DRAM is needed to store data
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Results
SW-Architecture Considerations
Throughput is limited
Can be an issue with star topologies
Starvation on network overload
Does not compromise throughput
Can be an issue with work-stealing scheduling
Barriers can help to ensure fairness
Reading is slower than writing
Throughput is highly asymmetric (by design)
Can be an issue with shared memory synchronization and
reference counting
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Motivation Intro Memory Network Communication Measurements Parallella
Results
Application To Other Architectures
Write vs. Read
Where reading is also slower
Main memory accesses in current mainstream CPUs
NUMA interconnects
Access to IO devices, e.g. via PCIe
Access to remote data in clusters
Considerations
Prefer sending copies over sharing local data
Use asynchronous messaging
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Finally
Thank you
Many thanks to
Sylvain Munaut for lending me a board
Links
Parallella http://www.parallella.org
Epiphany Docs http://www.adapteva.com/all-documents/
http://www.adapteva.com/analyst-reports/
Specifications http://www.parallella.org/board/
STDCL/Coprthr http://www.browndeertechnology.com/resources.htm
STDCL App Note http://www.browndeertechnology.com/docs/app_note_programming_
parallella_using_stdcl.pdf
MPR article http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Finally
Contact
Jacob Erlbeck, jacob.erlbeck@gmail.com
Copyright
c Jacob Erlbeck, 2014. Please contact the author if you wish to redistribute the work as a whole or in parts.
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Write Throughput To External Memory
By number of cores and size
4
8
12
16
1,000
2,000
3,000
4,000
100
150
200
250
Cores
Chunk size (B)
Throughput(MB/s)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Throughput By Transfer Method
Overview (16 cores, 600 MHz clock)
eLib Loop DMA DMA DMA DMA Peak
slow sync slow, sync (spec.)
Read DRAM 48 94 131 139 128 136 300/1200
Write DRAM 77 152 143 152 140 149 300/1200
Read (0,3) 270 536 471 501 485 519 —
Write (0,3) 2405 4802 4528 4757 3787 3508 4800
Read col 3 1069 2134 1873 1997 1796 1899 —
Write col 3 7779 7651 16522 9807 9624 7894 19200
Read next 1750 2834 5952 5414 5076 4616 —
Write next 8749 7651 21769 9807 14571 7895 (76800)
Read self 2275 3485 6530 5439 5602 4501 —
Write self 8785 7651 21758 9807 14628 7895 (76800)
According to the errata lists in the Epiphany III/IV data sheets (E16G301 and E64G401), the peak node→eMesh
rate is currently limited.
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Raw Read/Write Throughput From/To Columns 3 (East)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Raw Read/Write Throughput From/To Core (0, 3)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
Closing Measurements
Results
Measurement Summary
On-Chip Mesh Behaviour
Reading is much slower than writing
Overload can lead to core starvation
Overall throughput is maintained on overload
Cores can receive a constant data stream at peak rate
Cores can only send significantly below the peak rate
(errata)
c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai

More Related Content

What's hot

defense-linkedin
defense-linkedindefense-linkedin
defense-linkedin
Dr. Spiros N. Agathos
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
RISC-V International
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
AMD Developer Central
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
Shinya Takamaeda-Y
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Parallel program design
Parallel program designParallel program design
Parallel program design
ZongYing Lyu
 
Shellcoding, an Introduction
Shellcoding, an IntroductionShellcoding, an Introduction
Shellcoding, an Introduction
Daniele Bellavista
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
Redis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use casesRedis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use cases
Redis Labs
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
inside-BigData.com
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
marek_pomocka
 
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
The Linux Foundation
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
Shinya Takamaeda-Y
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
AnastasiaStulova
 
Unikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based KernelsUnikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based Kernels
Martin Děcký
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
Grigory Sapunov
 
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Laterjohn-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
Positive Hack Days
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
Scality
 

What's hot (20)

defense-linkedin
defense-linkedindefense-linkedin
defense-linkedin
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Parallel program design
Parallel program designParallel program design
Parallel program design
 
Shellcoding, an Introduction
Shellcoding, an IntroductionShellcoding, an Introduction
Shellcoding, an Introduction
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Redis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use casesRedis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use cases
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
Dissertation defense
Dissertation defenseDissertation defense
Dissertation defense
 
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
XPDDS17: uniprof: Transparent Unikernel Performance Profiling and Debugging -...
 
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
ディープニューラルネットワーク向け拡張可能な高位合成コンパイラの開発
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Unikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based KernelsUnikernels, Multikernels, Virtual Machine-based Kernels
Unikernels, Multikernels, Virtual Machine-based Kernels
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Laterjohn-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
john-devkit: 100 типов хешей спустя / john-devkit: 100 Hash Types Later
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 

Similar to Parallella: Embedded HPC For Everybody

Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Slide_N
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
chiportal
 
Brief Introduction to Parallella
Brief Introduction to ParallellaBrief Introduction to Parallella
Brief Introduction to Parallella
Somnath Mazumdar
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
IJCI JOURNAL
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
Dmitri Nesteruk
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
Steve Caron
 
OSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerOSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data center
NETWAYS
 
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerOSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
NETWAYS
 
OSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerOSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner Fischer
NETWAYS
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Zoltan Arnold Nagy
 
A Quick Introduction to Programmable Logic
A Quick Introduction to Programmable LogicA Quick Introduction to Programmable Logic
A Quick Introduction to Programmable Logic
Omer Kilic
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
abinaya m
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!
Linaro
 
Digital design with Systemc
Digital design with SystemcDigital design with Systemc
Digital design with Systemc
Marc Engels
 
Microkernel-based operating system development
Microkernel-based operating system developmentMicrokernel-based operating system development
Microkernel-based operating system development
Senko Rašić
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Intel® Software
 
07 - Bypassing ASLR, or why X^W matters
07 - Bypassing ASLR, or why X^W matters07 - Bypassing ASLR, or why X^W matters
07 - Bypassing ASLR, or why X^W matters
Alexandre Moneger
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
ugur candan
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
Alexandre Moneger
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
Stefano Salsano
 

Similar to Parallella: Embedded HPC For Everybody (20)

Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
Toward an Open and Unified Model for Heterogeneous and Accelerated Multicore ...
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Brief Introduction to Parallella
Brief Introduction to ParallellaBrief Introduction to Parallella
Brief Introduction to Parallella
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerOSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data center
 
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerOSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
 
OSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerOSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner Fischer
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
A Quick Introduction to Programmable Logic
A Quick Introduction to Programmable LogicA Quick Introduction to Programmable Logic
A Quick Introduction to Programmable Logic
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
 
BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!BKK16-103 OpenCSD - Open for Business!
BKK16-103 OpenCSD - Open for Business!
 
Digital design with Systemc
Digital design with SystemcDigital design with Systemc
Digital design with Systemc
 
Microkernel-based operating system development
Microkernel-based operating system developmentMicrokernel-based operating system development
Microkernel-based operating system development
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
07 - Bypassing ASLR, or why X^W matters
07 - Bypassing ASLR, or why X^W matters07 - Bypassing ASLR, or why X^W matters
07 - Bypassing ASLR, or why X^W matters
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters05 - Bypassing DEP, or why ASLR matters
05 - Bypassing DEP, or why ASLR matters
 
Dataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and toolsDataplane programming with eBPF: architecture and tools
Dataplane programming with eBPF: architecture and tools
 

Recently uploaded

Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Mydbops
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 

Recently uploaded (20)

Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 

Parallella: Embedded HPC For Everybody

  • 1. Motivation Intro Memory Network Communication Measurements Parallella Parallella: Embedded HPC For Everybody Jacob Erlbeck Sysmocom s.f.m.c. GmbH Berlin Softwarekonferenz für Parallel Programming, Concurrency und Multicore-Systeme, Karlsruhe, 5.-7. Mai 2014 c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 2. Motivation Intro Memory Network Communication Measurements Parallella The Parallella c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 3. Motivation Intro Memory Network Communication Measurements Parallella The Parallella (2) It’s cool! Credit card size Co-processors of multiple boards can be linked Inexpensive Software and design files are Open Source (github) GCC / GDB / GNU tool chain c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 4. Motivation Intro Memory Network Communication Measurements Parallella What did I want to know? Suitable for ... Audio processing? Software defined radio? Stream analysis? Real performance values How much of the peak performance rates do I get? How does it compare to other platforms (Dual Cortex A9)? What else? Is the system easy or difficult to use or understand? Are there helpful libraries or frameworks? Which tools are available? c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 5. Motivation Intro Memory Network Communication Measurements Parallella Example The example problem Input matrix I · · · · · · · d n d · · · n c n · · · d n d · · · · · · · · Oi,j = dIi−1,j−1 + nIi,j−1 + dIi+1,j−1 + nIi−1,j + cIi,j + nIi+1,j + dIi−1,j+1 + nIi,j+1 + dIi+1,j+1 1 = c + 4d + 4n Apply a 3 × 3 stencil filter to a 1000 × 1000 matrix These are 998 * 998 * 9 multiplications and summations On 16 cores á 600 MHz with fused multiply-add this should take 0.9 ms for the FPU This problem is being described in Brown Deer Technology’s STDCL documentation for the Parallella, see www.browndeertechnology.com/docs/app_note_programming_parallella_using_stdcl.pdf c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 6. Motivation Intro Memory Network Communication Measurements Parallella Software Programming frameworks Preinstalled Epiphany specific libraries e-lib Target library, access to registers and hardware units, context information, utilities e-hal Host library, access to the co-processor, loading and starting kernels newlib Port of libc/libm that runs on the co-processor Generic frameworks libcoprthr POSIX like threading abstraction for co-processors OpenCL Compiler and libraries STDCL Simplified layer on top of the above (host side) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 7. Motivation Intro Memory Network Communication Measurements Parallella Software Programming frameworks Preinstalled (2) Tools GNU Tools GNU suite of bin utils and compilers: e-gcc/e-g++, e-nm, e-objdump, ... e-server Remote GDB debugging proxy for Epiphany cores e-run Single core emulator, supports tracing & profiling e-gdb GDB for the Epiphany, remote and emulation e-tools Load programs, read/write core data, reset cores OS Linaro Linaro 14.01 / Ubuntu ’Saucy’ 13.10 The tools and libraries can be build and used on standard computers, e.g. for cross-compiling and emulation c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 8. Motivation Intro Memory Network Communication Measurements Parallella Implementation Example Implementation (STDCL) Host Part stencil2d_host.c (snippet), similar to the STDCL app note) int w=1000; int h=1000; float d=0.01/8; float n=d; float c=0.99; size_t size = sizeof(float)*w*h; float* in = clmalloc(stdacc, size, 0); float* out = clmalloc(stdacc, size, 0); // initialize ndr, in, out, ctx clmsync(ctx, 0, in, ...); clmsync(ctx, 0, out, ...); clexec(ctx, 0, &ndr, stencil2d_kern, in, out, w, h, c, n, d); clmsync(ctx, 0, out, ...); clwait(ctx, 0, CL_ALL_EVENT); c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 9. Motivation Intro Memory Network Communication Measurements Parallella Implementation Example Implementation (OpenCL) Co-processor Kernel stencil2d_kern.cl (snippet), similar to the STDCL note void stencil2d_kern( float* in, float* out, int w, int h, float c, float n, float d) { // initialize x1, x2, y1, y2 based on core id for(y=y1; y<y2; y++) for(x=x1; x<x2; x++) { int k = x+y*w; out[k] = d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] + n*in[k-1] + c*in[k] + n*in[k+1] + d*in[k+w-1] + n*in[k+w] + d*in[k+w+1]; } } c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 10. Motivation Intro Memory Network Communication Measurements Parallella Testing Trying it out Let’s see ... linaro@linaro-nano: stencil/stdcl/ $ clcc -k -o kernel.o -c stencil2d_kern.cl $ gcc -o stencil2d.x stencil2d_host.c kernel.o $ sudo ./stencil2d.x time used 1.184 s $ ./stencil2d.x -stdcpu time used 0.281 s $ Oops ! 99.9 % of the time is not being used for floating point ops! It’s 4 times faster to use the ARM CPUs than the Epiphany! c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 11. Motivation Intro Memory Network Communication Measurements Parallella Testing What to do What is happening here ? Questions What are 99.9 % of the time spent for? How can we fix it? Next steps Do measurements Look at the board’s architecture Try to improve the test programm accordingly Iterate ... c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 12. Motivation Intro Memory Network Communication Measurements Parallella Testing Measurements (1) Setup vs. computation Modifications Measure setup and computation time separately Original kernel (times in ms) Host Epiphany TE /TH Set up 252 388 150% Computation 32 773 2410% c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 13. Motivation Intro Memory Network Communication Measurements Parallella Parallella Architecture Parallella Architecture Overview c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 14. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Epiphany Architecture Overview MIMD Architecture On-chip 2D mesh network One shared 4 GiB address space (except for the first 1 MiB) 16 and 64 core versions available in silicon 256, 1024, and 4095 core versions offered as IP Multiple devices can be linked together via 4 eLinks c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 15. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Single Epiphany Node Overview Components eCore processor 32kiB SRAM memory Mesh network interface 2 DMA controller 2 event counter Data busses 64 bit wide Address busses 32 bit wide Network bus 104 bit wide c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 16. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Single Epiphany node Processor (eCore) Processor features RISC architecture Load/Store of 8, 16, 32, and 64 bit words 64 general purpose 32 bit registers ALU/FPU: 32 bit only No SIMD instructions All registers are also memory mapped Instruction pipeline (5 - 8 stages) RISC: ALU/FPU only operate on registers, memory access is only done via load/store instructions Pipeline stalls until all register dependencies are fulfilled c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 17. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Single Epiphany node Local RAM RAM features 32 kiB SRAM Organized in 4 × 8 kiB banks that can be accessed in parallel Used for code and data Access in 1 clock cycle External memory can be used for code and data No cache for external memory c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 18. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Memory Model Memory address ranges Node local 0x00000 – 0xFFFFF 1 MiB Local SRAM 0x00000 – 0x07FFF 32 kiB Local Registers 0xF0000 – 0xFFFFF 64 kiB External DRAM 0x8E000000 – 0x8FFFFFFF 32 MiB Map local to global addresses Row, 6 bit Col, 6 bit Local address, 20 bit Set user interrupt on core (33, 11) → core id 0x84B *(unsigned *)( (0x84B << 20) | 0xF042C ) = 0x20 Read external DRAM at offset 0x1234 val = *(unsigned *)(0x8E000000 + 0x1234) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 19. Motivation Intro Memory Network Communication Measurements Parallella Epiphany Architecture Memory Model Default memory layout of the Parallella as seen from each node Row Column 0 · · · 8 9 10 11 · · · 32 · · · 63 0 Local 0x00000000 −→ Global address ... ... (External cores via NORTH link) 32 Core 0,0 0x80800000 Core 0,1 0x80900000 Core 0,2 0x80A00000 Core 0,3 0x80B00000 33 Core 1,0 0x84800000 Core 1,1 0x84900000 Core 1,2 0x84A00000 Core 1,3 0x84B00000 34 Core 2,0 0x88800000 Core 2,1 0x88900000 Core 2,2 0x88A00000 Core 2,3 0x88B00000 · · · Ext DRAM 0x8E000000 35 Core 3,0 0x8C800000 Core 3,1 0x8C900000 Core 3,2 0x8CA00000 Core 3,3 0x8CB00000 ... (External cores via SOUTH link) ... 63 c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 20. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited External memory accesses void stencil2d_kern( float* in, float* out, int w, int h, float c, float n, float d) { for(y=y1; y<y2; y++) for(x=x1; x<x2; x++) { int k = x+y*w; out[k] = d*in[k-w-1] + n*in[k-w] + d*in[k-w+1] + n*in[k-1] + c*in[k] + n*in[k+1] + d*in[k+w-1] + n*in[k+w] + d*in[k+w+1]; } } c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 21. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Implementing caching Modifications Use local SRAM to cache 3 rows at a time — this reduces external float reads from 9 to 1 per output value Use register variables Measurements (times in ms) Host Epiphany TE /TH TE /THmin No caching 32 773 2410% 2410% Caching 78 152 194% 475% + Registers 65 129 198% 402% Access to the external memory is the bottleneck Is there still room for improvements? c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 22. Motivation Intro Memory Network Communication Measurements Parallella Architecture On-Chip Network (eMesh) Properties Separate Networks cMesh Write data on-chip (fast, async) rMesh Read data (slow, high latency) xMesh Write data from/to external devices (DRAM, async) Transaction per clock cycle: 64 bit data External transaction per clock cycle: 8 bit (16 bit peak) data DRAM accesses do not disturb on-chip write transactions Back-pressure (push-back) on congestion Read-after-write can return the old value c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 23. Motivation Intro Memory Network Communication Measurements Parallella Architecture On-Chip Network (eMesh) Transactions Messages Write Indication to write data atomically (includes data and destination address) Read Request to create a write transaction (includes source and destination address) Testset Request for atomic TESTSET (includes data, source and destination address) Data size is 8, 16, 32, or 64 bit Data is read/written atomically Messages include control bits (routing mode, interrupt, end-of-block) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 24. Motivation Intro Memory Network Communication Measurements Parallella Architecture On-Chip Network (eMesh) Routing Standard routing algorithm 1 If the column does not match: route horizontally 2 If the row does not match: route vertically 3 Both match: route to the attached core Other routing methods can be selected at the sending core External memory is accessed as if it were cores c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 25. Motivation Intro Memory Network Communication Measurements Parallella Architecture On-Chip Network (eMesh) Routing examples c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 26. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Using 64 bit accesses Modifications Read and write 2 floats (à 32 bit) at a time Measurements (times in ms) Host Epiphany TE /TH TE /THmin No caching 32 773 2410% 2410% Caching 78 152 194% 475% + Registers 65 129 198% 402% + 64bit 50 76 152% 237% Using 64bit transactions is more efficient Is there still room for improvements? c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 27. Motivation Intro Memory Network Communication Measurements Parallella Architecture Read transactions eCore eLink FIFO/AXI FPGA extMem DRAM read_req read_req mem read write_ind write_ind msc External load ’load’ operations stall the eCore until the data has arrived This adds latency The read requests share bandwidth with write indications This can reduce (write) throughput c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 28. Motivation Intro Memory Network Communication Measurements Parallella Architecture Single Epiphany node DMA controller DMA controller features 2 independant DMA controllers Configurations can be chained per controller Basically implements ... (without chaining) do_dma(*dst, dinc[2], *src, sinc[2], count[2]) { for (o = count[0]; o > 0; o--) for (i = count[1]; ; i--) *(item_t *)dst = *(item_t *)src; if (i == 1) break; dst += dinc[1]; src += sinc[1]; dst += dinc[0]; src += sinc[0]; } c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 29. Motivation Intro Memory Network Communication Measurements Parallella Architecture DMA Transactions eCore DMA eLink FIFO/AXI FPGA extMem DRAM dma_start read_req read_req mem read mem read write_ind write_ind mem read write_ind write_ind mem write write_ind buf_ex loop msc Using the DMA c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 30. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Using the DMA controller Modifications Port the example to the e-lib Use the DMA to read/write rows asynchronously linaro@linaro-nano: stencil/elib/ $ gcc -Wall -o stencil.o -c stencil.c $ gcc -Wall -o stencil stencil.o -le-hal $ e-gcc -Wall -O3 -ffast-math -c kern.c -o kern.o $ e-gcc -T fast.ldf kern.o -o kern.elf -le-lib $ e-objcopy -srec-forceS3 -output-target srec kern.elf kern.srec $ sudo ./stencil_host -K kern -R100 time used 5.130 s (0.219 s + 100 * 0.049 s) $ c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 31. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Using the DMA controller (2) Measurements (times in ms) Host Epiphany TE /TH TE /THmin No caching 32 773 2410% 2410% Caching 78 152 194% 475% + Registers 65 129 198% 402% + 64bit 50 76 152% 237% Set-up — 220 — — % + DMA — 49 — 153% Using the DMA avoids stalling the eCore Is there still room for improvements? c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 32. Motivation Intro Memory Network Communication Measurements Parallella External Memory Read/Write Throughput From/To External Memory By number of cores 4 8 12 16 50 100 150 200 250 Cores Throughput(MB/s) Write 4096 (’slow’ DMA) Write 256 Read 4096 Read + Write c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 33. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Reduce transaction rate Modifications Use DMA ’slow’ mode (use outer loop, increases transaction interval) Measurements (times in ms) Host Epiphany TE /THmin FPU-Rate DMA ’fast’ — 49 153% 2% DMA ’slow’ — 44 137% 2% DMA write-only — 26 (81%) 3% c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 34. Motivation Intro Memory Network Communication Measurements Parallella Example Example Revisited Further improvements? Measurement overview (times in ms) Time (ms) Clk/float FPU-Rate Set up 220 - 388 — — No caching 773 7240 — Caching 152 1459 — Registers 129 1238 — 64bit 50 480 — DMA ’fast’ 49 470 2 % DMA ’slow’ 44 422 2 % Local mem 2.2 21 38 % Stencil problem is too ’small’ per float (uses 5 % only) 2.4ms * 0.38% FPU instruction rate ≈ estimated 0.9ms c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 35. Motivation Intro Memory Network Communication Measurements Parallella More Measurements Measuring Throughput Overview (16 cores, 600 MHz clock) MB/s clock/float Peak Remarks Read DRAM 139 275 46 % sync, slow Write DRAM 152 252 50 % sync, slow R/W DRAM 2 × 86 441 57 % sync Read (0,3) 519 74 — sync, slow Write (0,3) 4802 8 100 % no sync, loop Read next 5952 6.5 — no sync Write next 21796 1.7 28 % no sync Stencil DRAM 2 × 91 422 60 % Stencil local 1774 21 43 % Reached throughput of 91MB/s ≈ measured max R/W rate c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 36. Motivation Intro Memory Network Communication Measurements Parallella More Measurements Raw Read/Write Throughput From/To External Memory On continious overload, throughput differs The rows don’t affect each other c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 37. Motivation Intro Memory Network Communication Measurements Parallella Results Learned From Working With the Example Set-up time Prefer long running kernels (»200ms) DRAM Avoid accessing the external memory twice for the same data, cache locally instead Write only Avoid reading remote data (latency, throughput) DMA read Use the DMA asynchronously to read data FPU Be computation intensive per external data value Compiler Using register variables and optimization options (-O3 –ffast-math) yields good results c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 38. Motivation Intro Memory Network Communication Measurements Parallella Results Applications Audio processing ≈ 200 channels à 24 bit at 96 kHz sampling rate (in & out) Less if external DRAM is needed for delay lines Video processing ≈ 1 stream 720p HD, 16 bit/pixel at 46 bps (in & out) / at 80 bps (out) Stream analysis ≈ 4 GBit/s data stream (in only) Less if external DRAM is needed to store data c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 39. Motivation Intro Memory Network Communication Measurements Parallella Results SW-Architecture Considerations Throughput is limited Can be an issue with star topologies Starvation on network overload Does not compromise throughput Can be an issue with work-stealing scheduling Barriers can help to ensure fairness Reading is slower than writing Throughput is highly asymmetric (by design) Can be an issue with shared memory synchronization and reference counting c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 40. Motivation Intro Memory Network Communication Measurements Parallella Results Application To Other Architectures Write vs. Read Where reading is also slower Main memory accesses in current mainstream CPUs NUMA interconnects Access to IO devices, e.g. via PCIe Access to remote data in clusters Considerations Prefer sending copies over sharing local data Use asynchronous messaging c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 41. Closing Measurements Finally Thank you Many thanks to Sylvain Munaut for lending me a board Links Parallella http://www.parallella.org Epiphany Docs http://www.adapteva.com/all-documents/ http://www.adapteva.com/analyst-reports/ Specifications http://www.parallella.org/board/ STDCL/Coprthr http://www.browndeertechnology.com/resources.htm STDCL App Note http://www.browndeertechnology.com/docs/app_note_programming_ parallella_using_stdcl.pdf MPR article http://www.adapteva.com/wp-content/uploads/2011/06/adapteva_mpr.pdf c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 42. Closing Measurements Finally Contact Jacob Erlbeck, jacob.erlbeck@gmail.com Copyright c Jacob Erlbeck, 2014. Please contact the author if you wish to redistribute the work as a whole or in parts. c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 43. Closing Measurements Write Throughput To External Memory By number of cores and size 4 8 12 16 1,000 2,000 3,000 4,000 100 150 200 250 Cores Chunk size (B) Throughput(MB/s) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 44. Closing Measurements Throughput By Transfer Method Overview (16 cores, 600 MHz clock) eLib Loop DMA DMA DMA DMA Peak slow sync slow, sync (spec.) Read DRAM 48 94 131 139 128 136 300/1200 Write DRAM 77 152 143 152 140 149 300/1200 Read (0,3) 270 536 471 501 485 519 — Write (0,3) 2405 4802 4528 4757 3787 3508 4800 Read col 3 1069 2134 1873 1997 1796 1899 — Write col 3 7779 7651 16522 9807 9624 7894 19200 Read next 1750 2834 5952 5414 5076 4616 — Write next 8749 7651 21769 9807 14571 7895 (76800) Read self 2275 3485 6530 5439 5602 4501 — Write self 8785 7651 21758 9807 14628 7895 (76800) According to the errata lists in the Epiphany III/IV data sheets (E16G301 and E64G401), the peak node→eMesh rate is currently limited. c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 45. Closing Measurements Raw Read/Write Throughput From/To Columns 3 (East) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 46. Closing Measurements Raw Read/Write Throughput From/To Core (0, 3) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai
  • 47. Closing Measurements Results Measurement Summary On-Chip Mesh Behaviour Reading is much slower than writing Overload can lead to core starvation Overall throughput is maintained on overload Cores can receive a constant data stream at peak rate Cores can only send significantly below the peak rate (errata) c Jacob Erlbeck, 2014 Para//el 2014, Karlsruhe, 5.-7. Mai