Term Project Presentation (4)

•Download as PPTX, PDF•

1 like•161 views

Louis Loizides PE

Scalable Matrix Multiplication
for the 16 Core Epiphany Co-
Processor
Louis Loizides
May 2nd 2015

Parallella Board
16 core MIMD Epiphany
Co-Processor
Zync ARM processor / FPGA
Image from Adapteva

Epiphany Versions
32 GFLOPS
16 core Epiphany on
Parallella
5 TFLOPS?
4096 core
Epiphany
Graphic from Adapteva

Compiling
*.c gcc
Host
prog
HAL
Execution:
*.c e-gcc *.elf
ELDF
e-
objcopy
Device
prog
*.srec
Hardware definition
file

Challenges
• Hard to code. Need for very
manual memory allocation and
management makes complex
coding difficult.
• Hard to debug. Epiphany
doesn’t share memory with
Linux
• Temperature. After a week of
frustration I realized I needed
to put a fan over it.
• Documentation. SDK and
examples are poor and
frequently broken. Few
beginner examples. Small
community of users.
My “thermal management solution”

Process Synchronization
• Each core runs a process, not a thread
– Every core can run a different process
– “Workgroups” can be created in SDK
• Functions exist in OpenCL, COPRTHR and eSDK for
synchronizing processes
– Mutexes only provided between cores
– SDK examples tend to use wait for single bits to change for
synchronization
• MPI, OpenMP currently not supported for coprocessor
– Some “community” projects in works… not much of a
community though

Memory Management
• “Shared” DRAM
– Memory allocated specifically for Epiphany using e_alloc
– 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978)
• SRAM in each core
– Only 32kB available
– 4 GB/s (1 GB/s in practice per DMA channel)
– Use DMA channel functions to transfer memory between cores
– Can’t use malloc!!! – must keep track manually
– Have to know addresses on other cores you want to send data to
– Must watch out for both code size and stack growth
32kB of memory
Prog. Stack(matrix buffers go here…
essentially the heap)
Debugging

Chip Architecture
• 32kb SRAM per core for program + stack
• ~2 GB/s DMA transfers between cores
• ~150 MB/s to transfer to/from shared DRAM
DMA engine frees up processor
Graphic from Adapteva

SUMMA/Blocking Implementation
Block matrix
Execute
SUMMA on
sub-blocks
Each core copies it’s
designated sub-block
Example code - copy sub-blocks from
shared DRAM to Epiphany
Epiphany
DRAM
Note: ~1000x1000 matrix
size limitation due to
Parallella Linux shared
memory size
150 MB/s
2 GB/s

Results
0
50
100
150
200
250
300
350
0 200 400 600 800 1000
ExecutionTime(s)
Matrix Side Size
Matrix Multiplication Execution Times
Single Epiphany Core
2x2 Core Grid
3x3 Core Grid
4x4 Core Grid
ARM Naive
ARM Blocked

Epiphany
Version
Grid Side Size
Epiphany
Time (s)
Speedup vs.
Single Core
1 317.2 1
2 80.9 3.92
3 35.43 8.95
E16G3 4 21.5 14.76
E64G4 8 7.7 41.24
E256G4 16 1.98 160.02
E1KG4 32 0.51 620.96
E4KG4 64 0.13 2409.56
Speedup
(vs. single core)
More cores -> Larger Blocks -> Exponentially Less Blocking
y = 1.0083x1.9562
R² = 0.9995
0
2
4
6
8
10
12
14
16
1 2 3 4
Speedup(vsSingleCore)
Grid Side Size
Speedups vs. Grid Side Size
Estimated

Conclusions
• Potentially powerful device, especially in embedded AI
applications with large search spaces
– Needs passive cooling
• 32kB SRAM is extremely limiting
– Needs either L2 cache or just some kind of faster near-chip
shared memory
– Really limitation of Parallella architecture, not Epiphany
• Incredibly difficult to code
– SDK & Documentation needs improvement
– Better debugging tools needed ASAP!

What's hot

IPv4aaS tutorial and hands-onAPNIC

Cpu Cachesshinolajla

µCLinux on Pluto 6 Project presentationedlangley

DB Latency Using DRAM + PMem in App Direct & Memory ModesScyllaDB

UKUUG presentation about µCLinux on Pluto 6edlangley

UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapedlangley

Hardware multithreadingFraboni Ec

Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteTushar B Kute

Rust Is Safe. But Is It Fast?ScyllaDB

Icg hpc-usergdburton

Rust, Wright's Law, and the Future of Low-Latency SystemsScyllaDB

What is simultaneous multithreadingFraboni Ec

Linux Locking MechanismsKernel TLV

Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...ScyllaDB

Current and Future of Non-Volatile Memory on Linuxmountpoint.io

Cache coherence problem and its solutionsMajid Saleem

A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon

Snooping 2Yasir Khan

Linux rt in financial marketsAdrien Mahieux

Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon

What's hot (20)

IPv4aaS tutorial and hands-on

Cpu Caches

µCLinux on Pluto 6 Project presentation

DB Latency Using DRAM + PMem in App Direct & Memory Modes

UKUUG presentation about µCLinux on Pluto 6

UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap

Hardware multithreading

Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute

Rust Is Safe. But Is It Fast?

Icg hpc-user

Rust, Wright's Law, and the Future of Low-Latency Systems

What is simultaneous multithreading

Linux Locking Mechanisms

Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...

Current and Future of Non-Volatile Memory on Linux

Cache coherence problem and its solutions

A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum

Snooping 2

Linux rt in financial markets

Running Applications on the NetBSD Rump Kernel by Justin Cormack

Viewers also liked

Balo153 vali vai size trung sieu nhe verage 15086 redbalo153

Bruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-mesesSecretaria Estadual de Saude e Defesa Civil

Balo153 vali vai size trung sieu nhe verage 13005 blackbalo153

Clase final -_portafolio-_1Pao Cordoba

Pet 735 presentation interdisciplinary curriculumajkeath

Puesto, Empleo y TrabajoRoger Velasquez

Balo153 vali vai size trung sieu nhe verage 15086 blackbalo153

Balo153 vali vai xach tay sieu nhe verage 15086 blackbalo153

Balo153 vali vai size lon sieu nhe verage 15086 redbalo153

Balo153 vali vai xach tay sieu nhe verage 13005 redbalo153

Balo153 vali vai xach tay sieu nhe verage 15086 redbalo153

CatalogueAlma Ding

Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...Alfonso José García Laguna

Объединение Германии и Италии в XIX векеВиталий Овсянников

SK2 / U.6 - Eating WellLee Gonz

Baterias automotivas artigos auto somSecretaria Estadual de Saude e Defesa Civil

Nomenclatura de alcoholesKarina Galvez

Conflicto arabe judiorogo2014

C. Allen Purvis ResumeChristopher Purvis

Leyes rossiaYuu Rakun

Viewers also liked (20)

Balo153 vali vai size trung sieu nhe verage 15086 red

Bruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-meses

Balo153 vali vai size trung sieu nhe verage 13005 black

Clase final -_portafolio-_1

Pet 735 presentation interdisciplinary curriculum

Puesto, Empleo y Trabajo

Balo153 vali vai size trung sieu nhe verage 15086 black

Balo153 vali vai xach tay sieu nhe verage 15086 black

Balo153 vali vai size lon sieu nhe verage 15086 red

Balo153 vali vai xach tay sieu nhe verage 13005 red

Balo153 vali vai xach tay sieu nhe verage 15086 red

Catalogue

Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...

Объединение Германии и Италии в XIX веке

SK2 / U.6 - Eating Well

Baterias automotivas artigos auto som

Nomenclatura de alcoholes

Conflicto arabe judio

C. Allen Purvis Resume

Leyes rossia

Similar to Term Project Presentation (4)

Multicore processingguestc0be34a

Brief Introduction to ParallellaSomnath Mazumdar

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen

Trip down the GPU lane with Machine LearningRenaldas Zioma

Multi-core architecturesnextlib

Exaflop In 2018 HardwareJacob Wu

04536342fidan78

Graphics processing uni computer archiectureHaris456

“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid

Introduction to DPDKKernel TLV

Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]RootedCON

Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N

fpga1 - What is.pptxssuser0de10a

cachegrand: A Take on High Performance CachingScyllaDB

Memory, Big Data, NoSQL and VirtualizationBigstep

The Cell ProcessorHeiko Joerg Schick

Multi-core processor and Multi-channel memory architectureUmair Amjad

It's always sunny with OpenJ9DanHeidinga

General Purpose GPU ComputingGlobalLogic Ukraine

Linux Kernel Platform Development: Challenges and InsightsGlobalLogic Ukraine

Term Project Presentation (4)

1. Scalable Matrix Multiplication for the 16 Core Epiphany Co- Processor Louis Loizides May 2nd 2015

2. Parallella Board 16 core MIMD Epiphany Co-Processor Zync ARM processor / FPGA Image from Adapteva

3. Epiphany Versions 32 GFLOPS 16 core Epiphany on Parallella 5 TFLOPS? 4096 core Epiphany Graphic from Adapteva

4. Compiling *.c gcc Host prog HAL Execution: *.c e-gcc *.elf ELDF e- objcopy Device prog *.srec Hardware definition file

5. Challenges • Hard to code. Need for very manual memory allocation and management makes complex coding difficult. • Hard to debug. Epiphany doesn’t share memory with Linux • Temperature. After a week of frustration I realized I needed to put a fan over it. • Documentation. SDK and examples are poor and frequently broken. Few beginner examples. Small community of users. My “thermal management solution”

6. Process Synchronization • Each core runs a process, not a thread – Every core can run a different process – “Workgroups” can be created in SDK • Functions exist in OpenCL, COPRTHR and eSDK for synchronizing processes – Mutexes only provided between cores – SDK examples tend to use wait for single bits to change for synchronization • MPI, OpenMP currently not supported for coprocessor – Some “community” projects in works… not much of a community though

7. Memory Management • “Shared” DRAM – Memory allocated specifically for Epiphany using e_alloc – 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978) • SRAM in each core – Only 32kB available – 4 GB/s (1 GB/s in practice per DMA channel) – Use DMA channel functions to transfer memory between cores – Can’t use malloc!!! – must keep track manually – Have to know addresses on other cores you want to send data to – Must watch out for both code size and stack growth 32kB of memory Prog. Stack(matrix buffers go here… essentially the heap) Debugging

8. Chip Architecture • 32kb SRAM per core for program + stack • ~2 GB/s DMA transfers between cores • ~150 MB/s to transfer to/from shared DRAM DMA engine frees up processor Graphic from Adapteva

9. SUMMA/Blocking Implementation Block matrix Execute SUMMA on sub-blocks Each core copies it’s designated sub-block Example code - copy sub-blocks from shared DRAM to Epiphany Epiphany DRAM Note: ~1000x1000 matrix size limitation due to Parallella Linux shared memory size 150 MB/s 2 GB/s

10. Results 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 ExecutionTime(s) Matrix Side Size Matrix Multiplication Execution Times Single Epiphany Core 2x2 Core Grid 3x3 Core Grid 4x4 Core Grid ARM Naive ARM Blocked

11. Epiphany Version Grid Side Size Epiphany Time (s) Speedup vs. Single Core 1 317.2 1 2 80.9 3.92 3 35.43 8.95 E16G3 4 21.5 14.76 E64G4 8 7.7 41.24 E256G4 16 1.98 160.02 E1KG4 32 0.51 620.96 E4KG4 64 0.13 2409.56 Speedup (vs. single core) More cores -> Larger Blocks -> Exponentially Less Blocking y = 1.0083x1.9562 R² = 0.9995 0 2 4 6 8 10 12 14 16 1 2 3 4 Speedup(vsSingleCore) Grid Side Size Speedups vs. Grid Side Size Estimated

12. Conclusions • Potentially powerful device, especially in embedded AI applications with large search spaces – Needs passive cooling • 32kB SRAM is extremely limiting – Needs either L2 cache or just some kind of faster near-chip shared memory – Really limitation of Parallella architecture, not Epiphany • Incredibly difficult to code – SDK & Documentation needs improvement – Better debugging tools needed ASAP!

Editor's Notes

Epiphany is a co-processor architecture by Adapteva It’s a matrix of tiny RISC CPUs connected by a communications framework Unlike other MIMD co-processors (Intel Xenon Phi) everything exists on a single chip Adapteva generally sells these processors for OEM use The Parallella board is a dev board for this processor – raised close to a million on Kickstarter
The chip provided with the Parallella is 16 core Adapteva believes this can scale up to 4096 cores, but the only other one they’re producing is 64 The 16 core is 32 GFLOPS For comparison, a high end i5 mobile processor is around 40-50 GFLOPS
Need 2 versions of gcc – one for host and one for Epiphany Host loads executable onto Epiphany and starts it
The Parallella was extremely to difficult develop on
There are some SDKs to facilitate multi-threading Better of using Adapteva’s SDK The problem with MPI and OpenMP is the limited memory in the core
Very explicit memory management – need to pass address pointer to each function and increment Can’t use malloc to keep track Need to start at some offset for the program Stack grows from the end Need to be very careful about about balancing stack vs. heap space Also need to set some pointers explicitly for DMA transfers
Adapteva calls this “a network on a chip” Fast inter-core memory transfers Very slow transfers to DRAM – want to work on largest matrix block possible at a time
Block distributed among cores, then SUMMA used to perform multiplication Could potentially require a lot of loops – great deal of overhead
Pretty much expected 9-16 cores needed to beat non-blocked multiplication on ARM ARM is shown as an example, but isn’t a good benchmark
Speedup from 1 to 16 cores is substantial More than 16x due to inter-core communication

Term Project Presentation (4)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Term Project Presentation (4)

Similar to Term Project Presentation (4) (20)

Term Project Presentation (4)

Editor's Notes