SlideShare a Scribd company logo
1 of 12
Scalable Matrix Multiplication
for the 16 Core Epiphany Co-
Processor
Louis Loizides
May 2nd 2015
Parallella Board
16 core MIMD Epiphany
Co-Processor
Zync ARM processor / FPGA
Image from Adapteva
Epiphany Versions
32 GFLOPS
16 core Epiphany on
Parallella
5 TFLOPS?
4096 core
Epiphany
Graphic from Adapteva
Compiling
*.c gcc
Host
prog
HAL
Execution:
*.c e-gcc *.elf
ELDF
e-
objcopy
Device
prog
*.srec
Hardware definition
file
Challenges
• Hard to code. Need for very
manual memory allocation and
management makes complex
coding difficult.
• Hard to debug. Epiphany
doesn’t share memory with
Linux
• Temperature. After a week of
frustration I realized I needed
to put a fan over it.
• Documentation. SDK and
examples are poor and
frequently broken. Few
beginner examples. Small
community of users.
My “thermal management solution”
Process Synchronization
• Each core runs a process, not a thread
– Every core can run a different process
– “Workgroups” can be created in SDK
• Functions exist in OpenCL, COPRTHR and eSDK for
synchronizing processes
– Mutexes only provided between cores
– SDK examples tend to use wait for single bits to change for
synchronization
• MPI, OpenMP currently not supported for coprocessor
– Some “community” projects in works… not much of a
community though
Memory Management
• “Shared” DRAM
– Memory allocated specifically for Epiphany using e_alloc
– 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978)
• SRAM in each core
– Only 32kB available
– 4 GB/s (1 GB/s in practice per DMA channel)
– Use DMA channel functions to transfer memory between cores
– Can’t use malloc!!! – must keep track manually
– Have to know addresses on other cores you want to send data to
– Must watch out for both code size and stack growth
32kB of memory
Prog. Stack(matrix buffers go here…
essentially the heap)
Debugging
Chip Architecture
• 32kb SRAM per core for program + stack
• ~2 GB/s DMA transfers between cores
• ~150 MB/s to transfer to/from shared DRAM
DMA engine frees up processor
Graphic from Adapteva
SUMMA/Blocking Implementation
Block matrix
Execute
SUMMA on
sub-blocks
Each core copies it’s
designated sub-block
Example code - copy sub-blocks from
shared DRAM to Epiphany
Epiphany
DRAM
Note: ~1000x1000 matrix
size limitation due to
Parallella Linux shared
memory size
150 MB/s
2 GB/s
Results
0
50
100
150
200
250
300
350
0 200 400 600 800 1000
ExecutionTime(s)
Matrix Side Size
Matrix Multiplication Execution Times
Single Epiphany Core
2x2 Core Grid
3x3 Core Grid
4x4 Core Grid
ARM Naive
ARM Blocked
Epiphany
Version
Grid Side Size
Epiphany
Time (s)
Speedup vs.
Single Core
1 317.2 1
2 80.9 3.92
3 35.43 8.95
E16G3 4 21.5 14.76
E64G4 8 7.7 41.24
E256G4 16 1.98 160.02
E1KG4 32 0.51 620.96
E4KG4 64 0.13 2409.56
Speedup
(vs. single core)
More cores -> Larger Blocks -> Exponentially Less Blocking
y = 1.0083x1.9562
R² = 0.9995
0
2
4
6
8
10
12
14
16
1 2 3 4
Speedup(vsSingleCore)
Grid Side Size
Speedups vs. Grid Side Size
Estimated
Conclusions
• Potentially powerful device, especially in embedded AI
applications with large search spaces
– Needs passive cooling
• 32kB SRAM is extremely limiting
– Needs either L2 cache or just some kind of faster near-chip
shared memory
– Really limitation of Parallella architecture, not Epiphany
• Incredibly difficult to code
– SDK & Documentation needs improvement
– Better debugging tools needed ASAP!

More Related Content

What's hot

IPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-onIPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-onAPNIC
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationedlangley
 
DB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory ModesDB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory ModesScyllaDB
 
UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6edlangley
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapedlangley
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteUnit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteTushar B Kute
 
Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?ScyllaDB
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsScyllaDB
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking MechanismsKernel TLV
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...ScyllaDB
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linuxmountpoint.io
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutionsMajid Saleem
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
Linux rt in financial markets
Linux rt in financial marketsLinux rt in financial markets
Linux rt in financial marketsAdrien Mahieux
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 

What's hot (20)

IPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-onIPv4aaS tutorial and hands-on
IPv4aaS tutorial and hands-on
 
Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
µCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentationµCLinux on Pluto 6 Project presentation
µCLinux on Pluto 6 Project presentation
 
DB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory ModesDB Latency Using DRAM + PMem in App Direct & Memory Modes
DB Latency Using DRAM + PMem in App Direct & Memory Modes
 
UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6UKUUG presentation about µCLinux on Pluto 6
UKUUG presentation about µCLinux on Pluto 6
 
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheapUWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
UWE Linux Boot Camp 2007: Hacking embedded Linux on the cheap
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteUnit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
 
Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?Rust Is Safe. But Is It Fast?
Rust Is Safe. But Is It Fast?
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
Rust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency SystemsRust, Wright's Law, and the Future of Low-Latency Systems
Rust, Wright's Law, and the Future of Low-Latency Systems
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linux
 
Cache coherence problem and its solutions
Cache coherence problem and its solutionsCache coherence problem and its solutions
Cache coherence problem and its solutions
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
Snooping 2
Snooping 2Snooping 2
Snooping 2
 
Linux rt in financial markets
Linux rt in financial marketsLinux rt in financial markets
Linux rt in financial markets
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 

Viewers also liked

Balo153 vali vai size trung sieu nhe verage 15086 red
Balo153 vali vai size trung sieu nhe verage 15086 redBalo153 vali vai size trung sieu nhe verage 15086 red
Balo153 vali vai size trung sieu nhe verage 15086 redbalo153
 
Balo153 vali vai size trung sieu nhe verage 13005 black
Balo153 vali vai size trung sieu nhe verage 13005 blackBalo153 vali vai size trung sieu nhe verage 13005 black
Balo153 vali vai size trung sieu nhe verage 13005 blackbalo153
 
Clase final -_portafolio-_1
Clase final -_portafolio-_1Clase final -_portafolio-_1
Clase final -_portafolio-_1Pao Cordoba
 
Pet 735 presentation interdisciplinary curriculum
Pet 735 presentation interdisciplinary curriculumPet 735 presentation interdisciplinary curriculum
Pet 735 presentation interdisciplinary curriculumajkeath
 
Puesto, Empleo y Trabajo
Puesto, Empleo y TrabajoPuesto, Empleo y Trabajo
Puesto, Empleo y TrabajoRoger Velasquez
 
Balo153 vali vai size trung sieu nhe verage 15086 black
Balo153 vali vai size trung sieu nhe verage 15086 blackBalo153 vali vai size trung sieu nhe verage 15086 black
Balo153 vali vai size trung sieu nhe verage 15086 blackbalo153
 
Balo153 vali vai xach tay sieu nhe verage 15086 black
Balo153 vali vai xach tay sieu nhe verage 15086 blackBalo153 vali vai xach tay sieu nhe verage 15086 black
Balo153 vali vai xach tay sieu nhe verage 15086 blackbalo153
 
Balo153 vali vai size lon sieu nhe verage 15086 red
Balo153 vali vai size lon sieu nhe verage 15086 redBalo153 vali vai size lon sieu nhe verage 15086 red
Balo153 vali vai size lon sieu nhe verage 15086 redbalo153
 
Balo153 vali vai xach tay sieu nhe verage 13005 red
Balo153 vali vai xach tay sieu nhe verage 13005 redBalo153 vali vai xach tay sieu nhe verage 13005 red
Balo153 vali vai xach tay sieu nhe verage 13005 redbalo153
 
Balo153 vali vai xach tay sieu nhe verage 15086 red
Balo153 vali vai xach tay sieu nhe verage 15086 redBalo153 vali vai xach tay sieu nhe verage 15086 red
Balo153 vali vai xach tay sieu nhe verage 15086 redbalo153
 
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...Alfonso José García Laguna
 
Объединение Германии и Италии в XIX веке
Объединение Германии и Италии в XIX векеОбъединение Германии и Италии в XIX веке
Объединение Германии и Италии в XIX векеВиталий Овсянников
 
SK2 / U.6 - Eating Well
SK2 / U.6 - Eating WellSK2 / U.6 - Eating Well
SK2 / U.6 - Eating WellLee Gonz
 
Nomenclatura de alcoholes
Nomenclatura de alcoholesNomenclatura de alcoholes
Nomenclatura de alcoholesKarina Galvez
 
Conflicto arabe judio
Conflicto arabe judioConflicto arabe judio
Conflicto arabe judiorogo2014
 
Leyes rossia
Leyes rossiaLeyes rossia
Leyes rossiaYuu Rakun
 

Viewers also liked (20)

Balo153 vali vai size trung sieu nhe verage 15086 red
Balo153 vali vai size trung sieu nhe verage 15086 redBalo153 vali vai size trung sieu nhe verage 15086 red
Balo153 vali vai size trung sieu nhe verage 15086 red
 
Bruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-meses
Bruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-mesesBruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-meses
Bruno chevrand-prinz-construa-sua-liberdade-financeira-em-2-meses
 
Balo153 vali vai size trung sieu nhe verage 13005 black
Balo153 vali vai size trung sieu nhe verage 13005 blackBalo153 vali vai size trung sieu nhe verage 13005 black
Balo153 vali vai size trung sieu nhe verage 13005 black
 
Clase final -_portafolio-_1
Clase final -_portafolio-_1Clase final -_portafolio-_1
Clase final -_portafolio-_1
 
Pet 735 presentation interdisciplinary curriculum
Pet 735 presentation interdisciplinary curriculumPet 735 presentation interdisciplinary curriculum
Pet 735 presentation interdisciplinary curriculum
 
Puesto, Empleo y Trabajo
Puesto, Empleo y TrabajoPuesto, Empleo y Trabajo
Puesto, Empleo y Trabajo
 
Balo153 vali vai size trung sieu nhe verage 15086 black
Balo153 vali vai size trung sieu nhe verage 15086 blackBalo153 vali vai size trung sieu nhe verage 15086 black
Balo153 vali vai size trung sieu nhe verage 15086 black
 
Balo153 vali vai xach tay sieu nhe verage 15086 black
Balo153 vali vai xach tay sieu nhe verage 15086 blackBalo153 vali vai xach tay sieu nhe verage 15086 black
Balo153 vali vai xach tay sieu nhe verage 15086 black
 
Balo153 vali vai size lon sieu nhe verage 15086 red
Balo153 vali vai size lon sieu nhe verage 15086 redBalo153 vali vai size lon sieu nhe verage 15086 red
Balo153 vali vai size lon sieu nhe verage 15086 red
 
Balo153 vali vai xach tay sieu nhe verage 13005 red
Balo153 vali vai xach tay sieu nhe verage 13005 redBalo153 vali vai xach tay sieu nhe verage 13005 red
Balo153 vali vai xach tay sieu nhe verage 13005 red
 
Balo153 vali vai xach tay sieu nhe verage 15086 red
Balo153 vali vai xach tay sieu nhe verage 15086 redBalo153 vali vai xach tay sieu nhe verage 15086 red
Balo153 vali vai xach tay sieu nhe verage 15086 red
 
Catalogue
CatalogueCatalogue
Catalogue
 
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...
Software para el Diseno de Sistemas de Ultrafiltracion / Software for Ultrafi...
 
Объединение Германии и Италии в XIX веке
Объединение Германии и Италии в XIX векеОбъединение Германии и Италии в XIX веке
Объединение Германии и Италии в XIX веке
 
SK2 / U.6 - Eating Well
SK2 / U.6 - Eating WellSK2 / U.6 - Eating Well
SK2 / U.6 - Eating Well
 
Baterias automotivas artigos auto som
Baterias automotivas   artigos auto somBaterias automotivas   artigos auto som
Baterias automotivas artigos auto som
 
Nomenclatura de alcoholes
Nomenclatura de alcoholesNomenclatura de alcoholes
Nomenclatura de alcoholes
 
Conflicto arabe judio
Conflicto arabe judioConflicto arabe judio
Conflicto arabe judio
 
C. Allen Purvis Resume
C. Allen Purvis ResumeC. Allen Purvis Resume
C. Allen Purvis Resume
 
Leyes rossia
Leyes rossiaLeyes rossia
Leyes rossia
 

Similar to Term Project Presentation (4)

Multicore processing
Multicore processingMulticore processing
Multicore processingguestc0be34a
 
Brief Introduction to Parallella
Brief Introduction to ParallellaBrief Introduction to Parallella
Brief Introduction to ParallellaSomnath Mazumdar
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
04536342
0453634204536342
04536342fidan78
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]RootedCON
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureUmair Amjad
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9DanHeidinga
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and InsightsGlobalLogic Ukraine
 

Similar to Term Project Presentation (4) (20)

Multicore processing
Multicore processingMulticore processing
Multicore processing
 
Brief Introduction to Parallella
Brief Introduction to ParallellaBrief Introduction to Parallella
Brief Introduction to Parallella
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
04536342
0453634204536342
04536342
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Multi-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architectureMulti-core processor and Multi-channel memory architecture
Multi-core processor and Multi-channel memory architecture
 
It's always sunny with OpenJ9
It's always sunny with OpenJ9It's always sunny with OpenJ9
It's always sunny with OpenJ9
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and Insights
 

Term Project Presentation (4)

  • 1. Scalable Matrix Multiplication for the 16 Core Epiphany Co- Processor Louis Loizides May 2nd 2015
  • 2. Parallella Board 16 core MIMD Epiphany Co-Processor Zync ARM processor / FPGA Image from Adapteva
  • 3. Epiphany Versions 32 GFLOPS 16 core Epiphany on Parallella 5 TFLOPS? 4096 core Epiphany Graphic from Adapteva
  • 4. Compiling *.c gcc Host prog HAL Execution: *.c e-gcc *.elf ELDF e- objcopy Device prog *.srec Hardware definition file
  • 5. Challenges • Hard to code. Need for very manual memory allocation and management makes complex coding difficult. • Hard to debug. Epiphany doesn’t share memory with Linux • Temperature. After a week of frustration I realized I needed to put a fan over it. • Documentation. SDK and examples are poor and frequently broken. Few beginner examples. Small community of users. My “thermal management solution”
  • 6. Process Synchronization • Each core runs a process, not a thread – Every core can run a different process – “Workgroups” can be created in SDK • Functions exist in OpenCL, COPRTHR and eSDK for synchronizing processes – Mutexes only provided between cores – SDK examples tend to use wait for single bits to change for synchronization • MPI, OpenMP currently not supported for coprocessor – Some “community” projects in works… not much of a community though
  • 7. Memory Management • “Shared” DRAM – Memory allocated specifically for Epiphany using e_alloc – 160 MB/s (https://parallella.org/forums/viewtopic.php?f=10&t=1978) • SRAM in each core – Only 32kB available – 4 GB/s (1 GB/s in practice per DMA channel) – Use DMA channel functions to transfer memory between cores – Can’t use malloc!!! – must keep track manually – Have to know addresses on other cores you want to send data to – Must watch out for both code size and stack growth 32kB of memory Prog. Stack(matrix buffers go here… essentially the heap) Debugging
  • 8. Chip Architecture • 32kb SRAM per core for program + stack • ~2 GB/s DMA transfers between cores • ~150 MB/s to transfer to/from shared DRAM DMA engine frees up processor Graphic from Adapteva
  • 9. SUMMA/Blocking Implementation Block matrix Execute SUMMA on sub-blocks Each core copies it’s designated sub-block Example code - copy sub-blocks from shared DRAM to Epiphany Epiphany DRAM Note: ~1000x1000 matrix size limitation due to Parallella Linux shared memory size 150 MB/s 2 GB/s
  • 10. Results 0 50 100 150 200 250 300 350 0 200 400 600 800 1000 ExecutionTime(s) Matrix Side Size Matrix Multiplication Execution Times Single Epiphany Core 2x2 Core Grid 3x3 Core Grid 4x4 Core Grid ARM Naive ARM Blocked
  • 11. Epiphany Version Grid Side Size Epiphany Time (s) Speedup vs. Single Core 1 317.2 1 2 80.9 3.92 3 35.43 8.95 E16G3 4 21.5 14.76 E64G4 8 7.7 41.24 E256G4 16 1.98 160.02 E1KG4 32 0.51 620.96 E4KG4 64 0.13 2409.56 Speedup (vs. single core) More cores -> Larger Blocks -> Exponentially Less Blocking y = 1.0083x1.9562 R² = 0.9995 0 2 4 6 8 10 12 14 16 1 2 3 4 Speedup(vsSingleCore) Grid Side Size Speedups vs. Grid Side Size Estimated
  • 12. Conclusions • Potentially powerful device, especially in embedded AI applications with large search spaces – Needs passive cooling • 32kB SRAM is extremely limiting – Needs either L2 cache or just some kind of faster near-chip shared memory – Really limitation of Parallella architecture, not Epiphany • Incredibly difficult to code – SDK & Documentation needs improvement – Better debugging tools needed ASAP!

Editor's Notes

  1. Epiphany is a co-processor architecture by Adapteva It’s a matrix of tiny RISC CPUs connected by a communications framework Unlike other MIMD co-processors (Intel Xenon Phi) everything exists on a single chip Adapteva generally sells these processors for OEM use The Parallella board is a dev board for this processor – raised close to a million on Kickstarter
  2. The chip provided with the Parallella is 16 core Adapteva believes this can scale up to 4096 cores, but the only other one they’re producing is 64 The 16 core is 32 GFLOPS For comparison, a high end i5 mobile processor is around 40-50 GFLOPS
  3. Need 2 versions of gcc – one for host and one for Epiphany Host loads executable onto Epiphany and starts it
  4. The Parallella was extremely to difficult develop on
  5. There are some SDKs to facilitate multi-threading Better of using Adapteva’s SDK The problem with MPI and OpenMP is the limited memory in the core
  6. Very explicit memory management – need to pass address pointer to each function and increment Can’t use malloc to keep track Need to start at some offset for the program Stack grows from the end Need to be very careful about about balancing stack vs. heap space Also need to set some pointers explicitly for DMA transfers
  7. Adapteva calls this “a network on a chip” Fast inter-core memory transfers Very slow transfers to DRAM – want to work on largest matrix block possible at a time
  8. Block distributed among cores, then SUMMA used to perform multiplication Could potentially require a lot of loops – great deal of overhead
  9. Pretty much expected 9-16 cores needed to beat non-blocked multiplication on ARM ARM is shown as an example, but isn’t a good benchmark
  10. Speedup from 1 to 16 cores is substantial More than 16x due to inter-core communication