SlideShare a Scribd company logo
1 of 22
The pocl Kernel Compiler
Clay Chang
CPU versus GPU
• Sophiscated Control
• Branch Prediction
• Out-of-Order Execution
• Large Cache
• Little Control
• No or Limited Branch
Prediction
• Simple Execution
• Small or no cache
• Lots of ALUs
OpenCL as the Portable API
Why OpenCL for CPU
 Muiti-core CPU is out there
 E.g. MediaTek Tri-Cluster 10 cores SoC
 Mobile GPU is already busy
 ~25% occupied by system UI in Android
 Not every programs run good on GPU
 Heavy Branch Divergence
 OpenCL allows easily exploit multi-core and SIMD
 Imagine: writing pthread + SIMD in assembly or intrinsics
Running OpenCL Kernels on CPU
 One thread per work-item?
 Thousands of threads being created
 Context-switching problems
 How to synchronize threads?
 How about running one work-group on a CPU thread?
Related Works
 Twin peaks: a software platform for heterogeneous computing on
general-purpose and graphics processors.
 MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core
CPUs
 Clover (http://people.freedesktop.org/~steckdenis/clover)
 Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
What is to pocl
 POrtable Computing Language
 An efficient implementation of OpenCL standard which can be easily
adapted for new targets
 http://github.com/pocl/pocl
 Main developer: Pekka Jääskeläinen from Tampere University of
Technology
 Supporting Architecture: CPU, tce, cellspu, HSA
 Current version: 0.11
Components in pocl
The pocl Kernel Compiler
OpenCL
Kernel Source
Clang / LLVM
pocl
Kernel Compiler
clBuildProgram(…)
clEnqueueNDRangeKernel
(…, local_size, …)
Single Work-item
Kernel
Transformed
Kernel
pocl Compilation Chain
1
2
3
4 Compile Kernel (OpenCL C) by
Clang
1
Linked with target-specific built-
in functions, such as sin, cos,
geom_distance, etc…
2
Work-group Function
Generation / Parallel Work-item
Loops Creation
3
Backend Optimizations (Auto-
vecs, …) and CodeGen
4
Work-group_function() {
for (int i = 0; i < work-group_size; i++) {
}
}
Work-group Function Generation
Kernel (single work-item)
What if there are
barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)
Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by
all work-items of a work-group executing the kernel
or by none at all…”
if (tid % 2) {
….
barrier();
…
}
Kernel Without barriers
• A node in a CFG is a basic block
(BB)
• BB: branchless sequence of
instructions
• BB executed as an entity,
from the first instruction to
the last.
• An edge in a CFG represents
a branch in the control flow
• Multiple exit BBs are
allowed
• pocl Kernel Compiler generates
WI-loop around the CFG
Types of Barrier
Un-conditional barriers
 barrier that dominates the exit node
Conditional barriers
 Barriers being placed in
 if – else
 for-loop (b-loop)
Kernel with unconditional barriers
 pocl Kernel Compiler creates WI-loops
before and after the barrier
 This forms an algorithm:
Algorithm 1: Parallel region formation when the kernel
does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and
the exit nodes of the kernel function and that there is
only one exit node in the kernel function. This is a safe
starting condition as it does not affect any execution
order restrictions.
Step2: Perform a depth-first-search traversal of the kernel
CFG. Ignore the possible back edges to avoid infinite
loops and to include the loops of the kernel to the
parallel region.
Step3: When encountering a barrier, create a parallel
region by calling CreateSubgraph for the previously
encountered barrier and the newly found barrier.
barrier
barrier
A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation
in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting
at the entry node.
Step2: Each time a new, unprocessed conditional barrier
is found, use CreateSubgraph to produce a sub-CFG from
that barrier to the next exit node (duplicate the tail).
Step3: Replicate the created sub-CFG using ReplicateCFG.
In order to reduce code duplication, merge the tails from
the same unconditional barrier paths. That is, replicate
the basic blocks only after the last barrier that is
unconditionally reachable from the one at hand.
Step4: Start the algorithm at each of the found barrier
successors.
A CFG with Two Conditional barriers
– After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?
“Peel” the First
Loop Iteration
?
?
No more ambiguous
branches in WI-
loops!
Barriers in Kernel Loops
Insert implicit barrier into:
1. End of loop pre-header
block
2. Before the loop latch
branch
3. After the PhiNode
region of the loop
header block
3
2
1
Horizontal Inner-Loop Parallelization
More parallelization after loop interchange
blockWidth unknown until runtime
Handling of Kernel Variables
1. There will be two parallel regions
2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)
3. B’s lifetime span across both parallel regions
Context Array
References
 Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle
Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable
OpenCL Implementation" in International Journal of Parallel
Programming, Springer, August 2014.
 http://github.com/pocl/pocl

More Related Content

What's hot

A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 
Sisteme de Operare: Memorie virtuala
Sisteme de Operare: Memorie virtualaSisteme de Operare: Memorie virtuala
Sisteme de Operare: Memorie virtualaAlexandru Radovici
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machineAlexei Starovoitov
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelKernel TLV
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicJoseph Lu
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi ComputersNemwos
 
Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]Akhil Nadh PC
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecturehugo lu
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsPeter Tröger
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Adrian Huang
 

What's hot (20)

Cache memory
Cache memoryCache memory
Cache memory
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
Character drivers
Character driversCharacter drivers
Character drivers
 
Sisteme de Operare: Memorie virtuala
Sisteme de Operare: Memorie virtualaSisteme de Operare: Memorie virtuala
Sisteme de Operare: Memorie virtuala
 
Cache
CacheCache
Cache
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Continguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux KernelContinguous Memory Allocator in the Linux Kernel
Continguous Memory Allocator in the Linux Kernel
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Understanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panicUnderstanding a kernel oops and a kernel panic
Understanding a kernel oops and a kernel panic
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
 
Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]Chorus - Distributed Operating System [ case study ]
Chorus - Distributed Operating System [ case study ]
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management Concepts
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
 

Similar to The pocl Kernel Compiler

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
 
Adding a BOLT pass
Adding a BOLT passAdding a BOLT pass
Adding a BOLT passAmir42407
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardJian-Hong Pan
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Clusterbyonggon chun
 
Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABIAlison Chaiken
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSnehaLatha68
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow AnalysisEdgar Barbosa
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinEqunix Business Solutions
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights LandingAndrey Vladimirov
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...Yandex
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleAlison Chaiken
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Jian-Hong Pan
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technologySZ Lin
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
 

Similar to The pocl Kernel Compiler (20)

SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 
Adding a BOLT pass
Adding a BOLT passAdding a BOLT pass
Adding a BOLT pass
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulBetter Kafka Performance Without Changing Any Code | Simon Ritter, Azul
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV ClusterMethod of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
Method of NUMA-Aware Resource Management for Kubernetes 5G NFV Cluster
 
Not breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABINot breaking userspace: the evolving Linux ABI
Not breaking userspace: the evolving Linux ABI
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Control Flow Analysis
Control Flow AnalysisControl Flow Analysis
Control Flow Analysis
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
 
IRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the PreemptibleIRQs: the Hard, the Soft, the Threaded and the Preemptible
IRQs: the Hard, the Soft, the Threaded and the Preemptible
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021Let's trace Linux Lernel with KGDB @ COSCUP 2021
Let's trace Linux Lernel with KGDB @ COSCUP 2021
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
淺談 Live patching technology
淺談 Live patching technology淺談 Live patching technology
淺談 Live patching technology
 
Achieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-KernelsAchieving Performance Isolation with Lightweight Co-Kernels
Achieving Performance Isolation with Lightweight Co-Kernels
 

Recently uploaded

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 

Recently uploaded (20)

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

The pocl Kernel Compiler

  • 1. The pocl Kernel Compiler Clay Chang
  • 2. CPU versus GPU • Sophiscated Control • Branch Prediction • Out-of-Order Execution • Large Cache • Little Control • No or Limited Branch Prediction • Simple Execution • Small or no cache • Lots of ALUs
  • 3. OpenCL as the Portable API
  • 4. Why OpenCL for CPU  Muiti-core CPU is out there  E.g. MediaTek Tri-Cluster 10 cores SoC  Mobile GPU is already busy  ~25% occupied by system UI in Android  Not every programs run good on GPU  Heavy Branch Divergence  OpenCL allows easily exploit multi-core and SIMD  Imagine: writing pthread + SIMD in assembly or intrinsics
  • 5. Running OpenCL Kernels on CPU  One thread per work-item?  Thousands of threads being created  Context-switching problems  How to synchronize threads?  How about running one work-group on a CPU thread?
  • 6. Related Works  Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs  Clover (http://people.freedesktop.org/~steckdenis/clover)  Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
  • 7. What is to pocl  POrtable Computing Language  An efficient implementation of OpenCL standard which can be easily adapted for new targets  http://github.com/pocl/pocl  Main developer: Pekka Jääskeläinen from Tampere University of Technology  Supporting Architecture: CPU, tce, cellspu, HSA  Current version: 0.11
  • 9. The pocl Kernel Compiler OpenCL Kernel Source Clang / LLVM pocl Kernel Compiler clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …) Single Work-item Kernel Transformed Kernel
  • 10. pocl Compilation Chain 1 2 3 4 Compile Kernel (OpenCL C) by Clang 1 Linked with target-specific built- in functions, such as sin, cos, geom_distance, etc… 2 Work-group Function Generation / Parallel Work-item Loops Creation 3 Backend Optimizations (Auto- vecs, …) and CodeGen 4
  • 11. Work-group_function() { for (int i = 0; i < work-group_size; i++) { } } Work-group Function Generation Kernel (single work-item) What if there are barriers? WI-loop clEnqueueNDRangeKernel(…., group_size, ….)
  • 12. Semantics of barrier Synchronization OpenCL 1.2 rev19 p.30: “… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…” if (tid % 2) { …. barrier(); … }
  • 13. Kernel Without barriers • A node in a CFG is a basic block (BB) • BB: branchless sequence of instructions • BB executed as an entity, from the first instruction to the last. • An edge in a CFG represents a branch in the control flow • Multiple exit BBs are allowed • pocl Kernel Compiler generates WI-loop around the CFG
  • 14. Types of Barrier Un-conditional barriers  barrier that dominates the exit node Conditional barriers  Barriers being placed in  if – else  for-loop (b-loop)
  • 15. Kernel with unconditional barriers  pocl Kernel Compiler creates WI-loops before and after the barrier  This forms an algorithm: Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers. Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions. Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region. Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier. barrier barrier
  • 16. A CFG with Two Conditional barriers Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel. Step1: Perform a depth-first traversal of the CFG, starting at the entry node. Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail). Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand. Step4: Start the algorithm at each of the found barrier successors.
  • 17. A CFG with Two Conditional barriers – After Tail Duplication Easier for WI-loops creation! barrier barrier barrier barrier ? ?
  • 18. “Peel” the First Loop Iteration ? ? No more ambiguous branches in WI- loops!
  • 19. Barriers in Kernel Loops Insert implicit barrier into: 1. End of loop pre-header block 2. Before the loop latch branch 3. After the PhiNode region of the loop header block 3 2 1
  • 20. Horizontal Inner-Loop Parallelization More parallelization after loop interchange blockWidth unknown until runtime
  • 21. Handling of Kernel Variables 1. There will be two parallel regions 2. a‘s lifetime only in the first parallel region (it’s a temporary variable) 3. B’s lifetime span across both parallel regions Context Array
  • 22. References  Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.  http://github.com/pocl/pocl

Editor's Notes

  1. A, B, D forms a parallel region and from B, there’s a branch to the middle of another parallel region’s (ABEHI) work-item loop. If at least one work-item takes the branch after B that can lead to a barrier, the rest of the work-item must follow  peel first loop