SlideShare a Scribd company logo
1 of 20
Download to read offline
Intel Open-source SYCL compiler project
Konstantin Bobrovskii, Intel
2019 Khronos Embedded Vision Summit
Agenda
SYCL intro
Intel Open-Source SYCL project
Q&A
2
SYCL Intro
3
Intel confidential 4
What is SYCL?
Single-source heterogeneous programming using STANDARD C++
 Use C++ templates and lambda functions for host & device code
Aligns the hardware acceleration of OpenCL with direction of the C++ standard
C++ Kernel Language
Low Level Control
‘GPGPU’-style separation of
device-side kernel source
code and host code
Single-source C++
Programmer Familiarity
Approach also taken by
C++ AMP and OpenMP
Developer Choice
The development of the two specifications are aligned so
code can be easily shared between the two approaches
5
Reactive to OpenCL Pros and Cons:
• OpenCL has a well-defined, portable
execution model.
• OpenCL is too verbose for many
application developers.
• OpenCL remains a C API and only
recently supported C++ kernels.
• Just-in-time compilation model and
disjoint source code is awkward and
contrary to HPC usage models.
Proactive about Future C++:
• SYCL is based on purely modern C++
and should feel familiar to C++11 users.
• SYCL expected to run ahead of C++Next
regarding heterogeneity and
parallelism. ISO C++ of tomorrow may
look a lot like SYCL.
• Not held back by C99 or C++03
compatibility goals.
Why SYCL? Reactive and Proactive Motivation:
6
nstream.cl
__kernel void nstream(
int length,
double s,
__global double * A,
__global double * B,
__global double * C)
{
int i = get_global_id(0);
A[i] += B[i] + s * C[i];
}
nstream.cpp
cl::Context gpu(CL_DEVICE_TYPE_GPU, &err);
cl::CommandQueue queue(gpu);
cl::Program program(gpu,
prk::opencl::loadProgram(“nstream.cl”), true);
auto kernel = cl::make_kernel<int, double, cl::Buffer,
cl::Buffer, cl::Buffer>(program, “nstream”, &err);
auto d_a = cl::Buffer(gpu, begin(h_a), end(h_a));
auto d_b = cl::Buffer(gpu, begin(h_b), end(h_b));
auto d_c = cl::Buffer(gpu, begin(h_c), end(h_c));
kernel(cl::EnqueueArgs(queue, cl::NDRange(length)),
length, scalar, d_a, d_b, d_c);
queue.finish();
OpenCL Example (w/ C++ Wrappers):
Intel confidential 7
SYCL Example
sycl::gpu_selector device_selector;
sycl::queue q(device_selector);
sycl::buffer<double> d_A { h_A.data(), h_A.size() };
sycl::buffer<double> d_B { h_B.data(), h_B.size() };
sycl::buffer<double> d_C { h_C.data(), h_C.size() };
q.submit([&](sycl::handler& h)
{
auto A = d_A.get_access<sycl::access::mode::read_write>(h);
auto B = d_B.get_access<sycl::access::mode::read>(h);
auto C = d_C.get_access<sycl::access::mode::read>(h);
h.parallel_for<class nstream>(sycl::range<1>{length}, [=] (sycl::item<1> i) {
A[i] += B[i] + s * C[i];
});
});
q.wait();
Retains OpenCL’s ability to easily target
different devices in the same thread.
Parallel loops are explicit like C++ vs. implicit in OpenCL.
Kernel code does not need to live in a separate part of the program.
Accessors create DAG to trigger data
movement and represent execution
dependencies.
Intel confidential 8
SYCL Example: Graph of Asynchronous Executions
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto B = b.get_access<access::mode::read>(cgh);
auto C = c.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add1>( range<2>{N, M},
[=](id<2> index) { C[index] = A[index] + B[index]; });
});
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto C = c.get_access<access::mode::read>(cgh);
auto D = d.get_access<access::mode::write>(cgh);
cgh.parallel_for<class add2>( range<2>{P, Q},
[=](id<2> index) { D[index] = A[index] + C[index]; });
});
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto D = d.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::write>(cgh);
cgh.parallel_for<class add3>( range<2>{S, T},
[=](id<2> index) { E[index] = A[index] + D[index]; });
});
add2
add3
add1
A B
A D
A E
C
D
Intel confidential 9
SYCL Example: Graph of Asynchronous Executions
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto B = b.get_access<access::mode::read>(cgh);
auto C = c.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add1>( range<2>{N, M},
[=](id<2> index) { C[index] = A[index] + B[index]; });
});
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto D = d.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add2>( range<2>{P, Q},
[=](id<2> index) { E[index] = A[index] + D[index]; });
});
myQueue.submit([&](handler& cgh) {
auto C = c.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::read>(cgh);
auto F = f.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add3>( range<2>{S, T},
[=](id<2> index) { F[index] = C[index] + E[index]; });
});
add2
add3
add1
A B A D
F
C
• SYCL queues are out-of-order by default – data dependencies order kernel executions
• Will also be able to use in-order queue policies to simplify porting
E
10
Hierarchical parallelism (logical view)
 Fundamentally top down expression of parallelism
 Many embedded features and details, not covered here
parallel_for_work_group (…) {
…
parallel_for_work_item (…) {
…
}
}
Intel Open-Source SYCL Project
Motivation, location, architecture, plugin interface, future directions
11
Motivation
Why SYCL
 SYCL is an open standard
 Can target any offload device
 Expressive and highly productive
Why Open source
 Promote SYCL as the offload programming model
 Foster collaboration and innovation in the industry
12
Location and information
13
Github: https://github.com/intel/llvm/tree/sycl - fork from llvm repo
Primary project goal is contribution to llvm.org:
 RFC: https://lists.llvm.org/pipermail/cfe-dev/2019-January/060811.html
First changes to the clang driver are already committed:
https://reviews.llvm.org/D57768
Detailed plan for upstream: https://github.com/intel/llvm/issues/49
Overview of Changes atop LLVM/Clang
14
SYCL runtime. Core runtime components, SYCL API definition and implementation
Compilation driver. Two-step host + device compilation, separate compilation and
linking, AOT compilation
Front End. Device compiler diagnostics, device code marking and outlining, address
space handling, integration header generation,
Tools.
 clang-offload-bundler. Generates “fat objects”. Updated to support SYCL offload kind and partial
linking to enable “fat static libraries”.
 clang-offload-wrapper. New tool to create fat binaries – host + device executables in one file. Portable
analog of OpenMP offload linker script.
LLVM. New sycldevice environment to customize spir target
Fat CPU
binary
SpirV
Device
code
Device compiler
SYCL Compiler Architecture
Single host + multiple device compilers
3rd-party host compiler can be used
 Integration header with kernel details
Support for JIT and AOT compilation
Support for separate compilation and linking
SYCL
Sources
SYCL
FE
Host compiler
AOT
compilation
SPIR-V
LLVMIR
LLVM IR to
SPIR-V
Device
code
SYCL offload runtime
SPIR-V
Back End
JIT
compilation
LoopNest
materializer
Loop Opt
SYCL headers
over OpenCL*
SYCL headers
over OpenMP
Vectorizer
Scalar
Opt
Integration
header
.bc & Integrated
Back Ends
SPIR-V
Back Ends
Single
CPU
binary OpenMP-based SYCL host
device runtime
Host Device Back
End
MT+SIMD path
* OpenCL dependence is to be moved under the Plugin Interface 15
SYCL host device runtime
Simplified Compilation Flow for JIT scenario
llvm-link
src1.cpp
.bc
src2.cpp
.bc
linked .bc
llvm-spirv
.spirv image
Fat binary
.bc_lib
device library
Device compiler
Middle End
clang-offload-wrapper
Host compiler
Integration
header generator
header1.h
header2.h
ld
.o .o
Device bin .o
SYCL RT headers
.lib
host library
Device bin
16
SYCL Runtime Architecture
17
SYCL
application
SYCL
runtime
L1RI plugin
Native runtime
& driver
Device
SYCL runtime library
Scheduler Device mgr
Program &
kernel mgr
SYCL API
Memory
mgr
CPU
TBB RT
SYCL host / host device
SPIR-V
Device X exe
Host device RT
interace
PI/OpenCL plugin
OpenMP RT
SYCL Runtime Plugin Interface (PI)
Device Z
OpenCL Runtime
…
PI types & services Device binary mgmt
DeviceX native RT
PI/X RT plugin
Device Y
Other
layers
Device X
PI discovery &
plugin infra
Modular architecture
Plugin interface to support
multiple back-ends
Support concurrent offload to
multiple devices
Support for device code
versioning
 Multiple device binaries in a single fat
binary
SYCL Runtime Plugin Interface
18
libsycl.so
OCL ICD like discovery logic
PI_deviceX_plugin.so PI_OpenCL_plugin.so
libOpenCL.so
libdeviceX_rt.so libOCL_Y_rt.so libOCL_Z_rt.so
libOCL_CPU_rt.so
dlopen dlopen
dlopen
Khronos ICD-compatible OpenCL installation “Custom” OpenCL installation
Non-OpenCL runtime
Defines a programming model atop OpenCL
concepts & model
Abstracts away the device layer from the SYCL
runtime
 Allows implementations for multiple devices co-
exist
Future plans
19
NDRange subgroups
Ordered queue
Ahead of time compilation
Scheduler improvements
Unified Shared Memory
Hierarchical parallelism
Generic address space
Windows support
Specialization constants
Documentation update
Fat static libraries
Device code distribution per cl::sycl::program
Q&A
20

More Related Content

Similar to 3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf

Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Two C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsTwo C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsAlison Chaiken
 
不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)Douglas Chen
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveAmiq Consulting
 
"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from IntelEdge AI and Vision Alliance
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScyllaDB
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005Saleem Ansari
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLinaro
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
 
An Introductory course on Verilog HDL-Verilog hdl ppr
An Introductory course on Verilog HDL-Verilog hdl pprAn Introductory course on Verilog HDL-Verilog hdl ppr
An Introductory course on Verilog HDL-Verilog hdl pprPrabhavathi P
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkanax inc.
 
TestUpload
TestUploadTestUpload
TestUploadZarksaDS
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019corehard_by
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkAnne Nicolas
 

Similar to 3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf (20)

Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Two C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsTwo C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp Insights
 
不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)
 
gcc and friends
gcc and friendsgcc and friends
gcc and friends
 
How to Connect SystemVerilog with Octave
How to Connect SystemVerilog with OctaveHow to Connect SystemVerilog with Octave
How to Connect SystemVerilog with Octave
 
"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel"Making OpenCV Code Run Fast," a Presentation from Intel
"Making OpenCV Code Run Fast," a Presentation from Intel
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them AllScylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
Scylla Summit 2022: ScyllaDB Rust Driver: One Driver to Rule Them All
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005GNU Compiler Collection - August 2005
GNU Compiler Collection - August 2005
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
An Introductory course on Verilog HDL-Verilog hdl ppr
An Introductory course on Verilog HDL-Verilog hdl pprAn Introductory course on Verilog HDL-Verilog hdl ppr
An Introductory course on Verilog HDL-Verilog hdl ppr
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkan
 
Verilog
VerilogVerilog
Verilog
 
TestUpload
TestUploadTestUpload
TestUpload
 
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver framework
 

More from JunZhao68

1-MIV-tutorial-part-1.pdf
1-MIV-tutorial-part-1.pdf1-MIV-tutorial-part-1.pdf
1-MIV-tutorial-part-1.pdfJunZhao68
 
GOP-Size_report_11_16.pdf
GOP-Size_report_11_16.pdfGOP-Size_report_11_16.pdf
GOP-Size_report_11_16.pdfJunZhao68
 
02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdfJunZhao68
 
MHV-Presentation-Forman (1).pdf
MHV-Presentation-Forman (1).pdfMHV-Presentation-Forman (1).pdf
MHV-Presentation-Forman (1).pdfJunZhao68
 
CODA_presentation.pdf
CODA_presentation.pdfCODA_presentation.pdf
CODA_presentation.pdfJunZhao68
 
http3-quic-streaming-2020-200121234036.pdf
http3-quic-streaming-2020-200121234036.pdfhttp3-quic-streaming-2020-200121234036.pdf
http3-quic-streaming-2020-200121234036.pdfJunZhao68
 
NTTW4-FFmpeg.pdf
NTTW4-FFmpeg.pdfNTTW4-FFmpeg.pdf
NTTW4-FFmpeg.pdfJunZhao68
 
03-Reznik-DASH-IF-workshop-2019-CAE.pdf
03-Reznik-DASH-IF-workshop-2019-CAE.pdf03-Reznik-DASH-IF-workshop-2019-CAE.pdf
03-Reznik-DASH-IF-workshop-2019-CAE.pdfJunZhao68
 
Practical Programming.pdf
Practical Programming.pdfPractical Programming.pdf
Practical Programming.pdfJunZhao68
 
Overview_of_H.264.pdf
Overview_of_H.264.pdfOverview_of_H.264.pdf
Overview_of_H.264.pdfJunZhao68
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdfJunZhao68
 
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdf
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdfWojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdf
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdfJunZhao68
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdfJunZhao68
 
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdfJunZhao68
 
2020+HESP+Technical+Deck+-+HESP+Alliance.pdf
2020+HESP+Technical+Deck+-+HESP+Alliance.pdf2020+HESP+Technical+Deck+-+HESP+Alliance.pdf
2020+HESP+Technical+Deck+-+HESP+Alliance.pdfJunZhao68
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdfJunZhao68
 

More from JunZhao68 (16)

1-MIV-tutorial-part-1.pdf
1-MIV-tutorial-part-1.pdf1-MIV-tutorial-part-1.pdf
1-MIV-tutorial-part-1.pdf
 
GOP-Size_report_11_16.pdf
GOP-Size_report_11_16.pdfGOP-Size_report_11_16.pdf
GOP-Size_report_11_16.pdf
 
02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf02-VariableLengthCodes_pres.pdf
02-VariableLengthCodes_pres.pdf
 
MHV-Presentation-Forman (1).pdf
MHV-Presentation-Forman (1).pdfMHV-Presentation-Forman (1).pdf
MHV-Presentation-Forman (1).pdf
 
CODA_presentation.pdf
CODA_presentation.pdfCODA_presentation.pdf
CODA_presentation.pdf
 
http3-quic-streaming-2020-200121234036.pdf
http3-quic-streaming-2020-200121234036.pdfhttp3-quic-streaming-2020-200121234036.pdf
http3-quic-streaming-2020-200121234036.pdf
 
NTTW4-FFmpeg.pdf
NTTW4-FFmpeg.pdfNTTW4-FFmpeg.pdf
NTTW4-FFmpeg.pdf
 
03-Reznik-DASH-IF-workshop-2019-CAE.pdf
03-Reznik-DASH-IF-workshop-2019-CAE.pdf03-Reznik-DASH-IF-workshop-2019-CAE.pdf
03-Reznik-DASH-IF-workshop-2019-CAE.pdf
 
Practical Programming.pdf
Practical Programming.pdfPractical Programming.pdf
Practical Programming.pdf
 
Overview_of_H.264.pdf
Overview_of_H.264.pdfOverview_of_H.264.pdf
Overview_of_H.264.pdf
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
 
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdf
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdfWojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdf
Wojciech Przybyl - Efficient Trick Modes with MPEG-DASH.pdf
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf
20230320-信息技术-人工智能系列深度报告:AIGC行业综述篇——开启AI新篇章-国海证券.pdf
 
2020+HESP+Technical+Deck+-+HESP+Alliance.pdf
2020+HESP+Technical+Deck+-+HESP+Alliance.pdf2020+HESP+Technical+Deck+-+HESP+Alliance.pdf
2020+HESP+Technical+Deck+-+HESP+Alliance.pdf
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf

  • 1. Intel Open-source SYCL compiler project Konstantin Bobrovskii, Intel 2019 Khronos Embedded Vision Summit
  • 4. Intel confidential 4 What is SYCL? Single-source heterogeneous programming using STANDARD C++  Use C++ templates and lambda functions for host & device code Aligns the hardware acceleration of OpenCL with direction of the C++ standard C++ Kernel Language Low Level Control ‘GPGPU’-style separation of device-side kernel source code and host code Single-source C++ Programmer Familiarity Approach also taken by C++ AMP and OpenMP Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches
  • 5. 5 Reactive to OpenCL Pros and Cons: • OpenCL has a well-defined, portable execution model. • OpenCL is too verbose for many application developers. • OpenCL remains a C API and only recently supported C++ kernels. • Just-in-time compilation model and disjoint source code is awkward and contrary to HPC usage models. Proactive about Future C++: • SYCL is based on purely modern C++ and should feel familiar to C++11 users. • SYCL expected to run ahead of C++Next regarding heterogeneity and parallelism. ISO C++ of tomorrow may look a lot like SYCL. • Not held back by C99 or C++03 compatibility goals. Why SYCL? Reactive and Proactive Motivation:
  • 6. 6 nstream.cl __kernel void nstream( int length, double s, __global double * A, __global double * B, __global double * C) { int i = get_global_id(0); A[i] += B[i] + s * C[i]; } nstream.cpp cl::Context gpu(CL_DEVICE_TYPE_GPU, &err); cl::CommandQueue queue(gpu); cl::Program program(gpu, prk::opencl::loadProgram(“nstream.cl”), true); auto kernel = cl::make_kernel<int, double, cl::Buffer, cl::Buffer, cl::Buffer>(program, “nstream”, &err); auto d_a = cl::Buffer(gpu, begin(h_a), end(h_a)); auto d_b = cl::Buffer(gpu, begin(h_b), end(h_b)); auto d_c = cl::Buffer(gpu, begin(h_c), end(h_c)); kernel(cl::EnqueueArgs(queue, cl::NDRange(length)), length, scalar, d_a, d_b, d_c); queue.finish(); OpenCL Example (w/ C++ Wrappers):
  • 7. Intel confidential 7 SYCL Example sycl::gpu_selector device_selector; sycl::queue q(device_selector); sycl::buffer<double> d_A { h_A.data(), h_A.size() }; sycl::buffer<double> d_B { h_B.data(), h_B.size() }; sycl::buffer<double> d_C { h_C.data(), h_C.size() }; q.submit([&](sycl::handler& h) { auto A = d_A.get_access<sycl::access::mode::read_write>(h); auto B = d_B.get_access<sycl::access::mode::read>(h); auto C = d_C.get_access<sycl::access::mode::read>(h); h.parallel_for<class nstream>(sycl::range<1>{length}, [=] (sycl::item<1> i) { A[i] += B[i] + s * C[i]; }); }); q.wait(); Retains OpenCL’s ability to easily target different devices in the same thread. Parallel loops are explicit like C++ vs. implicit in OpenCL. Kernel code does not need to live in a separate part of the program. Accessors create DAG to trigger data movement and represent execution dependencies.
  • 8. Intel confidential 8 SYCL Example: Graph of Asynchronous Executions myQueue.submit([&](handler& cgh) { auto A = a.get_access<access::mode::read>(cgh); auto B = b.get_access<access::mode::read>(cgh); auto C = c.get_access<access::mode::discardwrite>(cgh); cgh.parallel_for<class add1>( range<2>{N, M}, [=](id<2> index) { C[index] = A[index] + B[index]; }); }); myQueue.submit([&](handler& cgh) { auto A = a.get_access<access::mode::read>(cgh); auto C = c.get_access<access::mode::read>(cgh); auto D = d.get_access<access::mode::write>(cgh); cgh.parallel_for<class add2>( range<2>{P, Q}, [=](id<2> index) { D[index] = A[index] + C[index]; }); }); myQueue.submit([&](handler& cgh) { auto A = a.get_access<access::mode::read>(cgh); auto D = d.get_access<access::mode::read>(cgh); auto E = e.get_access<access::mode::write>(cgh); cgh.parallel_for<class add3>( range<2>{S, T}, [=](id<2> index) { E[index] = A[index] + D[index]; }); }); add2 add3 add1 A B A D A E C D
  • 9. Intel confidential 9 SYCL Example: Graph of Asynchronous Executions myQueue.submit([&](handler& cgh) { auto A = a.get_access<access::mode::read>(cgh); auto B = b.get_access<access::mode::read>(cgh); auto C = c.get_access<access::mode::discardwrite>(cgh); cgh.parallel_for<class add1>( range<2>{N, M}, [=](id<2> index) { C[index] = A[index] + B[index]; }); }); myQueue.submit([&](handler& cgh) { auto A = a.get_access<access::mode::read>(cgh); auto D = d.get_access<access::mode::read>(cgh); auto E = e.get_access<access::mode::discardwrite>(cgh); cgh.parallel_for<class add2>( range<2>{P, Q}, [=](id<2> index) { E[index] = A[index] + D[index]; }); }); myQueue.submit([&](handler& cgh) { auto C = c.get_access<access::mode::read>(cgh); auto E = e.get_access<access::mode::read>(cgh); auto F = f.get_access<access::mode::discardwrite>(cgh); cgh.parallel_for<class add3>( range<2>{S, T}, [=](id<2> index) { F[index] = C[index] + E[index]; }); }); add2 add3 add1 A B A D F C • SYCL queues are out-of-order by default – data dependencies order kernel executions • Will also be able to use in-order queue policies to simplify porting E
  • 10. 10 Hierarchical parallelism (logical view)  Fundamentally top down expression of parallelism  Many embedded features and details, not covered here parallel_for_work_group (…) { … parallel_for_work_item (…) { … } }
  • 11. Intel Open-Source SYCL Project Motivation, location, architecture, plugin interface, future directions 11
  • 12. Motivation Why SYCL  SYCL is an open standard  Can target any offload device  Expressive and highly productive Why Open source  Promote SYCL as the offload programming model  Foster collaboration and innovation in the industry 12
  • 13. Location and information 13 Github: https://github.com/intel/llvm/tree/sycl - fork from llvm repo Primary project goal is contribution to llvm.org:  RFC: https://lists.llvm.org/pipermail/cfe-dev/2019-January/060811.html First changes to the clang driver are already committed: https://reviews.llvm.org/D57768 Detailed plan for upstream: https://github.com/intel/llvm/issues/49
  • 14. Overview of Changes atop LLVM/Clang 14 SYCL runtime. Core runtime components, SYCL API definition and implementation Compilation driver. Two-step host + device compilation, separate compilation and linking, AOT compilation Front End. Device compiler diagnostics, device code marking and outlining, address space handling, integration header generation, Tools.  clang-offload-bundler. Generates “fat objects”. Updated to support SYCL offload kind and partial linking to enable “fat static libraries”.  clang-offload-wrapper. New tool to create fat binaries – host + device executables in one file. Portable analog of OpenMP offload linker script. LLVM. New sycldevice environment to customize spir target
  • 15. Fat CPU binary SpirV Device code Device compiler SYCL Compiler Architecture Single host + multiple device compilers 3rd-party host compiler can be used  Integration header with kernel details Support for JIT and AOT compilation Support for separate compilation and linking SYCL Sources SYCL FE Host compiler AOT compilation SPIR-V LLVMIR LLVM IR to SPIR-V Device code SYCL offload runtime SPIR-V Back End JIT compilation LoopNest materializer Loop Opt SYCL headers over OpenCL* SYCL headers over OpenMP Vectorizer Scalar Opt Integration header .bc & Integrated Back Ends SPIR-V Back Ends Single CPU binary OpenMP-based SYCL host device runtime Host Device Back End MT+SIMD path * OpenCL dependence is to be moved under the Plugin Interface 15 SYCL host device runtime
  • 16. Simplified Compilation Flow for JIT scenario llvm-link src1.cpp .bc src2.cpp .bc linked .bc llvm-spirv .spirv image Fat binary .bc_lib device library Device compiler Middle End clang-offload-wrapper Host compiler Integration header generator header1.h header2.h ld .o .o Device bin .o SYCL RT headers .lib host library Device bin 16
  • 17. SYCL Runtime Architecture 17 SYCL application SYCL runtime L1RI plugin Native runtime & driver Device SYCL runtime library Scheduler Device mgr Program & kernel mgr SYCL API Memory mgr CPU TBB RT SYCL host / host device SPIR-V Device X exe Host device RT interace PI/OpenCL plugin OpenMP RT SYCL Runtime Plugin Interface (PI) Device Z OpenCL Runtime … PI types & services Device binary mgmt DeviceX native RT PI/X RT plugin Device Y Other layers Device X PI discovery & plugin infra Modular architecture Plugin interface to support multiple back-ends Support concurrent offload to multiple devices Support for device code versioning  Multiple device binaries in a single fat binary
  • 18. SYCL Runtime Plugin Interface 18 libsycl.so OCL ICD like discovery logic PI_deviceX_plugin.so PI_OpenCL_plugin.so libOpenCL.so libdeviceX_rt.so libOCL_Y_rt.so libOCL_Z_rt.so libOCL_CPU_rt.so dlopen dlopen dlopen Khronos ICD-compatible OpenCL installation “Custom” OpenCL installation Non-OpenCL runtime Defines a programming model atop OpenCL concepts & model Abstracts away the device layer from the SYCL runtime  Allows implementations for multiple devices co- exist
  • 19. Future plans 19 NDRange subgroups Ordered queue Ahead of time compilation Scheduler improvements Unified Shared Memory Hierarchical parallelism Generic address space Windows support Specialization constants Documentation update Fat static libraries Device code distribution per cl::sycl::program