SlideShare a Scribd company logo
1 of 1
Download to read offline
Abstract
Performance on Xeon CPU
Performance on Xeon Phi
Take Home
Vectorized Data Movement Primitives
Funding Acknowledgements
This work was funded by the EU projects SAGE (671500) and E2Data
(780245), DFG Stratosphere (606902), and the German Ministry for
Education and Research as BBDC (01IS14013A) and Software
Campus (01IS12056).
Open Source Repository
Tobias Behrens
1
, Viktor Rosenfeld
1
, Jonas Traub
2
, Sebastian Breß
1,2
, Volker Markl
1,2
Efficient SIMD Vectorization for Hashing in OpenCL
1
firstname.lastname@dfki.de
2
firstname.lastname@tu-berlin.de
Hashing is at the core of many efficient database operators
such as hash-based joins and aggregations.
Significant speedup was shown for vectorized hash table
operations using processor specific low-level intrinsics.
We present portable and vectorized hashing primitives
using the parallel programming framework OpenCL.
●Scalar (C) Scalar (OpenCL) Vectorized (Intrinsics) Vectorized (OpenCL)
ProbeBuild
L1 L2
● ● ● ●
● ● ●
● ●
●
●
● ● ● ●
0
1
2
0.0
2.5
5.0
7.5
10.0
4
kB
16
kB
64
kB
256
kB
1
M
B
4
M
B
16
M
B
64
M
B
Hash Table Size
GB/Second
L1 L2
● ●
●
●
●
●
●
● ● ● ● ● ● ●
0.00
0.05
0.10
0.15
0.20
0.0
0.4
0.8
1.2
1.6
4
kB
16
kB
64
kB
256
kB
1
M
B
4
M
B
16
M
B
64
M
B
Hash Table Size
Bil.Keys/Second
●Scalar (C) Scalar (OpenCL) Vectorized (Intrinsics) Vectorized (OpenCL)
Gather OpenCL
① Build
② Probe
1 / / t abl es : Bucket ∗table , uint 8 mask
2 / / Output : uint8 vect or
3 vect or . s0 = t abl e [mask . s0 ]. key ;
4 vect or . s1 = t abl e [mask . s1 ]. key ;
5 / / . . . up to vect or . s7 = t abl e [mask . s7 ]. key
Probe
Table
Hash
Table
Result
Table
hash(vector) equal(vector,
h_table)?
Hash
Table
Hash
Table
hash(vector)
Build
Table
free(h_table)?
1 / / I nputs : ui nt 64 t ∗ table , 128i index
2 / / Output : m256i res
3 m128i index R = mm shuffl e epi 32( index , MM SHUFFLE ( 1 , 0, 3, 2) ) ;
4 m128i i12 = mm cvt epi 32 epi 64 ( index ) ;
5 m128i i34 = mm cvt epi 32 epi 64 ( index R ) ;
6 si ze t i 1 = mm cvt si 128 si 64 ( i12) ;
7 si ze t i 3 = mm cvt si 128 si 64 ( i34) ;
8 m128i d1 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 1 ]) ;
9 m128i d3 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 3 ]) ;
10 i12 = mm sr l i si 128 ( i12 , 8) ;
11 i34 = mm sr l i si 128 ( i34 , 8) ;
12 si ze t i 2 = mm cvt si 128 si 64 ( i12) ;
13 si ze t i 4 = mm cvt si 128 si 64 ( i34) ;
14 m128i d2 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 2 ]) ;
15 m128i d4 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 4 ]) ;
16 m256i d12 = mm256 castsi128 si256( mm unpacklo epi64( d1, d2) ) ;
17 m256i d34 = mm256 castsi128 si256( mm unpacklo epi64( d3, d4) ) ;
18 m256i res = mm256 permute2x128 si256( d12 , d34 , MM SHUFFLE ( 0 , 2, 0, 0) ) ;
Gather Intel Intrinsics
vs.
efficient
enough?
OpenCL
Processor Specific
Intrinsic Y
Processor Specific
Intrinsic X
Result
Table
Selective Store
Copies data from
selectable SIMD lanes
to contiguous memory.
Probe / Build
Table
Selective Load
Copies data from
contiguous memory
into selectable
SIMD lanes.
Hash
Table
Gather
Copies data from
discontiguous
memory into
SIMD lanes. Hash
Table
Scatter
Copies data from
SIMD lanes into
discontiguous
memory.
L1 L2 L3
● ● ● ● ● ●
●
● ● ● ● ●
●
● ●
0.0
0.5
1.0
1.5
2.0
0
2
4
6
84
kB
16
kB
64
kB
256
kB
1
M
B
4
M
B
16
M
B
64
M
B
Hash Table Size
GB/Second
Probe
L1 L2 L3
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●
● ● ●
●
● ● ●
●
● ● ● ●
0.0
0.5
1.0
1.5
2.0
0
4
8
12
16
4
kB
16
kB
64
kB
256
kB
1
M
B
4
M
B
16
M
B
64
M
B
Hash Table Size
Bil.Keys/Second
Without call overhead
Build
Portable and Maintainable Code Build is overhead dominated, OpenCL-based
probe outperforms scalar implementation.
Intrinsic-based implementation
outperforms OpenCL-based on Xeon Phi.
OpenCL reduces code complexity and
ensures portability of vectorized primitives.
OpenCL-based vectorized hashing
outperforms scalar hashing on Xeon CPU.
Processor specific intrinsics are still
faster, especially on Xeon Phi.
github.com/TU-Berlin-DIMA/OpenCL-SIMD-hashing

More Related Content

What's hot

Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings Sylvain Hallé
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonshin
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
 
Advance features of C++
Advance features of C++Advance features of C++
Advance features of C++vidyamittal
 
White-Box Testing on Methods
White-Box Testing on MethodsWhite-Box Testing on Methods
White-Box Testing on MethodsMinhas Kamal
 
Richard Salter: Using the Titanium OpenGL Module
Richard Salter: Using the Titanium OpenGL ModuleRichard Salter: Using the Titanium OpenGL Module
Richard Salter: Using the Titanium OpenGL ModuleAxway Appcelerator
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Addergueste731a4
 
Rsa in CTF
Rsa in CTFRsa in CTF
Rsa in CTFSoL ymx
 
software-vulnerability-detectionPresentation
software-vulnerability-detectionPresentationsoftware-vulnerability-detectionPresentation
software-vulnerability-detectionPresentationClaude Goubet
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanualiravi9
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGA
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGAOn using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGA
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGAmjgajda
 
Listing for CircleFunctions
Listing for CircleFunctionsListing for CircleFunctions
Listing for CircleFunctionsDerek Dhammaloka
 
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"OdessaJS Conf
 

What's hot (19)

Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings Activity Recognition Through Complex Event Processing: First Findings
Activity Recognition Through Complex Event Processing: First Findings
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluiton
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”
 
Advance features of C++
Advance features of C++Advance features of C++
Advance features of C++
 
Dma
DmaDma
Dma
 
White-Box Testing on Methods
White-Box Testing on MethodsWhite-Box Testing on Methods
White-Box Testing on Methods
 
Richard Salter: Using the Titanium OpenGL Module
Richard Salter: Using the Titanium OpenGL ModuleRichard Salter: Using the Titanium OpenGL Module
Richard Salter: Using the Titanium OpenGL Module
 
Ss matlab solved
Ss matlab solvedSs matlab solved
Ss matlab solved
 
Debugging TV Frame 0x0C
Debugging TV Frame 0x0CDebugging TV Frame 0x0C
Debugging TV Frame 0x0C
 
Ceng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer AdderCeng232 Decoder Multiplexer Adder
Ceng232 Decoder Multiplexer Adder
 
Rsa in CTF
Rsa in CTFRsa in CTF
Rsa in CTF
 
software-vulnerability-detectionPresentation
software-vulnerability-detectionPresentationsoftware-vulnerability-detectionPresentation
software-vulnerability-detectionPresentation
 
5th Semester (Dec-2015; Jan-2016) Computer Science and Information Science En...
5th Semester (Dec-2015; Jan-2016) Computer Science and Information Science En...5th Semester (Dec-2015; Jan-2016) Computer Science and Information Science En...
5th Semester (Dec-2015; Jan-2016) Computer Science and Information Science En...
 
8086 labmanual
8086 labmanual8086 labmanual
8086 labmanual
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGA
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGAOn using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGA
On using Haskell DSL - CLaSH, to implement a simple digital stopwatch on FPGA
 
Listing for CircleFunctions
Listing for CircleFunctionsListing for CircleFunctions
Listing for CircleFunctions
 
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
Yurii Shevtsov "V8 + libuv = Node.js. Under the hood"
 

Similar to Efficient SIMD Vectorization for Hashing in OpenCL

Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicGouda Mando
 
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...IJECEIAES
 
NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)Igalia
 
Topological Sorting
Topological SortingTopological Sorting
Topological SortingShahDhruv21
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsAsuka Nakajima
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD supportJoel Falcou
 
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGICANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGICSupanna Shirguppe
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaLisa Hua
 
IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation csandit
 
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONIP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONcscpconf
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoJoel Falcou
 
Vectorization with LMS: SIMD Intrinsics
Vectorization with LMS: SIMD IntrinsicsVectorization with LMS: SIMD Intrinsics
Vectorization with LMS: SIMD IntrinsicsETH Zurich
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Chapter 06 Combinational Logic Functions
Chapter 06 Combinational Logic FunctionsChapter 06 Combinational Logic Functions
Chapter 06 Combinational Logic FunctionsSSE_AndyLi
 

Similar to Efficient SIMD Vectorization for Hashing in OpenCL (20)

Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional Logic
 
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...
VHDL Design and FPGA Implementation of a High Data Rate Turbo Decoder based o...
 
NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
Topological Sorting
Topological SortingTopological Sorting
Topological Sorting
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
10 multiplexers-de mux
10 multiplexers-de mux10 multiplexers-de mux
10 multiplexers-de mux
 
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGICANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation IP Core Design of Hight Lightweight Cipher and its Implementation
IP Core Design of Hight Lightweight Cipher and its Implementation
 
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATIONIP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
IP CORE DESIGN OF HIGHT LIGHTWEIGHT CIPHER AND ITS IMPLEMENTATION
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 
Vectorization with LMS: SIMD Intrinsics
Vectorization with LMS: SIMD IntrinsicsVectorization with LMS: SIMD Intrinsics
Vectorization with LMS: SIMD Intrinsics
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Chapter 06 Combinational Logic Functions
Chapter 06 Combinational Logic FunctionsChapter 06 Combinational Logic Functions
Chapter 06 Combinational Logic Functions
 
VHDL Part 4
VHDL Part 4VHDL Part 4
VHDL Part 4
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 

More from Jonas Traub

Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Jonas Traub
 
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Jonas Traub
 
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...Jonas Traub
 
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...Jonas Traub
 
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Jonas Traub
 
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Jonas Traub
 
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)Jonas Traub
 
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Jonas Traub
 
Flink Forward 2018: Efficient Window Aggregation with Stream Slicing
Flink Forward 2018: Efficient Window Aggregation with Stream SlicingFlink Forward 2018: Efficient Window Aggregation with Stream Slicing
Flink Forward 2018: Efficient Window Aggregation with Stream SlicingJonas Traub
 
Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingScotty: Efficient Window Aggregation for Out-of-Order Stream Processing
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingJonas Traub
 
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Jonas Traub
 
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...Jonas Traub
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...Jonas Traub
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
 
I²: Interactive Real-Time Visualization for Streaming Data
I²: Interactive Real-Time Visualization for Streaming DataI²: Interactive Real-Time Visualization for Streaming Data
I²: Interactive Real-Time Visualization for Streaming DataJonas Traub
 
LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)Jonas Traub
 
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisLWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisJonas Traub
 

More from Jonas Traub (17)

Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...
Definitely not Java! A Hands-on Introduction to Efficient Functional Programm...
 
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...
Efficient Data Stream Processing in the Internet of Things - SoftwareCampus A...
 
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...
code.talks 2019 - Scotty: Efficient Window Aggregation for your Stream Proces...
 
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...
FlinkForward Berlin 2019 - Scotty: Efficient Window Aggregation with General ...
 
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...
Analyzing Efficient Stream Processing on Modern Hardware (VLDB 2019 Presentat...
 
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019
Database Research at TU Berlin DIMA and DFKI IAM - USA Excursion Slides 2019
 
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
Efficient Window Aggregation with General Stream Slicing (EDBT 2019, Best Paper)
 
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...
Resense: Transparent Record and Replay of Sensor Data in the Internet of Thin...
 
Flink Forward 2018: Efficient Window Aggregation with Stream Slicing
Flink Forward 2018: Efficient Window Aggregation with Stream SlicingFlink Forward 2018: Efficient Window Aggregation with Stream Slicing
Flink Forward 2018: Efficient Window Aggregation with Stream Slicing
 
Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing
Scotty: Efficient Window Aggregation for Out-of-Order Stream ProcessingScotty: Efficient Window Aggregation for Out-of-Order Stream Processing
Scotty: Efficient Window Aggregation for Out-of-Order Stream Processing
 
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive W...
 
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...
UZH Stream Reasoning Workshop 2018: Optimized On-Demand Data Streaming from S...
 
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of ...
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
 
I²: Interactive Real-Time Visualization for Streaming Data
I²: Interactive Real-Time Visualization for Streaming DataI²: Interactive Real-Time Visualization for Streaming Data
I²: Interactive Real-Time Visualization for Streaming Data
 
LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)
 
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream AnalysisLWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis
 

Recently uploaded

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...OnePlan Solutions
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Gáspár Nagy
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Primacy Infotech
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfmbmh111980
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 

Recently uploaded (20)

GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 

Efficient SIMD Vectorization for Hashing in OpenCL

  • 1. Abstract Performance on Xeon CPU Performance on Xeon Phi Take Home Vectorized Data Movement Primitives Funding Acknowledgements This work was funded by the EU projects SAGE (671500) and E2Data (780245), DFG Stratosphere (606902), and the German Ministry for Education and Research as BBDC (01IS14013A) and Software Campus (01IS12056). Open Source Repository Tobias Behrens 1 , Viktor Rosenfeld 1 , Jonas Traub 2 , Sebastian Breß 1,2 , Volker Markl 1,2 Efficient SIMD Vectorization for Hashing in OpenCL 1 firstname.lastname@dfki.de 2 firstname.lastname@tu-berlin.de Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Significant speedup was shown for vectorized hash table operations using processor specific low-level intrinsics. We present portable and vectorized hashing primitives using the parallel programming framework OpenCL. ●Scalar (C) Scalar (OpenCL) Vectorized (Intrinsics) Vectorized (OpenCL) ProbeBuild L1 L2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 0.0 2.5 5.0 7.5 10.0 4 kB 16 kB 64 kB 256 kB 1 M B 4 M B 16 M B 64 M B Hash Table Size GB/Second L1 L2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8 1.2 1.6 4 kB 16 kB 64 kB 256 kB 1 M B 4 M B 16 M B 64 M B Hash Table Size Bil.Keys/Second ●Scalar (C) Scalar (OpenCL) Vectorized (Intrinsics) Vectorized (OpenCL) Gather OpenCL ① Build ② Probe 1 / / t abl es : Bucket ∗table , uint 8 mask 2 / / Output : uint8 vect or 3 vect or . s0 = t abl e [mask . s0 ]. key ; 4 vect or . s1 = t abl e [mask . s1 ]. key ; 5 / / . . . up to vect or . s7 = t abl e [mask . s7 ]. key Probe Table Hash Table Result Table hash(vector) equal(vector, h_table)? Hash Table Hash Table hash(vector) Build Table free(h_table)? 1 / / I nputs : ui nt 64 t ∗ table , 128i index 2 / / Output : m256i res 3 m128i index R = mm shuffl e epi 32( index , MM SHUFFLE ( 1 , 0, 3, 2) ) ; 4 m128i i12 = mm cvt epi 32 epi 64 ( index ) ; 5 m128i i34 = mm cvt epi 32 epi 64 ( index R ) ; 6 si ze t i 1 = mm cvt si 128 si 64 ( i12) ; 7 si ze t i 3 = mm cvt si 128 si 64 ( i34) ; 8 m128i d1 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 1 ]) ; 9 m128i d3 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 3 ]) ; 10 i12 = mm sr l i si 128 ( i12 , 8) ; 11 i34 = mm sr l i si 128 ( i34 , 8) ; 12 si ze t i 2 = mm cvt si 128 si 64 ( i12) ; 13 si ze t i 4 = mm cvt si 128 si 64 ( i34) ; 14 m128i d2 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 2 ]) ; 15 m128i d4 = mm l oadl epi 64 ( ( m128i ∗)&t abl e [ i 4 ]) ; 16 m256i d12 = mm256 castsi128 si256( mm unpacklo epi64( d1, d2) ) ; 17 m256i d34 = mm256 castsi128 si256( mm unpacklo epi64( d3, d4) ) ; 18 m256i res = mm256 permute2x128 si256( d12 , d34 , MM SHUFFLE ( 0 , 2, 0, 0) ) ; Gather Intel Intrinsics vs. efficient enough? OpenCL Processor Specific Intrinsic Y Processor Specific Intrinsic X Result Table Selective Store Copies data from selectable SIMD lanes to contiguous memory. Probe / Build Table Selective Load Copies data from contiguous memory into selectable SIMD lanes. Hash Table Gather Copies data from discontiguous memory into SIMD lanes. Hash Table Scatter Copies data from SIMD lanes into discontiguous memory. L1 L2 L3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 0 2 4 6 84 kB 16 kB 64 kB 256 kB 1 M B 4 M B 16 M B 64 M B Hash Table Size GB/Second Probe L1 L2 L3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 0 4 8 12 16 4 kB 16 kB 64 kB 256 kB 1 M B 4 M B 16 M B 64 M B Hash Table Size Bil.Keys/Second Without call overhead Build Portable and Maintainable Code Build is overhead dominated, OpenCL-based probe outperforms scalar implementation. Intrinsic-based implementation outperforms OpenCL-based on Xeon Phi. OpenCL reduces code complexity and ensures portability of vectorized primitives. OpenCL-based vectorized hashing outperforms scalar hashing on Xeon CPU. Processor specific intrinsics are still faster, especially on Xeon Phi. github.com/TU-Berlin-DIMA/OpenCL-SIMD-hashing