SlideShare a Scribd company logo
Revisiting Co-Processing for Hash Joins
on the Coupled
Cpu-GPU Architecture

BY
Jiong He

Mian Lu

Presented By

Bingsheng He

Mohamed Ragab Moawad
Umm.. GPU??
GPU : Graphics Processing Unit
CPU : Central Processing Unit
GPU complements CPU which performs general
processing, by more efficiently handling Graphics
Calculations.
Can also accelerate Video transcoding, Image
processing and other complex computations using
concept of GPGPU (General-purpose computing on
Graphics Processing Units).
Eg. CUDA (Compute Unified Device Architecture).
THEN AND NOW
Intel ASCI Red/9632 : 2,379 GFLOPS
- Fastest Supercomputer (World), 1999
PARAM PADMA : 1,000 GFLOPS

- Fastest Supercomputer (India), 2003
NVIDIA GTX 780 Ti : 5,046 GFLOPS
- Fastest GPU (General), 2013
GFLOPS : Giga FLoating-point Operations Per Second.
A measure of computer performance.
Memory is the boss? NO!

GTX 680
(2GB)

GT 430 (4GB)

Memory : 2GB

Memory : 4GB

Price : Rs. 41,195

Price : Rs. 4,520

Computation power : 3,090 GFLOPS

Computation power : 269 GFLOPS
GPU VS CPU
•

A GPU is tailored for highly parallel operation
while a CPU executes programs serially.

GPUs have significantly faster and more advanced
memory interfaces as they need to shift around a lot
more data than CPUs
• CPU is optimized for sequential code performance.
• GPU is specialized for compute-intensive highly
parallel computation.
•

•

GPU has evolved into a highly
parallel, multithreaded, many core processors with
very high computational horsepower and very high
memory bandwidth.
CPU –GPU Architectures
There are two Architectures of CPU-GPU
DISCRETE CPU-GPU ARCH.
•

Old Model .

•

In Which GPU is usually connected to the CPU with a PCI-e bus.
PCI-e
PCI Express (Peripheral Component Interconnect
Express), officially abbreviated as PCI-e, is a high-speed
serial computer expansion bus standard designed to
replace the older PCI, PCI-X, and AGP bus standards.
PCI-e has numerous improvements over the
aforementioned bus standards, including higher
maximum system bus throughput, lower I/O pin count
and smaller physical footprint, better performancescaling for bus devices.
PROBLEM !!!
•

The relatively low bandwidth and high latency
of the PCI-e bus are usually bottleneck issues

So Many hardware vendors have attempted to
resolve this overhead with new architectures.
LIKE: COUPLED

CPU-GPU Architecture.
COUPLED CPU-GPU ARCH.
•

•

The CPU and the GPU are integrated into a single chip
avoiding the costly data transfer via the PCI-e bus
Examples: AMD-APU

Intel IVY(2012)
These new heterogeneous architectures potentially

open up new optimization opportunities for GPU
query co-processing.
There are many Types of Query co-processing
1.

Fine-grained

2.

coarse-grained

3.

embarrassing
FINE-GRAINED, COARSEGRAINED, AND EMBARRASSING
PARALLELISM:
Applications are often classified according to how often their subtasks
need to synchronize or communicate with each other.
An application exhibits fine-grained parallelism if its subtasks must
communicate many times per second;
it exhibits coarse-grained parallelism if they do not communicate many
times per second,
and it is embarrassingly parallel if they rarely or never have to
communicate. Embarrassingly parallel applications are considered the
easiest to parallelize.
SO….
-In the Discrete CPU-GPU Architecture
it is preferred to have coarse-grained co-processing to reduce the data
transfer on the PCI-e bus. Moreover, as the GPU and the
CPU have their own memory controllers and caches .
-In the Discrete CPU-GPU Architecture

It is feasible to have the fine-grained co-processing
OPEN-CL
Open Computing Language (Open-CL) is a framework for writing
programs that execute across heterogeneous platforms consisting of
central processing units (CPUs), graphics processing units (GPUs) .


The advantage of OpenCL is that the same OpenCL code

can run on both the CPU and the GPU without modification.

Previous studies have shown that implementations with OpenCL achieve very close performance to those with native languages
such as CUDA and OpenMP on the GPU and the
CPU, respectively.
OpenCL can be used to give an application access to a graphics
processing unit for non-graphical computing ( general-purpose
computing on graphics processing units).
GPGPU
General-purpose computing on graphics processing units (GPGPU, rarely GPGP
or GP²U) is the utilization of a graphics processing unit (GPU), which

typically handles computation only for computer graphics, to perform
computation in applications traditionally handled by the central
processing unit (CPU).Any GPU providing a functionally complete set of
operations performed on arbitrary bits can compute any computable value.
Additionally, the use of multiple graphics cards in one computer, or large
numbers of graphics chips, further parallelizes the already parallel nature of
graphics processing.

OpenCL is the currently dominant open general-purpose GPU computing
language. The dominant proprietary framework is Nvidia's CUDA.
HASH JOIN CO-PROCESSING


On the coupled architecture, co-processing

should be fine-grained, and schedule the workloads carefully
to the CPU and the GPU.



Moreover, we need to consider

the memory specific optimizations for the shared cache
architecture and memory systems exposed by OpenCL.
ARCHITECTURE AWARE HASH JOINS
Hash joins are considered as the most efficient join algorithm

for main memory databases.
Two main Types of Hash joins:
1.Simple Hash Join.
2.Portioned Hash Join.
FINE-GRAINED STEPS IN HASH JOINS
A hash join operator works on two input relations, R and
S. We assume that |R| < |S|. A typical hash join algorithm

has three phases: partition, build, and probe. The partition
phase is optional, and the simple hash join does not have
a partition phase.
In SHJ, the build phase constructs an in-memory hash
table for R. Then in the probe phase, for each tuple in S, it
looks up the hash table for matching entries. Both the build
and the probe phases are divided into four steps, b1 to b4 and
p1 to p4, respectively.

A hash table consists of an array of bucket headers. Each bucket header contains two fields: total
number of tuples within that bucket and the pointer to a key list. The key list contains all the unique
keys with the same hash value, each of which links a rid list
storing the IDs for all tuples with the same key.
SHJ ALGORITHM:
Algorithm 1 Fine-grained steps in SHJ

/*build*/
for each tuple in R do
(b1) compute hash bucket number;
(b2) visit the hash bucket header;

(b3) visit the hash key lists and create
a key header if necessary;
(b4) insert the record id into the rid list;
/*probe*/
for each tuple in S do
(p1) compute hash bucket number;
(p2) visit the hash bucket header;
(p3) visit the hash key lists;
(p4) visit the matching build tuple to compare keys and produce

output tuple;
PHJ ALGORITHM:
Main Procedure for PHJ:
/*Partitioning: perform multiple passes if necessary*/
Partition (R);
Partition (S);
/*Apply SHJ on each partition pair*/
for each partition pair Ri and Si do
Apply SHJ on Ri and Si;
Procedure: Partition (R)

for each tuple in R do
(n1) compute partition number;
(n2) visit the partition header;
(n3) insert the <key, rid> into partition;
REVISITING CO-PROCESSING MECHANISMS
Off-loading (OL):
proposed to off-load some heavy operators like joins to the GPU while
other operators in the query remain on the CPU.
The basic idea of OL on a step series is: the GPU is designed as a
powerful massively parallel query co-processor, and a step
is evaluated entirely by either the GPU or the CPU.
Query processing continues on the CPU until the off-loaded computation
completes on the GPU, and vice versa. That is, given
a step series s1, ..., sn, we need to decide if si is performed
on the CPU or the GPU.
REVISITING CO-PROCESSING MECHANISMS
Data dividing (DD):
Problem: OL could under-utilize the CPU when the off-loaded computations
are being executed on the GPU, and vice versa.
Moreover: As the performance gap between the
GPU and the CPU on the coupled architecture is smaller
than that on discrete architectures, we need to keep both
the CPU and the GPU busy to further improve the performance.
So: We can model the CPU and the
GPU as two independent processors, and the problem is

to schedule the workload to them. This problem has its
root in parallel query processing . One of the most commonly
used schemes is to partition the input data among
processors, perform parallel query processing on individual
processors and merge the partial results from individual

processors as the final result. We adopt this scheme to be
the data-dividing co-processing scheme (DD) on the coupled Architecture.
PIPELINED EXECUTION (PL).
To address the limitations
of OL and DD, we consider fine-grained workload scheduling

between the CPU and the GPU so that we can capture
their performance differences in processing the same workload.
For example, the GPU is much more efficient than the
CPU on b1 and p1 whereas b3 and p3 are more efficient on
the CPU. Meanwhile, we should keep both processors busy.
Therefore, we leverage the concept of pipelined execution
and develop an adaptive fine-grained co-processing scheme
for maximizing the efficiency of co-processing on the coupled
architecture.
EVALUATIONS ON DISCRETE ARCHITECTURES
END

More Related Content

What's hot

GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
Nived R Nambiar
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
spartasoft
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
Dhan V Sagar
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
ugur candan
 
Google TPU
Google TPUGoogle TPU
Google TPU
Hao(Robin) Dong
 
Introduction to Computing on GPU
Introduction to Computing on GPUIntroduction to Computing on GPU
Introduction to Computing on GPU
Ilya Kuzovkin
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
Linaro
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Antonios Katsarakis
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
Dong-Hyun Hwang
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Carlo C. del Mundo
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
AMD Developer Central
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
Amal R
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
Lior Sidi
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trends
Alessio Villardita
 

What's hot (19)

GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Gpu presentation
Gpu presentationGpu presentation
Gpu presentation
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Graphics processing unit
Graphics processing unitGraphics processing unit
Graphics processing unit
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
Introduction to Computing on GPU
Introduction to Computing on GPUIntroduction to Computing on GPU
Introduction to Computing on GPU
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
GPU power consumption and performance trends
GPU power consumption and performance trendsGPU power consumption and performance trends
GPU power consumption and performance trends
 

Viewers also liked

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
Suman Karumuri
 
Database Programming Techniques
Database Programming TechniquesDatabase Programming Techniques
Database Programming Techniques
Raji Ghawi
 
Parallel computing
Parallel computingParallel computing
Parallel computing
Vinay Gupta
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
Raj Mohan
 

Viewers also liked (6)

[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
Database Programming Techniques
Database Programming TechniquesDatabase Programming Techniques
Database Programming Techniques
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Paging and Segmentation in Operating System
Paging and Segmentation in Operating SystemPaging and Segmentation in Operating System
Paging and Segmentation in Operating System
 

Similar to Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture

A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
cseij
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelarators
Emmanuel college
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingMesbah Uddin Khan
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
Savith Satheesh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
IJECEIAES
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
QuEST Global (erstwhile NeST Software)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
AbdullahMunir32
 
Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
Subhajit Sahu
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 

Similar to Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture (20)

A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
High performance computing with accelarators
High performance computing with accelaratorsHigh performance computing with accelarators
High performance computing with accelarators
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
openCL Paper
openCL PaperopenCL Paper
openCL Paper
 
Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
CUDA
CUDACUDA
CUDA
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture

  • 1. Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture BY Jiong He Mian Lu Presented By Bingsheng He Mohamed Ragab Moawad
  • 2. Umm.. GPU?? GPU : Graphics Processing Unit CPU : Central Processing Unit GPU complements CPU which performs general processing, by more efficiently handling Graphics Calculations. Can also accelerate Video transcoding, Image processing and other complex computations using concept of GPGPU (General-purpose computing on Graphics Processing Units). Eg. CUDA (Compute Unified Device Architecture).
  • 3. THEN AND NOW Intel ASCI Red/9632 : 2,379 GFLOPS - Fastest Supercomputer (World), 1999 PARAM PADMA : 1,000 GFLOPS - Fastest Supercomputer (India), 2003 NVIDIA GTX 780 Ti : 5,046 GFLOPS - Fastest GPU (General), 2013 GFLOPS : Giga FLoating-point Operations Per Second. A measure of computer performance.
  • 4. Memory is the boss? NO! GTX 680 (2GB) GT 430 (4GB) Memory : 2GB Memory : 4GB Price : Rs. 41,195 Price : Rs. 4,520 Computation power : 3,090 GFLOPS Computation power : 269 GFLOPS
  • 5. GPU VS CPU • A GPU is tailored for highly parallel operation while a CPU executes programs serially. GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs • CPU is optimized for sequential code performance. • GPU is specialized for compute-intensive highly parallel computation. • • GPU has evolved into a highly parallel, multithreaded, many core processors with very high computational horsepower and very high memory bandwidth.
  • 6. CPU –GPU Architectures There are two Architectures of CPU-GPU
  • 7. DISCRETE CPU-GPU ARCH. • Old Model . • In Which GPU is usually connected to the CPU with a PCI-e bus.
  • 8. PCI-e PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCI-e, is a high-speed serial computer expansion bus standard designed to replace the older PCI, PCI-X, and AGP bus standards. PCI-e has numerous improvements over the aforementioned bus standards, including higher maximum system bus throughput, lower I/O pin count and smaller physical footprint, better performancescaling for bus devices.
  • 9. PROBLEM !!! • The relatively low bandwidth and high latency of the PCI-e bus are usually bottleneck issues So Many hardware vendors have attempted to resolve this overhead with new architectures. LIKE: COUPLED CPU-GPU Architecture.
  • 10. COUPLED CPU-GPU ARCH. • • The CPU and the GPU are integrated into a single chip avoiding the costly data transfer via the PCI-e bus Examples: AMD-APU Intel IVY(2012)
  • 11. These new heterogeneous architectures potentially open up new optimization opportunities for GPU query co-processing. There are many Types of Query co-processing 1. Fine-grained 2. coarse-grained 3. embarrassing
  • 12. FINE-GRAINED, COARSEGRAINED, AND EMBARRASSING PARALLELISM: Applications are often classified according to how often their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and it is embarrassingly parallel if they rarely or never have to communicate. Embarrassingly parallel applications are considered the easiest to parallelize.
  • 13. SO…. -In the Discrete CPU-GPU Architecture it is preferred to have coarse-grained co-processing to reduce the data transfer on the PCI-e bus. Moreover, as the GPU and the CPU have their own memory controllers and caches . -In the Discrete CPU-GPU Architecture It is feasible to have the fine-grained co-processing
  • 14. OPEN-CL Open Computing Language (Open-CL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs) .  The advantage of OpenCL is that the same OpenCL code can run on both the CPU and the GPU without modification. Previous studies have shown that implementations with OpenCL achieve very close performance to those with native languages such as CUDA and OpenMP on the GPU and the CPU, respectively. OpenCL can be used to give an application access to a graphics processing unit for non-graphical computing ( general-purpose computing on graphics processing units).
  • 15. GPGPU General-purpose computing on graphics processing units (GPGPU, rarely GPGP or GP²U) is the utilization of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU).Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. OpenCL is the currently dominant open general-purpose GPU computing language. The dominant proprietary framework is Nvidia's CUDA.
  • 16. HASH JOIN CO-PROCESSING  On the coupled architecture, co-processing should be fine-grained, and schedule the workloads carefully to the CPU and the GPU.  Moreover, we need to consider the memory specific optimizations for the shared cache architecture and memory systems exposed by OpenCL.
  • 17. ARCHITECTURE AWARE HASH JOINS Hash joins are considered as the most efficient join algorithm for main memory databases. Two main Types of Hash joins: 1.Simple Hash Join. 2.Portioned Hash Join.
  • 18. FINE-GRAINED STEPS IN HASH JOINS A hash join operator works on two input relations, R and S. We assume that |R| < |S|. A typical hash join algorithm has three phases: partition, build, and probe. The partition phase is optional, and the simple hash join does not have a partition phase. In SHJ, the build phase constructs an in-memory hash table for R. Then in the probe phase, for each tuple in S, it looks up the hash table for matching entries. Both the build and the probe phases are divided into four steps, b1 to b4 and p1 to p4, respectively. A hash table consists of an array of bucket headers. Each bucket header contains two fields: total number of tuples within that bucket and the pointer to a key list. The key list contains all the unique keys with the same hash value, each of which links a rid list storing the IDs for all tuples with the same key.
  • 19. SHJ ALGORITHM: Algorithm 1 Fine-grained steps in SHJ /*build*/ for each tuple in R do (b1) compute hash bucket number; (b2) visit the hash bucket header; (b3) visit the hash key lists and create a key header if necessary; (b4) insert the record id into the rid list; /*probe*/ for each tuple in S do (p1) compute hash bucket number; (p2) visit the hash bucket header; (p3) visit the hash key lists; (p4) visit the matching build tuple to compare keys and produce output tuple;
  • 20. PHJ ALGORITHM: Main Procedure for PHJ: /*Partitioning: perform multiple passes if necessary*/ Partition (R); Partition (S); /*Apply SHJ on each partition pair*/ for each partition pair Ri and Si do Apply SHJ on Ri and Si; Procedure: Partition (R) for each tuple in R do (n1) compute partition number; (n2) visit the partition header; (n3) insert the <key, rid> into partition;
  • 21. REVISITING CO-PROCESSING MECHANISMS Off-loading (OL): proposed to off-load some heavy operators like joins to the GPU while other operators in the query remain on the CPU. The basic idea of OL on a step series is: the GPU is designed as a powerful massively parallel query co-processor, and a step is evaluated entirely by either the GPU or the CPU. Query processing continues on the CPU until the off-loaded computation completes on the GPU, and vice versa. That is, given a step series s1, ..., sn, we need to decide if si is performed on the CPU or the GPU.
  • 22. REVISITING CO-PROCESSING MECHANISMS Data dividing (DD): Problem: OL could under-utilize the CPU when the off-loaded computations are being executed on the GPU, and vice versa. Moreover: As the performance gap between the GPU and the CPU on the coupled architecture is smaller than that on discrete architectures, we need to keep both the CPU and the GPU busy to further improve the performance. So: We can model the CPU and the GPU as two independent processors, and the problem is to schedule the workload to them. This problem has its root in parallel query processing . One of the most commonly used schemes is to partition the input data among processors, perform parallel query processing on individual processors and merge the partial results from individual processors as the final result. We adopt this scheme to be the data-dividing co-processing scheme (DD) on the coupled Architecture.
  • 23. PIPELINED EXECUTION (PL). To address the limitations of OL and DD, we consider fine-grained workload scheduling between the CPU and the GPU so that we can capture their performance differences in processing the same workload. For example, the GPU is much more efficient than the CPU on b1 and p1 whereas b3 and p3 are more efficient on the CPU. Meanwhile, we should keep both processors busy. Therefore, we leverage the concept of pipelined execution and develop an adaptive fine-grained co-processing scheme for maximizing the efficiency of co-processing on the coupled architecture.
  • 24. EVALUATIONS ON DISCRETE ARCHITECTURES
  • 25.
  • 26. END