This document summarizes research on adapting CPU cache optimization techniques for general purpose graphic processing units (GPGPUs). It first discusses related work on CPU and GPGPU cache architectures and optimization techniques. It then presents the conceptual design of selecting CPU techniques and analyzing their adaptation to GPGPUs. Two common CPU techniques, stride-one access and blocking, are adapted and experimental results show their effectiveness on a GPGPU, with blocking providing better performance than non-blocking approaches. The research contributes techniques for programmers to optimize GPGPU cache performance.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
Brief Explanation about the Tau-Leaping Process, Parallel Processing and NVIDIA's CUDA architecture
And the use of cuTau - Leaping for simulation of Biological systems
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
Parallel Application Performance Prediction Using Analysis Based Models and HPC Simulations, Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2018 SIGSIM Principles of Advanced Discrete Simulation (SIGSIM-PADS’18), May 2018.
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...Ilham Amezzane
Support Vector Machines (SVMs) have proven to yield high accuracy and have been used widespread in recent years. However, the standard versions of the SVM algorithm are very time-consuming and computationally intensive; which places a challenge on engineers to explore other hardware architectures than CPU, capable of performing real-time training and classifications while maintaining low power consumption in embedded systems. This paper proposes an overview of works based on the two most popular parallel processing devices: GPU and FPGA, with a focus on multiclass training process. Since different techniques have been evaluated using different experimentation platforms and methodologies, we only focus on the improvements realized in each study.
Brief Explanation about the Tau-Leaping Process, Parallel Processing and NVIDIA's CUDA architecture
And the use of cuTau - Leaping for simulation of Biological systems
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. For efficient construction of large maps searching the best-matching unit is usually the computationally heaviest operation in the SOM. The parallel nature of the algorithm and the huge computations involved makes it a good target for GPU based parallel implementation. This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm.
The network anomaly detection technology based
on support vector machine (SVM) can efficiently detect unknown
attacks or variants of known attacks. However, it cannot be used
for detection of large-scale intrusion scenarios due to the demand
of computational time. The graphics processing unit (GPU) has
the characteristics of multi-threads and powerful parallel
processing capability. Hence Parallel computing framework is
used to accelerate the SVM-based classification.
Molecular dynamics simulation is in-comparably superior to both experiments method and
theoretical analysis. However, because computational effort of molecular dynamics simulation is very
large, especially, the simulation of a large number of Carbon Nano Tube (CNT) particles, general CPU
serial algorithm implementation is inefficient and slow. A Compute Unified Device Architecture (CUDA)
based parallel algorithm of CNT molecular dynamics is proposed in this paper to take advantage of the
data parallelism of Graphic Processing Unit (GPU). A CNT is divided to several blocks and processed
parallel in the GPU. Experimental results show that the algorithm can obtain a speed-up more than 10
times to the CPU serial algorithm in a low-configured graphics card that has only 16 GPU stream
processors.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
n recent years, with the development of graphics p
rocessors, graphics cards have been widely
used to perform general-purpose calculations. Espec
ially with release of CUDA C
programming languages in 2007, most of the research
ers have been used CUDA C
programming language for the processes which needs
high performance computing.
In this paper, a scaling approach for image segment
ation using level sets is carried out by the
GPU programming techniques. Approach to level sets
is mainly based on the solution of partial
differential equations. The proposed method does no
t require the solution of partial differential
equation. Scaling approach, which uses basic geomet
ric transformations, is used. Thus, the
required computational cost reduces. The use of the
CUDA programming on the GPU has taken
advantage of classic programming as spending time a
nd performance. Thereby results are
obtained faster. The use of the GPU has provided to
enable real-time processing. The developed
application in this study is used to find tumor on
MRI brain images.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory
and fast transfer rate for different range of devices and for various multimedia applications. Video
compression is primarily achieved by Motion Estimation (ME) process in any video encoder which
contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric
in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung
Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder
scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42
% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using
Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. For efficient construction of large maps searching the best-matching unit is usually the computationally heaviest operation in the SOM. The parallel nature of the algorithm and the huge computations involved makes it a good target for GPU based parallel implementation. This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm.
The network anomaly detection technology based
on support vector machine (SVM) can efficiently detect unknown
attacks or variants of known attacks. However, it cannot be used
for detection of large-scale intrusion scenarios due to the demand
of computational time. The graphics processing unit (GPU) has
the characteristics of multi-threads and powerful parallel
processing capability. Hence Parallel computing framework is
used to accelerate the SVM-based classification.
Molecular dynamics simulation is in-comparably superior to both experiments method and
theoretical analysis. However, because computational effort of molecular dynamics simulation is very
large, especially, the simulation of a large number of Carbon Nano Tube (CNT) particles, general CPU
serial algorithm implementation is inefficient and slow. A Compute Unified Device Architecture (CUDA)
based parallel algorithm of CNT molecular dynamics is proposed in this paper to take advantage of the
data parallelism of Graphic Processing Unit (GPU). A CNT is divided to several blocks and processed
parallel in the GPU. Experimental results show that the algorithm can obtain a speed-up more than 10
times to the CPU serial algorithm in a low-configured graphics card that has only 16 GPU stream
processors.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
Image Processing Application on Graphics processorsCSCJournals
In this work, we introduce real time image processing techniques using modern programmable Graphic Processing Units GPU. GPU are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. By utilizing NVIDIA new GPU programming framework, “Compute Unified Device Architecture” CUDA as a computational resource, we realize significant acceleration in image processing algorithm computations. We show that a range of computer vision algorithms map readily to CUDA with significant performance gains. Specifically, we demonstrate the efficiency of our approach by a parallelization and optimization of image processing, Morphology applications and image integral.
n recent years, with the development of graphics p
rocessors, graphics cards have been widely
used to perform general-purpose calculations. Espec
ially with release of CUDA C
programming languages in 2007, most of the research
ers have been used CUDA C
programming language for the processes which needs
high performance computing.
In this paper, a scaling approach for image segment
ation using level sets is carried out by the
GPU programming techniques. Approach to level sets
is mainly based on the solution of partial
differential equations. The proposed method does no
t require the solution of partial differential
equation. Scaling approach, which uses basic geomet
ric transformations, is used. Thus, the
required computational cost reduces. The use of the
CUDA programming on the GPU has taken
advantage of classic programming as spending time a
nd performance. Thereby results are
obtained faster. The use of the GPU has provided to
enable real-time processing. The developed
application in this study is used to find tumor on
MRI brain images.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory
and fast transfer rate for different range of devices and for various multimedia applications. Video
compression is primarily achieved by Motion Estimation (ME) process in any video encoder which
contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric
in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung
Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder
scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42
% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using
Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42 % as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
Video Compression is very essential to meet the technological demands such as low power, less memory and fast transfer rate for different range of devices and for various multimedia applications. Video compression is primarily achieved by Motion Estimation (ME) process in any video encoder which contributes to significant compression gain.Sum of Absolute Difference (SAD) is used as distortion metric in ME process.In this paper, efficient Absolute Difference(AD)circuit is proposed which uses Brent Kung Adder(BKA) and a comparator based on modified 1’s complement principle and conditional sum adder scheme. Results shows that proposed architecture reduces delay by 15% and number of slice LUTs by 42% as compared to conventional architecture. Simulation and synthesis are done on Xilinx ISE 14.2 using Virtex 7 FPGA.
1) NVIDIA-Iguazio Accelerated Solutions for Deep Learning and Machine Learning (30 mins):
About the speaker:
Dr. Gabriel Noaje, Senior Solutions Architect, NVIDIA
http://bit.ly/GabrielNoaje
2) GPUs in Data Science Pipelines ( 30 mins)
- GPU as a Service for enterprise AI
- A short demo on the usage of GPUs for model training and model inferencing within a data science workflow
About the speaker:
Anant Gandhi, Solutions Engineer, Iguazio Singapore. https://www.linkedin.com/in/anant-gandhi-b5447614/
GPU compute has leveraged discrete GPUs for a fairly limited set of academic and supercomputing system workloads until recently. With the increase in performance of integrated GPU inside an Accelerated Processing Unit (APU), introduction of Heterogeneous System Architecture (HSA) devices, and proliferation of programming tools, we are seeing GPU compute make its way into mainstream applications. In this presentation we cover GPU compute and HSA, focusing on the application of GPU compute in the Medical and Print Imaging segments. Examples of performance data are reviewed and the case is made for how GPU compute can deliver tangible benefits.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
In this deck from PASC 2019, Liu Yu from Inspur presents: Large-Scale Optimization Strategies for Typical HPC Workloads.
"Ensuring performance of applications running on large-scale clusters is one of the primary focuses in HPC research. In this talk, we will show our strategies on performance analysis and optimization for applications in different fields of research using large-scale HPC clusters. Our strategies are designed to comprehensively analyze runtime features of applications, parallel mode of the physical model, algorithm implementation and other technical details. This three levels of strategy covers platform optimization, technological innovation, and model innovation, and targeted optimization based on these features. State-of-the-art CPU instructions, network communication and other modules, and innovative parallel mode of some applications have been optimized. After optimization, it is expected that these applications will outperform their non-optimized counterparts with obvious increase in performance."
Watch the video: https://wp.me/p3RLHQ-kwB
Learn more: http://en.inspur.com/en/2403285/2403287/2403295/index.html
and
https://pasc19.pasc-conference.org/program/keynote-presentations/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second. Until few years ago, special purpose processors designed especially for such applications were used. Such processors require significant design effort and are thus difficult to change as new algorithms in reconstructions evolve and have limited parallelism. Hence the demand for flexibility in medical applications motivated the use of stream processors with massively parallel architecture. Stream processing architectures offers data parallel kind of parallelism.
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
The main principles of parallel algorithm design are discussed here. For more information: visit, https://sites.google.com/view/vajira-thambawita/leaning-materials
Parallel programming platforms are introduced here. For more information about parallel programming and distributed computing visit,
https://sites.google.com/view/vajira-thambawita/leaning-materials
The theory behind parallel computing is covered here. For more theoretical knowledge: https://sites.google.com/view/vajira-thambawita/leaning-materials
Lecture 1 - Introduction to embedded system and RoboticsVajira Thambawita
Introduction to embedded systems and robotics can be found here. This is an introductory slide set related a course called embedded systems and robotics.
A register is a group of flip-flops, each one of which shares a common clock and is capable of storing one bit of information. An n-bit register consists of a group of n flip-flops capable of
storing n bits of binary information.
More information: https://sites.google.com/view/vajira-thambawita/leaning-materials/slides
Design procedures or methodologies specify hardware that will
implement the desired behaviour. The design of a clocked sequential circuit starts from a set of specifications and culminates in a logic diagram or a list of Boolean functions from which the logic diagram can be obtained.
More informations: https://sites.google.com/view/vajira-thambawita/leaning-materials/slides
The analysis describes what a given circuit will do under certain
operating conditions. The behaviour of a clocked sequential
circuit is determined from the inputs, the outputs, and the
state of its flip-flops.
More informaion:
https://sites.google.com/view/vajira-thambawita/leaning-materials/slides
Introduction to sequential logic is discussed here. Storage elements like latches and flip-flops are introduced. More information:
https://sites.google.com/view/vajira-thambawita/leaning-materials/slides
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Cache Optimization Techniques for General Purpose Graphic Processing Units
1. Cache Optimization Techniques for General
Purpose Graphic Processing Units
D.R.V.L.B. Thambawita
Supervised By
Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe
Department of Computer Engineering
Faculty of Engineering
University of Peradeniya
2. What is this GPU? Is it important?
AI Fluid Dynamic
Figure: CPU vs GPGPU
2 / 61
3. Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
3 / 61
4. Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
3 / 61
5. Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
Our Contribution....
Main
Finding suitable cache op-
timization techniques for
GPGPU (Programmer side)
Sub
Giving an idea about appli-
cation level cache behav-
ior of GPGPUs cache for
GPGPU architecture designers
3 / 61
6. GPU configurable cache architecture
Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)
4 / 61
7. Outline
1 Related works
2 Conceptual level design
3 Selected CPU Cache Optimization Techniques
4 Experimental Setup
5 Adaptation Process + Results and Discussion
6 Findings and Conclusions
7 Case Study
Introduction - Aho-corasick algorithm
Results and Discussion
Conclusion about the case study
8 Publications
9 Q? and A
5 / 61
8. Related works
Related works
J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”.
Morgan Kaufmann/Elsevier, 2012.
Identifying main cache optimization techniques in computer architecture.
M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and
Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003.
Selecting basic cache optimization techniques.
CUDA Toolkit Documentation
Finding available GPGPU optimization techniques and getting knowledge for adaptation
process.
C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using
a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp.
1906-1916, Oct. 2013.
Identifying a case study for our research.
6 / 61
9. Related works
Challenges!!!
Lack of information about GPGPU cache architecture
Complexity of SIMD architecture
No any direct research regarding GPGPU cache optimization
technique of end users
7 / 61
12. Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Adopting from CPU
to GPU
8 / 61
13. Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Adopting from CPU
to GPU
8 / 61
14. Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Adopting from CPU
to GPU
8 / 61
15. Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Case Study
Adopting from CPU
to GPU
8 / 61
16. Selected CPU Cache Optimization Techniques
Common end user cache optimization technique
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
9 / 61
17. Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
10 / 61
18. Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
10 / 61
19. Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Configurable)
- 16KB,48KB, L1 disabled
Texture Memory
10 / 61
21. Experimental Setup
Experimental setups
Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM
Cache
size
Cache
line size
Associativity Description
L1 cache 32KB 64bytes 8-way
L2 cache 256KB 64bytes 8-way
L3 cache 3072KB 64bytes 12-way Shared Memory
Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory
Cache
size
Cache
line size
Associativity Description
L1 cache
48KB/
16KB
128bytes Not mentioned
can be disable
by using
-Xptxas-dlcm=cg
compile flag
Shared
memory
16KB/
48KB
128bytes Not mentioned
can be used
manually
L2 cache 768KB
128bytes/
32bytes
Not mentioned Unified cache
11 / 61
22. Adaptation Process + Results and Discussion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
12 / 61
23. Adaptation Process + Results and Discussion Stride-one access
Stride-one memory access
Figure: Non-stride access vs stride access of GPGPU
13 / 61
24. Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
Loops
Changing parameter = Loop index
64 bytes
L1, L2 and L3
cache line size
Figure: Adaptation Process
14 / 61
25. Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
128 bytes
128 bytes 32 bytes
Changing parameter = blockDim * blockID + threadID
Kernel
L1 cache line size
L2 cache line size
Figure: Adaptation Process
15 / 61
26. Adaptation Process + Results and Discussion Stride-one access
Results: Effect of stride-one access on the CPU
0 10 20 30 40 50 60 70
0
20
40
60
80
100
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)
Time taken for execution increased continuously according to the stride amount.
16 / 61
27. Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Time taken for execution increases according to the stride amount.
It shows the best performance while stride amount is 1 like CPU changes.
The effect of stride amount is comparably low after the cache line is full.
17 / 61
28. Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1)
Input Size=2867200(16KB L1)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Disabled L1 cache shows better performance for large stride about because number of
cache lines are high in L2 cache when L1 cache is disabled.
Large L1 cache shows better performance than small cache due to large number of cache
lines in large cache.
18 / 61
29. Adaptation Process + Results and Discussion Stride-one access
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
19 / 61
30. Adaptation Process + Results and Discussion Blocking technique
Blocking technique
1
2
3
4
4
5
000000000000
0000
111111111111
1111
0000
000000000000
1111
111111111111
000000000000
0000
111111111111
111100000000000001111111111111
00
00000
0
000
0
11
11111
1
111
1
A C
B
Figure: Two different blocking techniques from two different sources. First technique uses small
blocks from the first matrix and large blocks from the second matrix. Second method uses equal
size blocks form both matrices
20 / 61
31. Adaptation Process + Results and Discussion Blocking technique
Adaptation
21 / 61
32. Adaptation Process + Results and Discussion Blocking technique
Adaptation
Figure: Adaptation process 22 / 61
33. Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Size of the matrix
Time[s]
Default method - without tilling technique
Method from Computer Architecture: A Quantitative Approach book
Method equivalent to GPGPU tiling method
Figure: Effect of tiling on CPU
Method equivalent to the GPGPU method shows better performance on the CPU also.
23 / 61
34. Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
35. Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
200
400
600
800
Size of the matrix
Time[ms]
Default - L1 (16KB) Blocked - L1 (16KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
36. Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
Size of the matrix
Time[ms]
Default - L1 (48KB) Blocked - L1 (48KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
37. Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Default - L1 (16KB) Blocked - L1 (16KB)
Default - L1 (48KB) Blocked - L1 (48KB)
Blocked - Shared memory
Figure: Non blocking vs blocking with various cache configurations on GPGPU
Blocking technique with shared memory shows the best performance among all other
GPGPU cache options.
24 / 61
38. Adaptation Process + Results and Discussion Blocking technique
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
25 / 61
39. Adaptation Process + Results and Discussion Loop fusion
Loop fusion
It is required to match the number of branching conditions in both
fused and non-fused loops.
Common variables within for loops have been used.
The loops within the GPGPU are kernels.
Kernel fusion is the technique in GPGPUs corresponding to the loop
fusion in CPUs.
Example
for (int i=0;i<n*n;i++){
h_array_c[i] =h_array_a[i] *h_array_b[i];
}
for (int i=0;i<n*n;i++){
h_array_d[i] =h_array_c[i] *h_array_a[i];
}
26 / 61
40. Adaptation Process + Results and Discussion Loop fusion
Adaptation
Common Data
Elements are here
Loop unrolling for
matching iterations
Figure: Adaptation process
27 / 61
41. Adaptation Process + Results and Discussion Loop fusion
Adaptation
Kernel 1
Kernel 2
Making
One
Kernel
Figure: Adaptation process
28 / 61
42. Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
50
100
Input Size
Time[ms]
Without loop fusion Without loop fusion - with loop unrolling
With loop fusion
Figure: Effect of loop fusion on CPU with two common data element
The loop fusion technique shows performance improvements on the CPU.
This improvement is not an affect of less number of iterations.
29 / 61
43. Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096
1
2
3
Input Size
Time[ms]
Without kernel fusion - default settings With kernel fusion - L1(16KB)
With kernel fusion - L1(48KB) With kernel fusion - L1(disabled)
Figure: Effect of kernel fusion on GPGPU - with common data accesses
Kernels fusion technique can be used for the kernels with common data accesses for
improving the performance.
30 / 61
44. Adaptation Process + Results and Discussion Loop fusion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
31 / 61
47. Adaptation Process + Results and Discussion Array padding
Adaptation
L1 Cache
Thrashing
Selected one warp (32 threads)
Figure: Adaptation to GPGPU
34 / 61
48. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the CPU
0 2,048 4,096 6,144 8,192 10,240
0.5
1
Input Size
Time[ms]
With cache thrashing Without cache thrashing (with array padding)
Figure: Effect of array padding for cache thrashing on CPU
Array padding technique shows slight improvement of performance in CPU side.
35 / 61
49. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 10 20 30 40 50 60 70
0.6
0.8
1
1.2
1.4
Stride Amount
Time[ms]
Figure: Effect of bank conflict of shared memory on GPGPU
Shared memory bank conflicts of GPGPU show considerable effect for performance of
applications.
36 / 61
50. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
1
2
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
51. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
1
2
3
4
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
52. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
2
4
6
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
32-way Bank Conflict-Without Padding 32-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
53. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (16KB) Without padding - L1 (48KB)
Figure: L1 cache thrashing points while L1 size is 16KB and 48KB
L1 cache accesses cause cache thrashing points according to the stride amount.
38 / 61
54. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (48KB)
With one padding - L1 (48KB)
With two padding - L1 (48KB)
Figure: Effect of padding to thrashing points of L1 cache
Array padding technique can be used to shift cache thrashing point rather than removing
those points.
39 / 61
55. Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
With padding - L1 (48KB) With padding - L1 (disabled)
Figure: Cache thrashing with L1 and L2 vs L2 only
Cache thrashing points which are significant to the performance are caused by L1 cache
not from L2 cache.
40 / 61
56. Adaptation Process + Results and Discussion Array padding
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
41 / 61
57. Adaptation Process + Results and Discussion Array merging
Array merging
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Figure: Basic idea behind the array merging technique.
42 / 61
58. Adaptation Process + Results and Discussion Array merging
Adaptation
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
43 / 61
59. Adaptation Process + Results and Discussion Array merging
Adaptation
Merged array
Located
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
44 / 61
60. Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
0
10
20
30
Input Size
Time[ms]
With Array Merging Without Array Merging
Figure: Effect of array merging on CPU
The array merging technique improves the performance of non-stride accesses on the CPU.
45 / 61
61. Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
10
20
30
40
Input Size
Time[ms]
Without Array Merging L1-disabled With Array Merging L1-disabled
Figure: Effect of array merging on GPGPU
46 / 61
62. Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-16KB With Array Merging L1-16KB
Figure: Effect of array merging on GPGPU
46 / 61
63. Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-48KB With Array Merging L1-48KB
Figure: Effect of array merging on GPGPU
The array merging technique can be used on GPGPU also for improving performance.
It needs more cache line size to improve performance.
46 / 61
64. Adaptation Process + Results and Discussion Array merging
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
47 / 61
65. Adaptation Process + Results and Discussion Array transpose
Array transpose
Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that
how to use transposed matrix for matrix multiplication
48 / 61
66. Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Cache friendly memory locations
Figure: Adaptation process
49 / 61
67. Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Figure: Adaptation process
50 / 61
68. Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Input Size
Time[s]
Basic Method With Transpose (Without Transpose Overhead)
Transpose Overhead
Figure: Effect of array transpose using matrix multiplication on CPU
The array transpose technique can be used to improve the performance of the CPU.
51 / 61
69. Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096 5120X5120
0
2,000
4,000
Input Size
Time[ms]
Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose
Figure: Effect of array transpose for matrix multiplication on GPGPU
The array transpose technique is not a good option on GPGPUs.
It will increase the number of memory access compared with original memory accesses.
52 / 61
70. Findings and Conclusions
Findings and Conclusions about GPGPU cache
optimizations
1 Stride one access is the best case for gaining better performance.
However, large non-stride accesses shows better performance while L1
cache is disabled.
2 Manually using the cache memory (shared memory) is the best option
for gaining better performance with the blocking technique.
3 Better performance with kernel fusion can be achieved if multiple
kernels have common data accesses.
4 Array padding causes positive effects for larger shared memory bank
conflict. L1 cache conflicts can be avoided by applying array padding.
5 Array merging is a good option for improving the performance of
overall memory access on CPUs as well as GPGPUs.
6 Transposing 2D arrays is not a good option to GPGPU for gaining
better performance for large data sets.
53 / 61
72. Case Study Introduction - Aho-corasick algorithm
Aho-corasick algorithm - What is this?
The Aho-Corasick algorithm is a multiple patterns searching
algorithm.
Where can we see this Aho-corasick algorithm?
0 4
8
2
5
9
3
6 7
1
A
B G
B E D E
E
D
Aho-corasick
Algorithm
ABGABBEDG...GGABEDG
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
Figure: Applications of Aho-corasick Algorithm
A parallel GPGPU version of the Aho-Corasick algorithm is called as
Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known
Paper - IEEE Tran.]. 55 / 61
73. Case Study Introduction - Aho-corasick algorithm
How did we test our findings?
Implemented our own PFAC for DNA sequence matching. (From
available GPGPU Aho-corasick algorithm)
Analyzed the developed source code to find suitable locations to
apply the GPGPU optimization techniques
Analyzing...
1 Stride-one memory access → Not possible
2 Blocking → Found compatibility for input text file → Loading input
text via shared memory
3 Kernel fusion → Only one kernel
4 Array padding → No cache thrashing points
5 Array merging → Found compatibility for input pattern file → Two
arrays of input pattern file were merged via texture memory
6 Array transpose
56 / 61
74. Case Study Results and Discussion
Results: Comparison between original PFAC and our PFAC
(without cache optimization techniques)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.4
0.6
0.8
Time(s)
Available PFAC Our PFAC implementation - without any optimizations
Figure: Performance comparison between the original PFAC vs our PFAC implementation
Performance gain from application specific adaptation is around
1.27X
57 / 61
75. Case Study Results and Discussion
Results: Comparison between PFAC implementations
(without and with cache optimization)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.2
0.3
0.4
0.5
Time(s)
Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations
Figure: Performance comparison - our PFAC (without optimization and with optimizations)
Performance gain from the developed application specific solution to the
cache optimized solution is around 2X
58 / 61
76. Case Study Conclusion about the case study
Conclusion
Applied application specific techniques improved the performance of
our PFAC implementation.
Applied cache memory optimization techniques also improved the
performance of the PFAC implementation.
Our PFAC implementation worst case (without any optimizations)
shows 1.27X average improvement while the best case (with all
optimizations - Total) is 2.40X faster than available best GPGPU
solution (PFAC).
59 / 61
77. Publications
Publications (Up to now)
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Graphics processing units (gpus) for pattern matching algorithms.
In 7th IEEE International Conference on Information and Automation for Sustainability,
pages 1–4, Dec 2014.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
An optimized parallel failure-less aho-corasick algorithm for dna sequence matching.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Cpu’s cache optimization techniques for gpgpus.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe.
GPGPU: To use or not to use?
In Peradeniya University Research Sessions (PURSE), 2013.
60 / 61