This document discusses implementing a subset of the SQLite command processor directly on a GPU to accelerate SQL database operations. The author focuses on accelerating SELECT queries by having each CUDA thread execute SQLite opcodes on a single database row. Results show 20-70x speedups compared to CPU execution. This allows SQL queries to leverage the GPU's parallelism with minimal code changes, providing a simpler interface than existing GPU data processing approaches.
This document provides a summary of the Parallel-D final year project report. It discusses developing a database management system (DBMS) that utilizes a graphical processing unit (GPU) for parallel query processing. The project aims to optimize DBMS performance for processing large datasets. It outlines requirements, design considerations, and a roadmap for further development in the second year of the project. Key aspects summarized include the project objectives, system architecture with CPU and GPU components, and a methodology for partial query processing and data sorting algorithms.
Integrating dbm ss as a read only execution layer into hadoopJoĆ£o Gabriel Lima
Ā
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
The International Journal of Engineering and Science (The IJES)theijes
Ā
The document summarizes research on heterogeneous computing using CPU-GPU integration. It proposes a unified graphics computing architecture (UGCA) that utilizes both CPU and GPU resources efficiently. The UGCA design translates PTX code to LLVM for execution on CPU and GPU. It also introduces a workload distribution module that splits tasks between CPU and GPU kernels based on granularity. Performance comparisons show CUDA providing better speedups than OpenCL due to its coarse-grained warp-level parallelism. The architecture aims to improve resource utilization for heterogeneous multi-core processors.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
HadoopDB is a system that combines the performance of parallel database systems with the flexibility and fault tolerance of Hadoop. It uses Hadoop as the communication layer between multiple single-node database instances running on cluster nodes. Benchmark results showed that HadoopDB's performance was close to parallel databases for structured queries and similar to Hadoop for unstructured queries, while also providing Hadoop's ability to operate in heterogeneous environments and tolerate faults.
This document discusses accelerating deep learning algorithms on different hardware platforms. It explores using CPUs, GPUs, Intel Xeon Phi, FPGAs, and low-power devices. For CPUs, it finds that fixed-point arithmetic with SSE instructions provides 3x speedup over optimized BLAS packages. It also examines using MapReduce on Xeon Phi to reduce thread oversubscription. For FPGAs, it discusses using data parallelism on Hadoop clusters and integrating FPGA modules for acceleration. Overall, the document analyzes optimizations for various hardware to improve deep learning performance.
Team 6 is comprised of 5 members: Sourabh Ketkale, Sahil Kaw, Siddhi Pai, Goutham Nekkalapu, and Prince Jacob Chandy. The document discusses several techniques for optimizing neural network performance on different hardware, including using 8-bit quantization, SSE3 and SSE4 instruction sets, batching, lazy evaluation, batched lazy evaluation, and implementing neural networks on the Xeon Phi processor using techniques such as data parallelism and task parallelism. It also discusses using FPGAs and distributed systems to achieve large-scale deep learning.
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
Ā
Hadoop, Evolution of Hadoop, Features of Hadoop is explained in the presentation as per the syllabus of RGPV, BU and MCU for the students of BCA, MCA and B. Tech.
This document provides a summary of the Parallel-D final year project report. It discusses developing a database management system (DBMS) that utilizes a graphical processing unit (GPU) for parallel query processing. The project aims to optimize DBMS performance for processing large datasets. It outlines requirements, design considerations, and a roadmap for further development in the second year of the project. Key aspects summarized include the project objectives, system architecture with CPU and GPU components, and a methodology for partial query processing and data sorting algorithms.
Integrating dbm ss as a read only execution layer into hadoopJoĆ£o Gabriel Lima
Ā
The document discusses integrating database management systems (DBMSs) as a read-only execution layer into Hadoop. It proposes a new system architecture that incorporates modified DBMS engines augmented with a customized storage engine capable of directly accessing data from the Hadoop Distributed File System (HDFS) and using global indexes. This allows DBMS to efficiently provide read-only operators while not being responsible for data management. The system is designed to address limitations of HadoopDB by improving performance, fault tolerance, and data loading speed.
The International Journal of Engineering and Science (The IJES)theijes
Ā
The document summarizes research on heterogeneous computing using CPU-GPU integration. It proposes a unified graphics computing architecture (UGCA) that utilizes both CPU and GPU resources efficiently. The UGCA design translates PTX code to LLVM for execution on CPU and GPU. It also introduces a workload distribution module that splits tasks between CPU and GPU kernels based on granularity. Performance comparisons show CUDA providing better speedups than OpenCL due to its coarse-grained warp-level parallelism. The architecture aims to improve resource utilization for heterogeneous multi-core processors.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
HadoopDB is a system that combines the performance of parallel database systems with the flexibility and fault tolerance of Hadoop. It uses Hadoop as the communication layer between multiple single-node database instances running on cluster nodes. Benchmark results showed that HadoopDB's performance was close to parallel databases for structured queries and similar to Hadoop for unstructured queries, while also providing Hadoop's ability to operate in heterogeneous environments and tolerate faults.
This document discusses accelerating deep learning algorithms on different hardware platforms. It explores using CPUs, GPUs, Intel Xeon Phi, FPGAs, and low-power devices. For CPUs, it finds that fixed-point arithmetic with SSE instructions provides 3x speedup over optimized BLAS packages. It also examines using MapReduce on Xeon Phi to reduce thread oversubscription. For FPGAs, it discusses using data parallelism on Hadoop clusters and integrating FPGA modules for acceleration. Overall, the document analyzes optimizations for various hardware to improve deep learning performance.
Team 6 is comprised of 5 members: Sourabh Ketkale, Sahil Kaw, Siddhi Pai, Goutham Nekkalapu, and Prince Jacob Chandy. The document discusses several techniques for optimizing neural network performance on different hardware, including using 8-bit quantization, SSE3 and SSE4 instruction sets, batching, lazy evaluation, batched lazy evaluation, and implementing neural networks on the Xeon Phi processor using techniques such as data parallelism and task parallelism. It also discusses using FPGAs and distributed systems to achieve large-scale deep learning.
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
Ā
Hadoop, Evolution of Hadoop, Features of Hadoop is explained in the presentation as per the syllabus of RGPV, BU and MCU for the students of BCA, MCA and B. Tech.
The document summarizes presentations from a meeting focused on robust file replication. Various projects discussed their approaches to file replication including JLAB, SRB, Globus, GDMP, MAGDA, SAM, STAR and Babar. Key topics included replica catalogs, logical naming, error handling, interfaces, and reliability. There was discussion of defining common components and interfaces to promote interoperability across replication applications and middleware. Future work items were identified around web services, naming, consistency models and performance.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Ā
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Review this presentation to discover the whens and hows of scaling at it provides and overview of multi-session, single-session, and multi-host scaling options and the workloads where these options are appropriate.
Database scaling refers to increasing database throughput. Vertical scaling utilizes additional resources such as I/O, memory, CPU, while horizontal scaling uses additional computers. However, the high concurrency and write requirements of database servers make scaling a challenge. Sometimes scaling is only possible with multiple sessions, while other options require data model adjustments or server configuration changes.
To learn more please visit www.EnterpriseDB.com or contact sales@enterprisedb.com.
This document discusses Greenplum Database on HDFS (GOH). It provides an introduction and overview of GOH's architecture, features, and performance. Key points include that GOH allows Greenplum to use HDFS for storage, provides pluggable storage support, and full transaction support for tables on HDFS. It also notes challenges around supporting many concurrent queries due to limitations of the current Java-based HDFS client, and possibilities for addressing this.
MongoDB on Windows Azure provides two options for deploying the MongoDB database on Microsoft's cloud platform:
1) Windows Azure Virtual Machines allow more control over infrastructure but require more operational effort. Users can choose Windows or Linux and install software themselves.
2) Windows Azure Cloud Services decrease operational effort through automated management but provide less infrastructure control. Only Windows is supported and configurations are pre-defined.
Both options provide scalability and high availability through features like replication and sharding. Developers should evaluate the level of control and effort needed to determine the best deployment model for their application on the Windows Azure cloud.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Ā
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes key features of Hadoop like high scalability, availability, and performance. It explains how Hadoop can process massive amounts of data generated daily by internet companies efficiently using commodity hardware. Hadoop provides a storage system called HDFS and a programming model called MapReduce that together form the backbone of distributed computing infrastructure for many large companies.
The definition of a successful Hadoop solution need not be limited to whether or not the hardware can run the jobs and sort the data. As our tests show, the Dell PowerEdge FX2 was powerful enough to run our Hadoop workload, but more importantly, it scaled well when we added another cluster. Adding a second PowerEdge FX2 chassis complete with four Dell PowerEdge FC430 server nodes and Dell PowerEdge FD332 storage cut the time to run our Hadoop job in half. The all-in-one chassis that brings compute, storage, and networking together can also offer other benefits inherent in its design: the Dell PowerEdge FX2 can sort big data in a small space, which can also deliver space savings and ease the burden of managing the Hadoop solution.
1) Many groups presented file replication systems they have developed and are using in production, including JLAB, SRB, Globus, GDMP, MAGDA, SAM, STAR, and BaBar.
2) The systems utilize various components like replica catalogs, file transfer services, storage interfaces, and scheduling/management layers to provide robust file replication capabilities.
3) Key topics of discussion included interfaces and standards for replication services, error handling, reliability, performance, and experience from different experiments. Groups expressed interest in further collaboration in these areas.
MoreVRP is a database performance monitoring and acceleration tool, and offers DBAs the capability to have real-time monitoring and resource management and control.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Ā
Cisco's Unified Fabric provides an integrated networking solution optimized for big data infrastructures using Hadoop. The document describes Cisco's testing of the Unified Fabric using a Hadoop cluster of 128 and 16 nodes running Yahoo's Terasort benchmark on 1TB of data. It found that the Unified Fabric can support the network traffic patterns of Hadoop workloads while efficiently utilizing buffering to absorb bursts of traffic during shuffle and replication phases.
Characterization of hadoop jobs using unsupervised learningJoĆ£o Gabriel Lima
Ā
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
Ā
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. In this session, we will be presenting the architecture and design of the next generation of MapReduce and will delve into the details of the architecture that makes it much easier to innovate. The architecture will have built in HA, security and multi-tenancy to support many users on the larger clusters. It will also increase innovation, agility and hardware utilization. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
Ā
This document summarizes a research paper that proposes using Google File System (GFS) and MapReduce in a cloud computing environment to improve resource utilization and processing of large datasets. The paper discusses GFS architecture with a master node and chunk servers, and how MapReduce can split large files into chunks and process them in parallel across idle cloud nodes. It also proposes encrypting data for security and using a third party to audit client files. The goal is to provide fault tolerance, optimize workload processing time, and maximize utilization of cloud resources for data-intensive applications.
Challenges and Opportunities of FPGA Acceleration in Big DataIRJET Journal
Ā
This document discusses the challenges and opportunities of using field programmable gate arrays (FPGAs) to accelerate big data applications. It describes how FPGAs can be integrated into big data systems to improve performance and efficiency. The main challenges are ensuring transparency of the FPGA acceleration and efficient integration. Potential benefits include speeding up long-running queries, reducing latency for sensitive applications, and improving energy efficiency. Specific examples discussed include using FPGAs to accelerate Spark SQL queries and deep learning workloads.
This document discusses GPU computing and provides comparisons between CPU and GPU architectures and performance. It begins by introducing hybrid clusters that use accelerators like GPUs and FPGAs to provide high-performance computation. GPUs are discussed as being highly parallel and suitable for general-purpose computations. The document then summarizes GPU architecture and programming models like CUDA and OpenCL that are used to program GPUs. It provides an example GPU hardware architecture and explains how programming models map applications to GPU resources. Benchmark results are mentioned as showing GPUs can provide significantly faster computation times than CPUs for parallel problems.
This document discusses GPU computing and provides comparisons between CPU and GPU architectures and performance. It begins by introducing hybrid clusters that use accelerators like GPUs and FPGAs to provide high-performance computation. GPUs are discussed as being highly parallel and suitable for general-purpose computations. The document then summarizes GPU architecture and programming models like CUDA and OpenCL that are used to program GPUs. It provides an example GPU hardware architecture and explains how programming models map applications to GPU resources. Benchmark results are mentioned as showing GPUs can provide significantly faster computation times than CPUs for parallel problems.
The document summarizes presentations from a meeting focused on robust file replication. Various projects discussed their approaches to file replication including JLAB, SRB, Globus, GDMP, MAGDA, SAM, STAR and Babar. Key topics included replica catalogs, logical naming, error handling, interfaces, and reliability. There was discussion of defining common components and interfaces to promote interoperability across replication applications and middleware. Future work items were identified around web services, naming, consistency models and performance.
PERFORMANCE EVALUATION OF BIG DATA PROCESSING OF CLOAK-REDUCEijdpsjournal
Ā
Big Data has introduced the challenge of storing and processing large volumes of data (text, images, and
videos). The success of centralised exploitation of massive data on a node is outdated, leading to the
emergence of distributed storage, parallel processing and hybrid distributed storage and parallel
processing frameworks.
The main objective of this paper is to evaluate the load balancing and task allocation strategy of our
hybrid distributed storage and parallel processing framework CLOAK-Reduce. To achieve this goal, we
first performed a theoretical approach of the architecture and operation of some DHT-MapReduce. Then,
we compared the data collected from their load balancing and task allocation strategy by simulation.
Finally, the simulation results show that CLOAK-Reduce C5R5 replication provides better load balancing
efficiency, MapReduce job submission with 10% churn or no churn.
Review this presentation to discover the whens and hows of scaling at it provides and overview of multi-session, single-session, and multi-host scaling options and the workloads where these options are appropriate.
Database scaling refers to increasing database throughput. Vertical scaling utilizes additional resources such as I/O, memory, CPU, while horizontal scaling uses additional computers. However, the high concurrency and write requirements of database servers make scaling a challenge. Sometimes scaling is only possible with multiple sessions, while other options require data model adjustments or server configuration changes.
To learn more please visit www.EnterpriseDB.com or contact sales@enterprisedb.com.
This document discusses Greenplum Database on HDFS (GOH). It provides an introduction and overview of GOH's architecture, features, and performance. Key points include that GOH allows Greenplum to use HDFS for storage, provides pluggable storage support, and full transaction support for tables on HDFS. It also notes challenges around supporting many concurrent queries due to limitations of the current Java-based HDFS client, and possibilities for addressing this.
MongoDB on Windows Azure provides two options for deploying the MongoDB database on Microsoft's cloud platform:
1) Windows Azure Virtual Machines allow more control over infrastructure but require more operational effort. Users can choose Windows or Linux and install software themselves.
2) Windows Azure Cloud Services decrease operational effort through automated management but provide less infrastructure control. Only Windows is supported and configurations are pre-defined.
Both options provide scalability and high availability through features like replication and sharding. Developers should evaluate the level of control and effort needed to determine the best deployment model for their application on the Windows Azure cloud.
A comparative survey based on processing network traffic data using hadoop pi...ijcses
Ā
Big data analysis has now become an integral part of many computational and statistical departments.
Analysis of peta-byte scale of data is having an enhanced importance in the present day scenario. Big data
manipulation is now considered as a key area of research in the field of data analytics and novel
techniques are being evolved day by day. Thousands of transaction requests are being processed in every
minute by different websites related to e-commerce, shopping carts and online banking. Here comes the
need of network traffic and weblog analysis for which Hadoop comes as a suggested solution. It can
efficiently process the Netflow data collected from routers, switches or even from website access logs at
fixed intervals.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes key features of Hadoop like high scalability, availability, and performance. It explains how Hadoop can process massive amounts of data generated daily by internet companies efficiently using commodity hardware. Hadoop provides a storage system called HDFS and a programming model called MapReduce that together form the backbone of distributed computing infrastructure for many large companies.
The definition of a successful Hadoop solution need not be limited to whether or not the hardware can run the jobs and sort the data. As our tests show, the Dell PowerEdge FX2 was powerful enough to run our Hadoop workload, but more importantly, it scaled well when we added another cluster. Adding a second PowerEdge FX2 chassis complete with four Dell PowerEdge FC430 server nodes and Dell PowerEdge FD332 storage cut the time to run our Hadoop job in half. The all-in-one chassis that brings compute, storage, and networking together can also offer other benefits inherent in its design: the Dell PowerEdge FX2 can sort big data in a small space, which can also deliver space savings and ease the burden of managing the Hadoop solution.
1) Many groups presented file replication systems they have developed and are using in production, including JLAB, SRB, Globus, GDMP, MAGDA, SAM, STAR, and BaBar.
2) The systems utilize various components like replica catalogs, file transfer services, storage interfaces, and scheduling/management layers to provide robust file replication capabilities.
3) Key topics of discussion included interfaces and standards for replication services, error handling, reliability, performance, and experience from different experiments. Groups expressed interest in further collaboration in these areas.
MoreVRP is a database performance monitoring and acceleration tool, and offers DBAs the capability to have real-time monitoring and resource management and control.
The document discusses MapReduce, including its programming model, internal framework, and improvements. It describes MapReduce as a programming model and framework that allows parallel processing of large datasets across commodity machines. The map function processes input key-value pairs to generate intermediate pairs, and the reduce function combines values for each key. The framework automatically parallelizes jobs and provides fault tolerance.
The document describes Megastore, a storage system developed by Google to meet the requirements of interactive online services. Megastore blends the scalability of NoSQL databases with the features of relational databases. It uses partitioning and synchronous replication across datacenters using Paxos to provide strong consistency and high availability. Megastore has been widely deployed at Google to handle billions of transactions daily storing nearly a petabyte of data across global datacenters.
This paper discusses implementing NoSQL databases for robotics applications. NoSQL databases are well-suited for robotics because they can store massive amounts of data, retrieve information quickly, and easily scale. The paper proposes using a NoSQL graph database to store robot instructions and relate them according to tasks. MapReduce processing is also suggested to break large robot data problems into parallel pieces. Implementing a NoSQL system would allow building more intelligent humanoid robots that can process billions of objects and learn quickly from massive sensory inputs.
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Ā
Cisco's Unified Fabric provides an integrated networking solution optimized for big data infrastructures using Hadoop. The document describes Cisco's testing of the Unified Fabric using a Hadoop cluster of 128 and 16 nodes running Yahoo's Terasort benchmark on 1TB of data. It found that the Unified Fabric can support the network traffic patterns of Hadoop workloads while efficiently utilizing buffering to absorb bursts of traffic during shuffle and replication phases.
Characterization of hadoop jobs using unsupervised learningJoĆ£o Gabriel Lima
Ā
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
Ā
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. In this session, we will be presenting the architecture and design of the next generation of MapReduce and will delve into the details of the architecture that makes it much easier to innovate. The architecture will have built in HA, security and multi-tenancy to support many users on the larger clusters. It will also increase innovation, agility and hardware utilization. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
Ā
This document summarizes a research paper that proposes using Google File System (GFS) and MapReduce in a cloud computing environment to improve resource utilization and processing of large datasets. The paper discusses GFS architecture with a master node and chunk servers, and how MapReduce can split large files into chunks and process them in parallel across idle cloud nodes. It also proposes encrypting data for security and using a third party to audit client files. The goal is to provide fault tolerance, optimize workload processing time, and maximize utilization of cloud resources for data-intensive applications.
Challenges and Opportunities of FPGA Acceleration in Big DataIRJET Journal
Ā
This document discusses the challenges and opportunities of using field programmable gate arrays (FPGAs) to accelerate big data applications. It describes how FPGAs can be integrated into big data systems to improve performance and efficiency. The main challenges are ensuring transparency of the FPGA acceleration and efficient integration. Potential benefits include speeding up long-running queries, reducing latency for sensitive applications, and improving energy efficiency. Specific examples discussed include using FPGAs to accelerate Spark SQL queries and deep learning workloads.
This document discusses GPU computing and provides comparisons between CPU and GPU architectures and performance. It begins by introducing hybrid clusters that use accelerators like GPUs and FPGAs to provide high-performance computation. GPUs are discussed as being highly parallel and suitable for general-purpose computations. The document then summarizes GPU architecture and programming models like CUDA and OpenCL that are used to program GPUs. It provides an example GPU hardware architecture and explains how programming models map applications to GPU resources. Benchmark results are mentioned as showing GPUs can provide significantly faster computation times than CPUs for parallel problems.
This document discusses GPU computing and provides comparisons between CPU and GPU architectures and performance. It begins by introducing hybrid clusters that use accelerators like GPUs and FPGAs to provide high-performance computation. GPUs are discussed as being highly parallel and suitable for general-purpose computations. The document then summarizes GPU architecture and programming models like CUDA and OpenCL that are used to program GPUs. It provides an example GPU hardware architecture and explains how programming models map applications to GPU resources. Benchmark results are mentioned as showing GPUs can provide significantly faster computation times than CPUs for parallel problems.
The document discusses the evolution of GPU architecture and capabilities over time. It describes how GPUs have become massively parallel processors with programmable capabilities beyond just graphics. The document outlines the core components of a GPU including the graphics pipeline and programming model. It also discusses how GPUs are well suited for parallel, data-intensive applications and how their capabilities have expanded into general purpose computing through technologies like CUDA.
An exposition of performance comparison of graphic processing unit virtualiza...Asif Farooq
Ā
As the demand for computing power is increasing the number of new and improved methodologies in computer architectures are expanding. With the introduction of accelerated heterogeneous computing model, compute times for complex algorithms and tasks are reduced significantly as a result of high degree data parallelism. GPU based heterogeneous computing can not only benefit Cloud infrastructures but also large-scale distributed computing models to work more cost-effective by improving resource efficiencies and decreasing energy consumptions. Thus to implement such paradigm on cloud and largescale infrastructure would require effective GPU virtualization techniques. In this survey, an overview of GPGPU virtualization techniques using CUDA programming model is reviewed with a detailed performance comparison.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
Ā
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrixāmatrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
Ā
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
This document discusses using OpenCL to accelerate numerical modeling of gravitational wave sources on hardware accelerators like GPUs and the Cell BE. It summarizes the EMRI Teukolsky Code, which models gravitational waves generated by a compact object orbiting a supermassive black hole by solving the Teukolsky equation. The authors parallelized this code using OpenCL to run on GPUs and the Cell BE, achieving performance comparable to using each vendor's native SDK while only writing code once for both architectures.
AMD Fusion APUs combine CPU and GPU capabilities on a single chip to overcome limitations of previous technologies. By integrating graphics processing directly onto the processor die, APUs eliminate performance bottlenecks between discrete CPU and GPU components. This allows for more balanced processing across CPU and GPU cores. APUs also address software limitations by providing developers with flexibility to utilize scalar and vector processing approaches based on task requirements. The first generation of AMD Fusion APUs launched in 2010 aimed to facilitate adoption of applications harnessing combined CPU and GPU capabilities.
This document provides an overview and comparison of different GPU virtualization techniques using the CUDA programming model. It first reviews several techniques for GPU virtualization, including GViM, vCUDA, gVirtuS, rCUDA, DS-CUDA, LoGV, and Grid CUDA. It then compares these techniques based on factors like the CUDA version compatibility, hypervisor used, and whether they support remote GPU acceleration. Finally, the document provides a performance comparison based on overhead percentages and execution times reported in various studies, with rCUDA having the lowest overhead and fastest execution time on average.
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
Ā
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document summarizes a presentation by Dr. Christoph Angerer on RAPIDS, an open source library for GPU-accelerated data science. Some key points:
- RAPIDS provides an end-to-end GPU-accelerated workflow for data science using CUDA and popular tools like Pandas, Spark, and XGBoost.
- It addresses challenges with data movement and formats by keeping data on the GPU as much as possible using the Apache Arrow data format.
- Benchmarks show RAPIDS provides significant speedups over CPU for tasks like data preparation, machine learning training, and visualization.
- Future work includes improving cuDF (GPU DataFrame library), adding algorithms to cuML
A sql implementation on the map reduce frameworkeldariof
Ā
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
1. The document proposes using FPGA for data offloading in cloud datacenter networks to improve quality of service. It discusses how current high-performance processors are not keeping pace with required QoS and how FPGA can help balance workloads.
2. It reviews related works in cloud datacenter networking that have not considered using FPGA for acceleration and quality of service improvement. Most have focused on topology, architecture, flow scheduling, and virtualization but not FPGA-based acceleration.
3. The document argues that an FPGA-based spine-leaf model could be an alternative to traditional network models for enterprise applications like EETACP, offering benefits like lower latency, offloading, scalability, and
Cloud Based Datacenter Network Acceleration Using FPGA for Data-Oļ¬oading Onyebuchi nosiri
Ā
Currently, the high-performance processors in Spine-Leaf, Mesh, and Router layer-3 (SLMR-3) backend server domain have multiple cores, but data oļ¬oading from processor to the peripheral is not keeping pace with the required Quality of Service (QoS) needed to balance the workload on a Warehouse Scaled Computer (WSC) running a developed Enterprise Energy Tracking Analytic Cloud Portal (EETACP) data center network. High speed with low latency interconnects between the processors and Field Programmable Gate Array (FPGA) is critical for achieving performance beneļ¬ts in EETACP deployment. Most of the servers in WSC architectures are running at average utilization rates and perform well under peak processing power. These servers are good candidates for FPGA processors in cloud-based data centers owing to its acceleration coherency. This paper made a strong case for cloudbased support for EETACP. An FPGA-based Spine-Leaf model is proposed to be an alternative to traditional network models for EETACP provisioning. The paper analyzed reconļ¬gurable FPGAs, characterized a simpliļ¬ed process model for hyperscale FPGA cloud design description. To validate the performance, comparisons was made with two similar networks, namely DCell and BCube for enterprise application deployments. It was concluded that FPGA-based DCN acceleration for EETACP oļ¬ers acceptable QoS expectations
SAP Sybase IQ uses a technique called distributed query processing (DQP) that can improve query performance by breaking queries into pieces and distributing the pieces across multiple SAP Sybase IQ servers. DQP provides both intra-query and inter-query parallelism. It dynamically manages resources to balance workloads and avoid saturating the system. For DQP to be effective, the storage area network must have sufficient performance to support the increased parallelism.
This document discusses accelerating the S3D turbulent reacting flow solver using a GPU. The getrates kernel, which calculates chemical reaction rates, was identified as a bottleneck and ported to the GPU. The GPU version achieved speedups of up to 31.4x for single precision and 17.0x for double precision. However, single precision results in accumulated error over time that impacts variables like temperature and chemical species. Double precision versions better controlled error but had lower performance. Overall, the document examines the GPU acceleration process and tradeoffs between performance and accuracy.
CUDA performance study on Hadoop MapReduce Clusterairbots
Ā
This document summarizes a study on using GPUs (CUDA) to accelerate Hadoop MapReduce workloads. It introduces CUDA into Hadoop clusters, evaluates the performance speedup and power efficiency on matrix multiplication and molecular dynamics simulations, and concludes that GPU acceleration provides up to 20x speedup and reduces power consumption by up to 19/20, making it a cost-effective approach compared to CPU-only upgrades. Future work is outlined to port more applications and support heterogeneous GPU/CPU clusters.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This monthās highlights covers the Organization's newly elected president, an updated OpenACC 3.1 specification, upcoming 2021 GPU Hackathons, new resources and more!
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
Ā
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
āTemporal Event Neural Networks: A More Efficient Alternative to the Transfor...Edge AI and Vision Alliance
Ā
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the āTemporal Event Neural Networks: A More Efficient Alternative to the Transformerā tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChipās Akida neuromorphic hardware IP further enhances TENNsā capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Ā
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Ā
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Ā
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
Ā
An English š¬š§ translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech šØšæ version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Ā
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
Fueling AI with Great Data with Airbyte WebinarZilliz
Ā
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Ā
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Ā
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
Ā
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power gridās behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
Ā
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of whatās possible in finance.
In summary, DeFi in 2024 is not just a trend; itās a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Ā
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Ā
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
SAP S/4 HANA sourcing and procurement to Public cloud
Ā
SQL CUDA
1. Accelerating SQL Database Operations
on a GPU with CUDA
Peter Bakkum and Kevin Skadron
Department of Computer Science
University of Virginia, Charlottesville, VA 22904
{pbb7c, skadron}@virginia.edu
ABSTRACT on the GPU [2]. Many of these operations have direct par-
Prior work has shown dramatic acceleration for various data- allels to classic database queries [4, 9].
base operations on GPUs, but only using primitives that are The GPUās complex architecture makes it diļ¬cult for un-
not part of conventional database languages such as SQL. familiar programmers to fully exploit. A productive CUDA
This paper implements a subset of the SQLite command programmer must have an understanding of six diļ¬erent
processor directly on the GPU. This dramatically reduces memory spaces, a model of how CUDA threads and thread-
the eļ¬ort required to achieve GPU acceleration by avoiding blocks are mapped to GPU hardware, an understanding of
the need for database programmers to use new programming CUDA interthread communication, etc. CUDA has brought
languages such as CUDA or modify their programs to use GPU development closer to the mainstream but program-
non-SQL libraries. mers must still write a low-level CUDA kernel for each data
This paper focuses on accelerating SELECT queries and processing operation they perform on the GPU, a time-
describes the considerations in an eļ¬cient GPU implemen- intensive task that frequently duplicates work.
tation of the SQLite command processor. Results on an SQL is an industry-standard generic declarative language
NVIDIA Tesla C1060 achieve speedups of 20-70X depend- used to manipulate and query databases. Capable of per-
ing on the size of the result set. forming very complex joins and aggregations of data sets,
SQL is used as the bridge between procedural programs and
structured tables of data. An acceleration of SQL queries
Categories and Subject Descriptors would enable programmers to increase the speed of their
D.1.3 [Concurrent Programming]: Parallel Programming; data processing operations with little or no change to their
H.2.4 [Database Management]: Parallel Databases source code. Despite the demand for GPU program acceler-
ation, no implementation of SQL is capable of automatically
Keywords accessing a GPU, even though SQL queries have been closely
emulated on the GPU to prove the parallel architectureās
GPGPU, CUDA, Databases, SQL adaptability to such execution patterns [5, 6, 9].
There exist limitations to current GPU technology that af-
1. INTRODUCTION fect the potential users of such a GPU SQL implementation.
GPUs, known colloquially as video cards, are the means The two most relevant technical limitations are the GPU
by which computers render graphical information on a screen. memory size and the host to GPU device memory transfer
The modern GPUās parallel architecture gives it very high time. Though future graphics cards will almost certainly
throughput on certain problems, and its near universal use in have greater memory, current NVIDIA cards have a maxi-
desktop computers means that it is a cheap and ubiquitous mum of 4 gigabytes, a fraction of the size of many databases.
source of processing power. There is a growing interest in Transferring memory blocks between the CPU and the GPU
applying this power to more general non-graphical problems remains costly. Consequently, staging data rows to the GPU
through frameworks such as NVIDIAās CUDA, an applica- and staging result rows back requires signiļ¬cant overhead.
tion programming interface developed to give programmers Despite these constraints, the actual query execution can be
a simple and standard way to execute general purpose logic run concurrently over the GPUās highly parallel organiza-
on NVIDIA GPUs. Programmers often use CUDA and sim- tion, thus outperforming CPU query execution.
ilar interfaces to accelerate computationally intensive data There are a number of applications that ļ¬t into the do-
processing operations, often executing them ļ¬fty times faster main of this project, despite the limitations described above.
Many databases, such as those used for research, modify
data infrequently and experience their heaviest loads during
read queries. Another set of applications care much more
Permission to make digital or hard copies of all or part of this work for about the latency of a particular query than strict adher-
personal or classroom use is granted without fee provided that copies are ence to presenting the latest data, an example being Internet
not made or distributed for proļ¬t or commercial advantage and that copies search engines. Many queries over a large-size dataset only
bear this notice and the full citation on the ļ¬rst page. To copy otherwise, to address a subset of the total data, thus inviting staging this
republish, to post on servers or to redistribute to lists, requires prior speciļ¬c
permission and/or a fee.
subset into GPU memory. Additionally, though the ļ¬nite
GPGPU-3 March 14, 2010, Pittsburg, PA, USA memory size of the GPU is a signiļ¬cant limitation, allocat-
Copyright 2010 ACM 978-1-60558-935-0/10/03 ...$10.00.
2. ing just half of the 4 gigabytes of a Tesla C1060 to store a functionality can be exploited to accelerate certain opera-
data set gives the user room for over 134 million rows of 4 tions. The novelty of this approach is that CUDA kernels
integers. are accessed through a database rather than explicitly called
The contribution of this paper is to implement and demon- by a user program.
strate a SQL interface for GPU data processing. This in- The most closely related research is Relational Query Co-
terface enables a subset of SQL SELECT queries on data processing on Graphics Processors, by Bingsheng He, et al.
that has been explicitly transferred in row-column form to [12]. This is a culmination of much of the previous research
GPU memory. SELECT queries were chosen since they are performed on GPU-based data processing. Its authors de-
the most common SQL query, and their read-only charac- sign a database, called GDB, accessed through a plethora
teristic exploits the throughput of the GPU to the high- of individual operations. These operations are divided into
est extent. The project is built upon an existing open- operators, access methods, and primitives. The operators in-
source database, SQLite, enabling switching between CPU clude ordering, grouping, and joining functionality. The ac-
and GPU query execution and providing a direct compari- cess methods control how the data is located in the database,
son of serial and parallel execution. While previous research and includes scanning, trees, and hashing. Finally the prim-
has used data processing primitives to approximate the ac- itives are a set of functional programming operations such as
tions of SQL database queries, this implementation is built map, reduce, scatter, gather, and split. GDB has a number
from the ground up around the parsing of SQL queries, and of similarities to the implementation described in this paper,
thus executes with signiļ¬cant diļ¬erences. notably the read-only system and column-row data organi-
In this context, SQL allows the programmer to drastically zation, but lacks direct SQL access. In the paper, several
change the data processing patterns executed on the GPU SQL queries are constructed with the primitives and bench-
with the smallest possible development time, literally pro- marked, but no parser exists to transform SQL queries to
ducing completely orthogonal queries with a few changes in sequences of primitives.
SQL syntax. Not only does this simplify GPU data process- This paperās implementation has similar results to the pre-
ing, but the results of this paper show that executing SQL vious research, but approaches the querying of datasets from
queries on GPU hardware signiļ¬cantly outperforms serial an opposing direction. Other research has built GPU com-
CPU execution. Of the thirteen SQL queries tested in this puting primitives from the ground up, then built programs
paper, the smallest GPU speedup was 20X, with a mean of with these primitives to compare to other database opera-
35X. These results suggest this will be a very fruitful area tions. This paperās research begins with the codebase of a
for future research and development. CPU-based database and adapts its computational elements
to execute on a GPU. This approach allows a much more
2. RELATED WORK direct comparison with traditional databases, and most im-
portantly, allows the computing power of the GPU to be
2.1 GPU Data Mining accessed directly through SQL. SQL presents a uniform and
standardized interface to the GPU, without knowledge of
There has been extensive research in general data min-
the speciļ¬c primitives of a certain implementation, and with
ing on GPUs, thoroughly proving its power and the advan-
the option choosing between CPU and GPU execution. In
tages of oļ¬oading processing from the CPU. The research
other words, the marginal cost of designing data processing
relevant to this paper focuses on demonstrating that cer-
queries to be run on a GPU is signiļ¬cantly reduced with a
tain database operations, (i.e. operations that are logically
SQL interface.
performed within a database during a query execution) can
To our knowledge, no other published research provides
be sped up on GPUs. These projects are implemented us-
this SQL interface to GPU execution. In practical terms,
ing primitives such as Sort and Scatter, that can be com-
this approach means that a CUDA thread executes a set of
bined and run in succession on the same data to produce
SQLite opcodes on a single row before exiting, rather than
the results of common database queries. One paper divides
a host function managing bundle of primitives as CUDA
database queries into predicate evaluation, boolean combi-
kernels. It is possible that a SQL interface to the primi-
nation, and aggregation functions [9]. Other primitives in-
tives discussed in other research could be created through
clude binary searches, p-ary searches [14], tree operations,
a parser, but this has not been done, and may or may not
relational join operations [6], etc. An area where GPUs have
be more advantageous for GPU execution. Many primitives
proven particularly useful is with sort operations. GPUTera-
such as sort and group have direct analogs in SQL, future
Sort, for example, is an algorithm developed to sort database
research may clarify how an optimal SQL query processor
rows based on keys, and demonstrated signiļ¬cant perfor-
diļ¬ers when targeting the GPU versus the CPU.
mance improvements over serial sorting methods [8]. One
of the most general of the primitive-based implementations
is GPUMiner, a program which implements several algo- 2.2 MapReduce
rithms, including k-means, and provides tools to visualize A new and active area of data mining research is in the
the results [7]. Much of this research was performed on pre- MapReduce paradigm. Originally pioneered by Google, it
vious generations of GPU hardware, and recent advances gives the programmer a new paradigm for data mining based
can only improve the already impressive results. on the functional primitives map and reduce [3]. This para-
One avenue of research directly related to production SQL digm has a fundamentally parallel nature, and is used exten-
databases is the development of database procedures that sively by Google and many other companies for large-scale
employ GPU hardware. These procedures are written by the distributed data processing. Though essentially just a name
user and called through the database to perform a speciļ¬c for using two of the primitives mentioned in the previous
function. It has been shown using stored and external pro- section, MapReduce has become a major topic itself. Re-
cedures on Oracle [1] PostgreSQL databases [13] that GPU search in this area has shown that MapReduce frameworks
3. can be accelerated on multicore machines [16] and on GPUs applications, such as Firefox, and on mobile devices, such
[11]. Notably, Thrust, a library of algorithms implemented as the iPhone [22]. SQLite is respected for its extreme sim-
in CUDA intended as a GPU-aware library similar to the plicity and extensive testing. Unlike most databases which
C++ Standard Template Library, includes a MapReduce operate as server, accessed by separate processes and usually
implementation [24]. accessed remotely, SQLite is written to be compiled directly
In some cases, a MapReduce framework has become a re- into the source code of the client application. SQLite is dis-
placement for a traditional SQL database, though its use tributed as a single C source ļ¬le, making it trivial to add
remains limited. The advantage of one over the other re- a database with a full SQL implementation to a C/C++
mains a hotly debated topic, both are very general methods application.
through which data can be processed. MapReduce requires
the programmer to write a speciļ¬c query procedurally, while 3.2 Architecture
SQLās power lies in its simple declarative syntax. Conse- SQLiteās architecture is relatively simple, and a brief de-
quently, MapReduce most useful for handling unstructured scription is necessary for understanding the CUDA imple-
data. A key diļ¬erence is that the simplicity of the MapRe- mentation described in this paper. The core of the SQLite
duce paradigm makes it simple to implement in CUDA, infrastructure contains the user interface, the SQL command
while no such SQL implementation exists. Additionally the processor, and the virtual machine [21]. SQLite also contains
limited use of MapReduce restricts any GPU implementa- extensive functionality for handling disk operations, mem-
tion to a small audience, particularly given that the memory ory allocation, testing, etc. but these areas are less relevant
ceilings of modern GPUs inhibit their use in the huge-scale to this project. The user interface consists of a library of
data processing applications for which MapReduce is known. C functions and structures to handle operations such as ini-
tializing databases, executing queries, and looking at results.
2.3 Programming Abstraction The interface is simple and intuitive: it is possible to open
Another notable vector of research is the eļ¬ort to simplify a database and execute a query in just two function calls.
the process of writing GPGPU applications, CUDA appli- Function calls that execute SQL queries use the SQL com-
cations in particular. Writing optimal CUDA programs re- mand processor. The command processor functions exactly
quires an understanding of the esoteric aspects of NVIDIA like a compiler: it contains a tokenizer, a parser, and a code
hardware, speciļ¬cally the memory heirarchy. Research on generator. The parser is created with an LALR(1) parser
this problem has focused on making the heirarchy trans- generator called Lemon, very similar to YACC and Bison.
parent to the programmer, performing critical optimization The command processor outputs a program in an intermedi-
during compilation. One such project has programmers ate language similar to assembly. Essentially, the command
write CUDA programs that exclusively use global memory, processor takes the complex syntax of a SQL query and out-
then chooses the best variables to move to register mem- puts a set of discrete steps.
ory, shared memory, etc. during the compilation phase [17]. Each operation in this intermediate program contains an
Other projects such as CUDA-lite and hiCUDA have the opcode and up to ļ¬ve arguments. Each opcode refers to a
programmer annotate their code for the compiler, which speciļ¬c operation performed within the database. Opcodes
chooses the best memory allocation based on these notes, perform operations such as opening a table, loading data
an approach similar to the OpenMP model [10, 25]. Yet from a cell into a register, performing a math operation on
another project directly translates OpenMP code to CUDA, a register, and jumping to another opcode [23]. A simple
eļ¬ectively making it possible to migrate parallel processor SELECT query works by initializing access to a database
code to the GPU with no input from the programmer [15]. A table, looping over each row, then cleaning up and exiting.
common thread in this area is the tradeoļ¬ between the diļ¬- The loop includes opcodes such as Column, which loads data
culty of program development and the optimality of the ļ¬n- from a column of the current row and places it in a register,
ished product. Ultimately, programming directly in CUDA ResultRow, which moves the data in a set of registers to the
remains the only way to ensure a program is taking full ad- result set of the query, and Next, which moves the program
vantage of the GPU hardware. on to the next row.
Regardless of the speciļ¬cs, there is clear interest in provid- This opcode program is executed by the SQLite virtual
ing a simpler interface to GPGPU programming than those machine. The virtual machine manages the open database
that currently exist. The ubiquity of SQL and its pervasive and table, and stores information in a set of āregistersā,
parallelism suggest that a SQL-based GPU interface would which should not be confused with the register memory of
be easy for programmers to use and could signiļ¬cantly speed CUDA. When executing a program, the virtual machine di-
up many applications that have already been developed with rects control ļ¬ow through a large switch statement, which
databases. Such an interface would not be ideal for all ap- jumps to a block of code based on the current opcode.
plications, and would lack the ļ¬ne-grained optimization of
the previously discussed interfaces, but could be signiļ¬cantly 3.3 Usefulness
simpler to use.
SQLite was chosen as a component of this project for a
number of reasons. First, using elements of a well-developed
3. SQLITE database removes the burden of having to implement SQL
query processing for the purposes of this project. SQLite
3.1 Overview was attractive primarily for its simplicity, having been de-
SQLite is a completely open source database developed by veloped from the ground up to be as simple and compact
a small team supported by several major corporations [20]. as possible. The source code is very readable, written in a
Its development team claims that SQLite is the most widely clean style and commented heavily. The serverless design of
deployed database in the world owing to its use in popular SQLite also makes it ideal for research use. It is very easy
4. to modify and add code and recompile quickly to test, and 4.2 Data Set
its functionality is much more accessible to someone inter- As previously described, this project assumes data stays
ested in comparing native SQL query execution to execu- resident on the card across multiple queries and thus ne-
tion on the GPU. Additionally, the SQLite source code is in glects the up-front cost of moving data to the GPU. Based
the public domain, thus there are no licensing requirements on the read-only nature of the SQL queries in this project
or restrictions on use. Finally, the widespread adoption of and the characteristics of the CUDA programming model,
SQLite makes this project relevant to the industry, demon- data is stored on the GPU in row-column form. SQLite
strating that many already-developed SQLite applications stores its data in a B-Tree, thus an explicit translation step
could improve their performance by investing in GPU hard- is required. For convenience, this process is performed with
ware and changing a trivial amount of code. a SELECT query in SQLite to retrieve a subset of data from
From an architectural standpoint, SQLite is useful for its the currently open database.
rigid compartmentalization. Its command processor is en- The Tesla C1060 GPU used for development has 4 gi-
tirely separate from the virtual machine, which is entirely gabytes of global memory, thus setting the upper limit of
separate from the disk i/o code and the memory alloca- data set size without moving data on and oļ¬ the card dur-
tion code, such that any of these pieces can be swapped ing query execution. Note that in addition to the data set
out for custom code. Critically, this makes it possible to re- loaded on the GPU, there must be another memory block
implement the virtual machine to run the opcode program allocated to store the result set. Both of these blocks are al-
on GPU hardware. located during the initialization of the program. In addition
A limitation of SQLite is that its serverless design means it to allocation, meta data such as the size of the block, the
is not implemented to take advantage of multiple cores. Be- number of rows in the block, the stride of the block, and the
cause it exists solely as a part of another programās process, size of each column must be explicitly managed.
threading is controlled entirely outside SQLite, though it has
been written to be thread-safe. This limitation means that 4.3 Memory Spaces
there is no simple way to compare SQLite queries executed
This project attempts to utilize the memory heirarchy of
on a single core to SQLite queries optimized for multicore
the CUDA programming model to its full extent, employ-
machines. This is an area for future work.
ing register, shared, constant, local, and global memory [19].
Register memory holds thread-speciļ¬c memory such as oļ¬-
sets in the data and results blocks. Shared memory, mem-
4. IMPLEMENTATION ory shared among all threads in the thread block, is used
to coordinate threads during the reduction phase of the ker-
4.1 Scope nel execution, in which each thread with a result row must
Given the range of both database queries and database ap- emit that to a unique location in the result data set. Con-
plications and the limitations of CUDA development, it is stant memory is particularly useful for this project since
necessary to deļ¬ne the scope of of this project. We explicitly it is used to store the opcode program executed by every
target applications that run SELECT queries multiple times thread. It is also used to store data set meta information,
on the same mid-size data set. The SELECT query qualiļ¬- including column types and widths. Since the program and
cation means that the GPU is used for read-only data. This this data set information is accessed very frequently across
enables the GPU to maximize its bandwidth for this case all threads, constant memory signiļ¬cantly reduces the over-
and predicates storing database rows in row-column form. head that would be incurred if this information was stored
The āmultiple timesā qualiļ¬cation means that the project has in global memory.
been designed such that SQL queries are executed on data Global memory is necessarily used to store the data set
already resident on the card. A major bottleneck to GPU on which the query is being performed. Global memory has
data processing is the cost of moving data between device signiļ¬cantly higher latency than register or constant mem-
and host memory. By moving a block of data into the GPU ory, thus no information other than the entire data set is
memory and executing multiple queries, the cost of loading stored in global memory, with one esoteric exception. Local
data is eļ¬ectively amortized as we execute more and more memory is an abstraction in the CUDA programming model
queries, thus the cost is mostly ignored. Finally, a āmid-size that means memory within the scope of a single thread that
data setā is enough data to ignore the overhead of setting is stored in the global memory space. Each CUDA thread
up and calling a CUDA kernel but less than the ceiling of block is limited to 16 kilobytes of register memory: when this
total GPU memory. In practice, this project was designed limit broken the compiler automatically places variables in
and tested using one and ļ¬ve million row data sets. local memory. Local memory is also used for arrays that
This project only implements support for numeric data are accessed by variables not known at compile time. This
types. Though string and blob types are certainly very use- is a signiļ¬cant limitation since the SQLite virtual machine
ful elements of SQL, in practice serious data mining on un- registers are stored in an array. This limitation is discussed
structured data is often easier to implement with another in further detail below.
paradigm. Strings also break the ļ¬xed-column width data Note that texture memory is not used for data set access.
arrangement used for this project, and transferring charac- Texture memory acts as a one to three dimensional cache
ter pointers from the host to device is a tedious operation. for accessing global memory and can signiļ¬cantly accelerate
The numeric data types supported include 32 bit integers, certain applications[19]. Experimentation determined that
32 bit IEEE 754 ļ¬oating point values, 64 bit integers, and using texture memory had no eļ¬ect on query performance.
64 bit IEEE 754 double precision values. Relaxing these re- There are several reasons for this. First, the global data set
strictions is an area for future work. is accessed relatively infrequently, data is loaded into SQLite
registers before it is manipulated. Next, texture memory
5. is optimized for two dimensional caching, while the data opcodes. This type of divergence does not occur with a
set is accessed as one dimensional data in a single block of query-plan of primitives.
memory. Finally, the row-column data format enables most
global memory accesses to be coalesced, reducing the need 4.5 Virtual Machine Infrastructure
for caching. The crux of this project is the reimplementation of the
SQLite virtual machine with CUDA. The virtual machine
4.4 Parsed Queries is implemented as a CUDA kernel that executes the op-
code procedure. The project has implemented around 40
As discussed above, SQLite parses a SQL query into an
opcodes thus far which cover the comparison opcodes, such
opcode program that resembles assembly code. This project
as Ge (greater than or equal), the mathematical opcodes,
calls the SQLite command processor and extracts the re-
such as Add, the logical opcodes, such as Or, the bitwise
sults, removing data superļ¬uous to the subset of SQL queries
opcodes, such as BitAnd, and several other critical opcodes
implemented in this project. A processing phase is also used
such as ResultRow. The opcodes are stored in two switch
to ready the opcode program for transfer to the GPU, in-
statements.
cluding dereferencing pointers and storing the target directly
The ļ¬rst switch statement of the virtual machine allows
in the opcode program. A sample program is printed below,
divergent opcode execution, while the second requires con-
output by the command processor for query 1 in Appendix
current opcode execution. In other words, the ļ¬rst switch
A.
0: Trace 0 0 0 statement allows diļ¬erent threads to execute diļ¬erent op-
1: Integer 60 1 0 codes concurrently, and the second does not. When the
2: Integer 0 2 0 Next opcode is encountered, signifying the end of the data-
3: Goto 0 17 0 dependent parallelism, the virtual machine jumps from the
4: OpenRead 0 2 0 divergent block to the concurrent block. The concurrent
5: Rewind 0 15 0 block is used for the aggregation functions, where coordina-
6: Column 0 1 3 tion across all threads is essential.
7: Le 1 14 3 A major piece of the CUDA kernel is the reduction when
8: Column 0 2 3 the ResultRow opcode is called by multiple threads to emit
9: Ge 2 14 3 rows of results. Since not every thread emits a row, a reduc-
10: Column 0 0 5 tion operation must be performed to ensure that the result
11: Column 0 1 6 block is a contiguous set of data. This reduction involves
12: Column 0 2 7 inter-thread and inter-threadblock communication, as each
13: ResultRow 5 3 0 thread that needs to emit a row must be assigned a unique
14: Next 0 6 0 area of the result set data block. Although the result set is
15: Close 0 0 0 contiguous, no order of results is guaranteed. This saves the
16: Halt 0 0 0 major overhead of completely synchronizing when threads
17: Transaction 0 0 0 and threadblocks complete execution.
18: VerifyCookie 0 1 0 The reduction is implemented using the CUDA atomic
19: TableLock 0 2 0 operation atomicAdd(), called on two tiers. First, each
20: Goto 0 4 0 thread with a result row calls atomicAdd() on a variable
A virtual machine execution of this opcode procedure iter- in shared memory, thus receiving an assignment within the
ates sequentially over the entire table and emits result rows. thread block. The last thread in the block then calls this
Note that not all of the opcodes are relevant to this projectās function on a separate global variable which determineās the
storage of a single table in GPU memory, and are thus not thread blockās position in the memory space, which each
implemented. The key to this kind of procedure is that thread then uses to determine its exact target row based on
opcodes manipulate the program counter and jump to dif- the previous assignment within the thread block. Experi-
ferent locations, thus opcodes are not always executed in mentation has found that this method of reduction is faster
order. The Next opcode, for example, advances from one than others for this particular type of assigment, particularly
row to the next and jumps to the value of the second ar- with sparse result sets.
gument. An examination of the procedure thus reveals the This project also supports SQL aggregation functions (i.e.
block of opcodes 6 through 14 are executed for each row of COUNT, SUM, MIN, MAX, and AVG), though only for in-
the table. The procedure is thus inherently parallelizable teger values. Signiļ¬cant eļ¬ort has been made to adhere
by assigning each row to a CUDA thread and executing the to the SQLite-parsed query plan without multiple kernel
looped procedure until the Next opcode. launches. Since inter-threadblock coordination, such as that
Nearly all opcodes manipulate the array of SQLite reg- used for aggregation functions, is diļ¬cult without using a
isters in some way. The registers are generic memory cells kernel launch as a global barrier, atomic functions are used
that can store any kind of data and are indexed in an array. for coordination, but these can only be used with integer
The Column opcode is responsible for loading data from a values in CUDA. This limitation is expected to be removed
column in the current row into a certain register. in next-generation hardware, and the performance data for
Note the diļ¬erences between a program of this kind and integer aggregates is likely a good approximation of future
a procedure of primitives, as implemented in previous re- performance for other types.
search. Primitives are individual CUDA kernels executed
serially, while the entire opcode procedure is executed en- 4.6 Result Set
tirely within a kernel. As divergence is created based on Once the virtual machine has been executed, the result
the data content of each row, the kernels execute diļ¬erent set of a query still resides on the GPU. Though the speed
6. of query execution can be measured simply by timing the 5.3 Fairness of Comparison
virtual machine, in practice the results must be moved back Every eļ¬ort has been made to produce comparison results
to the CPU to be useful to the host process. This is im- that are as conservative as possible.
plemented as a two-step process. First, the host transfers
a block of information about the result set back from the ā¢ Data on the CPU side has been explicitly loaded into
GPU. This information contains the stride of a result row memory, thus eliminating mid-query disk accesses.
and the number of result rows. The CPU multiplies these SQLite has functionality to declare a temporary data-
values to determine the absolute size of the result block. If base that exists only in memory. Once initalized, the
there are zero rows then no result memory copy is needed, data set is attached and named. Without this step the
otherwise a memory copy is used to transfer the result set. GPU implementation is closer to 200X faster, but it
Note that because we know exactly how large the result set makes for a fairer comparison: it means the data is
is, we do not have to transfer the entire block of memory loaded completely into memory for both the CPU and
allocated for the result set, saving signiļ¬cant time. the GPU.
5. PERFORMANCE ā¢ SQLite has been compiled with the Intel C Compiler
version 11.1. It is optimized with the ļ¬ags -O2, the fa-
5.1 Data Set miliar basic optimization ļ¬ag,
-xHost, which enables processor-speciļ¬c optimization,
The data used for performance testing has ļ¬ve million
and -ipo, which enables optimization across source
rows with an id column, three integer columns, and three
ļ¬les. This forces SQLite to be as fast as possible: with-
ļ¬oating point columns. The data has been generated us-
out optimization SQLite performs signiļ¬cantly worse.
ing the GNU scientiļ¬c libraryās random number generation
functionality. One column of each data type has a uniform ā¢ Directives are issued to SQLite at compile time to omit
distribution in the range [-99.0, 99.0], one column has a nor- all thread protection and store all temporary ļ¬les in
mal distribution with a sigma of 5, and the last column has memory rather than on disk. These directives reduce
a normal distribution with a sigma of 20. Integer and ļ¬oat- overhead on SQLite queries.
ing point data types were tested. The random distributions
provide unpredictable data processing results and mean that ā¢ Pinned memory is not used in the comparison. Using
the size of the result set varies based on the criteria of the pinned memory generally speeds transfers between the
SELECT query. host and device by a factor of two. This means that the
To test the performance of the implementation, 13 queries GPU timing results that include the memory transfer
were written, displayed in Appendix A. Five of the thirteen are worse than they would be if this feature was turned
query integer values, ļ¬ve query ļ¬oating point values, and on.
the ļ¬nal 3 test the aggregation functions. The queries were
executed through the CPU SQLite virtual machine, then ā¢ Results from the host query are not saved. In SQLite
through the GPU virtual machine, and the running times results are returned by passing a callback function along
were compared. Also considered was the time required to with the SQL query. This is set to null, which means
transfer the GPU result set from the device to the host. that host query results are thrown away while device
The size of the result set in rows for each query is shown, query results are explicitly saved to memory. This
as this signiļ¬cantly aļ¬ects query performance. The queries makes the the SQLite execution faster.
were chosen to demonstrate the ļ¬exibility of currently im-
plemented query capabilities and to provide a wide range of 5.4 Results
computational intensity and result set size. Table 1 shows the mean results for the ļ¬ve integer queries,
We have no reason to believe results would change signif- the ļ¬ve ļ¬oating point queries, the three aggregation queries,
icantly with realistic data sets, since all rows are checked in and all of the queries. The rows column gives the average
a select operation, and the performance is strongly corre- number of rows output to the result set during a query, which
lated with the number of rows returned. The implemented is 1 for the aggregate functions data, because the functions
reductions all function such that strange selection patterns, implemented reduce down to a single value across all rows of
such as selecting every even row, or selecting rows such that the data set. The mean speedup across all queries was 50X,
only the ļ¬rst threads in a threadblock output a result row, which was reduced to 36X when the results transfer time was
make no diļ¬erence in performance. Unfortunately, we have included. This means that on average, running the queries
not yet been able to set up real data sets to validate this on the dataset already loaded on to the GPU and transfer-
hypothesis, and this is something left for future work, but ring the result set back was 36X faster than executing the
there is little reason to expect diļ¬erent performance results. query on the CPU through SQLite. The numbers for the all
row are calculated with the summation of the time columns,
5.2 Hardware and are thus time-weighted.
The performance results were gathered from an Intel Xeon Figure 1 graphically shows the speedup and speedup with
X5550 machine running Linux 2.6.24. The processor is a 2.66 transfer time of the tested queries. Odd numbered queries
GHz 64 bit quad-core, supporting eight hardware threads are integer queries, even numbered queries are ļ¬oating point
with maximum throughput of 32 GB/sec. The machine queries, and the ļ¬nal 3 queries are aggregation calls. The
has 5 gigabytes of memory. The graphics card used is an graph shows the signiļ¬cant deviations in speedup values
NVIDIA Tesla C1060. The Tesla has 240 streaming multi- depending on the speciļ¬c query. The pairing of the two
processors, 16 GB of global memory, and supports a maxi- speedup measurements also demonstrates the signiļ¬cant
mum throughput of 102 GB/sec. amount of time that some queries, such as query 6, spend
7. Table 1: Performance Data by Query Type
Queries Speedup Speedup w/ Transfer CPU time (s) GPU time (s) Transfer Time (s) Rows Returned
Int 42.11 28.89 2.3843 0.0566 0.0259148 1950104.4
Float 59.16 43.68 3.5273 0.0596 0.0211238 1951015.8
Aggregation 36.22 36.19 1.0569 0.0292 0.0000237 1
All 50.85 36.20 2.2737 0.0447 0.0180920 1500431.08
Figure 1: The speedup of query execution on the GPU for each of the 13 queries considered, both including
and excluding the results transfer time
transferring the result set. In other queries, such as query individual queries, including the diļ¬culty of each opera-
2, there is very little diļ¬erence. The aggregation queries all tion and output size. Though modern CPUs run at clock
had fairly average results but trivial results transfer time, speeds in excess of 2 GHz and utilize extremely optimized
since the aggregation functions used all reduced to a single and deeply pipelined ALUs, the fact that these operations
result. These functions were run over the entire dataset, are parallelized over 240 streaming multiprocessors means
thus the speedup represents the time it takes to reduce ļ¬ve that the GPU should outperform in this area, despite the
million rows to a single value. fact that the SMs are much less optimized on an individual
The time to transfer the data set from the host memory level. Unfortunately, it is diļ¬cult to measure the compu-
of SQLite to the device memory is around 2.8 seconds. This tational intensity of a query, but it should be noted that
operation is so expensive because the data is retrieved from queries 7 and 8, which involve multiplication operations,
SQLite through a query and placed into row-column form, performed on par with the other queries, despite the fact
thus it is copied several times. This is necessary because that multiplication is a fairly expensive operation.
SQLite stores data in B-Tree form, while this projectās GPU A more signiļ¬cant determinant of query speedup was the
virtual machine expects data in row-column form. If these size of the result set, in other words, the number of rows
two forms were identical, data could be transferred directly that a query returned. This matters because a bigger result
from the host to the device with a time comparable to the set increases the overhead of the reduction step since each
result transfer time. Note that if this were the case, many thread must call atomicAdd(). It also directly aļ¬ects how
GPU queries would be faster than CPU queries even includ- long it takes to copy the result set from device memory to
ing the data transfer time, query execution time, and the host memory. These factors are illuminated with ļ¬gure 2.
results transfer time. As discussed above, we assume that A set of 21 queries were executed in which rows of data were
multiple queries are being performed on the same data set returned when the uniformi column was less than x, where
and ignore this overhead, much as we ignore the overhead x was a value in the range [-100, 100] incremented by 10 for
of loading the database ļ¬le from disk into SQLite memory. each subsequent query. Since the uniformi column contains
Interestingly, the ļ¬oating point queries had a slightly high- a uniform distribution of integers between -99 and 99, the
er speedup than the integer queries. This is likely a result of expected size of the result set increased by 25,000 for each
the GPUās treatment of integers. While the GPU supports query, ranging from 0 to 5,000,000.
IEEE 754 compliant ļ¬oating point operations, integer math The most striking trend of this graph is that the speedup
is done with a 24-bit unit, thus 32-bit integer operations are of GPU query execution increased along with the size of the
essentially emulated[19]. The resulting diļ¬erence in perfor- result set, despite the reduction overhead. This indicates
mance is nontrivial but not big enough to change the mag- that the GPU implementation is more eļ¬cient at handling a
nitude of the speedup. Next generation NVIDIA hardware result row than the CPU implementation, probably because
is expected to support true 32-bit integer operations. of the sheer throughput of the device. The overhead of trans-
There are several major factors that aļ¬ect the results of ferring the result set back is demonstrated in the second line,
8. importantly the host to device memory transfer bottleneck,
that would reduce the usefulness of such an implementation.
The subset of possible SELECT queries implemented thus
far precludes several important and frequently used features.
First and foremost, this project does not implement the
JOIN command, used to join multiple database tables to-
gether as part of a SELECT query. The project was de-
signed to give performance improvement for multiple queries
run on data that has been moved to the GPU, thus encour-
aging running an expensive JOIN operation before the data
is primed. Indeed, since data is transferred to the GPU with
a SELECT query in this implementation, such an operation
is trivial. GROUP BY operations are also ignored. Though
not as complex as join operations, they are a commonly
implemented feature may be included in future implemen-
tations. The SQL standard includes many other operators,
both commonly used and largely unimplemented, and this
discussion of missing features is far from comprehensive.
Further testing should include a multicore implementa-
tion of SQLite for better comparison against the GPU re-
sults presented. Such an implementation would be able to
achieve a maximum of only n times faster execution on an
Figure 2: The eļ¬ect of the result set size on the n-core machine, but a comparison with the overhead of the
speedup of GPU query execution, including and ex- shared memory model versus the CUDA model would be in-
cluding the results transfer time teresting and valuable. Additionally, further testing should
compare these results against other open source and com-
mercial databases that do utilize multiple cores. Anecdotal
which gradually diverges from the ļ¬rst but still trends up, evidence suggests that SQLite performance is roughly equiv-
showing that the GPU implementation is still more eļ¬cient alent to other databases on a single core, but further testing
when the time to transfer a row back is considered. For would prove this equivalence.
these tests, the unweighted average time to transfer a single
16 byte row (including meta information and memory copy
setup overhead) was 7.67 ns. Note that the data point for 6.2 Hardware Limitations
0 returned rows is an outlier. This is because transferring There exist major limitations of current GPU hardware
results back is a two step process, as described in the imple- that signiļ¬cantly limit this projectās performance, but may
mentation section, and the second step is not needed when be reduced in the near future. First, indirect jumps are not
there are no result rows. This point thus shows how high allowed. This is signiļ¬cant because each of the 35 SQLite
the overhead is for using atomic operations in the reduction opcodes implemented in the virtual machine exist in a switch
phase and initiating a memory copy operation in the results block. Since this block is used for every thread for every op-
transfer phase. code, comparing the switch argument to the opcode values
We have not yet implemented a parallel version of the creates nontrivial overhead. The opcode values are arbi-
same SQLite functionality for multicore CPUs. This is an trary, and must only be unique, thus they could be set to
important aspect of future work. In the meantime, the po- the location of the appropriate code, allowing the program to
tential speedup with multiple cores must be kept in mind jump immediately for each opcode and eļ¬ectively removing
when interpreting the GPU speedups we report. Speedup this overhead. Without indirect jumps, this optimization is
with multicore would have an upper bound of the number impossible.
of hardware threads supported, 8 on the Xeon X5550 used The next limitation is that dynamically accessed arrays
for testing, and would be reduced by the overhead of coordi- are stored in local memory rather than register memory in
nation, resulting in a speedup less than 8X. The speedups we CUDA. Local memory is an abstraction that refers to mem-
observed with the GPU substantially exceed these numbers, ory in the scope of a single thread that is stored in the global
showing that the GPU has a clear architectural advantage. memory of the GPU. Since it has the same latency as global
memory, local memory is 100 to 150 times slower than reg-
ister memory [19]. In CUDA, arrays that are accessed with
6. FURTHER IMPROVEMENT an an index that is unknown at compile time are automat-
ically placed in local memory. In fact it is impossible to
6.1 Unimplemented Features store them in register memory. The database virtual ma-
By only implementing a subset of SELECT queries on the chine is abstract enough that array accesses of this nature
GPU, the programmer is limited to read-only operations. As are required and very frequent, in this case with the SQLite
discussed, this approach applies speed to the most useful and register array. Even the simplest SQL queries such as query
frequently used area of data processing. Further research 1 (shown in Appendix A) require around 25 SQLite register
could examine the power of the GPU in adding and removing accesses, thus not being able to use register memory here is
data from the memory-resident data set. Though it is likely a huge restriction.
that the GPU would outperform the CPU in this area as Finally, atomic functions in CUDA, such as atomicAdd()
well, it would be subject to a number of constraints, most are implemented only for integer values. Implementation
9. for other data types would be extremely useful for inter- resident on a single host and across multiple hosts.
threadblock communication, particularly given the architec-
ture of this project, and would make implementation of the 7. CONCLUSIONS
aggregate functions much simpler.
All three of these limitations are expected to disappear This project simultaneously demonstrates the power of
with Fermi, the next generation of NVIDIAās architecture using a generic interface to drive GPU data processing and
[18]. Signiļ¬cant eļ¬orts are being made to bring the CUDA provides further evidence of the eļ¬ectiveness of accelerat-
development environment in line with what the average pro- ing database operations by oļ¬oading queries to a GPU.
grammer is accustomed to, such as a uniļ¬ed address space Though only a subset of all possible SQL queries can be
for the memory heirarchy that makes it possible to run true used, the results are promising and there is reason to believe
C++ on Fermi GPUs. It is likely that this uniļ¬ed address that a full implementation of all possible SELECT queries
space will enable dynamic arrays in register memory. Com- would achieve similar results. SQL is an excellent interface
bined with the general performance improvements of Fermi, through which the GPU can be accessed: it is much simpler
it is possible that a slightly modiļ¬ed implementation will be and more widely used than many alternatives. Using SQL
signiļ¬cantly faster on this new architecture. represents a break from the paradigm of previous research
The most important hardware limitation from the stand- which drove GPU queries through the use of operational
point of a database is the relatively small amount of global primitives, such as map, reduce, or sort. Additionally, it
memory on current generation NVIDIA GPUs. The cur- dramatically reduces the eļ¬ort required to employ GPUs
rent top of the line GPGPU, the NVIDIA Tesla C1060, has for database acceleration. The results of this paper suggest
four gigabytes of memory. Though this is large enough for that implementing databases on GPU hardware is a fertile
literally hundreds of millions of rows of data, in practice area for future research and commercial development.
many databases are in the terabyte or even petabyte range. The SQLite database was used as a platform for the pro-
This restriction hampers database research on the GPU, and ject, enabling the use of an existing SQL parsing mechanism
makes any enterprise application limited. Fermi will employ and switching between CPU and GPU execution. Execution
a 40-bit address space, making it possible to address up to a on the GPU was supported by reimplementing the SQLite
terabyte of memory, though it remains to be seen how much virtual machine as a CUDA kernel. The queries executed on
of this space Fermi-based products will actually use. the GPU were an average of 35X faster than those executed
With the capabilities of CUDA there are two ways around through the serial SQLite virtual machine. The character-
the memory limitation. First, data could be staged (or istics of each query, the type of data being queried, and the
āpagedā) between the host and the device during the exe- size of the result set were all signiļ¬cant factors in how CPU
cution of a query. For example, a query run on a 6 GB and GPU execution compared. Despite this variation, the
database could move 3 GB to the GPU, execute on this minimum speedup for the 13 queries considered was 20X.
block, then move the 2nd half to the GPU and complete ex- Additionally, the results of this paper are expected to im-
ecution. The memory transfer time would create signiļ¬cant prove with the release of the next generation of NVIDIA
overhead and the entire database would have to ļ¬t into the GPU hardware. Though further research is needed, clearly
host memory, since storing on disk would create huge bottle- native SQL query processing can be signiļ¬cantly accelerated
neck. It is possible that queries executed this way would still with GPU hardware.
outperform CPU execution, but this scheme was not tested
in this project. The second workaround for the memory 8. ACKNOWLEDGEMENTS
limitation is to utilize CUDAās āzero-copyā direct memory This work was supported in part by NSF grant no. IIS-
access functionality, but this is less feasible than the ļ¬rst 0612049 and SRC grant no. 1607.001. We would also like to
option. Not only does this type of DMA have prohibitively thank the anonymous reviewers for their helpful comments.
low bandwidth, but it requires that the memory be declared
as pinned1 [19]. In practice, both the GPU and the operat-
ing system are likely to have limits to pinned memory that 9. REFERENCES
are less than 4 gigabytes, thus undermining the basis of this [1] N. Bandi, C. Sun, D. Agrawal, and A. El Abbadi.
approach. Hardware acceleration in commercial databases: a case
study of spatial operations. In VLDB ā04: Proceedings
6.3 Multi-GPU Conļ¬guration of the Thirtieth international conference on Very large
A topic left unexamined in this paper is the possibility data bases, pages 1021ā1032. VLDB Endowment, 2004.
of breaking up a data set and running a query concurrently [2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaļ¬er,
on multiple GPUs. Though there would certainly be coor- and K. Skadron. A performance study of
dination overhead, it is very likely that SQL queries could general-purpose applications on graphics processors
be further accelerated with such a conļ¬guration. Consider using cuda. J. Parallel Distrib. Comput.,
the NVIDIA Tesla S1070, a server product which contains 4 68(10):1370ā1380, 2008.
Tesla GPUs. This machine has a combined GPU throughput [3] J. Dean and S. Ghemawat. Mapreduce: simpliļ¬ed
of 408 GB/sec, 960 streaming multiprocessors, and a total of data processing on large clusters. Commun. ACM,
16 GB of GPU memory. Further research could implement 51(1):107ā113, 2008.
a query mechanism that takes advantage of multiple GPUs [4] A. di Blas and T. Kaldeway. Data monster: Why
1 graphics processors will transform database
This type of memory is also called page-locked, and means
that the operating system has relinquished the ability to processing. IEEE Spectrum, September 2009.
swap out the page. Thus, once allocated, the memory is [5] S. Ding, J. He, H. Yan, and T. Suel. Using graphics
guaranteed to be in certain location. processors for high performance IR query processing.
10. In WWW ā09: Proceedings of the 18th international Proceedings of the 23rd international conference on
conference on World wide web, pages 421ā430, New Supercomputing, pages 400ā409, New York, NY, USA,
York, NY, USA, 2009. ACM. 2009. ACM.
[6] R. Fang, B. He, M. Lu, K. Yang, N. K. Govindaraju, [18] NVIDIA. Nvidiaās next generation cuda compute
Q. Luo, and P. V. Sander. GPUQP: query architecture: Fermi. http://www.nvidia.com/
co-processing using graphics processors. In ACM content/PDF/fermi_white_papers/NVIDIA_
SIGMOD International Conference on Management of Fermi_Compute_Architecture_Whitepaper.pdf.
Data, pages 1061ā1063, New York, NY, USA, 2007. [19] NVIDIA. NVIDIA CUDA Programming Guide, 2.3.1
ACM. edition, August 2009.
[7] W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, http://developer.download.nvidia.com/compute/
P. Y. Yang, B. Hel, Q. Luo, P. V. Sander, and cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_
K. Yang. Parallel data mining on graphics processors. Guide_2.3.pdf.
Technical report, Hong Kong University of Science [20] SQLite. About sqlite.
and Technology, 2008. http://sqlite.org/about.html.
[8] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. [21] SQLite. The architecture of sqlite.
GPUTeraSort: high performance graphics co-processor http://sqlite.org/arch.html.
sorting for large database management. In ACM [22] SQLite. Most widely deployed sql database.
SIGMOD International Conference on Management of http://sqlite.org/mostdeployed.html.
Data, pages 325ā336, New York, NY, USA, 2006. [23] SQLite. Sqlite virtual machine opcodes.
ACM. http://sqlite.org/opcode.html.
[9] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and [24] Thrust. Thrust homepage.
D. Manocha. Fast computation of database operations http://code.google.com/p/thrust/.
using graphics processors. In SIGGRAPH ā05: ACM [25] S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei
SIGGRAPH 2005 Courses, page 206, New York, NY, W. Hwu. Cuda-lite: Reducing gpu programming
USA, 2005. ACM. complexity. In LCPC, pages 1ā15, 2008.
[10] T. D. Han and T. S. Abdelrahman. hicuda: a
high-level directive-based language for gpu
programming. In GPGPU-2: Proceedings of 2nd APPENDIX
Workshop on General Purpose Processing on Graphics A. QUERIES USED
Processing Units, pages 52ā61, New York, NY, USA,
Below are the ten queries used in the performance mea-
2009. ACM.
surements. Note that uniformi, normali5, and normali20
[11] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and are integer values, while uniformf, normalf5, and normalf20
T. Wang. Mars: a mapreduce framework on graphics
are ļ¬oating point values.
processors. In PACT ā08: Proceedings of the 17th
international conference on Parallel architectures and 1. SELECT id, uniformi, normali5 FROM test WHERE uni-
compilation techniques, pages 260ā269, New York, NY, formi > 60 AND normali5 < 0
USA, 2008. ACM. 2. SELECT id, uniformf, normalf5 FROM test WHERE uni-
[12] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, formf > 60 AND normalf5 < 0
Q. Luo, and P. V. Sander. Relational query
3. SELECT id, uniformi, normali5 FROM test WHERE uni-
coprocessing on graphics processors. ACM Trans.
formi > -60 AND normali5 < 5
Database Syst., 34(4):1ā39, 2009.
[13] T. Hoļ¬. Scaling postgresql using cuda, May 2009. 4. SELECT id, uniformf, normalf5 FROM test WHERE uni-
http://highscalability.com/scaling-postgresql- formf > -60 AND normalf5 < 5
using-cuda. 5. SELECT id, normali5, normali20 FROM test WHERE (nor-
[14] T. Kaldeway, J. Hagen, A. Di Blas, and E. Sedlar. mali20 + 40) > (uniformi - 10)
Parallel search on video cards. Technical report, 6. SELECT id, normalf5, normalf20 FROM test WHERE (nor-
Oracle, 2008. malf20 + 40) > (uniformf - 10)
[15] S. Lee, S.-J. Min, and R. Eigenmann. Openmp to
gpgpu: a compiler framework for automatic 7. SELECT id, normali5, normali20 FROM test WHERE nor-
translation and optimization. In PPoPP ā09: mali5 * normali20 BETWEEN -5 AND 5
Proceedings of the 14th ACM SIGPLAN symposium 8. SELECT id, normalf5, normalf20 FROM test WHERE nor-
on Principles and practice of parallel programming, malf5 * normalf20 BETWEEN -5 AND 5
pages 101ā110, New York, NY, USA, 2009. ACM. 9. SELECT id, uniformi, normali5, normali20 FROM test
[16] M. D. Linderman, J. D. Collins, H. Wang, and T. H. WHERE NOT uniformi OR NOT normali5 OR NOT normali20
Meng. Merge: a programming model for
heterogeneous multi-core systems. In ASPLOS XIII: 10. SELECT id, uniformf, normalf5, normalf20 FROM test
Proceedings of the 13th international conference on WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20
Architectural support for programming languages and 11. SELECT SUM(normalf20) FROM test
operating systems, pages 287ā296, New York, NY, 12. SELECT AVG(uniformi) FROM test WHERE uniformi >
USA, 2008. ACM. 0
[17] W. Ma and G. Agrawal. A translation system for
13. SELECT MAX(normali5), MIN(normali5) FROM test
enabling data mining applications on gpus. In ICS ā09: