Dark Secrets: Heterogeneous Memory Models is a document about the challenges of memory consistency models across heterogeneous systems like OpenCL. It discusses three main issues with current models: coarse-grained access to data, separate addressing between devices and hosts, and weak/undefined ordering controls. It also covers basics of the OpenCL 2.0 memory model, which aims to address these issues by allowing shared virtual addressing between devices and hosts, finer-grained synchronization using events, and atomic operations for ordering. However, memory models still have limitations around data races and forward progress that require careful programming.
This document provides an overview of Zenoh, an open-source distributed data-centric middleware. It summarizes Zenoh's key features including pub/sub, storage/query, compute capabilities and how these can be used in peer-to-peer, routed, and geo-distributed communication patterns. It also describes how to install and use Zenoh in Python and Docker, configure routers and backends, and interact with the system using its REST API.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
This document discusses various topics related to multimedia systems and data compression, including:
1. It defines multimedia systems and describes their characteristics such as being computer controlled and representing information digitally. It lists common types and applications of multimedia.
2. It introduces the concepts of lossless and lossy data compression, explaining that lossless compression preserves all information while lossy compression loses some information.
3. It describes several popular lossless compression algorithms, including run-length coding, Huffman coding, and Shannon-Fano coding. It provides an example to illustrate run-length coding.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
This document provides an overview of Zenoh, an open-source distributed data-centric middleware. It summarizes Zenoh's key features including pub/sub, storage/query, compute capabilities and how these can be used in peer-to-peer, routed, and geo-distributed communication patterns. It also describes how to install and use Zenoh in Python and Docker, configure routers and backends, and interact with the system using its REST API.
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010
This document discusses various topics related to multimedia systems and data compression, including:
1. It defines multimedia systems and describes their characteristics such as being computer controlled and representing information digitally. It lists common types and applications of multimedia.
2. It introduces the concepts of lossless and lossy data compression, explaining that lossless compression preserves all information while lossy compression loses some information.
3. It describes several popular lossless compression algorithms, including run-length coding, Huffman coding, and Shannon-Fano coding. It provides an example to illustrate run-length coding.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...Microsoft Technet France
Vous aussi, devenez incollable sur la Haute Dispo d’Exchange ! Session technique, en Anglais, faite par le gourou des technos de haute disponibilité d’Exchange : Scott Schnoll. Scott est speaker aux TechReady et TechEd de Microsoft, a écrit de nombreux livres de référence, et il sera présent en exclusivité pour animer cette session. Parmi les thèmes abordés : Comment séparer mon flux de réplication des logs de mon flux client ? quand un DAG (Database Availability Group) tombe, comment le système choisit-il la bonne copie de la base de données à répliquer ? Allez au-delà des fonctions de base de la haute disponibilité et apprenez ce qui se passe réellement dans les arcanes d’un DAG Exchange. Cette session couvre le fonctionnement interne des DAGs, nous discuterons des réseaux de DAGs, d’Active Manager, de comment le système permet la sélection des meilleures réplications de bases et du Datacenter Activation Coordination Mode.
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
The document discusses several topics related to distributed machine learning and distributed systems including:
- Reasons for using distributed machine learning being either due to large data volumes or hopes of increased speed
- Failure rates of hardware and network links in large data centers
- Examples of database inconsistencies and data loss caused by network partitions in different distributed databases
- Key aspects of distributed data processing including data storage, integration, computing primitives, and analytics
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Apache Cassandra is an open source, distributed, decentralized, scalable, highly available, fault-tolerant, and tunably consistent database. It is based on Amazon's Dynamo and Google's Bigtable models. Cassandra provides linear scalability, high availability with no single points of failure, and tunable consistency. It uses a distributed data model across a cluster with configurable replication for fault tolerance.
The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
Hadoop MapReduce introduces YARN, which separates cluster resource management from application execution. YARN introduces a global ResourceManager and per-node NodeManagers to manage resources. Applications run as ApplicationMasters and containers on the nodes. This improves scalability, fault tolerance, and allows various application paradigms beyond MapReduce. Optimization techniques for MapReduce include tuning buffer sizes, enabling sort avoidance when sorting is unnecessary, and using Netty and batch fetching to improve shuffle performance.
By combining two best of breed solutions, one can create very powerful big data crunching solution.
Hadoop is a very popular big data solution with poor agility and not so great data transformation capabilities - despite what many Hadoop hype riding companies are trying to pitch.
By combining Hadoop's strengths with other very powerful open source technology - CloverETL, we get a nice synergy of both.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
Java File I/O Performance Analysis - Part I - JCConf 2018Michael Fong
The document summarizes various file I/O techniques in Java including stream I/O, decorator pattern, NIO, buffers, memory mapping, sendfile, benchmarks, and more. It provides code examples and discusses performance aspects. Benchmark results are shown for sequential and random file read/write using different Java file I/O APIs and configurations.
This document discusses optimizations for deep learning frameworks on Intel CPUs and Fugaku processors. It introduces oneDNN, an Intel performance library for deep neural networks. JIT assembly using Xbyak is proposed to generate optimized code depending on parameters at runtime. Xbyak has been extended to AArch64 as Xbyak_aarch64 to support Fugaku. AVX-512 SIMD instructions are briefly explained.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
1) The Group Steiner Problem involves finding a minimum-cost tree connecting at least one node from each of k disjoint groups in a graph. It has applications in VLSI circuit routing.
2) The authors propose a CUDA-aware MPI-based approach to solve large instances of the Group Steiner Problem using a heuristic that constructs 2-star trees in parallel.
3) Experimental results on a supercomputer show the parallel implementation achieves up to 302x speedup over sequential algorithms and finds accurate solutions to industry benchmark instances.
The document discusses and compares the performance of different network topologies in OPNET:
- A hub-only topology showed the highest delay and packet loss due to collisions from broadcasting.
- Adding a switch improved performance by reducing delay and increasing packets received through switching instead of broadcasting.
- A switch-only or dual switch topology had the best performance with no collisions and lowest delay, as switches use addressing tables to directly send packets without broadcasting.
The document discusses optimizing parallel reduction in CUDA. It presents 7 versions of a parallel reduction kernel with optimizations including interleaved vs sequential addressing to avoid bank conflicts, unrolling the last warp, adding the first reduction during global memory loads, and completely unrolling the reduction using C++ templates. Performance is improved from 2 GB/s to over 30 GB/s for a 4 million element reduction, reaching over 15x speedup by eliminating instruction overhead. Templates allow specializing the kernel for different block sizes known only at launch time.
A compact bytecode format for JavaScriptCoreTadeu Zagallo
JavaScriptCore (JSC) is the multi-tiered JavaScript virtual machine in WebKit. The bytecode is a central piece in JSC: it’s executed by the interpreter and the source of truth for all of JSC’s compilers. In this talk we’ll look at the recent redesign of our bytecode format, which cut its size in half and enabled persisting the bytecode on disk without impacting the overall performance of the system.
This document summarizes the work done to set up a computing cluster at Florida Tech. It describes installing Rocks on a frontend server and nodes, setting up Condor as the batch job system, and using OpenFiler for network attached storage. The cluster originally had one frontend and one compute node from the University of Florida. Future work involves recovering from a hard drive failure on the frontend and continuing the installation of the Open Science Grid.
This document provides an overview of the Xilinx SDAccel design flow for FPGA hardware acceleration using OpenCL. It begins with an introduction to the hardware design flow and SDAccel framework. It then covers OpenCL concepts including the computational and memory models. The remainder of the document demonstrates the SDAccel design flow through examples, including specifying a kernel in OpenCL or C/C++, software emulation, hardware emulation, and building for the FPGA board.
Plongée profonde dans les technos de haute disponibilité d’Exchange 2010 par...Microsoft Technet France
Vous aussi, devenez incollable sur la Haute Dispo d’Exchange ! Session technique, en Anglais, faite par le gourou des technos de haute disponibilité d’Exchange : Scott Schnoll. Scott est speaker aux TechReady et TechEd de Microsoft, a écrit de nombreux livres de référence, et il sera présent en exclusivité pour animer cette session. Parmi les thèmes abordés : Comment séparer mon flux de réplication des logs de mon flux client ? quand un DAG (Database Availability Group) tombe, comment le système choisit-il la bonne copie de la base de données à répliquer ? Allez au-delà des fonctions de base de la haute disponibilité et apprenez ce qui se passe réellement dans les arcanes d’un DAG Exchange. Cette session couvre le fonctionnement interne des DAGs, nous discuterons des réseaux de DAGs, d’Active Manager, de comment le système permet la sélection des meilleures réplications de bases et du Datacenter Activation Coordination Mode.
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
The document discusses several topics related to distributed machine learning and distributed systems including:
- Reasons for using distributed machine learning being either due to large data volumes or hopes of increased speed
- Failure rates of hardware and network links in large data centers
- Examples of database inconsistencies and data loss caused by network partitions in different distributed databases
- Key aspects of distributed data processing including data storage, integration, computing primitives, and analytics
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Apache Cassandra is an open source, distributed, decentralized, scalable, highly available, fault-tolerant, and tunably consistent database. It is based on Amazon's Dynamo and Google's Bigtable models. Cassandra provides linear scalability, high availability with no single points of failure, and tunable consistency. It uses a distributed data model across a cluster with configurable replication for fault tolerance.
The document describes how to use Gawk to perform data aggregation from log files on Hadoop by having Gawk act as both the mapper and reducer to incrementally count user actions and output the results. Specific user actions are matched and counted using operations like incrby and hincrby and the results are grouped by user ID and output to be consumed by another system. Gawk is able to perform the entire MapReduce job internally without requiring Hadoop.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
Hadoop MapReduce introduces YARN, which separates cluster resource management from application execution. YARN introduces a global ResourceManager and per-node NodeManagers to manage resources. Applications run as ApplicationMasters and containers on the nodes. This improves scalability, fault tolerance, and allows various application paradigms beyond MapReduce. Optimization techniques for MapReduce include tuning buffer sizes, enabling sort avoidance when sorting is unnecessary, and using Netty and batch fetching to improve shuffle performance.
By combining two best of breed solutions, one can create very powerful big data crunching solution.
Hadoop is a very popular big data solution with poor agility and not so great data transformation capabilities - despite what many Hadoop hype riding companies are trying to pitch.
By combining Hadoop's strengths with other very powerful open source technology - CloverETL, we get a nice synergy of both.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
Java File I/O Performance Analysis - Part I - JCConf 2018Michael Fong
The document summarizes various file I/O techniques in Java including stream I/O, decorator pattern, NIO, buffers, memory mapping, sendfile, benchmarks, and more. It provides code examples and discusses performance aspects. Benchmark results are shown for sequential and random file read/write using different Java file I/O APIs and configurations.
This document discusses optimizations for deep learning frameworks on Intel CPUs and Fugaku processors. It introduces oneDNN, an Intel performance library for deep neural networks. JIT assembly using Xbyak is proposed to generate optimized code depending on parameters at runtime. Xbyak has been extended to AArch64 as Xbyak_aarch64 to support Fugaku. AVX-512 SIMD instructions are briefly explained.
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
This document discusses common technical problems encountered when building products with Spark and provides solutions. It covers Spark exceptions like out of memory errors and shuffle file problems. It recommends increasing partitions and memory configurations. The document also discusses optimizing Spark code using functional programming principles like strong and weak pipelining, and leveraging monoid structures to reduce shuffling. Overall it provides tips to debug issues, optimize performance, and productize Spark applications.
1) The Group Steiner Problem involves finding a minimum-cost tree connecting at least one node from each of k disjoint groups in a graph. It has applications in VLSI circuit routing.
2) The authors propose a CUDA-aware MPI-based approach to solve large instances of the Group Steiner Problem using a heuristic that constructs 2-star trees in parallel.
3) Experimental results on a supercomputer show the parallel implementation achieves up to 302x speedup over sequential algorithms and finds accurate solutions to industry benchmark instances.
The document discusses and compares the performance of different network topologies in OPNET:
- A hub-only topology showed the highest delay and packet loss due to collisions from broadcasting.
- Adding a switch improved performance by reducing delay and increasing packets received through switching instead of broadcasting.
- A switch-only or dual switch topology had the best performance with no collisions and lowest delay, as switches use addressing tables to directly send packets without broadcasting.
The document discusses optimizing parallel reduction in CUDA. It presents 7 versions of a parallel reduction kernel with optimizations including interleaved vs sequential addressing to avoid bank conflicts, unrolling the last warp, adding the first reduction during global memory loads, and completely unrolling the reduction using C++ templates. Performance is improved from 2 GB/s to over 30 GB/s for a 4 million element reduction, reaching over 15x speedup by eliminating instruction overhead. Templates allow specializing the kernel for different block sizes known only at launch time.
A compact bytecode format for JavaScriptCoreTadeu Zagallo
JavaScriptCore (JSC) is the multi-tiered JavaScript virtual machine in WebKit. The bytecode is a central piece in JSC: it’s executed by the interpreter and the source of truth for all of JSC’s compilers. In this talk we’ll look at the recent redesign of our bytecode format, which cut its size in half and enabled persisting the bytecode on disk without impacting the overall performance of the system.
This document summarizes the work done to set up a computing cluster at Florida Tech. It describes installing Rocks on a frontend server and nodes, setting up Condor as the batch job system, and using OpenFiler for network attached storage. The cluster originally had one frontend and one compute node from the University of Florida. Future work involves recovering from a hard drive failure on the frontend and continuing the installation of the Open Science Grid.
This document provides an overview of the Xilinx SDAccel design flow for FPGA hardware acceleration using OpenCL. It begins with an introduction to the hardware design flow and SDAccel framework. It then covers OpenCL concepts including the computational and memory models. The remainder of the document demonstrates the SDAccel design flow through examples, including specifying a kernel in OpenCL or C/C++, software emulation, hardware emulation, and building for the FPGA board.
This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
This workshop was given at the NZITF conference 2018 in Wellington. The workshop covers Velociraptor, a modern DFIR endpoint monitoring and response tool.
Ansiblefest 2018 Network automation journey at robloxDamien Garros
In December 2017, Roblox’s network was managed in a traditional way without automation.
To sustained its growth, the team had to deploy 2 datacenters, a global network and multiple point of presence around the world in few months, the only solution to be able to achieve that was to automate everything.
6 months later, the team has made tremendous progress and many aspects of the network lifecycle has been automated from the routers, switches to the load balancers.
Synopsis
This talk is a retrospective of Roblox’s journey into Network automation:
How we got started and how we automated an existing network.
How we organized the project around Github and an DCIM/IPAM solution (netbox),
How Docker helped us to package Ansible and create a consistent environment.
How we managed many roles and variations of our design in single project
How we have automated the provisioning of our F5 Load Balancers.
For each point, we’ll cover what was successful, what was more challenging and what limitations we had to deal with.
Dell EMC uses Ansible for automating various tasks including network switch configuration, OpenStack configuration, out-of-band server management, and OpenShift deployment. Ansible provides agentless automation and configuration management through playbooks, templates, and roles. Dell EMC has developed networking roles and Ansible modules to manage switches, servers, and OpenStack configurations. Examples shown include configuring Dell switches, deploying OpenStack projects and users, getting server health/logs through Redfish, and automating an OpenShift reference architecture.
This document contains an agenda and summaries of sections from a presentation on messaging architectures and using messaging data with Splunk. The agenda includes an overview of common messaging systems like JMS, AMQP, and Kafka. It also covers customizing message handling and scaling modular inputs. Additionally, it discusses using ZeroMQ to access messages and using underutilized computers with JMS tasks.
20150704 benchmark and user experience in sahara weitingWei Ting Chen
Sahara provides a way to deploy and manage Hadoop clusters within an OpenStack cloud. It addresses common customer needs like providing an elastic environment for data processing jobs, integrating Hadoop with the existing private cloud infrastructure, and reducing costs. Key challenges include speeding up cluster provisioning times, supporting complex data workflows, optimizing storage architectures, and improving performance when using remote object storage.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Sector - Presentation at Cloud Computing & Its Applications 2009Robert Grossman
This is a presentation about Sector that I gave at the Cloud Computing and Its Applications (CCA 09) Workshop that took place in Chicago on October 20, 2009. Sector is an open source cloud computing framework designed for data intensive computing.
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. It uses a data-parallel programming model where the same program code, known as a kernel, is executed over multiple processing elements. OpenCL provides abstractions for platform setup, memory management, kernel execution and synchronization. Programs are compiled at runtime for different devices. This allows writing portable code that can leverage multiple processors to accelerate computationally intensive applications.
This document discusses how to download and play the mobile game Subway Surfers on a personal computer. It describes using BlueStacks, an Android emulator, to install and run the game normally played on phones and tablets. BlueStacks allows users to access Google Play to download Subway Surfers and other Android apps. Once installed through BlueStacks, the game can be played offline on a PC like a mobile game, allowing users to enjoy Subway Surfers on a larger screen without being limited to a phone.
Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah
ElasticHPC Supports the creation and management of cloud computing resources over multiple public cloud Providers Including Amazon, Azure, Google and Clouds supporting OpenStack.
This document discusses programming with shared memory in parallel programming. It begins by explaining shared memory multiprocessor systems and how shared memory programming allows data to be accessible by all processors. It then discusses several methods for programming shared memory systems, including using threads and OpenMP. The bulk of the document focuses on explaining OpenMP directives for parallel programming, including how to specify parallel regions, work sharing constructs like for loops and sections, and synchronization methods like barriers and critical sections.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
Building real-time notification systems is often limited to basic filtering and pattern matching against incoming records. Allowing users to query incoming documents using Solr's full range of capabilities is much more powerful. In our environment we needed a way to allow for tens of thousands of such query subscriptions, meaning we needed to find a way to distribute the query processing in the cloud. By creating in-memory Lucene indices from our Solr configuration, we were able to parallelize our queries across our cluster. To achieve this distribution, we wrapped the processing in a Storm topology to provide a flexible way to scale and manage our infrastructure. This presentation will describe our experiences creating this distributed, real-time inverted search notification framework.
Similar to 20150207 howes-gpgpu8-dark secrets (20)
The document discusses preliminary findings regarding the Xbox Series X (XSX) architecture based on information from AMD slides and cache sizes from RDNA1. It notes that in the RDNA1 (Navi10) architecture, two compute engines form a workgroup processor, and five workgroup processors form an asynchronous compute engine. It provides details on cache and memory sizes per workgroup processor and compute unit from AMD slides. It also notes similarities between the XSX and RDNA1 structures but that the number of compute units per group may be customized and references sources showing six compute units per group.
The document discusses DirectX 12 and how it allows for more efficient use of resources and parallel execution across multiple GPUs. It provides examples of how DirectX 12 enables features like allocating resources once and sharing them across GPUs, allowing the GPUs to process frames in parallel to improve performance. New API features in DirectX 12 give developers more control over hardware compared to earlier APIs.
DirectX 12 provides major performance improvements over previous versions by optimizing the entire graphics stack and distributing work more efficiently between the application, engine, driver, operating system, and GPU. It uses a multi-threaded execution model with asynchronous resource access and command buffers, and supports features like execute indirect, multi-engine/multi-adapter for scalability. Hardware support is organized into feature levels to identify supported capabilities independent of the API version. Tools are integrated to help developers optimize their games and applications for DirectX 12.
This document discusses strategies for dealing with asynchronous operations in a graphics application using APIs like Vulkan and DirectX 12. It recommends planning out frame submission to separate command queues for different tasks like resource creation and deletion. Global and per-frame resources are tracked to determine when they need to be committed or uncommitted. Readbacks from the GPU are queued and handled asynchronously to avoid stalling. Presentation requires special handling on Windows to avoid threading issues. Benchmark results show large potential performance gains from next-generation APIs and hardware.
The technology behind_the_elemental_demo_16x9-1248544805mistercteam
This document discusses advances in real-time rendering techniques used in the Unreal Engine 4 Elemental demo. It describes technologies like voxel cone tracing for indirect lighting, deferred shading, screen space ambient occlusion, image-based lens flares, HDR histograms for eye adaptation, and more. The presentation provides technical details and comparisons of these techniques for real-time rendering of complex 3D scenes.
This lecture discusses advanced SRAM technologies including FinFET-based SRAM issues and alternatives. It begins with fundamentals of 6T SRAM design and reviews state-of-the-art SRAM performance. Key challenges for FinFET-based SRAM include variability impacts and design tradeoffs. The lecture explores techniques to improve SRAM stability for FinFETs and reviews Intel's 22nm tri-gate SRAM technology. Finally, it discusses SRAM alternatives such as 8T cells, SDRAM, and context memory to address scaling challenges.
This document provides an overview of the "Beyond Programmable Shading" course being offered at the University of Washington in Winter 2011. The course aims to teach students about the state-of-the-art in real-time rendering hardware, programming models, and algorithms. It will cover GPU and CPU architectures, parallel programming models for graphics, and the latest real-time rendering techniques. The course will be taught by Aaron Lefohn from Intel and Mike Houston from AMD.
This document discusses using DirectCompute for cloth simulation. It describes how cloth simulation works by computing forces between connected particles and updating their positions over multiple iterations. It outlines the steps of computing forces based on spring lengths, applying position corrections from forces, and computing new positions. It also discusses representing springs and masses, and how a CPU typically handles the simulation by updating links sequentially to conserve momentum. DirectCompute can accelerate this by processing multiple particles in parallel.
This document describes a primitive processing and advanced shading architecture for embedded systems. It features a vertex cache and programmable primitive engine that can process fixed and variable size primitives with reduced memory bandwidth requirements. The architecture includes a configurable per-fragment shader that supports various shading models using dot products and lookup tables stored on-chip. This hybrid design aims to bring appealing shading to embedded applications while meeting limitations on gate size, power consumption, and memory traffic growth.
This document provides recommendations for optimizing DirectX 11 performance. It separates the graphics pipeline process into offline and runtime stages. For the offline stage, it recommends creating resources like buffers, textures and shaders on multiple threads. For the runtime stage, it suggests culling unused objects, minimizing state changes, and pushing commands to the driver quickly. It also provides tips for updating dynamic resources efficiently and grouping related constants together. The goal is to keep the CPU and GPU pipelines fully utilized for maximum performance.
This document provides an overview of the Mantle programming guide and API reference. It discusses Mantle's software infrastructure and execution model. It also summarizes Mantle's approach to memory management, objects, pipelines, shaders, and error handling. The guide describes basic Mantle operations like GPU initialization and resource creation. It also covers topics like queues, command buffers, compute and graphics operations, synchronization, and shader atomics.
The document discusses key differences between Direct3D 11 and the new Direct3D 12 API. D3D12 provides lower-level access to the hardware for increased control and performance. It utilizes command lists, resource binding tables, explicit synchronization, and multiple queues to allow for greater parallelism and efficiency. Developers now have responsibility for correct rendering but also more opportunities to optimize through techniques like command list threading and avoiding redundant state changes.
D3D12 introduces a new API and underlying philosophy that is closer to the hardware for increased efficiency and performance. It trades off some safety for more control and power. The software model has changed significantly from D3D11 with features like command lists for parallelism, explicit resource binding and memory management, and removing implicit synchronization. This provides opportunities for higher performance but also new complexity to consider in areas like profiling and debugging.
This document discusses advancements in tiled-based compute rendering. It describes current proven tiled rendering techniques used in games. It then discusses opportunities for improvement like using parallel reduction to calculate depth bounds more efficiently than atomics, improved light culling techniques like modified Half-Z, and clustered rendering which divides the screen into tiles and slices to reduce lighting workloads. The document concludes clustered shading has potential savings on culling and offers benefits over traditional 2D tiling.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
2. 2
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
3. 3
Introduction – why
memory consistency
Introduction
Current
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
4. 4
Many programming languages have coped with weak memory models
These are fixed in various ways:
− Platform-specific rules
− Conservative compilation behavior
A clear memory model allows developers to understand their program behavior!
− Or part of it anyway. It’s not the only requirement but it is a start.
Many current models are based on the Data-Race-Free work
− Special operations are used to maintain order
− Between special instructions reordering is valid for efficiency
Weak memory models
5. 5
Many languages have tried to formalize their memory models
− Java, C++, C…
− Why?
OpenCL and HSA are not exceptions
− Both have developed models based on the DRF work, via C11/C++11
Ben Gaster, Derek Hower and I have collaborated on HRF-Relaxed
− An extension of DRF models for heterogeneous systems
− To appear in ACM TACO
Unfortunately, even a stronger memory model doesn’t solve all your problems…
Better memory models
7. 7
Three specific issues:
− Coarse grained access to data
− Separating addressing between devices and the host process
− Weak and poorly defined ordering controls
Any of these can be worked around with vendor extensions or device knowledge
Weakness in the OpenCL 1.x memory model
13. 13
Separate addressing
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
14. 14
Separate addressing
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
15. 15
Separate addressing
Unfortunately, the address of data may change
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
16. 16
Bounds on visibility
Controlling memory ordering is challenging
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
17. 17
Physically, this probably means within a single core
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
18. 18
Barrier operations synchronize active threads and constituent work-items
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Barrier
19. 19
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
20. 20
Bounds on visibility
Between groups we can’t
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read?
21. 21
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
22. 22
Ensure the write completes for a given work-item
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
23. 23
Fence at the other end
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
24. 24
Ensure a read is after the fence
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
25. 25
Probably seeing a write that was written after a fence guarantees that the fence completed
− The spec is not very clear on this
There is no coherence guarantee
− Need the write ever complete?
− If it doesn’t complete, who can see it, who can know that the fence happened?
Can the flag be updated without a race?
− For that matter, what is a race?
Spliet et al: KMA.
− Weak ordering differences between platforms due to poorly defined model.
Meaning of fences
When did the fence happen?
{{
data[n] = value;
fence(…);
flag = trigger;
||
if(flag) {
fence(…);
value = data[n];
}}
27. 27
Sharing virtual addresses
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
28. 28
Sharing virtual addresses
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
29. 29
Sharing virtual addresses
Now the address does not change!
Host
OpenCL
Device
int a = data[n].ptr->val;
assert(a == 3);
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
30. 30
Unmap on the host, event dependency on device
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
31. 31
Granularity of data race covers the whole buffer
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
32. 32
Event dependency on device – caches will flush as necessary
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
33. 33
Data will be merged – data race at byte granularity
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
34. 34
No dispatch-level ordering necessary
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Device
int a = atomic-load data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
35. 35
Races at byte level – avoided using the atomics in the memory model
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Device
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
int a = atomic-load data[n].ptr->val;
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
36. 36
This is an ease of programming concern
− Complex apps with complex data structures can more effectively work with data
− Less work to package and repackage data
It can also improve performance
− Less overhead in updating pointers to convert to offsets
− Less overhead in repacking data
− Lower overhead of data copies when appropriate hardware support present
Sharing virtual addresses
37. 37
Most memory operations are entirely unordered
− The compiler can reorder
− The caches can reorder
Unordered relations to the same location are races
− Behaviour is undefined
Ordering operations (atomics) may update the same location
− Ordering operations order other operations relative to the ordering operation
Ordering operations default to sequentially consistent semantics
Sharing virtual addresses
SC-for Data-Race-Free by default – release consistency for flexibility
38. 38
Commonly tried on OpenCL 1.x
− Not at all portable!
− Due to: weak memory ordering, lack of forward progress
Is the situation any better for OpenCL 2.0?
− Yes, memory ordering is now under control!
− Is forward progress?
So what can we safely do with synchronization
Take a spin wait…
39. 39
Yes and no
− Conceptually similar to CPU waiting on an event – ie all work-items to complete
− Other app could occupy device, graphics could consume device, work may interfere with graphics
− Risk of whole device being occupied elsewhere so work on the assumption that this context owns the
device
So what can we safely do with synchronization
CPU thread waits on work-item – works?
CPU threadOpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
40. 40
Probably
− The spec doesn’t guarantee this
− How do you know what core your work-item is on?
− All you know is which work-group it is in.
So what can we safely do with synchronization
Work-item on one core waits for work-item on another core – works?
Core 1
OpenCL Work-item
Core 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
41. 41
Sometimes
− On some architectures this is fine
− On other architectures if both SIMD threads are on the same core: starvation
− Thread 0 may never run to satisfy thread 1’s spin wait
So what can we safely do with synchronization
Work-item on one thread waits for work-item on another thread – works?
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
42. 42
No (well, sometimes yes, but rarely)
− Fairly widely understood, but sometimes hard for new developers
− If you think about the mapping to SIMD it is fairly obvious
− A single program counter can’t be in two places at once – some architectures can track multiple
So what can we safely do with synchronization
Work-item waits for work-item in the same SIMD thread – works?
{SIMD} thead 0
OpenCL Work-item 1
{SIMD} thread 0
OpenCL Work-item 0
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
43. 43
Maybe, maybe not
− It depends entirely on where the work-items are mapped in the group
− Same thread – no
− Different threads – maybe
− The developer can often tell, but it isn’t portable and the compiler can easily break it
So what can we safely do with synchronization
Work-items in a work-group
Work-group 0
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
44. 44
Maybe, maybe not
− It depends entirely on where the work-groups are placed on the device
− Two work-groups on the same core – you have the thread to thread case
− Two work-groups on different cores – it probably works
− No way to control the mapping!
So what can we safely do with synchronization
Work-groups
Work-group 1
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
45. 45
Realistically
− Spin waits on other OpenCL work-items are just not portable
− Very limited use of the memory model
So what can you do?
− Communicating that work has passed a certain point
− Updating shared data buffers with flags
− Lock-free FIFO data structures to share data
− OpenCL 2.0’s sub-group extension provides limited but important forward progress guarantees
So what can we safely do with synchronization
Overall, a fairly poor situation
47. 47
Even acquire-release consistency can be expensive
In particular, always synchronizing the whole system is expensive
− A discrete GPU does not want to always make data visible across the PCIe interface
− A single core shuffling data in local cache does not want to interfere with DRAM-consuming throughput
tasks
OpenCL 2 optimizes this using the concept of synchronization scopes
Lowering implementation cost
48. 48
WI WI
SG SG
WG WG
Not all communication is global – so bound it
Hierarchical memory synchronization
Synchronization is expensive
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
49. 49
WI WI
SG SG
WG WG
Sub-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
50. 50
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
51. 51
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
52. 52
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope; All-SVM-Devices scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
53. 53
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
54. 54
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release device-scope x load-acquire device-scope x
55. 55
WI WI
SG SG
WG WG
Hierarchical memory synchronization
If scopes do not reach far enough, this is a race
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
56. 56
Allows aggressive hardware optimization to coherence traffic
− GPU coherence in particular is often expensive – GPUs generate a lot of memory traffic
The memory model defines synchronization rules in terms of scopes
− Insufficient scope is a race
− Non-matching scopes race (in the OpenCL 2.0 model, this isn’t necessary)
Scoped synchronization
57. 57
Four address spaces in OpenCL 2.0
− Constant and private are not relevant for communication
− Global and local maintain separate orders in the memory model
Synchronization, acquire/release behavior etc apply only to local OR global, not both
− The global release->acquire order below does not order the updates to a!
Address space orderings
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
local-store a = 2
Global-store-release value to flag
while(global-load-acquire flag != value) {}
assert local-load a==2
Local-happens-before
58. 58
Take an example like this:
− void updateRelease(int *flag, int *data);
If I lock the flag, do I safely update data or not?
− The function takes pointers with no address space
− The ordering depends on the address space
− The address space depends on the type of the pointers at the call site, or even earlier!
There are ways to get the fence flags, but it is messy, care must be taken
Multiple orderings in the presence of generic pointers
60. 60
Sequential consistency aims to be a simple, easy to understand, model
− Behave as if the ordering operations are simply interleaved
In the OpenCL model, sequential consistency is guaranteed only in specific cases:
− All memory_order_seq_cst operations have the scope memory_scope_all_svm_devices and all affected
memory locations are contained in system allocations or fine grain SVM buffers with atomics support
− All memory_order_seq_cst operations have the scope memory_scope_device and all affected memory
locations are not located in system allocated regions or fine-grain SVM buffers with atomics support
Consider what this means…
− You start modifying your app to have more fine-grained sharing of some structures
− Suddenly your atomics are not sequentially consistent at all!
− What about SC operations to local memory?
Limits of sequential consistency
61. 61
These are data-race-free memory models
They only guarantee ordering in the absence of races
− So we only actually order things we can observe! Order can be relaxed between atomics.
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG WG
62. 62
In one part of the execution, we have SC operations at device scope
− Let’s assume this is valid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
63. 63
Elsewhere we have another set of SC operations
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG scope SC
Operations.
Ordered SC.
WG WG
64. 64
Any access across from one work-group to another is a race
− It is equivalent to a non-atomic operation
− Therefore it is invalid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
An access like this is a race
WG scope SC
Operations.
Ordered SC.
65. 65
In Hower et. al.’s work on Heterogeneous-Race-Free memory models this is made explicit
Sequential consistency can be maintained with scopes
Access to invalid scope
− Is unordered
− Is not observable
− So this is still valid sequential consistency: everything observable, ie that is race-free, is SC
Making sequential consistency a partial order
66. 66
We can apply SC semantics here for the same reason
− Actions to coarse-grained memory is not visible to other clients
− However – coarse buffers don’t fit cleanly in a hierarchical model
In Gaster et al. (to appear ACM TACO) we use the concept of observability as a memory
model extension
− “At any given point in time a given location will be available in a particular set of scope instances out to
some maximum instance and by some set of actors. Only memory operations that are inclusive with that
maximal scope instance will observe changes to those locations.”
Observability can migrate using API actions
− Map, unmap, and event dependencies
Extending to coarse-grained memory
67. 67
Observability
Initial state
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability bounds – device scope on the GPU
68. 68
Observability
Map operation
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability moves to
device scope on the CPU
69. 69
Observability
Unmap
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Unmap will transfer dependence back to an OpenCL device
70. 70
We can cleanly put scopes and coarse memory in the same memory consistency model
− It is more complicated than, say, C++
− It is practical to formalize, and hopefully easy to explain
There is no need for quirky corner cases
We merely rethink slightly
− Instead of a total order S for all SC operations we have a total order Sa for each agent a of all SC operations
observable by that agent such that all these orders are consistent with each other.
− This is a relatively small step from the DRF idea, which is effectively that non-atomic operations are not
observable, and thus do not need to be strictly ordered.
The point
71. 71
Separate orders for local and global memory are likely to be painful for programmers
Do we need them?
− We have formalized the concept (also in the TACO paper) using multiple happens-before orders that subset
memory operations
− We also describe bridging-synchronization-order as a formal way to allow join points
− It is messy as a formalization. The harm to the developer is probably significant.
Joining address spaces
72. 72
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
73. 73
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
Entirely different interfaces from
the processing core
74. 74
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
− We will have to see how this plays out
− Consider this a warning to take care if you try to use this aspect of OpenCL 2.0
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAM
Texture
Cache
Entirely different interfaces from
the processing core
76. 76
Like mainstream languages, heterogeneous programming models are adopting firm memory
models
Without fundamental execution model guarantees the usefulness is limited
We are making progress on both these counts
Things are improving
77. 77
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
Thank you
Follow us on: