The past few years have seen a sharp increase in the complexity of rendering algorithms used in modern game engines. Large portions of the rendering work are increasingly written in GPU computing languages, and decoupled from the conventional “one-to-one” pipeline stages for which shading languages were designed. Following Tim Foley’s talk from SIGGRAPH 2016’s Open Problems course on shading language directions, we explore example rendering algorithms that we want to express in a composable, reusable and performance-portable manner. We argue that a few key constraints in GPU computing languages inhibit these goals, some of which are rooted in hardware limitations. We conclude with a call to action detailing specific improvements we would like to see in GPU compute languages, as well as the underlying graphics hardware.
This talk was originally given at SIGGRAPH 2017 by Andrew Lauritzen (EA SEED) for the Open Problems in Real-Time Rendering course.
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
The past few years have seen a sharp increase in the complexity of rendering algorithms used in modern game engines. Large portions of the rendering work are increasingly written in GPU computing languages, and decoupled from the conventional “one-to-one” pipeline stages for which shading languages were designed. Following Tim Foley’s talk from SIGGRAPH 2016’s Open Problems course on shading language directions, we explore example rendering algorithms that we want to express in a composable, reusable and performance-portable manner. We argue that a few key constraints in GPU computing languages inhibit these goals, some of which are rooted in hardware limitations. We conclude with a call to action detailing specific improvements we would like to see in GPU compute languages, as well as the underlying graphics hardware.
This talk was originally given at SIGGRAPH 2017 by Andrew Lauritzen (EA SEED) for the Open Problems in Real-Time Rendering course.
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Presentation from DICE Coder's Day (2010 November) by Andreas Fredriksson in the Frostbite team.
Goes into detail about Scope Stacks, which are a systems programming tool for memory layout that provides
• Deterministic memory map behavior
• Single-cycle allocation speed
• Regular C++ object life cycle for objects that need it
This makes it very suitable for games.
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
This session addresses how we are expanding the scope of the Burst Compiler to enable even the most demanding, hand-coded engine and gameplay problems to be expressed in HPC# via direct CPU intrinsics. Andreas shares the reasoning and use cases; as well as discussing implementation challenges, debugging, and performance along with comparisons to C++ code.
Speaker: Andreas Fredriksson - Unity
Watch the session on YouTube: https://youtu.be/BpwvXkoFcp8
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
By Kristoffer Benjaminsson, CTO, Easy.
This talk presents the telemetry system used in Battlefield Heroes and how it helps the team make technical decisions in order to provide the best service possible. We will show real life examples of how telemetry helped improve matchmaking, reduce latency for players and help find false alarms from the cheat detection system. We will also discuss how telemetry can be used in development for catching bugs and support game designers in their work.
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Unity Technologies
Have you ever needed to compare the difference in performance between two versions of your project? This session will show you how to use Unity's Profile Analyzer to see the impact of an asset or code change, optimization work, settings modification, or Unity version upgrade to verify enhancements.
Speaker: Lyndon Homewood- Unity
Watch the session on YouTube: https://youtu.be/0lzqdDdE9Tc
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
A set of mobile game optimization best practices. This presentation extensively covers PowerVR series of GPUs from Imagination Technologies and iOS, however the majority of recommendations can be applied to other GPUs and mobile operating systems.
Game engines have long been in the forefront of taking advantage of the ever
increasing parallel compute power of both CPUs and GPUs. This talk is about how the
parallel compute is utilized in practice on multiple platforms today in the Frostbite game
engine and how we think the parallel programming models, hardware and software in
the industry should look like in the next 5 years to help us make the best games possible.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Presentation & discussion around low-level graphics APIs. This was a quickly made presentation that I put together for a discussion with Intel and fellow ISVs, thought it could be worth sharing
Scene Graphs & Component Based Game EnginesBryan Duggan
A presentation I made at the Fermented Poly meetup in Dublin about Scene Graphs & Component Based Game Engines. Lots of examples from my own game engine BGE - where almost everything is a component. Get the code and the course notes here: https://github.com/skooter500/BGE
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
Presentation from DICE Coder's Day (2010 November) by Andreas Fredriksson in the Frostbite team.
Goes into detail about Scope Stacks, which are a systems programming tool for memory layout that provides
• Deterministic memory map behavior
• Single-cycle allocation speed
• Regular C++ object life cycle for objects that need it
This makes it very suitable for games.
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
This session addresses how we are expanding the scope of the Burst Compiler to enable even the most demanding, hand-coded engine and gameplay problems to be expressed in HPC# via direct CPU intrinsics. Andreas shares the reasoning and use cases; as well as discussing implementation challenges, debugging, and performance along with comparisons to C++ code.
Speaker: Andreas Fredriksson - Unity
Watch the session on YouTube: https://youtu.be/BpwvXkoFcp8
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
By Kristoffer Benjaminsson, CTO, Easy.
This talk presents the telemetry system used in Battlefield Heroes and how it helps the team make technical decisions in order to provide the best service possible. We will show real life examples of how telemetry helped improve matchmaking, reduce latency for players and help find false alarms from the cheat detection system. We will also discuss how telemetry can be used in development for catching bugs and support game designers in their work.
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Optimize your game with the Profile Analyzer - Unite Copenhagen 2019Unity Technologies
Have you ever needed to compare the difference in performance between two versions of your project? This session will show you how to use Unity's Profile Analyzer to see the impact of an asset or code change, optimization work, settings modification, or Unity version upgrade to verify enhancements.
Speaker: Lyndon Homewood- Unity
Watch the session on YouTube: https://youtu.be/0lzqdDdE9Tc
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
A set of mobile game optimization best practices. This presentation extensively covers PowerVR series of GPUs from Imagination Technologies and iOS, however the majority of recommendations can be applied to other GPUs and mobile operating systems.
Game engines have long been in the forefront of taking advantage of the ever
increasing parallel compute power of both CPUs and GPUs. This talk is about how the
parallel compute is utilized in practice on multiple platforms today in the Frostbite game
engine and how we think the parallel programming models, hardware and software in
the industry should look like in the next 5 years to help us make the best games possible.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Presentation & discussion around low-level graphics APIs. This was a quickly made presentation that I put together for a discussion with Intel and fellow ISVs, thought it could be worth sharing
Scene Graphs & Component Based Game EnginesBryan Duggan
A presentation I made at the Fermented Poly meetup in Dublin about Scene Graphs & Component Based Game Engines. Lots of examples from my own game engine BGE - where almost everything is a component. Get the code and the course notes here: https://github.com/skooter500/BGE
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
DIANNE - A distributed deep learning framework on OSGi - Tim Verbelenmfrancis
OSGi Community Event 2016 Presentation by Tim Verbelen (iMinds / imec)
With the current explosion of IoT devices connected to the Internet, the biggest challenge in the near future is how to process and analyze all this generated data, making use of the highly distributed compute infrastructure at hand. A promising approach for data analysis is deep learning, using brain-inspired neural networks for feature extraction and detection. In our research lab, we have developed DIANNE, an OSGi-based framework for creating, deploying and training artificial neural networks in a modular way. Benefiting from OSGi modularity, we can easily distribute (parts of) the neural networks among cloud and edge devices.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Building and operating HPC-based AI computing environment inside Gwangju Institute of Science and Technology
For using the part of the slide, you need to cite "Narantuya Jargalsaikhan, GIST AI-X Computing Cluster, 2021".
Thank you!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
1. 1Spring 2011 – Beyond Programmable Shading
Graphics with GPU Compute APIs
Mike Houston, AMD / Stanford
Aaron Lefohn, Intel / University of Washington
2. 2Spring 2011 – Beyond Programmable Shading
What’s In This Talk?
• Brief review of last lecture
• Advanced usage patterns of GPU compute languages
• Rendering uses cases for GPU Computing Languages
– Histograms (for shadows, tone mapping, etc)
– Deferred rendering
– Writing new graphics pipelines (sort of )
4. 4Spring 2011 – Beyond Programmable Shading
Definitions: Execution
• Task
– A logically related set of instructions executed in a single execution context
(aka shader, instance of a kernel, task)
• Concurrent execution
– Multiple tasks that may execute simultaneously
(because they are logically independent)
• Parallel execution
– Multiple tasks whose execution contexts are guaranteed to be live simultaneously
(because you want them to be for locality, synchronization, etc)
5. 5Spring 2011 – Beyond Programmable Shading
Synchronization
• Synchronization
– Restricting when tasks are permitted to execute
• Granularity of permitted synchronization determines at which
granularity system allows user to control scheduling
6. 6Spring 2011 – Beyond Programmable Shading
GPU Compute Languages Review
• “Write code from within two nested concurrent/parallel loops”
• Abstracts
– Cores, execution contexts, and SIMD ALUs
• Exposes
– Parallel execution contexts on same core
– Fast R/W on-core memory shared by the execution contexts on same core
• Synchronization
– Fine grain: between execution contexts on same core
– Very coarse: between large sets of concurrent work
– No medium-grain synchronization “between function calls” like task
systems provide
7. 7Spring 2011 – Beyond Programmable Shading
GPU Compute Pseudocode
void myWorkGroup()
{
parallel_for(i = 0 to NumWorkItems - 1)
{
… GPU Kernel Code … (This is where you write GPU compute code)
}
}
void main()
{
concurrent_for( i = 0 to NumWorkGroups - 1)
{
myWorkGroup();
}
sync;
}
8. 8Spring 2011 – Beyond Programmable Shading
DX CS/OCL/CUDA Execution Model
• Fundamental unit is work-item
– Single instance of “kernel” program
(i.e., “task” using the definitions in this talk)
– Each work-item executes in single SIMD lane
• Work items collected in work-groups
– Work-group scheduled on single core
– Work-items in a work-group
– Execute in parallel
– Can share R/W on-chip scratchpad memory
– Can wait for other work-items in work-group
• Users launch a grid of work-groups
– Spawn many concurrent work-groups
void f(...) {
int x = ...;
...;
...;
if(...) {
...
}
}
Figure by Tim Foley
9. 9Spring 2011 – Beyond Programmable Shading
When Use GPU Compute vs Pixel
Shader?
• Use GPU compute language if your algorithm needs on-chip memory
– Reduce bandwidth by building local data structures
• Otherwise, use pixel shader
– All mapping, decomposition, and scheduling decisions automatic
– (Easier to reach peak performance)
10. 10Spring 2011 – Beyond Programmable Shading
Conventional Thread Parallelism on GPUs
• Also called “persistent threads”
• “Expert” usage model for GPU compute
– Defeat abstractions over cores, execution contexts, and SIMD functional units
– Defeat system scheduler, load balancing, etc.
– Code not portable between architectures
11. 11Spring 2011 – Beyond Programmable Shading
• Execution
– Two-level parallel execution model
– Lower level: parallel execution of M identical tasks on M-wide SIMD
functional unit
– Higher level: parallel execution of N different tasks on N execution contexts
• What is abstracted?
– Nothing (other than automatic mapping to SIMD lanes)
• Where is synchronization allowed?
– Lower-level: between any task running on same SIMD functional unit
– Higher-level: between any execution context
Conventional Thread Parallelism on GPUs
12. 12Spring 2011 – Beyond Programmable Shading
Why Persistent Threads?
• Enable alternate programming models that require different
scheduling and synchronization rules than the default model
provides
• Example alternate programming models
– Task systems (esp. nested task parallelism)
– Producer-consumer rendering pipelines
– (See references at end of this slide deck for more details)
13. 13Spring 2011 – Beyond Programmable Shading
“Sample Distribution Shadow Maps,”
Lauritzen et al., SIGGRAPH 2010
Without ComputeShader With ComputeShader
Sample Distribution Shadow Maps
14. 14Spring 2011 – Beyond Programmable Shading
Sample Distribution Shadow Maps
• Algorithm Overview
– Render depth buffer from eye
– Analyze depth buffer with Compute Shader:
– Build histogram of pixel positions
– Analyze histogram to optimize shadow rendering
– Renders shadow maps
• Compute Shader to build histogram (from pixel data)
– Sequential and parallel code in single kernel (many fork-joins)
– Liberal use of atomic to shared (on-chip) and global (off-chip) memory
– Gather and scatter to arbitrary locations in off-chip and on-chip memory
• Compute Shader to analyze histogram
– Sequential and parallel code in single kernel (many fork-joins)
15. 15Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
[numthreads(BLOCK_DIM, BLOCK_DIM, 1)]
void ScatterHistogram(uint3 groupId : SV_GroupID,
uint3 groupThreadId : SV_GroupThreadID,
uint groupIndex : SV_GroupIndex)
{
// Initialize local histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to histogram bins
// - Between threadgroups: Each threadgroup has own histogram
localHistogram[groupIndex] = emptyBin();
GroupMemoryBarrierWithGroupSync();
. . .
16. 16Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Build histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to pixels in image tile
// - Between threadgroups: Each threadgroup maps to image tile
// Read and compute surface data
uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy;
SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords);
// Bin based on view space Z
// Scatter data to the right bin in our local (on-chip) histogram
int bin = int(ZToBin(data.positionView.z));
InterlockedAdd(localHistogram[bin].count, 1U);
InterlockedMin(localHistogram[bin].bounds.minTexCoordX, data.texCoordX);
InterlockedMax(localHistogram[bin].bounds.maxTexCoordX, data.texCoordX);
//… (more atomic min/max operations for other values in histogram bin) …
GroupMemoryBarrierWithGroupSync();
. . .
17. 17Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Use per-threadgroup scalar code to atomically merge all on-chip histograms into
// single histogram in global memory.
// Parallelism
// - Within threadgroup: SIMD lanes map to histogram elements
// - Between threadgroups: Each threadgroup writing to single global histogram
uint i = groupIndex;
if (localHistogram[i].count > 0) {
InterlockedAdd(gHistogram[i].count, histogram[i].count);
InterlockedMin(gHistogram[i].bounds.minTexCoordX, histogram[i].bounds.minTexCoordX );
InterlockedMin(gHistogram[i].bounds.minTexCoordY, histogram[i].bounds.minTexCoordY );
InterlockedMin(gHistogram[i].bounds.minLightSpaceZ, histogram[i].bounds.minLightSpaceZ);
InterlockedMax(gHistogram[i].bounds.maxTexCoordX, histogram[i].bounds.maxTexCoordX );
InterlockedMax(gHistogram[i].bounds.maxTexCoordY, histogram[i].bounds.maxTexCoordY );
InterlockedMax(gHistogram[i].bounds.maxLightSpaceZ, histogram[i].bounds.maxLightSpaceZ);
}
}
18. 18Spring 2011 – Beyond Programmable Shading
Optimization: Moving farther away
from basic data-parallelism
• Problem---1:1 mapping between workgroups and image tiles
– Flushes local memory to global memory more times than necessary
– Would like larger workgroups but limited to 1024 workitems per group
• Solution
– Use the largest workgroups possible (1024 workitems)
– Launch fewer workgroups. Find sweet spot that fills all threads on all cores to
maximize latency hiding but minimizes the writes to global memory
– Loop over multiple image tiles within a single compute shader
• Take-away
– “Invoke just enough parallel work to fill the SIMD lanes, threads, and cores of
the machine to achieve sufficient latency hiding”
– The abstraction is broken because this optimization exposes the number of
hardware resources
19. 19Spring 2011 – Beyond Programmable Shading
Building Histogram in DX11 CS
// Build histogram in parallel
// Parallelism:
// - Within threadgroup: SIMD lanes map to pixels in image tile
// - Between threadgroups: Each threadgroup maps to image tile
uint2 tileStart = groupId.xy * TILE_DIM + groupThreadId.xy;
for (uint tileY = 0; tileY < TILE_DIM; tileY += BLOCK_DIM) {
for (uint tileX = 0; tileX < TILE_DIM; tileX += BLOCK_DIM) {
// Read and compute surface data
uint2 globalCoords = groupId.xy * TILE_DIM + groupThreadId.xy;
SurfaceData data = ComputeSurfaceDataFromGBuffer(globalCoords);
// Bin based on view space Z
// Scatter data to the right bin in our local (on-chip) histogram
int bin = int(ZToBin(data.positionView.z));
InterlockedAdd(localHistogram[bin].count, 1U);
InterlockedMin(localHistogram[bin].bounds.minTexCoordX,
data.texCoordX);
… (more atomic min/max ops for other values in histogram bin) …
}}
GroupMemoryBarrierWithGroupSync();
. . .
20. 20Spring 2011 – Beyond Programmable Shading
SW Pipeline 1: Particle Rasterizer
• Mock-up particle rendering pipeline with render-target-read
– Written by 2 people over the course of 1 week
– Runs ~2x slower than D3D rendering pipeline (but has glass jaws)
Without Volumetric Shadow With Volumetric Shadow
21. 21Spring 2011 – Beyond Programmable Shading
Tiled Particle Rasterizer in DX11 CS
[numthreads(RAST_THREADS_X, RAST_THREADS_Y, 1)]
void RasterizeParticleCS(uint3 groupId : SV_GroupID,
uint3 groupThreadId : SV_GroupThreadID,
uint groupIndex : SV_GroupIndex)
{
uint i = 0; // For all particles..
while (i < mParticleCount) {
GroupMemoryBarrierWithGroupSync();
const uint particlePerIter = min(mParticleCount - i, NT_X * NT_Y);
// Vertex shader and primitive assembly
// Parallelism: SIMD lanes map over particles.
if (groupIndex < particlePerIter) {
const uint particleIndex = i + groupIndex;
// … read vertex data for this particle from memory,
// construct screen-facing quad, test if particle intersects tile,
// use atomics to on-chip memory to append to list of particles
}
GroupMemoryBarrierWithGroupSync();
. . .
22. 22Spring 2011 – Beyond Programmable Shading
Tiled Particle Rasterizer in DX11 CS
// Find all particles that intersect this pixel
// Parallelism: SIMD lanes map over pixels in image tile
for (n = 0; n < gVisibileParticlePerIter; n++) {
if (ParticleIntersectsPixel(gParticles[n], fragmentPos)) {
float dx, dy;
ComputeInterpolants(gParticles[n], fragmentPos, dx, dy);
float3 viewPos = BilinearInterp3(gParticles[n].viewPos, dx, dy);
float3 entry, exit, t;
if (IntersectParticle(viewPos, gParticles[n], entry, exit, t)) {
// Run pixel shader on this particle
// Read-modify-write framebuffer held in global off-chip memory
}
}
}
i += particlePerIter;
}
23. 23Spring 2011 – Beyond Programmable Shading
SW Pipeline 1: Particle Rasterizer
• Usage
– Atomics to on-chip memory
– Gather/scatter to on-chip and off-chip memory
– Latency hiding of off-chip memory accesses
• Lesson learned
– The programmer productivity of these programming models is impressive
– This pipeline is statically scheduled (from a SW perspective) but underlying
hardware scheduler is dynamically scheduling threadgroups
– Needs to be doing dynamic SW scheduling to achieve more stable / higher
performance
24. 24Spring 2011 – Beyond Programmable Shading
Software Pipeline 2: OptiX
• NVIDIA interactive ray-tracing library
– Started as research project, product announced 2009
• Custom rendering pipeline
– Implemented entirely with CUDA using “persistent threads” usage pattern
– Users define geometry, materials, lights in C-for-CUDA
– PTX intermediate layer used to synthesize optimized code
• Other custom pipelines in the research world
– Stochastic rasterization, particles, conservative rasterization, …
25. 25Spring 2011 – Beyond Programmable Shading
Deferred Rendering
(Slides by Andrew Lauritzen)
(Possibly the most important use of ComputeShader)
27. 27Spring 2011 – Beyond Programmable Shading
Forward Shading
• Do everything we need to shade a pixel
– for each light
– Shadow attenuation (sampling shadow maps)
– Distance attenuation
– Evaluate lighting and accumulate
• Multi-pass requires resubmitting scene geometry
– Not a scalable solution
28. 28Spring 2011 – Beyond Programmable Shading
Forward Shading Problems
• Ineffective light culling
– Object space at best
– Trade-off with shader permutations/batching
• Memory footprint of all inputs
– Everything must be resident at the same time (!)
• Shading small triangles is inefficient
– Covered earlier in this course: [Fatahalian 2010]
29. 29Spring 2011 – Beyond Programmable Shading
Conventional Deferred Shading
• Store lighting inputs in memory (G-buffer)
– for each light
– Use rasterizer to scatter light volume and cull
– Read lighting inputs from G-buffer
– Compute lighting
– Accumulate lighting with additive blending
• Reorders computation to extract coherence
30. 30Spring 2011 – Beyond Programmable Shading
Modern Implementation
• Cull with screen-aligned quads
– Cover light extents with axis-aligned bounding box
– Full light meshes (spheres, cones) are generally overkill
– Can use oriented bounding box for narrow spot lights
– Use conservative single-direction depth test
– Two-pass stencil is more expensive than it is worth
– Depth bounds test on some hardware, but not batch-friendly
31. 31Spring 2011 – Beyond Programmable Shading
Lit Scene (256 Point Lights)
32. 32Spring 2011 – Beyond Programmable Shading
Deferred Shading Problems
• Bandwidth overhead when lights overlap
– for each light
– Use rasterizer to scatter light volume and cull
– Read lighting inputs from G-buffer overhead
– Compute lighting
– Accumulate lighting with additive blending overhead
• Not doing enough work to amortize overhead
33. 33Spring 2011 – Beyond Programmable Shading
Improving Deferred Shading
• Reduce G-buffer overhead
– Access fewer things inside the light loop
– Deferred lighting / light pre-pass
• Amortize overhead
– Group overlapping lights and process them together
– Tile-based deferred shading
34. 34Spring 2011 – Beyond Programmable Shading
Tile-Based Deferred Rendering
Parallel_for over lights
Atomically append lights that affect tile to shared list
Barrier
Parallel_for over pixels in tile
Evaluate all selected lights at each pixel
35. 35Spring 2011 – Beyond Programmable Shading
Tile-Based Deferred Shading
• Goal: amortize overhead
– Large reduction in bandwidth requirements
• Use screen tiles to group lights
– Use tight tile frusta to cull non-intersecting lights
– Reduces number of lights to consider
– Read G-buffer once and evaluate all relevant lights
– Reduces bandwidth of overlapping lights
• See [Andersson 2009] for more details
36. 36Spring 2011 – Beyond Programmable Shading
Lit Scene (1024 Point Lights)
41. 41Spring 2011 – Beyond Programmable Shading
Anti-aliasing
• Multi-sampling with deferred rendering requires some work
– Regular G-buffer couples visibility and shading
• Handle multi-frequency shading in user space
– Store G-buffer at sample frequency
– Only apply per-sample shading where necessary
– Offers additional flexibility over forward rendering
42. 42Spring 2011 – Beyond Programmable Shading
Identifying Edges
• Forward MSAA causes redundant work
– It applies to all triangle edges, even for continuous, tessellated surfaces
• Want to find surface discontinuities
– Compare sample depths to depth derivatives
– Compare (shading) normal deviation over samples
44. 44Spring 2011 – Beyond Programmable Shading
Deferred Rendering Conclusions
• Deferred shading is a useful rendering tool
– Decouples shading from visibility
– Allows efficient user-space scheduling and culling
• Tile-based methods win going forward
– ComputeShader/OpenCL/CUDA implementations save a lot of bandwidth
– Fastest and most flexible
– Enable efficient MSAA
45. 45Spring 2011 – Beyond Programmable Shading
Summary for GPU Compute Languages
• GPU compute languages
– “Easy” way to exploit compute capability of GPUs (easier than 3D APIs)
– The performance benefit over pixel shaders comes when using on-core
R/W memory to save off-chip bandwidth
– Increasingly used as “just another tool in the real-time graphics
programmer’s toolkit”
– Deferred rendering
– Shadows
– Post-processing
– …
– The current languages have a lot of rough edges and limitations.
47. 47Spring 2011 – Beyond Programmable Shading
Future Work
• Hierarchical light culling
– Straightforward but would need lots of small lights
•Improve MSAA memory usage
–Irregular/compressed sample storage?
–Revisit binning pipelines?
–Sacrifice higher resolutions for better AA?
48. 48Spring 2011 – Beyond Programmable Shading
Acknowledgements
• Microsoft and Crytek for the scene assets
• Johan Andersson from DICE
• Craig Kolb, Matt Pharr, and others in the Advanced Rendering
Technology team at Intel
• Nico Galoppo, Anupreet Kalra and Mike Burrows from Intel
49. 49Spring 2011 – Beyond Programmable Shading
References
• [Andersson 2009] Johan Andersson, “Parallel Graphics in Frostbite - Current & Future”,
http://s09.idav.ucdavis.edu/
• [Fatahalian 2010] Kayvon Fatahalian, “Evolving the Direct3D Pipeline for Real-
Time Micropolygon Rendering”, http://bps10.idav.ucdavis.edu/
• [Hoffman 2009] Naty Hoffman, “Deferred Lighting Approaches”,
http://www.realtimerendering.com/blog/deferred-lighting-approaches/
• [Stone 2009] Adrian Stone, “Deferred Shading Shines. Deferred Lighting? Not So Much.”,
http://gameangst.com/?p=141
50. 50Spring 2011 – Beyond Programmable Shading
Questions?
• Full source and demo available at:
– http://visual-computing.intel-
research.net/art/publications/deferred_rendering/
52. 52Spring 2011 – Beyond Programmable Shading
Deferred Lighting / Light Pre-Pass
• Goal: reduce G-buffer overhead
• Split diffuse and specular terms
– Common concession is monochromatic specular
• Factor out constant terms from summation
– Albedo, specular amount, etc.
• Sum inner terms over all lights
53. 53Spring 2011 – Beyond Programmable Shading
Deferred Lighting / Light Pre-Pass
• Resolve pass combines factored components
– Still best to store all terms in G-buffer up front
– Better SIMD efficiency
• Incremental improvement for some hardware
– Relies on pre-factoring lighting functions
– Ability to vary resolve pass is not particularly useful
• See [Hoffman 2009] and [Stone 2009]
54. 54Spring 2011 – Beyond Programmable Shading
MSAA with Quad-Based Methods
• Mark pixels for per-sample shading
– Stencil still faster than branching on most hardware
– Probably gets scheduled better
• Shade in two passes: per-pixel and per-sample
– Unfortunately, duplicates culling work
– Scheduling is still a problem
56. 56Spring 2011 – Beyond Programmable Shading
MSAA with Tile-Based Methods
• Handle per-pixel and per-sample in one pass
– Avoids duplicate culling work
– Can use branching, but incurs scheduling problems
– Instead, reschedule per-sample pixels
– Shade sample 0 for the whole tile
– Pack a list of pixels that require per-sample shading
– Redistribute threads to process additional samples
– Scatter per-sample shaded results