This document discusses a lecture on GPU architecture given by Mark Kilgard at the University of Texas on March 6, 2012. The lecture covers the architecture of graphics processing units and how they have evolved over the past six years. It also includes an in-class quiz, information about homework and projects, and the professor's office hours.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Introduction to the DAOS Scale-out object store (HLRS Workshop, April 2017)Johann Lombardi
DAOS is a open-source storage stack designed from the ground up to address many of the problems that arise when scaling out storage. DAOS takes advantage of next generation non-volatile memory technologies while presenting a rich and scalable storage interface providing features such as transactional non-blocking list I/O, data resiliency on top of commodity hardware, fine grained data control, and storage tiering to optimize performance and cost. Check out https://github.com/daos-stack for more information.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Introduction to the DAOS Scale-out object store (HLRS Workshop, April 2017)Johann Lombardi
DAOS is a open-source storage stack designed from the ground up to address many of the problems that arise when scaling out storage. DAOS takes advantage of next generation non-volatile memory technologies while presenting a rich and scalable storage interface providing features such as transactional non-blocking list I/O, data resiliency on top of commodity hardware, fine grained data control, and storage tiering to optimize performance and cost. Check out https://github.com/daos-stack for more information.
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaAnne Nicolas
PipeWire is an open source project that aims to greatly improve audio and video handling under Linux. Utilising a fresh design, it bridges use cases that have been previously addressed by different tools – or not addressed at all -, providing ground for building complex, yet secure and efficient, multimedia systems.
In this talk, Julien is going to present the PipeWire project and the concepts that make up its design. In addition, he is going to give an update of the current and future work going on around PipeWire, both upstream and in Automotive Grade Linux, an early adopter that Julien is actively working on.
Julian Bouzas
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
Abstract: Although new non-volatile media inherently offers very low latency, remote access
using protocols such as NVMe-oF and presenting the data to VMs via virtualized interfaces such as virtio
adds considerable software overhead. One way to reduce the overhead is to use the Storage
Performance Development Kit (SPDK), an open-source software project that provides building blocks for
scalable and efficient storage applications with breakthrough performance. Comparing the software
paths for virtualizing block storage I/O illustrates the advantages of the SPDK-based approach. Empirical
data shows that using SPDK can improve CPU efficiency by up to 10 x and reduce latency up to 50% over
existing methods. Future enhancements for SPDK will make its advantages even greater.
Speaker Bio: Anu Rao is Product line manager for storage software in Data center Group. She helps
customer ease into and adopt open source Storage software like Storage Performance Development Kit
(SPDK) and Intelligent Software Acceleration-Library (ISA-L).
Checkerboard Rendering in Dark Souls: Remastered by QLOCQLOC
This is a talk on checkerboard rendering Markus & Andreas held at Digital Dragons 2019.
In it they quickly go through the history of Checkerboard Rendering before taking a deep dive into how it works and how it is implemented in Dark Souls: Remastered. Lastly, they present the quality and performance improvements they got from using it and their conclusion.
PS: The PDF. file includes useful in-depth notes from both authors.
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
Epic Games Japan hold a meeting named "Lightmass Deep Dive" on July 30, 2016.
Osamu Satio of Square Enix Osaka gave a presentation about their Lightmass Operation for Large Console Games. EGJ translated the slide for the presentation to English and published it.
There are some movies in the slide. So we recommend downloading this slide.
Tales from the Optimization Trenches - Unite Copenhagen 2019Unity Technologies
In this talk, you'll learn about the tools and techniques that Unity's Consulting and Development team uses to identify and fix performance issues. The team travels the world visiting customers and conducting Project Reviews, in-depth engagements to locate and resolve performance bottlenecks. This session is designed to help you apply their knowledge to your Unity projects, so you'll see examples of real-life performance problems, their solutions, and receive up-to-date best practice advice.
Speaker: Ignacio Liverotti – Unity
Watch the session on YouTube: https://youtu.be/GuODu4-cXXQ
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
This presentation introduces Vulkan components, what you must know to start using this new API. And what you must know when using it on NVIDIA hardware
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
There are many systems that handle heavy UDP transactions, like DNS and RADIUS servers. Nowadays 10G Ethernet NICs are so widely deployed and even 40G and 100G NICs are out there. This makes it difficult for a single server to get enough performance to consume link bandwidth with short packet transactions. Since usually Linux is by default not tuned for dedicated UDP servers, we are investigating ways to boost such UDP transaction performance.
This talk will show how we analyze the bottleneck and give tips we found to make the performance better. Also we discuss challenges to improve it even more.
This presentation was given at LinuxCon Japan 2016 by Toshiaki Makita
Dominant resource fairness fair allocation of multiple resource typesanet18
DRF algorithm provides fair resource allocation in a system containing different resource types. Dominant Resource Fairness (DRF), is a generalization of max-min fairness to multiple resource types. This algorithm is used in resource allocation of hadoop cluster.
Embedded Recipes 2019 - Pipewire a new foundation for embedded multimediaAnne Nicolas
PipeWire is an open source project that aims to greatly improve audio and video handling under Linux. Utilising a fresh design, it bridges use cases that have been previously addressed by different tools – or not addressed at all -, providing ground for building complex, yet secure and efficient, multimedia systems.
In this talk, Julien is going to present the PipeWire project and the concepts that make up its design. In addition, he is going to give an update of the current and future work going on around PipeWire, both upstream and in Automotive Grade Linux, an early adopter that Julien is actively working on.
Julian Bouzas
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
Abstract: Although new non-volatile media inherently offers very low latency, remote access
using protocols such as NVMe-oF and presenting the data to VMs via virtualized interfaces such as virtio
adds considerable software overhead. One way to reduce the overhead is to use the Storage
Performance Development Kit (SPDK), an open-source software project that provides building blocks for
scalable and efficient storage applications with breakthrough performance. Comparing the software
paths for virtualizing block storage I/O illustrates the advantages of the SPDK-based approach. Empirical
data shows that using SPDK can improve CPU efficiency by up to 10 x and reduce latency up to 50% over
existing methods. Future enhancements for SPDK will make its advantages even greater.
Speaker Bio: Anu Rao is Product line manager for storage software in Data center Group. She helps
customer ease into and adopt open source Storage software like Storage Performance Development Kit
(SPDK) and Intelligent Software Acceleration-Library (ISA-L).
Checkerboard Rendering in Dark Souls: Remastered by QLOCQLOC
This is a talk on checkerboard rendering Markus & Andreas held at Digital Dragons 2019.
In it they quickly go through the history of Checkerboard Rendering before taking a deep dive into how it works and how it is implemented in Dark Souls: Remastered. Lastly, they present the quality and performance improvements they got from using it and their conclusion.
PS: The PDF. file includes useful in-depth notes from both authors.
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
Epic Games Japan hold a meeting named "Lightmass Deep Dive" on July 30, 2016.
Osamu Satio of Square Enix Osaka gave a presentation about their Lightmass Operation for Large Console Games. EGJ translated the slide for the presentation to English and published it.
There are some movies in the slide. So we recommend downloading this slide.
Tales from the Optimization Trenches - Unite Copenhagen 2019Unity Technologies
In this talk, you'll learn about the tools and techniques that Unity's Consulting and Development team uses to identify and fix performance issues. The team travels the world visiting customers and conducting Project Reviews, in-depth engagements to locate and resolve performance bottlenecks. This session is designed to help you apply their knowledge to your Unity projects, so you'll see examples of real-life performance problems, their solutions, and receive up-to-date best practice advice.
Speaker: Ignacio Liverotti – Unity
Watch the session on YouTube: https://youtu.be/GuODu4-cXXQ
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
I'm going to discuss the efficiency/performance optimizations of different layers of the system. Starting from the lowest levels like hardware and drivers: these tunings can be applied to pretty much any high-load server. Then we’ll move to Linux kernel and its TCP/IP stack: these are the knobs you want to try on any of your TCP-heavy boxes. Finally, we’ll discuss library and application-level tunings, which are mostly applicable to HTTP servers in general and nginx/envoy specifically.
For each potential area of optimization I’ll try to give some background on latency/throughput tradeoffs (if any), monitoring guidelines, and, finally, suggest tunings for different workloads.
Also, I'll cover more theoretical approaches to performance analysis and the newly developed tooling like `bpftrace` and new `perf` features.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
This presentation introduces Vulkan components, what you must know to start using this new API. And what you must know when using it on NVIDIA hardware
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
There are many systems that handle heavy UDP transactions, like DNS and RADIUS servers. Nowadays 10G Ethernet NICs are so widely deployed and even 40G and 100G NICs are out there. This makes it difficult for a single server to get enough performance to consume link bandwidth with short packet transactions. Since usually Linux is by default not tuned for dedicated UDP servers, we are investigating ways to boost such UDP transaction performance.
This talk will show how we analyze the bottleneck and give tips we found to make the performance better. Also we discuss challenges to improve it even more.
This presentation was given at LinuxCon Japan 2016 by Toshiaki Makita
Dominant resource fairness fair allocation of multiple resource typesanet18
DRF algorithm provides fair resource allocation in a system containing different resource types. Dominant Resource Fairness (DRF), is a generalization of max-min fairness to multiple resource types. This algorithm is used in resource allocation of hadoop cluster.
Creating a game using C++, OpenGL and Qtguestd5d4ce
A short presentation I did at work showing a game I am making in my spare time. Most of this presentation is about the tools and techniques. The game itself is located on Sourceforge. Is still under development.
Regards,
Jostein Topland
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
In this article we compare the results obtained with an implementation of the Finite Volume for structured meshes on GPGPUs with experimental results and also with a Finite Element code with boundary fitted strategy. The example is a fully submerged spherical buoy immersed in a cubic water recipient. The recipient undergoes an harmonic linear motion imposed with a shake table. The experiment is recorded with a high speed camera and the displacement of the buoy if obtained from the video with a MoCap (Motion Capture) algorithm. The amplitude and phase of the resulting motion allows to determine indirectly the added mass and drag of the sphere.
NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.
Video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS104.html
Date: Wednesday, August 8, 2012
Time: 11:50 AM - 12:50 PM
Location: SIGGRAPH 2012, Los Angeles
Attend this session to get the most out of OpenGL on NVIDIA Quadro and GeForce GPUs. Learn about the new features in OpenGL 4.3, particularly Compute Shaders. Other topics include bindless graphics; Linux improvements; and how to best use the modern OpenGL graphics pipeline. Learn how your application can benefit from NVIDIA's leadership driving OpenGL as a cross-platform, open industry standard.
Get OpenGL 4.3 beta drivers for NVIDIA GPUs from http://www.nvidia.com/content/devzone/opengl-driver-4.3.html
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and BeyondMark Kilgard
Location: Conference Hall K, Singapore EXPO
Date: Thursday, November 29, 2012
Time: 11:00 AM - 11:50 PM
Presenter: Mark Kilgard (Principal Software Engineer, NVIDIA, Austin, Texas)
Abstract: Attend this session to get the most out of OpenGL on NVIDIA Quadro and GeForce GPUs. Learn about the new features in OpenGL 4.3, particularly Compute Shaders. Other topics include bindless graphics; Linux improvements; and how to best use the modern OpenGL graphics pipeline. Learn how your application can benefit from NVIDIA's leadership driving OpenGL as a cross-platform, open industry standard.
Topic Areas: Computer Graphics; Development Tools & Libraries; Visualization; Image and Video Processing
Level: Intermediate
Presented September 30, 2009 in San Jose, California at GPU Technology Conference.
Describes the new features of OpenGL 3.2 and NVIDIA's extensions beyond 3.2 such as bindless graphics, direct state access, separate shader objects, copy image, texture barrier, and Cg 2.2.
Presented as a pre-conference tutorial at the GPU Technology Conference in San Jose on September 20, 2010.
Learn about NVIDIA's OpenGL 4.1 functionality available now on Fermi-based GPUs.
The next generation of GPU APIs for Game EnginesPooya Eimandar
Demonstrate about new pipeline of GPU APIs for developing real time game engine.
Developing for DirectX12, Vulkan or Metal requires a redesign of the game engine. Developers can achieve key benefits like reduced power consumption and optimized CPU and GPU, multi-threading on multiple GPU devices.
NVIDIA engineers will talk about the latest OpenGL API features for programmable GPUs. This session will cover the “hows, whys, and whens” of programming the fourth generation shader hardware. It will match detailed information on the API with practical examples. Topics covered will include geometry shaders, transformation feedback, instancing, and advanced texture formats.
Improving Shadows and Reflections via the Stencil BufferMark Kilgard
This 1999 tutorial explains how to use the stencil buffer to achieve realistic shadows and reflections. This tutorial was included in the "Advanced OpenGL Development" course presented at the 1999 Computer Game Developer Conference (now GDC).
This tutorial predates subsequent work by Cass Everitt and me to develop a truly robust Z-fail stencil shadow volume algorithm.
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
D11: a high-performance, protocol-optional, transport-optional, window system...Mark Kilgard
Consider the dual pressures toward a more tightly integrated workstation window system: 1) the need to efficiently handle high bandwidth services such as video, audio, and three-dimensional graphics; and 2) the desire to achieve the under-realized potential for local window system performance in X11.
This paper proposes a new window system architecture called D11 that seeks higher performance while preserving compatibility with the industry-standard X11 window system. D11 reinvents the X11 client/server architecture using a new operating system facility similar in concept to the Unix kernel's traditional implementation but designed for user-level execution. This new architecture allows local D11 programs to execute within the D11 window system kernel without compromising the window sytem's integrity. This scheme minimizes context switching, eliminates protocol packing and unpacking, and greatly reduces data copying. D11 programs fall back to the X11 protocol when running remote or connecting to an X11 server. A special D11 program acts as an X11 protocol translator to allow X11 programs to utilize a D11 window system.
[The described system was never implemented.]
NVIDIA OpenGL and Vulkan Support for 2017Mark Kilgard
Learn how NVIDIA continues improving both Vulkan and OpenGL for cross-platform graphics and compute development. This high-level talk is intended for anyone wanting to understand the state of Vulkan and OpenGL in 2017 on NVIDIA GPUs. For OpenGL, the latest standard update maintains the compatibility and feature-richness you expect. For Vulkan, NVIDIA has enabled the latest NVIDIA GPU hardware features and now provides explicit support for multiple GPUs. And for either API, NVIDIA's SDKs and Nsight tools help you develop and debug your application faster.
NVIDIA booth theater presentation at SIGGRAPH in Los Angeles, August 1, 2017.
http://www.nvidia.com/object/siggraph2017-schedule.html?id=sig1732
Get your SIGGRAPH driver release with OpenGL 4.6 and the latest Vulkan functionality from
https://developer.nvidia.com/opengl-driver
EXT_window_rectangles extends OpenGL with a new per-fragment test called the "window rectangles test" for use with FBOs that provides 8 or more inclusive or exclusive rectangles for rasterized fragments. Applications of this functionality include web browsers and virtual reality.
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...Mark Kilgard
Slides for SIGGRAPH paper presentation of "Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline".
Presented by Vineet Batra (Adobe) on Thursday, August 13, 2015 at 2:00 pm - 3:30 pm, Los Angeles Convention Center, Room 150/151.
Accelerating Vector Graphics Rendering using the Graphics Hardware PipelineMark Kilgard
SIGGRAPH 2015 paper.
We describe our successful initiative to accelerate Adobe Illustrator with the graphics hardware pipeline of modern GPUs. Relying on OpenGL 4.4 plus recent OpenGL extensions for advanced blend modes and first-class GPU-accelerated path rendering, we accelerate the Adobe Graphics Model (AGM) layer responsible for rendering sophisticated Illustrator scenes. Illustrator documents render in either an RGB or CMYK color mode. While GPUs are designed and optimized for RGB rendering, we orchestrate OpenGL rendering of vector content in the proper CMYK color space and accommodate the 5+ color components required. We support both non-isolated and isolated transparency groups, knockout, patterns, and arbitrary path clipping. We harness GPU tessellation to shade paths smoothly with gradient meshes. We do all this and render complex Illustrator scenes 2 to 6x faster than CPU rendering at Full HD resolutions; and 5 to 16x faster at Ultra HD resolutions.
NV_path_rendering is an OpenGL extension for GPU-accelerated path rendering. Recent functionality improvements provide better performance, better typography, rounded rectangles, conics, and OpenGL ES support. This functionality is available today with NVIDIA's 337.88 drivers.
The latest NV_path_rendering specification documents these new functional improvements:
https://www.opengl.org/registry/specs/NV/path_rendering.txt
You can find sample code here:
https://github.com/markkilgard/NVprSDK
presented at SIGGRAPH 2014 in Vancouver during NVIDIA's "Best of GTC" sponsored sessions
http://www.nvidia.com/object/siggraph2014-best-gtc.html
Watch the replay that includes a demo of GPU-accelerated Illustrator and several OpenGL 4 demos running on NVIDIA's Tegra Shield tablet.
http://www.ustream.tv/recorded/51255959
Find out more about the OpenGL examples for GameWorks:
https://developer.nvidia.com/gameworks-opengl-samples
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingMark Kilgard
Presented at SIGGRAPH Asia 2012 in Singapore on Friday, 30 November 14:15 - 16:00 during the "Points and Vectors" session.
Find the paper at http://developer.nvidia.com/game/gpu-accelerated-path-rendering or on Slideshare.
For thirty years, resolution-independent 2D standards (e.g. PostScript, SVG) have relied largely on CPU-based algorithms for the filling and stroking of paths. Learn about our approach to accelerate path rendering with our GPU-based "Stencil, then Cover" programming interface. We've built and productized our OpenGL-based system.
Preprint for SIGGRAPH Asia 2012
Copyright ACM, 2012
For thirty years, resolution-independent 2D standards (e.g. PostScript,
SVG) have depended on CPU-based algorithms for the filling and
stroking of paths. However advances in graphics hardware have largely
ignored the problem of accelerating resolution-independent 2D graphics
rendered from paths.
Our work builds on prior work to re-factor the path rendering task
to leverage existing capabilities of modern pipelined and massively
parallel GPUs. We introduce a two-step “Stencil, then Cover” (StC)
paradigm that explicitly decouples path rendering into one GPU
step to determine a path’s filled or stenciled coverage and a second
step to rasterize conservative geometry intended to test and reset the
coverage determinations of the first step while shading color samples
within the path. Our goals are completeness, correctness, quality, and
performance—but we go further to unify path rendering with OpenGL’s
established 3D rendering pipeline. We have built and productized our
approach to accelerate path rendering as an OpenGL extension.
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard
Video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS106.html
Location: West Hall Meeting Room 503, Los Angeles Convention Center
Date: Wednesday, August 8, 2012
Time: 2:40 PM – 3:40 PM
The future of GPU-based visual computing integrates the web, resolution-independent 2D graphics, and 3D to maximize interactivity and quality while minimizing consumed power. See what NVIDIA is doing today to accelerate resolution-independent 2D graphics for web content. This presentation explains NVIDIA's unique "stencil, then cover" approach to accelerating path rendering with OpenGL and demonstrates the wide variety of web content that can be accelerated with this approach.
More information: http://developer.nvidia.com/nv-path-rendering
Presented at the GPU Technology Conference 2012 in San Jose, California.
Tuesday, May 15, 2012.
Standards such as Scalable Vector Graphics (SVG), PostScript, TrueType outline fonts, and immersive web content such as Flash depend on a resolution-independent 2D rendering paradigm that GPUs have not traditionally accelerated. This tutorial explains a new opportunity to greatly accelerate vector graphics, path rendering, and immersive web standards using the GPU. By attending, you will learn how to write OpenGL applications that accelerate the full range of path rendering functionality. Not only will you learn how to render sophisticated 2D graphics with OpenGL, you will learn to mix such resolution-independent 2D rendering with 3D rendering and do so at dynamic, real-time rates.
Presented at the GPU Technology Conference 2012 in San Jose, California.
Monday, May 14, 2012.
Attend this session to get the most out of OpenGL on NVIDIA Quadro and GeForce GPUs. Topics covered include the latest advances available for Cg 3.1, the OpenGL Shading Language (GLSL); programmable tessellation; improved support for Direct3D conventions; integration with Direct3D and CUDA resources; bindless graphics; and more. When you utilize the latest OpenGL innovations from NVIDIA in your graphics applications, you benefit from NVIDIA's leadership driving OpenGL as a cross-platform, open industry standard.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. CS 354 2
Today’s material
In-class quiz
Lecture topic
Architecture of Graphics Processing Units (GPUs)
Course work
Homework #4 due today
Review textbook reading
Chapter 5, 6, and 7
Project #2 on texturing, shading, & lighting is coming
Remember: Midterm in-class on March 8
3. CS 354 3
My Office Hours
Tuesday, before class
Painter (PAI) 5.35
8:45 a.m. to 9:15
Thursday, after class
ACE 6.302
11:00 a.m. to 12
Randy’s office hours
Monday & Wednesday
11 a.m. to 12:00
Painter (PAI) 5.33
4. CS 354 4
Last time, this time
Last lecture, we discussed
Programmable shading
Graphics hardware shading languages
This lecture
How do GPUs work?
5. CS 354 5
On a sheet of paper
Daily Quiz • Write your EID, name, and date
• Write #1, #2, #3, #4 followed by its answer
Pick the best choice: Shade Multiple choice: The GLSL standard
has built-in data types for
trees are a) vectors
a) fractal trees with shadows b) matrices
b) OpenGL commands c) texture samplers
c) hierarchical arrangements of d) floating-point values
e) pointers to malloc’ed memory
shading computations f) a through e
d) fractal patterns of all sorts g) a through d
Name one general purpose
programming language that GLSL
borrows from.
6. CS 354 6
Key Trend in OpenGL Evolution
Complex
Configurability
Simple Shaders!
Configurability
High-level languages
Fixed-function Programmable
Direct3D follows the same trend
Also reflects trend in GPU architecture
API and hardware co-evolving
7. CS 354 7
Programming Shaders inside GPU
Multiple programmable domains within the GPU
3D Application
or Game Can be programmed in high-level languages
Cg, HLSL, or OpenGL Shading Language (GLSL)
OpenGL API
CPU – GPU
Boundary
GPU Vertex Primitive Clipping, Setup, Raster
Front End Assembly Assembly and Rasterization Operations
Vertex Geometry Fragment
Shader Program Shader
Attribute Fetch
Legend
Parameter Buffer Read Texture Fetch Framebuffer Access
programmable
fixed-function
Memory Interface
OpenGL 3.3
9. CS 354 9
Six Years of GPU Architecture
OpenGL Direct3D
Product New Features Version Version
Hardware transform & lighting, configurable
2000 GeForce 256 fixed-point shading, cube maps, texture 1.3 DX7
compression, anisotropic texture filtering
Programmable vertex transformation, 4
texture units, dependent textures, 3D
2001 GeForce3
textures, shadow maps, multisampling,
1.4 DX8
occlusion queries
2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1
Vertex program branching, floating-point
fragment programs, 16 texture units, limited
2003 GeForce FX
floating-point textures, color and depth
1.5 DX9
compression
Vertex textures, structured fragment
branching, non-power-of-two textures,
2004 GeForce 6800 Ultra
generalized floating-point textures, floating-
2.0 DX9c
point texture filtering and blending
2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c
10. CS 354 10
GeForce Peak
Vertex Processing Trends
rate for trivial 4x4 exceeds peak
1,400
vertex transform setup rates—allows
Millions of vertices per second
1,200
excess vertex
processing
1,000
800
600
400
200
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Vertex units 1 1 2 3 6 8
11. CS 354 11
GeForce Peak
Memory Bandwidth Trends
200
128-bit interface 256-bit interface
180
Raw
160 bandwidth
Gigabytes per second
140
Effective raw
bandwidth
120
with
compression
100
Expon.
(Effective raw
bandwidth
80
with
compression)
60
Expon. (Raw
bandwidth)
40
20
0
GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce
GT S 4600 6800 Ultra 7800 GT X
12. CS 354 12
Effective GPU
Memory Bandwidth
Compression schemes
Lossless depth and color (when multisampling)
compression
Lossy texture compression (S3TC / DXTC)
Typically assumes 4:1 compression
Avoidance useless work
Early killing of fragments (Z cull)
Avoiding useless blending and texture fetches
Very clever memory controller designs
Combining memory accesses for improved coherency
Caches for texture fetches
13. CS 354 13
GeForce Core and Memory
Clock Rates
1,400
DDR memory
1,200
transition—
memory rates
1,000
double physical
clock rate
Megahertz (Mhz)
800 Core
clock
600 Memory
clock
400
200
0
X
a
0
S
ltr
T
X
60
2
3
X
T
T
G
F
U
ce
Z
G
i4
N
ce
0
0
or
a
T
2
T
80
80
iv
ce
or
eF
a
4
R
7
iv
6
eF
e
or
G
R
ce
c
ce
eF
G
or
or
or
eF
G
eF
eF
G
G
G
14. CS 354 14
GeForce Peak
Triangle Setup Trends
300
assumes 50%
face culling
Millions of triangles per second
250
200
150
100
50
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
15. CS 354 15
GeForce Peak
Texture Fetch Trends
12,000
assuming no texture
cache misses
10,000
Millions of texture fetches
8,000
per second
6,000
4,000
2,000
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Texture units 2×4 2×4 2×4 2×4 16 24
16. CS 354 16
GeForce Peak
Depth/Stencil-only Fill
18,000
assuming no double speed
Millions of depth/stencil pixel updates
16,000 read-modify-write depth-stencil
only
14,000
12,000
per second
10,000
8,000
6,000
4,000
2,000
0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX
Raster Op units 4 4 4 4+4 16+16 16+16
17. CS 354 17
GeForce Transistor Count and
Semiconductor Process
450
400
Millions of transistors
350
300
250
200
150
100
50
0
Riva ZX Riva GeForce2 GeForce3 GeForce4 GeForce GeForce GeForce
TNT2 GTS Ti 4600 FX 6800 7800 GTX
Ultra
Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11
18. CS 354 18
Hardware GeForce GeForce GeForce
Unit FX 5900 6800 Ultra 7800 GTX
Vertex
3 6 8
4+4 16 24
Fragment
2nd Texture
Fetch
4+4 16+16 16+16
Raster Color
Raster Depth
19. CS 354 19
GeForce 7800 GTX
Board Details
SLI Connector Single slot cooling
sVideo
TV Out
DVI x 2
256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
20. CS 354 20
GeForce 7800 GTX
GPU Details
302 million transistors
430 MHz core clock
256-bit memory interface
Notable Functionality
• Non-power-of-two textures with mipmaps
• Floating-point (fp16) blending and filtering
• sRGB color space texture filtering and
frame buffer blending
• Vertex textures
• 16x anisotropic texture filtering
• Dynamic vertex and fragment branching
• Double-rate depth/stencil-only rendering
• Early depth/stencil culling
• Transparency antialiasing
22. CS 354 22
GeForce Graphics Pipeline
Separate dedicated units
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
23. CS 354 23
GeForce Graphics Pipeline
Vertex Engine
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
24. CS 354 24
GeForce Graphics Pipeline
Setup
Prepare triangle for
rasterization
215M triangles/sec setup
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
25. CS 354 25
GeForce Graphics Pipeline
Raster
Compute coverage
Points, lines, and triangles
Rotated grid multisampling
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
26. CS 354 26
GeForce Graphics Pipeline
Z Cull
Discard fragments early based on Z
Up to 64 pixels/clock
Multisampled: 256 samples/clock
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
27. CS 354 27
GeForce Graphics Pipeline
Fragment Shader
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
28. CS 354 28
GeForce Graphics Pipeline
Texture
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
29. CS 354 29
GeForce Graphics Pipeline
Texture
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
30. CS 354 30
Single GeForce 7800
Vertex Unit
Primitive Assembly + Vertex Processing Engine
Attribute Processing • MIMD Architecture
• Dual Issue
• Low-penalty branching
• Shader Model 3.0
• 32 vector registers
Vertex FP32 FP32 • 512 static instructions per
Texture Scalar Vector
Fetch Unit Unit
program
• Indexed input and output
registers
Texture Branch
Vertex Texture Fetch
Cache Unit
• Non-stalling
• Up to 4 texture units
Viewport Processing • Unlimited fetches
• Mipmapping, no filtering
To Setup
31. CS 354 31
Vertex Texturing Example
Vertex
Program
Flat tessellated mesh Displaced mesh
Height field
texture
33. CS 354 33
Vertex Textures to Drive
Particle Systems
Render-to-texture
Simulation runs
in floating-point
frame buffer, also
usable as texture
Vertex textures
Determines particle
location with
vertex texture
fetch
34. CS 354 34
Single GeForce 7800
Fragment Shader Pipeline
Texture Input Fragment Texture Processor
Data Data
16 texture units
1 texture fetch at full speed
Bilinear or tri-linear filtering
FP32 16x anisotropic filtering
Texture
Shader
Processor Floating-point (fp16) texture filtering
Unit 1
Shader Unit 1
FP32 4 MULs + RCP
Texture Dual Issue
Shader
Cache Unit 2 Texture address calculation
Fast fp16 normalize
Branch Free: negate, abs, condition codes
Processor
Shader Unit 2
Output 4 MADs or DP4
Fixed-function
Shaded Dual Issue
Fog Unit
Fragments
Free: negate, abs, condition codes
35. CS 354 35
Operations Per Fragment
Shader Pass
Shader 4 Components 1 Texture /
Unit 1 1 Op / component
fragment at full
4 ops / fragment or
per pass speed per pass
Texture
Shader 4 Components
1 Op / component
Unit 2 4 Ops / fragment
per pass
8 Operations / fragment per pass
36. CS 354 36
Fragment Shader
Component Co-issue
Use 4 components various ways
RGBA all together
RGB and A
RG and GB
Shader
Both shader units Unit 1 R G B A
Two operations Operation 1 Operation 2
per shader unit
Shader
Unit 2 R G B A
Operation 3 Operation 4
37. CS 354 37
Single GeForce 7800
Raster Operations Pipeline
Input
Shaded Pixel Crossbar
Fragment Interconnect Functionality
Data
• OpenEXR
Multisample Antialiasing floating-point
blending
• sRGB
Depth Color blending
Compression Compression • 4x rotated grid
multisampling
Depth Color • Lossless color
Raster Raster and depth
Operations Operations compression
• Multiple
render targets
Memory Frame Buffer Partition
39. CS 354 39
Scalable Link Interface (SLI)
Gang two GeForce 6600, 6800, or 7800
graphics boards together
Can almost double your performance
SLI
Connector
Two 6800 Ultras
pictured
40. CS 354 40
SLI Rendering Modes
Split Frame Rendering (SFR)
One GPU renders top of screen; other renders the bottom
Scales fragment processing but not vertex processing
Alternate Frame Rendering (AFR)
Scales both vertex and fragment processing
Adds frame latency
Rendering must be free of CPU synchronization
SLI Antialiasing: SLI8x and SLI16x
Better antialiasing quality rather than performance
Each card renders with slightly different sub-pixel offset
41. CS 354 41
PC Graphics Hardware Evolution
Viable economics: 650 million GeForce GPUs since 1999
1,000x complexity since 1995
Moore’s Law at work GeForce
580 GTX
3B transistors
GeForce
8800
681M
GeForce FX transistors
GeForce 256 GeForce 3
® 125M
23M 60M transistors
RIVA 128
transistors transistors
3M
transistors
1997 2000 2005 2010
45. CS 354 45
Streaming
Multiprocessor (SM)
Multi-processor
execution unit
32 scalar processor
cores
Warp is a unit of
thread execution of up
to 32 threads
Two workloads
Graphics
Vertex shader
Tessellation
Geometry shader
Fragment shader
Compute
46. CS 354 46
OpenGL Pipeline Programmable
Domains run on Unified Hardware
Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
Plus tessellation + compute (not shown below)
,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations
Can be Vertex Primitive Fragment
unified Program Program Program
hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access
Memory Interface
48. CS 354 48
Shader or CUDA Core,
Same Unit but Two Personalities
Execution unit
Scalar floating-point
Scalar integer
49. CS 354 49
Levels of Caching in Fermi GPU
12 KB L1 Texture cache
Per texture unit
SM 64 K cache
Split into dedicated 16K or 48K
Load/Store cache
Shared memory 48K or 16K
L2 unifies texture cache, raster
operation cache, and internal
buffering in prior generation
768 K
Read / write
Fully coherent
50. CS 354 50
Cache Use Strategies
in Fermi GPU
Pipeline stages can communicate efficiently through
GPU’s L1 and L2 caches
Buffering between stages stays all on chip
Only vertex, texel, and pixel read/writes need to go to DRAM
51. CS 354 51
Vertex and Tessellation
Processing Tasks
Fixed-function graphics engines
Pull attributes and assemble vertex
Manage tessellation control and domain shader evaluation
Viewport transform
Attribute setup of plane equations for rasterization
Stream out vertices into buffers
52. CS 354 52
Rasterization Tasks
Turns primitives into fragments
Computes edge equations
Two-stage rasterization
Coarse raster finds tiles the primitive could be in
Fine raster evaluates sample positions within tiles
Zcull efficiently eliminates occluded fragments
56. CS 354 56
GPUs as Compute Nodes
Architecture of GPU has evolved into a high-
performance, high-bandwidth compute node
Small form factor
Compute
Integrated CPU-GPU OEM CPU Server + Workstations
Servers & Blades Compute 1U 2 to 4 Tesla
GPUs
57. CS 354 57
Compute Programming Model
Cooperative Thread Array (CTA)
Single Program, Multiple Data
Organized in 1D, 2D, or 3D
Programming APIs
CUDA, OpenCL, DirectCompute
APIs + language = parallel processing system
OpenGL or Direct3D through shaders
Cg, HLSL, GLSL
58. CS 354 58
Now in World’s Fastest
Supercomputers
Tianhe-1A
2.507 Petaflop
7168 Tesla M2050
GPUs
National Supercomputing Center
in Tianjin
59. CS 354 59
Opposite direction:
Consumer mobile devices
60. CS 354 60
Low-power Mobile
System on a Chip (SoC)
Complete system on a chip
4 ARM cores
Integrated graphics
OpenGL ES 2.0
Power <1W
61. CS 354 61
Mid-term Next Class
Mid-term
Similar in format to the homeworks
15% of your final grade
Arrive on time
Open textbook. Open notes, including lecture slides.
Calculators allowed/encouraged.
No smart phones, no computers, no Internet access.
Show your work to justify your answer and provide a basis for partial
credit.
What to study
All material in lecture slides
Review in-class quiz questions
Study homeworks
Responsible for textbook readings
Have a relaxing spring break
Next lecture: Shadows
Come back to Project 2
Editor's Notes
The technology of graphics processors has evolved amazingly over the last 15 years or so. I’ve been at NVIDIA for more than 10 years and have seen a lot of this first hand. As the hardware increases in performance, the visual quality improves. This is driven by Moore’s law, which says that the number of transistors able to fit on a piece of silicon doubles roughly every 18 months. The great thing about graphics is that has an insatiable appetite for computation. We’re clearly not at photo-realistic quality yet and still have a long way to.
World’s Fastest Known Supercomputer today – official Top500 list comes out next month Peta = 10^15 = thousand trillion floating point operations per second