What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019Unity Technologies
The AssetDatabase has been rewritten. The more you know about how this API works, the stronger your code will be. This information can guide your decision-making for your own Asset Management strategies. For example, you do not need to reimport assets when you jump between platforms. In this session, you'll gain a deeper understanding of importing modified assets and tracking dependencies to improve your workflow and iteration time significantly.
Speaker: Javier Abud Chavez- Unity
Watch the session on Youtube: https://youtu.be/S2P9n5U9xVw
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Recorded video here:
http://on-demand.gputechconf.com/siggraph/2017/video/sig1757-tristan-lorach-vkFX-effective-approach-for-vulkan-api.html
Vulkan is a complex low-level API, full of structures and dedicated objects. Using it may be tedious and often leads to complicated source code. We propose here a way to define and use Vulkan components in a convenient and readable way. Then we will show how this infrastructure allows to introduce and use higher concepts, such as Techniques, Passes; and even how to instantiate resources, render-targets right from within the effect, making it self-sufficient and consistent as a general description. The overall purpose of this open-source project is to improve and enhance the use of Vulkan API, while keeping its strength and flexibility. This project can run in two different ways: either as a compiler generating C++ code for you; or at runtime, to load effects and use them right away.
vkFx comes from a former project called nvFx, presented few years ago. While nvFx was intended to be Generic (OpenGL & D3D compliant), vkFx is Vulkan-specific: so the project is thin and doesn’t break important paradigms that Vulkan requires to stay powerful.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019Unity Technologies
The AssetDatabase has been rewritten. The more you know about how this API works, the stronger your code will be. This information can guide your decision-making for your own Asset Management strategies. For example, you do not need to reimport assets when you jump between platforms. In this session, you'll gain a deeper understanding of importing modified assets and tracking dependencies to improve your workflow and iteration time significantly.
Speaker: Javier Abud Chavez- Unity
Watch the session on Youtube: https://youtu.be/S2P9n5U9xVw
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Recorded video here:
http://on-demand.gputechconf.com/siggraph/2017/video/sig1757-tristan-lorach-vkFX-effective-approach-for-vulkan-api.html
Vulkan is a complex low-level API, full of structures and dedicated objects. Using it may be tedious and often leads to complicated source code. We propose here a way to define and use Vulkan components in a convenient and readable way. Then we will show how this infrastructure allows to introduce and use higher concepts, such as Techniques, Passes; and even how to instantiate resources, render-targets right from within the effect, making it self-sufficient and consistent as a general description. The overall purpose of this open-source project is to improve and enhance the use of Vulkan API, while keeping its strength and flexibility. This project can run in two different ways: either as a compiler generating C++ code for you; or at runtime, to load effects and use them right away.
vkFx comes from a former project called nvFx, presented few years ago. While nvFx was intended to be Generic (OpenGL & D3D compliant), vkFx is Vulkan-specific: so the project is thin and doesn’t break important paradigms that Vulkan requires to stay powerful.
Graham Wihlidal from SEED attended the Munich Khronos Meetup and presented some aspects of Halcyon's rendering architecture, as well as details of the Vulkan implementation. Graham presented components like high-level render command translation, render graph, and shader compilation.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
This presentation introduces Vulkan components, what you must know to start using this new API. And what you must know when using it on NVIDIA hardware
NVIDIA OpenGL and Vulkan Support for 2017Mark Kilgard
Learn how NVIDIA continues improving both Vulkan and OpenGL for cross-platform graphics and compute development. This high-level talk is intended for anyone wanting to understand the state of Vulkan and OpenGL in 2017 on NVIDIA GPUs. For OpenGL, the latest standard update maintains the compatibility and feature-richness you expect. For Vulkan, NVIDIA has enabled the latest NVIDIA GPU hardware features and now provides explicit support for multiple GPUs. And for either API, NVIDIA's SDKs and Nsight tools help you develop and debug your application faster.
NVIDIA booth theater presentation at SIGGRAPH in Los Angeles, August 1, 2017.
http://www.nvidia.com/object/siggraph2017-schedule.html?id=sig1732
Get your SIGGRAPH driver release with OpenGL 4.6 and the latest Vulkan functionality from
https://developer.nvidia.com/opengl-driver
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Game engines have long been in the forefront of taking advantage of the ever
increasing parallel compute power of both CPUs and GPUs. This talk is about how the
parallel compute is utilized in practice on multiple platforms today in the Frostbite game
engine and how we think the parallel programming models, hardware and software in
the industry should look like in the next 5 years to help us make the best games possible.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
This talk is about our experiences gained during making of the Killzone Shadow Fall announcement demo.
We’ve gathered all the hard data about our assets, memory, CPU and GPU usage and a whole bunch of tricks.
The goal of talk is to help you to form a clear picture of what’s already possible to achieve on PS4.
Speed up your asset imports for big projects - Unite Copenhagen 2019Unity Technologies
The release of the new Asset Database provides a solid foundation for further speeding up asset imports. In these slides, you'll learn about the improvements in the way Unity handles very large projects. You'll also discover some of the upcoming features directed towards making the asset pipeline more extensible and ensuring project stability.
Speaker:
Jonas Drewsen - Unity
Watch the session on YouTube: https://youtu.be/VF-Qe-0zXlc
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
In this technical presentation Johan Andersson shows how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts. He will go through the work of bringing over an advanced existing engine to an entirely new graphics API, the benefits and concrete details of doing low-level rendering on PC and how it fits into the architecture and rendering systems of Frostbite. Advanced optimization techniques and topics such as parallel dispatch, GPU memory management, multi-GPU rendering, async compute & async DMA will be covered as well as sharing experiences of working with Mantle in general.
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
Graham Wihlidal from SEED attended the Munich Khronos Meetup and presented some aspects of Halcyon's rendering architecture, as well as details of the Vulkan implementation. Graham presented components like high-level render command translation, render graph, and shader compilation.
Siggraph 2016 - Vulkan and nvidia : the essentialsTristan Lorach
This presentation introduces Vulkan components, what you must know to start using this new API. And what you must know when using it on NVIDIA hardware
NVIDIA OpenGL and Vulkan Support for 2017Mark Kilgard
Learn how NVIDIA continues improving both Vulkan and OpenGL for cross-platform graphics and compute development. This high-level talk is intended for anyone wanting to understand the state of Vulkan and OpenGL in 2017 on NVIDIA GPUs. For OpenGL, the latest standard update maintains the compatibility and feature-richness you expect. For Vulkan, NVIDIA has enabled the latest NVIDIA GPU hardware features and now provides explicit support for multiple GPUs. And for either API, NVIDIA's SDKs and Nsight tools help you develop and debug your application faster.
NVIDIA booth theater presentation at SIGGRAPH in Los Angeles, August 1, 2017.
http://www.nvidia.com/object/siggraph2017-schedule.html?id=sig1732
Get your SIGGRAPH driver release with OpenGL 4.6 and the latest Vulkan functionality from
https://developer.nvidia.com/opengl-driver
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Game engines have long been in the forefront of taking advantage of the ever
increasing parallel compute power of both CPUs and GPUs. This talk is about how the
parallel compute is utilized in practice on multiple platforms today in the Frostbite game
engine and how we think the parallel programming models, hardware and software in
the industry should look like in the next 5 years to help us make the best games possible.
Building an Event Streaming Architecture with Apache PulsarScyllaDB
What is Apache Pulsar? How does it differ from other event streaming technologies available? StreamNative Developer Advocate Tim Spann will walk you through the features and architecture of this increasingly popular event streaming system, along with best practices for streaming and storing your data.
This talk is about our experiences gained during making of the Killzone Shadow Fall announcement demo.
We’ve gathered all the hard data about our assets, memory, CPU and GPU usage and a whole bunch of tricks.
The goal of talk is to help you to form a clear picture of what’s already possible to achieve on PS4.
Speed up your asset imports for big projects - Unite Copenhagen 2019Unity Technologies
The release of the new Asset Database provides a solid foundation for further speeding up asset imports. In these slides, you'll learn about the improvements in the way Unity handles very large projects. You'll also discover some of the upcoming features directed towards making the asset pipeline more extensible and ensuring project stability.
Speaker:
Jonas Drewsen - Unity
Watch the session on YouTube: https://youtu.be/VF-Qe-0zXlc
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
In this technical presentation Johan Andersson shows how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts. He will go through the work of bringing over an advanced existing engine to an entirely new graphics API, the benefits and concrete details of doing low-level rendering on PC and how it fits into the architecture and rendering systems of Frostbite. Advanced optimization techniques and topics such as parallel dispatch, GPU memory management, multi-GPU rendering, async compute & async DMA will be covered as well as sharing experiences of working with Mantle in general.
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
Yesterday's thinking may still believe NVMe (NVM Express) is in transition to a production ready solution. In this session, we will discuss how the evolution of NVMe is ready for production, the history and evolution of NVMe and the Linux stack to address where NVMe has progressed today to become the low latency, highly reliable database key value store mechanism that will drive the future of cloud expansion. Examples of protocol efficiencies and types of storage engines that are optimizing for NVMe will be discussed. Please join us for an exciting session where in-memory computing and persistence have evolved.
A deep five into offloading techniques for Oracle database servers, that takes both hardware and software solution into consideration. The focus is clearly to boost the efficiency of your already paid licenses.
We all know how CPU hungry Ceph is. What if we could change our architecture using NVMeoF?
This talk explores a theoretical setup and was given originally at Ceph Day London 2019.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linuxbrouer
This talk my 2009 updates on the progress of doing 10Gbit/s routing on standard hardware running Linux. The results are good, BUT to achieve these results, a lot of tuning is required of hardware queues, MSI interrupts and SMP affinity, together with some (now) submitted patches. I\'ll explain the concept of network hardware queues and why interrupt and SMP tuning is essential. I\'ll present results from different hardware both 10GbE netcards and CPUs (current CPUs under test is AMD phenom and Core i7). Many future challenges still exists, especially in the area of more easy tuning. A high knowledge level about the Linux kernel is required to follow all the details.
Birds eye overview of AI industry in Lithuania as of 2019 Q4.
Key highlights:
* 400 professionals & researchers
* €50K average annual salary
* Revenue €30..€50M
New Tools for Art and Content - Artificial Intelligence and Machine LearningRenaldas Zioma
Gentle overview of Machine Learning for Artists.
It covers new era defining prototypes for art powered by AI along with description on how Neural Networks work and how they generate images.
This deck is based on several public talks that I gave including CG Event and What's Next conferences.
SLIDES WITH NOTES: http://bitly.com/rej-practical-ai
This talk is an introductory material for students and programmers aspiring for developing AI for games.
Talk is split into 2 parts - first part provides an overview on popular AI approaches in games and second part gives an outlook on technology that might be relevant for games AI in the future.
I gave this talk at Vilnius University as a guest speaker in the late October 2016.
THE BLACKSMITH demo: Bridging the Gap between Realtime and Offline CGRenaldas Zioma
Talk covers experience developing Realtime animated shorts in Unity - “The Blacksmith” and “Butterfly Effect”. Talk is intended for general audience with interest in computer graphics.
Unite2014: Mastering Physically Based Shading in Unity 5Renaldas Zioma
Light introduction to Physically Based Shading. Presentation discusses theory behind light interaction with different materials, new Standard shader in Unity5 and how to prepare data for your Physically Based workflow.
FMX2013 presentation on challenges and solutions in transitioning from traditional offline CG production towards interactive approach used in creation of animated short Butterfly Effect. Butterfly Effect is a real-time CG-animated short developed in a collaboration between Unity Technologies, Passion Pictures, and NVIDIA.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!
5. Single Instruction Multiple Data Single Instruction Multiple Threads
SIMD SIMT
+
Lane
Instruction
applied to all
lanes
Data
Multiple lanes
aka vector
+
6. 4 lanes (SSE)
8 lanes (AVX)
32 lanes
16 lanes (Mobile GPU)
64 lanes (AMD)
SIMD SIMT
SIMT is almost the same as SIMD, but much wider
+
+
7. Executes “any” instruction that
has all source operands ready
Warp is 32 threads working in-sync
Threads must share the same program!
1 core* can keep 64** warps in flight
Executes warp that has all source operands ready
Very lightweight switch between warps
Out Of Order Warp or Wavefront
Aim is to hide memory latency
By Core here I mean Streaming Multiprocessor
(SM) in Nvidia or Compute Unit (CU) in AMD
GPU, but not “CUDA core”. While SM and CU are
comparable to CPU core, “CUDA core” is more of
a marketing term than a full-fledged core.
Depends actually - 40 on AMD, 128 on GP100…
*)
**)
8. SIMT in one picture
+
+
Thread = Lane
1 thread is mapped to 1 hardware lane
Warp of threads
Each core
maintains many
warps in flight
9. Warp
Warp is 32* threads working in a lockstep
All threads execute the same program
Each thread is mapped to a single SIMT lane
Processor will “juggle” warps to hide memory latency
The number of threads per warp
is actually hardware dependent
and ranges from 16 on Mobile to
64 on AMD
*)
10. Take Away
Need lots of parallel work - in 1000s* of threads
IF-ELSE block executes both IF and ELSE**
If SIMT lane is not doing work, it is wasted!
Number of cores (SM/CU) ×
number of threads in warp ×
number of warps in flight
For example optimised matrix
multiplication kernel would
need at least 10240 threads
to saturate GTX 1080
= 20 SMs × 32 warps × 16
*)
Unless of course all threads
in the warp agree on the
result of conditional
statement. In such case only
one path needs to be
executed, either IF or ELSE
**)
13. LDS - Local Data Share
Piece of small, but ultra fast memory - up to 8TB/s!
Explicitly programmable “cache”
Fast data exchange between threads
NVIDIA calls it “Shared Memory”*
But “Shared Memory” is such a generic term, lets be more
specific and call it Local Data Share for the rest of the slides.
*
14. Memory and execution is
sequential
Some instructions see
memory as 2D / 3D blocks
Work scheduled in groups
One dimensional Multidimensional view
16. 320 GB/s
38* GB/s
PCIe 3.0 x1616 GB/s
Desktop
GTX 1080
VRAM
GTX 1080
i7-7700
Numbers provided here and onward are
for peak bandwidth. Practical achievable
bandwidth is usually 75% of peak and can
be even lower for CPU.
*)
19. Take Away
MYTH: PCIe is very slow
CPU GPU (PCIe) speed is not too terrible comparing
with common CPU memory ~ 1:3 ratio
PCIe speed is mostly an issue when training with multi-GPU setup
20. Take Away
Need to access the same data several times on GPU to
make worth the transfer
FLOPs per byte metric
21. Take Away
Getting results back from GPU can be slow,
but NOT because of PCIe speed
Rather due to latency - CPU & GPU work async!
22. PS4 Pro — 218 GB/s
iPad Pro — 51 GB/s
Mobile/Console
Unified Memory Architecture
CPU & GPU share same memory
Take Away
CPU & GPU exchange data much faster
But overall bandwidth is limited
RAM
23. Multi-GPU
Tesla V100
VRAM
8..16* GB/s
7.8 TB/s
900 GB/s
V100
LDS
8..16* GB/s
V100
RAM
…
Motherboard might not have
enough PCIe lanes to provide
16 dedicated lanes per GPU
*
28. Programming model is scalar!
Compiler maps 32 scalar “threads” to 1 SIMT instruction
Memory access across 32 “threads” are grouped as well…
32 threads are executed
in lock-step, each step is
1 SIMT instruction
29. Memory accesses are grouped… wait, what?
Accesses are grouped into large transactions to
max memory bandwidth
Single memory transaction 256 bit
imagine a single
memory read
worth of 32 floats
imagine a single
memory write
worth of 32 floats
30. Memory accesses are grouped
Called “Coalesced Access to Memory”
Will automatically map to as few
cache line accesses as possible!
address #0
address #4
address #8
address #12
address #16
address #20
address #24
address #28
address #32
address #36
address #40
address #44
address #48
address #52
address #56
address #60
address #64
address #68
address #72
address #76
address #80
48. CUDA pipeline
CUDA C PTX cubin / SASS
CUDA C — that is what you usually write
cubin / SASS — that is what actually runs on GPU
49. CUDA
CUDA C PTX cubin / SASS
PTX — bytecode for GPU
Intermediate step
NVIDIA only, but architecture* independent
cubin / SASS — binary for GPU
NVIDIA only and architecture* specific
By saying ‘architecture’ here, I mean:
Volta, Pascal, Maxwell, Kepler, etc
*
50. CUDA
CUDA C PTX
nvcc — nvidia open source LLVM based compiler
PTX cubin / SASS
ptxas — nvidia closed source assembler
51. OpenCL
Doesn’t integrate well with the cross-platform engine
DirectX + OpenCL = ?
PS4 + OpenCL = ?
Mobile + OpenCL = ?
Code is compiled by OpenCL run-time (“driver”)
Result might be hit-or-miss in terms of performance
Performance is not portable
52. Platform specific Zoo
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute is cross-platform, but not widely implemented yet!
All Integrate well with the rendering engine
Asyncronous compute - graphics and compute workloads simultaneously
53. Unity Compute ☛ bit.ly/unity-compute-docs
Cross compiles to platform specific Compute:
Integrated well with the rendering engine
Performance is not portable
DirectCompute — Windows, XBOX
Metal Compute — iOS, MacOS
GLSL Compute — PS4, Linux, Android ES3.1
Vulkan Compute — Android, …
55. Large Matrix Multiplication
Workhorse of Machine Learning
• Fully Connected layer is matrix multiplication
• Convolutional layer has a lot in common /w matrix multiplication
SGEMM in BLAS
57. Math ops = O(N3)
Memory ops = O(N2+N2)
Classical solution
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Matrix multiplication
58. Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
59. Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory
Classical solution
Matrix multiplication
60. Math ops = O(N3)
Memory ops = O(N2+N2)
Work in blocks aka tiles!
Load source tiles from the memory to the cache (LDS)
Accumulate multiplication result in the cache
Store accumulator tile back to the memory × =
Classical solution
Matrix multiplication
63. More ALU than LoaD/STore
GPU can issue more MultiplyAdd instructions than
memory reads per cycle
GPU packs more arithmetic (ALU)
than memory access units (LD/ST)
4:1 is a common ratio
65. GTX 1080 320 GB/s — VRAM
4.1 TB/s — LDS
30+ TB/s — Registers
GTX 1080
File of 16K scalar registers shared for up to 16 warps
Up to 256 scalar (FP32) registers per thread
66. Matrix multiplication #2
Cache big tiles in LDS
to minimize VRAM bandwidth
Cache small tiles (4x4) in registers
to minimize LD/ST issue
Arithmetic operations on 4x4 blocks
VRAM
LDS
registers
× =
ALU
67. Take away
Reaching the best performance requires data dimensions to be
multiple of tile size
Minibatch size is especially crucial for Fully Connected layers
68. Take away
Sweet spot for Convolutional layers is between 16 and 32
Preferably all dimensions divisible by 4 or 8
small tile loaded into registers
Volta TensorCore
70. Despite architectural differences between CPU & GPU
what dominates the speed of training Convolutional Neural Net
is the raw TFLOPs of a given chip!*
Xeon E5 v4 @ 2.2Ghz × 20 cores
0.7 TFLOPs**
i7-7700 @ 3.6Ghz × 4 cores
0.36 TFLOPs
iPad Pro, A9Xm @ 2.26Ghz × 2 cores
0.08 TFLOPs
Tesla V100 @ 1.4Ghz × 80 SM cores
14.9 TFLOPs***
GTX 1080 Ti @ 1.5Ghz × 28 SM cores
11.34 TFLOPs
iPad Pro, A9X-PVR-7XT @ 0.45Ghz × 12
0.35 TFLOPs
Numbers for both CPU & GPU are
specified at full FP32 precision
***)CPU numbers here are measured and
do not completely agree with theoretical -
some errors might have crept in ;)
**)
Given that reasonably optimised code is used - like
cuDNN lib for GPU and Intel-MKL-DNN for CPU
*)