A set of mobile game optimization best practices. This presentation extensively covers PowerVR series of GPUs from Imagination Technologies and iOS, however the majority of recommendations can be applied to other GPUs and mobile operating systems.
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
Unity - Internals: memory and performanceCodemotion
by Marco Trivellato - In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part will cover implementation details and their memory vs cycles tradeoffs in both Unity4 and the upcoming Unity5.
Optimization in Unity: simple tips for developing with "no surprises" / Anton...DevGAMM Conference
Every developer once faces the optimization of their project, but sometimes it happens quite spontaneously and at the wrong time. Developers, especially beginners, often ignore very simple and cheap techniques, which allow to make the optimization process more calm and predictable. This work is devoted to the things you should pay attention in developing to avoid inconvenience in future.
In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part on performance profiling and optimizations. Finally, there will be an overview of debugging and profiling improvements in the newly announced Unity 5.0.
This article contains information about performance optimization of Unity3D games for android. Different solutions provided both for CPU and GPU. Also here you can find methodology which will help you to detect performance problems, analyze them and perform appropriate optimization.
Unity - Internals: memory and performanceCodemotion
by Marco Trivellato - In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part will cover implementation details and their memory vs cycles tradeoffs in both Unity4 and the upcoming Unity5.
Optimization in Unity: simple tips for developing with "no surprises" / Anton...DevGAMM Conference
Every developer once faces the optimization of their project, but sometimes it happens quite spontaneously and at the wrong time. Developers, especially beginners, often ignore very simple and cheap techniques, which allow to make the optimization process more calm and predictable. This work is devoted to the things you should pay attention in developing to avoid inconvenience in future.
In this presentation we will provide in-depth knowledge about the Unity runtime. The first part will focus on memory and how to deal with fragmentation and garbage collection. The second part on performance profiling and optimizations. Finally, there will be an overview of debugging and profiling improvements in the newly announced Unity 5.0.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
The past few years have seen a sharp increase in the complexity of rendering algorithms used in modern game engines. Large portions of the rendering work are increasingly written in GPU computing languages, and decoupled from the conventional “one-to-one” pipeline stages for which shading languages were designed. Following Tim Foley’s talk from SIGGRAPH 2016’s Open Problems course on shading language directions, we explore example rendering algorithms that we want to express in a composable, reusable and performance-portable manner. We argue that a few key constraints in GPU computing languages inhibit these goals, some of which are rooted in hardware limitations. We conclude with a call to action detailing specific improvements we would like to see in GPU compute languages, as well as the underlying graphics hardware.
This talk was originally given at SIGGRAPH 2017 by Andrew Lauritzen (EA SEED) for the Open Problems in Real-Time Rendering course.
Scene Graphs & Component Based Game EnginesBryan Duggan
A presentation I made at the Fermented Poly meetup in Dublin about Scene Graphs & Component Based Game Engines. Lots of examples from my own game engine BGE - where almost everything is a component. Get the code and the course notes here: https://github.com/skooter500/BGE
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Unity Technologies
In this technical talk, we will describe the science behind the UIElements rendering system, built from the ground up for retained-mode UI. It uses every CPU/GPU trick in the book to render thousands of different elements onscreen in a fraction of a millisecond, all on one thread. This powerful UI performance and optimization tool also supports complex features like clipping and vector graphics, even on low-end devices.
Speaker: Wessam Bahnassi – Unity
Watch the session on YouTube: https://youtu.be/zeCdVmfGUN0
Cross-scene references: A shock to the system - Unite Copenhagen 2019Unity Technologies
Discover the GUID Based reference workflow, with a focus on Multi-Scene workflows, save game, and how these all came together in System Shock 3. You'll learn about the reason for wanting a stable instance Id, why Unity doesn't have one out of the box, what this implementation does, and how you can use it in your own projects. System Shock 3 is a great example of this tool's power, but demonstrates how it can be tricky too.
Speaker:
William Armstrong - Unity
Shazam is a very popular audio recognition app. It is installed on 100M+ Android devices and it’s growing rapidly. In this talk, we will address how we made the release schedule faster, more predictable and with more features by using BDD and automation testing.
I'd like to demonstrate how that can be done without slowing down or hindering the development process and why our developers actually find writing tests is fun. Finally, we’ll look at how our testing strategy has translated to our testing framework and hardware infrastructure.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
The past few years have seen a sharp increase in the complexity of rendering algorithms used in modern game engines. Large portions of the rendering work are increasingly written in GPU computing languages, and decoupled from the conventional “one-to-one” pipeline stages for which shading languages were designed. Following Tim Foley’s talk from SIGGRAPH 2016’s Open Problems course on shading language directions, we explore example rendering algorithms that we want to express in a composable, reusable and performance-portable manner. We argue that a few key constraints in GPU computing languages inhibit these goals, some of which are rooted in hardware limitations. We conclude with a call to action detailing specific improvements we would like to see in GPU compute languages, as well as the underlying graphics hardware.
This talk was originally given at SIGGRAPH 2017 by Andrew Lauritzen (EA SEED) for the Open Problems in Real-Time Rendering course.
Scene Graphs & Component Based Game EnginesBryan Duggan
A presentation I made at the Fermented Poly meetup in Dublin about Scene Graphs & Component Based Game Engines. Lots of examples from my own game engine BGE - where almost everything is a component. Get the code and the course notes here: https://github.com/skooter500/BGE
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Unity Technologies
In this technical talk, we will describe the science behind the UIElements rendering system, built from the ground up for retained-mode UI. It uses every CPU/GPU trick in the book to render thousands of different elements onscreen in a fraction of a millisecond, all on one thread. This powerful UI performance and optimization tool also supports complex features like clipping and vector graphics, even on low-end devices.
Speaker: Wessam Bahnassi – Unity
Watch the session on YouTube: https://youtu.be/zeCdVmfGUN0
Cross-scene references: A shock to the system - Unite Copenhagen 2019Unity Technologies
Discover the GUID Based reference workflow, with a focus on Multi-Scene workflows, save game, and how these all came together in System Shock 3. You'll learn about the reason for wanting a stable instance Id, why Unity doesn't have one out of the box, what this implementation does, and how you can use it in your own projects. System Shock 3 is a great example of this tool's power, but demonstrates how it can be tricky too.
Speaker:
William Armstrong - Unity
Shazam is a very popular audio recognition app. It is installed on 100M+ Android devices and it’s growing rapidly. In this talk, we will address how we made the release schedule faster, more predictable and with more features by using BDD and automation testing.
I'd like to demonstrate how that can be done without slowing down or hindering the development process and why our developers actually find writing tests is fun. Finally, we’ll look at how our testing strategy has translated to our testing framework and hardware infrastructure.
Writing in the right way for your website, by Expert MarketEd Beardsell
We take turns to present on a given subject, while another of us bakes. This is David's slideshow, the week Andrea baked strudel, and it's all about writing for websites and also how different sites have devised their own ways for people to engage with the site and other users. A better title might have been 'Attention currency', but I chose the far less attractive title of 'Writing in the right way for your website.'
www.expertmarket.co.uk
Adding more visuals without affecting performanceSt1X
Smallest viable set of performance optimizations recommendations for game artists. This presentation targets artist that have little knowledge about computer hardware capabilities and limitations.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
cachegrand: A Take on High Performance CachingScyllaDB
cachegrand is what happens when you throw in a mix a SIMD-accelerated hashtable — capable of performing parallel GET operations without locks or busy-wait loops (e.g. atomic operations) — with fibers, io_uring, your own I/O library, your own memory allocator, and an in-memory & on-disk time series database!
Written in C, built from scratch, natively modular - currently working on Redis compatibility — it's a platform that can deliver very high QPS with low latencies for caching and data streaming with the door open to supporting business logic in Rust & WebAssembly down the line.
This session will focus on developing techniques and OS components used highlighting how they can provide an extra boost to your platforms, no matter the programming language.
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
Slides from when I was teaching CS4052 Computer Graphics at Trinity College Dublin in Ireland.
These slides aren't used any more so they may as well be available to the public!
There are some mistakes in the slides, I'll try to comment below these.
This is the second lecture, and introduces programming with OpenGL 4 and shaders.
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu
Implementing a fixed point int16_t integer matrix vector multiplication kernel for Intel processors with AVX-512 and the Xbyak just-in-time compiler (what Intel MKL jit_cgemm uses)
This presentation by Andrii Radchenko (Senior Software Engineer, Consultant, GlobalLogic) was delivered at GlobalLogic Kharkiv C++ Workshop #2 on February 8, 2020.
Talk topics:
● Memory management in C++
● Virtual memory
● Implementation details for virtual allocation in Windows and Linux
● Pointers types for virtual memory
● The purpose of collections allocators
● Allocators and memory resources types in modern C++ standard
● Implementation of own memory resource and its benefits
Event materials: https://www.globallogic.com/ua/about/events/kharkiv-cpp-workshop-2/
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.
Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
2. Mobile GPUs architecture
• There are 3 major mobile GPU architectures
on a market:
• IMR (Immediate Mode Renderer)
• TBR (Tile Based Renderer)
• TBDR (Tile Based Deferred Renderer)
2
3. IMR
• Renders anything sent to the GPU
immediately. It makes no assumption about
what is going to be submitted next.
• Application has to sort opaque geometry front
to back.
• It’s basically a brute force.
• Nvidia, AMD.
3
4. TBR
• Improves on IMR, but still is an IMR.
• Bandwidth is a precious resource on mobiles
and TBR tries to reduce data transfers as much
as possible.
• Your geometry is split in to tiles and then it is
processed per tile. Tiles have small amount of
memory for colour, depthstencil buffers, so
they have no need to do transfers fromto
system memory.
• Qualcomm Adreno, ARM Mali 4
5. TBDR
• It is deferred i.e. all the graphics is drawn
somewhere later.
• And this is where all the magic happens!
• The GPU is aware of context - it know’s what is
going to be drawn in future and this allows it
to employ some awesome optimisations,
reduce power consumption, bandwidth and a
fillrate.
• Imagination PowerVR.
5
9. What you might know
• Pixel perfect HSR (Hidden Surface Removal),
Adreno and ARM does not feature this.
• But still needs to sort transparent geometry!
• Avoid doing alpha test. Use alpha blend
instead
10. What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
CPU!
• Prefer Stencil Test before Scissor.
– Stencil test is performed in hardware on PowerVR
GPUs.
– Stencil mask is stored in fast on-chip memory
– Stencil can be of any form in contrast to the
rectangular Scissor
11. What you might not know
• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before visibility for
current fragment can be determined. This will remove benefits of HSR
o Even more! If shader code contains discard, than any geometry rendered
with this shader will suffer from alpha test drawbacks. Even if this key-word
is under condition, USSE (PVR’s shader engine) does assumes, that this
condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha blended in the end
12. What you might know
• Bandwidth matters
1. Use constant colour per object, instead of per
vertex
2. Simplify your models. Use smaller data types.
3. Use indexed triangles or non-indexed triangle
strips
4. Use VBO instead of client arrays
5. Use VAO
13. What you might not know
• VBOs allocations are aligned by 4KB page size.
That means, your small buffer for just a
couple of triangles will occupy 4KB in
memory, - large amount of small VBOs can
defragment and waste you memory.
14. What you might not know
• Updating your VBO data each frame:
1. glBufferSubData. If it is used to update big part of the
original data it will harm performance. Try to avoid
updates to buffers, that are in use now
2. glBufferData. It’s OK to completely overwrite original
data. Old data will be orphaned by driver and a new
data storage will be allocated
3. glMapBuffer with triple buffered VBO is preferred way
to update your data
• EXT_map_buffer_range (iOS 6+ only), when you need to
update only a subset of a buffer object.
15. What you might not know
int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it
{
glBindBuffer(vertexBuffer[i]);
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);
}
//...
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);
//update data here
glUnmapBufferOES(GL_ARRAY_BUFFER);
++bufferID;
if (bufferID == 3) //cycling through 3 buffers
{
bufferID = 0;
}
16. What you might not know
• This scheme will give you the best performance
possible – without blocking CPU or GPU, no
redundant memcpy operations, lower CPU load, but
extra memory is used (note, that you will need no
extra temporal buffer to store your data before
sending it to VBO). This is ideal for dynamic
batching of sprites.
update(1), draw(1), gpuworking(..............)
update(2), draw(2), gpuworking(..............)
update(3), draw(3), gpuworking(..............)
17. What you might not know
• Float type is native to GPU
• …that means any other type will be converted
to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice of tradeoff between
bandwidthstorage and additional cycles
18. What you might know
• Use interleaved vertex data
– Align each vertex attribute by 4 bytes boundaries
19. What you might not know
• If you don’t align your data, driver will do this
instead.
• …resulting in slower performance.
20. What you might not know
• PowerVR SGX 5XT GPU series have a vertex
cache for last 12 vertex indices. Optimise your
indexed geometry for this cache size.
• PowerVR Series 6 (XT) has 16k of vertex cache
• Take a look at optimisers, that use Tom
Forsyth’s algorithm
http://home.comcast.net/~tom_forsyth/paper
s/fast_vert_cache_opt.html
21. What you might know
• Split your vertex data into two parts:
1. Static VBO - the one, that never will be changed
2. Dynamic VBO – the one, that needs to be
updated frequently
• Split your vertex data into few VBOs, when few
meshes share the same set of attributes
23. What you might know
• Bandwidth matters
1. Use lower precision formats - RGBA4444,
RGBA5551
2. Use PVRTC compressed textures
3. Use atlases
4. Use mipmaps. They improve texture cache
efficiency and quality.
24. What you might not know
• Avoid RGB8 format - texture data has to be
aligned, so driver will pad RGB8 to RGBA8.
• Try to replace it with RGB565
24
25. What you might not know
• Why PVRTC?
1. PVRTC provides great compression, resulting in
smaller texture size, improved cache, saved
bandwidth and decreased power consumption
2. PVRTC stores pixel data in GPU’s native order i.e
BGRA, instead of RGBA, in blocks optimised for
data access pattern.
26. What you might not know
• It doesn’t matter whether your textures are in
RGBA or BGRA format - the driver will still do
internal processing on a texture data to
improve memory access locality and cache
efficiency.
26
27. What you might not know
• On PVR 6 (XT) driver will reserve memory for both
texture and mip maps chain, but it will commit
memory only for mip level 0.
• If you’ll decide to generate mip maps driver will
commit pages reserved for mip chain.
• That’s expectable.
28. What you might not know
• On PVR 55MP (tested on iOS 4 – 7.1.1 versions)
driver will ALWAYS commit memory for mip maps,
regardless, whether you requested to create them, or
not.
• That means you’ll waste 33% of memory!
• In most cases you don’t need mip maps for 2D
games, but you are forced to pay this overhead.
• That’s too bad for 2D games. However there is one
workaround – make your textures NPOT (non-power
of two).
28
29. What you might not know
• Luckily, there is one solution to this problem.
• Core OpenGL ES 2.0 doesn’t support mip maps
for NPoT (non power of two) textures, so if
you’ll make your textures to be NPoT, you will
not pay this memory overhead.
29
30. What you might not know
• Interesting notes:
• glTexImage2D driver implementation has a
function CheckFastPath. When you upload
PoT texture you’ll hit this fast path. NPoT
textures omit it.
• When you upload a lot of textures you
VRAM gets defragmented, so driver will
remap memory - i.e. it will create one big
buffer for few small textures and will move
them to that buffer 30
31. What you might not know
• Let’s take a look on a texture upload process.
• Usual way to do this:
1. Load texture to temporal buffer in RAM
1. Encode texture if it is stored in compressed file format
– JPGPNG
2. Feed this buffer to glTexImage2D
3. Draw!
• Looks simple, but is it the fastest way?
32. What you might not know
• …NO!
void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture
LoadTexture(textureName);
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf);
// buf is copied into internal buffer, created by driver (that's obvious)
free(buf); // because buffer can be freed immediately after glTexImage2D
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
• A lot of redundant work!
33. What you might not know
• Jedi way to upload textures:
int fileHandle = open(filename, O_RDONLY);
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
munmap(ptr, TEXTURE_SIZE);
• File mapping does not copy your file data into RAM! It
does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
decreased texture upload time and decreased memory
fragmentation
34. What you might not know
• Keep in my, that textures are finally wired only
when they are used first time. So draw them
off screen immediately after glTexImage2D,
otherwise it will take too long to render the
first frame and it will be nearly impossible to
track the cause of this.
34
35. What you might not know
• NPOT textures works only with the
GL_CLAMP_TO_EDGE wrap mode
• POT are preferable, they gives you the best
performance possible
• Use NPOT textures with dimensions multiple to
32 pixels for best performance
• Driver will pad data of your NPOT texture to
match the size of the closes POT values.
36. What you might not know
• Prefer OES_texture_half_float instead of
OES_texture_float
• Texture reads fetch only 32 bits per texel, thus RGBA float
texture will result in 4 texture reads
37. What you might not know
• Always use glClear at the beginning of the
frame…
• … and EXT_discard_framebuffer at the end.
• PVR GPU series have a fast on chip
depthstencil buffer for each tile. If you forget
to cleardiscard depth buffer, it will be
uploaded from HW to SW
38. What you might know
• Prefer multi texturing instead of multiple
passes
• Configure texture parameters before feeding
image data to driver
40. What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard instruction as
early, as possible to avoid useless
computations
41. What you might not know
• Code inside of dynamic branch (condition is
non constant value) will be executed anyway
and than it will be orphaned if condition is
false
42. What you might not know
• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
2] with step of 1/256
• Try to give the same precision to all you
operands, because conversion takes some time
43. What you might not know
• highp values are calculated on a scalar
processor only on USSE1 (thats PVR 5):
highp vec4 v1, v2;
highp float s1, s2;
v2 = (v1 * s1) * s2;
//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on
//a scalar processor again – 4 additional operations
v2 = v1 * (s1 * s2);
//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
45. What you might know
• Typical CPU found in mobile devices:
1. ARMv7ARMv8 architecture
2. Cortex AXKraitSwift or Cyclone
3. Up to 2300 MHz
4. Up to 8 cores
5. Thumb-2 instructions set
46. What you might not know
• ARMv7 has no hardware support for integer
division
• VFPv3, VFPv4 FPU
• NEON SIMD engine
• Unaligned access is done in software on Cortex
A8. That means it is hundred times slower
• Cortex A8 is in-order CPU. Cortex A9+ are out
of order
47. What you might not know
• Cortex A9+ core has full VFPv3 FPU, while
Cortex A8 has a VFPLite. That means, that float
operations take 1 cycle on A9 and 10 cycles on
A8!
48. What you might not know
• NEON – 16 registers, 128 bit wide each.
Supports operations on 8, 16, 32 and 64 bits
integers and 32 bits float values
• NEON can be used for:
– Software geometry instancing;
– Skinning;
– As a general vertex processor;
– Other, typical, applications for SIMD.
49. What you might not know
• There are 3 ways to use NEON engine in your
code:
1. Intrinsics
1.1 GLKMath
2. Handwritten NEON assembly
3. Autovectorization. Add –mllvm –vectorize –
mllvm –bb-vectorize-aligned-only to Other CC++
Flags in project settings and you are ready to go.
53. What you might not know
• Summary:
Running time, ms CPU usage, %
Intrinsics 2764 19
Assembly 3664 20
FPU 6209 25-28
FPU autovectorized 5028 22-24
• Intrinsics got me 25% speedup over assembly.
• Note that speed of code generated from
intrinsics will vary from compiler to compiler.
Modern compilers are really good in this.
54. What you might not know
• Intrinsics advantages over assembly:
– Higher level code;
– Much simpler;
– No need to manage registers;
– You can vectorize basic blocks and build
solution for every new problem with this
blocks. In contrast to assembly – you have to
solve each new problem from scratch;
55. What you might not know
• Assembly advantages over intrinsics:
– Code generated from intrinsics vary from
compiler to compiler and can give you really
big difference in speed. Assembly code will
always be the same.
59. What you might not know
• For detailed explanation on
intrinsicsassembly see:
http://infocenter.arm.com/help/index.jsp?topi
c=/com.arm.doc.dui0491e/CIHJBEFE.html
Editor's Notes
In this presentation I am going to talk mostly about Imagination Technologies GPUs. This is at least 50% of the market. All test I did on iOS, but I assume, you’ll get the same behaviour on Android.
This presentation will consist from few parts, each dedicated to optimisation problems in one area.