SlideShare a Scribd company logo
Enabling Low-level
Intrinsics in Burst
— You’re interested in CPU performance
— You’re considering porting engine systems to HPC#
— You just go to technical talks because it’s cool
Who this talk is for
3
— Quick introduction to SIMD topics
— Options for SIMD programming in HPC# today
— The case for intrinsics and typeless SIMD
— Case studies for intrinsics
— Q & A
Talk Contents
4
SIMD: Some background
5
What is SIMD?
6
— Single Instruction, Multiple Data
– Doing more than one thing at a time
— Available on essentially all hardware today in some form
– Capabilities vary, but a few families exist
— ARM Neon
— x86/64 SSE and AVX
SIMD Analogy: Chopping Veggies
7
Input Data
Output Data
Instruction
Preprocessed!
Why is SIMD important?
8
— It’s more efficient to do more with less instructions
— There is dedicated hardware for this stuff
— Often the only way you can get the max cache bandwidth
Dedicated Hardware: Skylake Example
9
Cache Bandwidth
10
— L1 Caches can deliver N bits every cycle
– N typically much larger than 64
– 128 or 256 bits per cycle common in most CPUs today
— Without using SIMD instructions, only get a fraction of this
– Important part of not leaving performance on the table
Cache Bandwidth
11
— This matters when you are processing in cache
– Which is what we hope to do most of the time
— Processing floats (4 bytes, 32 bits), 128-bit cache b/w
– Each load wastes 75% of bandwidth from the cache
— Better: Process 4 floats at a time
– Full cache utilization
The vector fallacy
12
— SIMD and mathematical vectors are mostly unrelated
— Lots of confusion around this issue
— False: Using a vector math library is somehow SIMD
— True: Working with arrays of data can lead to opportunities
for SIMD (but not always)
— It’s especially problematic with 3-component vectors as we’ll
see
Quick smell test for float SIMD-ness (x86)
13
— xxxps instructions feeding into each other?
– It’s probably SIMD code
— xxxss instructions?
– Scalar code
— Occasional xxxps instructions with infrastructure?
– Mix of SIMD and scalar, red flag!
Example: 3D dot products
14
public static float DotExample(float3 a, float3 b)
{
return math.dot(a, b);
}
mulps xmm0, xmmword ptr [rdx]
movshdup xmm1, xmm0
addss xmm1, xmm0
movhlps xmm0, xmm0
addss xmm0, xmm1
1 SIMD op, 2 infrastructure, 2 scalar ops => 1 dot product
4-wide mul, only 3 lanes valid
shuffle overhead
scalar addition
shuffle overhead
scalar addition
Example: 3D dot products
15
— But wait a minute, what is a dot product?
— For 3D:
– a.x * b.x + a.y * b.y + a.z * b.z
— What if we go back to basics and base our code on this?
Example: 3D dot products (back to basics)
16
public static float DotExample1(
float ax, float ay, float az
float bx, float by, float bz)
{
return ax * bx + ay * by + az * bz;
}
mulss xmm0, xmm3
mulss xmm1, dword ptr [rsp + 40]
addss xmm0, xmm1
mulss xmm2, dword ptr [rsp + 48]
addss xmm0, xmm2
0 SIMD ops, 0 infrastructure, 5 scalar ops => 1 dot product
mul
mul
add
mul
add
Example: 3D dot products (SIMD)
17
public static float4 DotExample4(
float4 ax, float4 ay, float4 az
float4 bx, float4 by, float4 bz)
{
return ax * bx + ay * by + az * bz;
}
mulps xmm2, xmmword ptr [r9]
mulps xmm1, xmmword ptr [rcx]
addps xmm1, xmm2
mulps xmm0, xmmword ptr [rax]
addps xmm0, xmm1
5 SIMD ops, 0 infrastructure, 0 scalar ops => 4 dot products
4-wide mul
4-wide mul
4-wide add
4-wide mul
4-wide add
SIMD mindset
18
— Important not to think in terms of abstractions
— Don’t think about float4 as “a 4-D vector”
— Better: 4 floats is the width of the vector unit on this CPU
— Get used to the idea of 128 or 256 bit blocks of data
– Divide into whatever size is convenient
— What fits in 128 bits?
– 16 bytes
– 8 shorts
– 4 floats or ints
– 2 doubles or longs
SIMD mindset, contd.
19
— Try to find opportunities to compute independent values
– Like the 4 independent dot products we just saw
— Fight the urge to think of vectors as horizontal values
– Horizontal operations often go against the grain of SIMD instructions
— Typically scalar code without abstractions vectorizes well
– float3, float2 etc can be convenient but often get in the way
So how do you get SIMD code?
20
— In HPC# we’ve had two options so far:
– LLVM auto-vectorization
– Unity.Mathematics explicit SIMD
LLVM’s auto vectorizer
21
— Simple mode: Write scalar code, get SIMD code out
— For simple loops, LLVM is often able to generate SIMD code
— Checklist to look at before expecting SIMD:
– Data ranges must not alias
– Data must be contiguous in memory (for wide loads)
– Data types must be integer or float with fast-math
– Branches are kept to a minimum
– There is no cross-element interference
LLVM’s auto vectorizer
22
— Pros
– Simpler code to read/write (at face value)
– Often gives you a speedup where you didn’t expect one
— Cons
– Need to learn a bunch of rules to get SIMD code from loops
– No way to tell when you’ve stopped getting SIMD
– (We’re looking at ways to make this a compile error if desired)
– Hard to reinterpret data types
– Often surprising what will not vectorize
Example of successful vectorization
23
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i];
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_7:
vpmaxsd ymm1, ymm0, ymmword ptr [r10 + 4*rdx]
vpmaxsd ymm2, ymm0, ymmword ptr [r10 + 4*rdx + 32]
vpmaxsd ymm3, ymm0, ymmword ptr [r10 + 4*rdx + 64]
vpmaxsd ymm4, ymm0, ymmword ptr [r10 + 4*rdx + 96]
vmovdqu ymmword ptr [rcx + 4*rdx], ymm1
vmovdqu ymmword ptr [rcx + 4*rdx + 32], ymm2
vmovdqu ymmword ptr [rcx + 4*rdx + 64], ymm3
vmovdqu ymmword ptr [rcx + 4*rdx + 96], ymm4
add rdx, 32
cmp rax, rdx
jne .LBB0_7
Example of unsuccessful vectorization
24
[BurstCompile]
public struct VectorizeDemo : IJob
{
public NativeArray<int> Inputs;
public NativeArray<int> Outputs;
public void Execute()
{
for (int i = 0; i < Inputs.Length; ++i)
{
if (Inputs[i] >= 0)
{
Outputs[i] = Inputs[i] * 2;
}
else
{
Outputs[i] = 0;
}
}
}
}
.LBB0_2:
mov edx, dword ptr [r10 + 4*rax]
lea ecx, [rdx + rdx]
test edx, edx
cmovs ecx, r8d
mov dword ptr [r11 + 4*rax], ecx
inc rax
cmp r9, rax
jne .LBB0_2
Explicit SIMD with Unity.Mathematics
25
— Use e.g. float4, int4 vertically (as in dot product example)
— Maps directly to LLVM vector types, you will get vector code
— Checklist:
– Avoid branches, use select/mask idioms
– Use native arrays, with ReinterpretLoad/Store as needed
– Handle end-of-array cases manually
Explicit Unity.Mathematics SIMD Example
26
static public IntersectResult Intersect2(NativeArray<PlanePacket4> cullingPlanePackets, AABB a)
{
// …
int4 outCounts = 0;
int4 inCounts = 0;
for (int i = 0; i < cullingPlanePackets.Length; i++) {
var p = cullingPlanePackets[i];
float4 distances = dot4(p.Xs, p.Ys, p.Zs, mx, my, mz) + p.Distances;
float4 radii = dot4(ex, ey, ez, math.abs(p.Xs), math.abs(p.Ys), math.abs(p.Zs));
outCounts += (int4) (distances + radii <= 0);
inCounts += (int4) (distances > radii);
}
int inCount = math.csum(inCounts);
int outCount = math.csum(outCounts);
if (outCount != 0)
return IntersectResult.Out;
else
return (inCount == 4 * cullingPlanePackets.Length) ? IntersectResult.In : IntersectResult.Partial;
}
The Case For Intrinsics
27
The need for typeless SIMD
28
— In the engine space it’s frequently useful to reinterpret data
— Want control over instruction selection for particular HW
— Want to leverage tricks that compilers don’t use
Data reinterpretation
29
— Work with floats bits using integer operations
— Example: Converting small integers to floats
ushort x = ...;
uint y = x | 0x4b000000;
float f = as_float(y) - 8388608.0f;
Instruction selection
30
— Often useful to base core engine loops around specific h/w
— Example: x86 pmulhrsw
Leveraging data tricks
31
— Many tricks are not in the repertoire of most compilers
— Example: Quickly generating mask from sign of float data
float x = ...;
uint mask = as_int(x) >> 31;
Burst Intrinsics
32
What we’re working on
33
— Typeless SIMD library of intrinsics
— Start with x86, with ARM to come
— Good C# integration with debugging considerations
Typeless?
34
— Types are mostly an annoyance for real world SIMD
— Often need to reinterpret float/int
— Often need to deal with masks, which are unclearly typed
— Canonical example: comparisons
– _mm_cmpeq_ps – returns a mask of all ones when equal
– So… is that a float? Or an int?
Do what the hardware does
35
— The hardware just has registers, not types (obviously)
— That’s what we expose in our intrinsics API
— m128 – 128 bit SIMD register
— m256 – 256 bit SIMD register
— Instructions determine how the register contents are interpreted
API Usage Example
36
using static Burst.Compiler.IL.x86;
// …
m128 a, b = …;
m128 mask = cmpeq_ps(a, b);
// …
API Extract
37
// _mm_cmpeq_ps
/// <summary> Compare packed single-precision (32-bit)
/// floating-point elements in "a" and "b" for equality,
/// and store the results in "dst". </summary>
[X86InstructionFamily(InstructionFamily.SSE)]
[DebuggerStepThrough]
public static m128 cmpeq_ps(m128 a, m128 b)
{
m128 dst = default(m128);
dst.UInt0 = a.Float0 == b.Float0 ? ~0u : 0;
dst.UInt1 = a.Float1 == b.Float1 ? ~0u : 0;
dst.UInt2 = a.Float2 == b.Float2 ? ~0u : 0;
dst.UInt3 = a.Float3 == b.Float3 ? ~0u : 0;
return dst;
}
C# Reference Implementation
A more complete example
38
A more complete example
39
For each door:
open = 0
For each player position:
if player in range and correct team:
open = 1
store open state for door
A more complete example
40
— Basic N vs M test
— N doors, M players
public struct Door
{
public float3 Pos;
public float RadiusSquared;
public int Team;
}
public struct DoorTestPos
{
public float3 Pos;
public int Team;
}
Reference version
41
[BurstCompile]
public struct DoorTest_Reference : IJob
{
public NativeArray<Door> Doors;
public NativeArray<DoorTestPos> TestPos;
public NativeArray<int> DoorOpenStates;
public void Execute() {
for (int j = 0; j < Doors.Length; ++j) {
bool shouldOpen = false;
for (int i = 0; i < TestPos.Length; ++i) {
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team) {
shouldOpen = true;
break;
}
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}
Reference disassembly
42
.LBB0_6:
vmovsd xmm2, qword ptr [rsi - 12]
vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32
vsubps xmm2, xmm2, xmm0
vmulps xmm2, xmm2, xmm2
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vaddss xmm3, xmm3, xmm4
vaddss xmm2, xmm2, xmm3
vucomiss xmm2, xmm1
jae .LBB0_10 ; not inside radius?
mov ebx, dword ptr [rdx]
cmp ebx, dword ptr [rsi]
je .LBB0_8 ; break out of loop
.LBB0_10:
inc rdi
add rsi, 16
cmp rdi, rax
jl .LBB0_6
Let’s lose the branches
43
public void Execute() {
for (int j = 0; j < Doors.Length; ++j) {
bool shouldOpen = false;
for (int i = 0; i < TestPos.Length; ++i) {
float3 delta = TestPos[i].Pos - Doors[j].Pos;
float dsq = math.csum(delta * delta);
bool inRadius = dsq < Doors[j].RadiusSquared;
bool teamMatches = Doors[j].Team == TestPos[i].Team;
shouldOpen |= (inRadius & teamMatches) ? true : false;
}
DoorOpenStates[j] = shouldOpen ? 1 : 0;
}
}
}
Branch-free disassembly
44
.LBB0_4:
vmovsd xmm2, qword ptr [rdi - 12]
vinsertps xmm2, xmm2, dword ptr [rdi - 4], 32
vsubps xmm2, xmm2, xmm0
vmulps xmm2, xmm2, xmm2
vmovshdup xmm3, xmm2
vpermilpd xmm4, xmm2, 1
vaddss xmm3, xmm3, xmm4
vaddss xmm2, xmm2, xmm3
vucomiss xmm2, xmm1
setb al
cmp ebp, dword ptr [rdi]
sete dl
and dl, al
movzx eax, dl
or esi, eax
add rdi, 16
dec rbx
jne .LBB0_4
Explicit SIMD with Unity Mathematics
45
public struct DoorGroup
{
public float4 Xs;
public float4 Ys;
public float4 Zs;
public float4 RadiiSquared;
public int4 Teams;
}
public NativeArray<DoorGroup> Doors;
Explicit SIMD with Unity Mathematics
46
for (int j = 0; j < Doors.Length; ++j) {
bool4 openMask = false;
for (int i = 0; i < TestPos.Length; ++i) {
float4 xdeltas = TestPos[i].X - Doors[j].Xs;
float4 ydeltas = TestPos[i].Y - Doors[j].Ys;
float4 zdeltas = TestPos[i].Z - Doors[j].Zs;
float4 xdsq = xdeltas * xdeltas;
float4 ydsq = ydeltas * ydeltas;
float4 zdsq = zdeltas * zdeltas;
float4 dsq = xdsq + ydsq + zdsq;
bool4 rangeMask = dsq < Doors[j].RadiiSquared;
bool4 teamMask = TestPos[i].Team == Doors[j].Teams;
openMask |= teamMask & rangeMask;
}
DoorOpenStates[j] = math.select(new int4(0), new int4(1), openMask);
}
Explicit Math version disassembly
47
.LBB0_2:
vbroadcastss xmm0, dword ptr [rdx - 12]
vsubps xmm0, xmm0, xmm11
vbroadcastss xmm2, dword ptr [rdx - 8]
vsubps xmm2, xmm2, xmm4
vbroadcastss xmm3, dword ptr [rdx - 4]
vsubps xmm3, xmm3, xmm5
vmulps xmm0, xmm0, xmm0
vmulps xmm2, xmm2, xmm2
vmulps xmm3, xmm3, xmm3
vaddps xmm0, xmm0, xmm3
vaddps xmm0, xmm2, xmm0
vcmpltps xmm0, xmm0, xmm7
vpcmpeqd xmm2, xmm1, xmmword ptr [rdx]
vpand xmm0, xmm2, xmm0
vpsrld xmm0, xmm0, 31
vpor xmm6, xmm6, xmm0
add rdx, 28
dec rsi
jne .LBB0_2
Explicit SIMD with Burst Intrinsics
48
public struct Door4
{
public m128 Xs;
public m128 Ys;
public m128 Zs;
public m128 RadiiSquared;
public m128 Teams;
}
Explicit SIMD with Burst Intrinsics
49
for (int j = 0; j < Doors.Length; ++j) {
m128 openMask = new m128(~0u);
for (int i = 0; i < TestPos.Length; ++i) {
m128 tx = new m128(TestPos[i].X);
m128 ty = new m128(TestPos[i].Y);
m128 tz = new m128(TestPos[i].Z);
m128 tt = new m128(TestPos[i].Team);
m128 xdeltas = sub_ps(Doors[j].Xs, tx);
m128 ydeltas = sub_ps(Doors[j].Ys, ty);
m128 zdeltas = sub_ps(Doors[j].Zs, tz);
m128 xdsq = mul_ps(xdeltas, xdeltas);
m128 ydsq = mul_ps(ydeltas, ydeltas);
m128 zdsq = mul_ps(zdeltas, zdeltas);
m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq));
m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared);
rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt));
openMask = or_ps(openMask, rangeMask);
}
DoorOpenStates.ReinterpretStore(j * 4, openMask);
}
Explicit SIMD Disassembly
50
.LBB1_3:
vbroadcastss xmm4, dword ptr [rax - 12]
vbroadcastss xmm5, dword ptr [rax - 8]
vbroadcastss xmm6, dword ptr [rax - 4]
vpbroadcastd xmm7, dword ptr [rax]
vpcmpeqd xmm7, xmm3, xmm7
vsubps xmm4, xmm1, xmm4
vsubps xmm5, xmm1, xmm5
vsubps xmm6, xmm1, xmm6
vmulps xmm4, xmm4, xmm4
vmulps xmm5, xmm5, xmm5
vaddps xmm4, xmm5, xmm4
vmulps xmm5, xmm6, xmm6
vaddps xmm4, xmm5, xmm4
vcmpleps xmm4, xmm4, xmm2
vpand xmm4, xmm7, xmm4
vpor xmm0, xmm4, xmm0
inc rsi
add rax, 16
cmp rsi, rdx
jl .LBB1_3
Guidelines for SIMD with Burst
51
— Become familiar with the Burst inspector
— Eliminate branches (typically a good idea)
— Prefer wider batches of input data
— Use Unity.Mathematics vertically (as in this example)
— SIMD intrinsics gives you least surprises, but require the most
effort
What about System.Numerics?
52
— We might consider supporting this API at a later stage
— We want complete control and easy porting of C++ intrinsic
code to HPC#
— Similar to the approach we took with HLSL code for Math
Summary
53
— Intrinsics are coming
— Be careful with abstractions
— Adopt a SIMD mindset with Unity.Mathematics today
— Independent values are your friends
— Get familiar with the Burst inspector
— Go forth and compute more things quickly!
Thank you!
54
— Q & A
— Forum feedback welcome
— Twitter: @deplinenoise

More Related Content

What's hot

The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Philip Hammer
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
 
Best Practices for Shader Graph
Best Practices for Shader GraphBest Practices for Shader Graph
Best Practices for Shader Graph
Unity Technologies
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
Johan Andersson
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
Intel® Software
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
Cass Everitt
 
Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019
Unity Technologies
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
AMD Developer Central
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Unity Technologies
 
Visibility Optimization for Games
Visibility Optimization for GamesVisibility Optimization for Games
Visibility Optimization for Games
Umbra
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
Johan Andersson
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Guerrilla
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
Electronic Arts / DICE
 
Practical Guide for Optimizing Unity on Mobiles
Practical Guide for Optimizing Unity on MobilesPractical Guide for Optimizing Unity on Mobiles
Practical Guide for Optimizing Unity on Mobiles
Valentin Simonov
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Unity Technologies
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
Cass Everitt
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
Philip Hammer
 

What's hot (20)

The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Best Practices for Shader Graph
Best Practices for Shader GraphBest Practices for Shader Graph
Best Practices for Shader Graph
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
 
Visibility Optimization for Games
Visibility Optimization for GamesVisibility Optimization for Games
Visibility Optimization for Games
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Practical Guide for Optimizing Unity on Mobiles
Practical Guide for Optimizing Unity on MobilesPractical Guide for Optimizing Unity on Mobiles
Practical Guide for Optimizing Unity on Mobiles
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
 

Similar to Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019

SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
dk03006
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Eastern European Computer Vision Conference
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
Joel Falcou
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
Codemotion Tel Aviv
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
ssuserc28b3c
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
Daniel Blezek
 
lec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptxlec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptx
Rakesh Pogula
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
Wei-Ta Wang
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
Champ Yen
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler Intrinsics
Richard Thomson
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
Valeriia Maliarenko
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
Sergey Platonov
 
Peddle the Pedal to the Metal
Peddle the Pedal to the MetalPeddle the Pedal to the Metal
Peddle the Pedal to the Metal
C4Media
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
DefconRussia
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
PVS-Studio
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
Aleksei Voitylov
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
The World of Smalltalk
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
Andrey Karpov
 

Similar to Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 (20)

SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
 
lec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptxlec2 - Modern Processors - SIMD.pptx
lec2 - Modern Processors - SIMD.pptx
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler Intrinsics
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Peddle the Pedal to the Metal
Peddle the Pedal to the MetalPeddle the Pedal to the Metal
Peddle the Pedal to the Metal
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
12 virtualmachine
12 virtualmachine12 virtualmachine
12 virtualmachine
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 

More from Unity Technologies

Build Immersive Worlds in Virtual Reality
Build Immersive Worlds  in Virtual RealityBuild Immersive Worlds  in Virtual Reality
Build Immersive Worlds in Virtual Reality
Unity Technologies
 
Augmenting reality: Bring digital objects into the real world
Augmenting reality: Bring digital objects into the real worldAugmenting reality: Bring digital objects into the real world
Augmenting reality: Bring digital objects into the real world
Unity Technologies
 
Let’s get real: An introduction to AR, VR, MR, XR and more
Let’s get real: An introduction to AR, VR, MR, XR and moreLet’s get real: An introduction to AR, VR, MR, XR and more
Let’s get real: An introduction to AR, VR, MR, XR and more
Unity Technologies
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
Unity Technologies
 
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
The Tipping Point: How Virtual Experiences Are Transforming Global IndustriesThe Tipping Point: How Virtual Experiences Are Transforming Global Industries
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
Unity Technologies
 
Unity Roadmap 2020: Live games
Unity Roadmap 2020: Live games Unity Roadmap 2020: Live games
Unity Roadmap 2020: Live games
Unity Technologies
 
Unity Roadmap 2020: Core Engine & Creator Tools
Unity Roadmap 2020: Core Engine & Creator ToolsUnity Roadmap 2020: Core Engine & Creator Tools
Unity Roadmap 2020: Core Engine & Creator Tools
Unity Technologies
 
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
Unity Technologies
 
Unity XR platform has a new architecture – Unite Copenhagen 2019
Unity XR platform has a new architecture – Unite Copenhagen 2019Unity XR platform has a new architecture – Unite Copenhagen 2019
Unity XR platform has a new architecture – Unite Copenhagen 2019
Unity Technologies
 
Turn Revit Models into real-time 3D experiences
Turn Revit Models into real-time 3D experiencesTurn Revit Models into real-time 3D experiences
Turn Revit Models into real-time 3D experiences
Unity Technologies
 
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
Unity Technologies
 
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
Unity Technologies
 
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
QA your code: The new Unity Test Framework – Unite Copenhagen 2019QA your code: The new Unity Test Framework – Unite Copenhagen 2019
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
Unity Technologies
 
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Unity Technologies
 
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
Supplying scalable VR training applications with Innoactive - Unite Copenhage...Supplying scalable VR training applications with Innoactive - Unite Copenhage...
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
Unity Technologies
 
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
Unity Technologies
 
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Unity Technologies
 
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
Unity Technologies
 
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
Unity Technologies
 
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
Unity Technologies
 

More from Unity Technologies (20)

Build Immersive Worlds in Virtual Reality
Build Immersive Worlds  in Virtual RealityBuild Immersive Worlds  in Virtual Reality
Build Immersive Worlds in Virtual Reality
 
Augmenting reality: Bring digital objects into the real world
Augmenting reality: Bring digital objects into the real worldAugmenting reality: Bring digital objects into the real world
Augmenting reality: Bring digital objects into the real world
 
Let’s get real: An introduction to AR, VR, MR, XR and more
Let’s get real: An introduction to AR, VR, MR, XR and moreLet’s get real: An introduction to AR, VR, MR, XR and more
Let’s get real: An introduction to AR, VR, MR, XR and more
 
Using synthetic data for computer vision model training
Using synthetic data for computer vision model trainingUsing synthetic data for computer vision model training
Using synthetic data for computer vision model training
 
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
The Tipping Point: How Virtual Experiences Are Transforming Global IndustriesThe Tipping Point: How Virtual Experiences Are Transforming Global Industries
The Tipping Point: How Virtual Experiences Are Transforming Global Industries
 
Unity Roadmap 2020: Live games
Unity Roadmap 2020: Live games Unity Roadmap 2020: Live games
Unity Roadmap 2020: Live games
 
Unity Roadmap 2020: Core Engine & Creator Tools
Unity Roadmap 2020: Core Engine & Creator ToolsUnity Roadmap 2020: Core Engine & Creator Tools
Unity Roadmap 2020: Core Engine & Creator Tools
 
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
How ABB shapes the future of industry with Microsoft HoloLens and Unity - Uni...
 
Unity XR platform has a new architecture – Unite Copenhagen 2019
Unity XR platform has a new architecture – Unite Copenhagen 2019Unity XR platform has a new architecture – Unite Copenhagen 2019
Unity XR platform has a new architecture – Unite Copenhagen 2019
 
Turn Revit Models into real-time 3D experiences
Turn Revit Models into real-time 3D experiencesTurn Revit Models into real-time 3D experiences
Turn Revit Models into real-time 3D experiences
 
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
How Daimler uses mobile mixed realities for training and sales - Unite Copenh...
 
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
How Volvo embraced real-time 3D and shook up the auto industry- Unite Copenha...
 
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
QA your code: The new Unity Test Framework – Unite Copenhagen 2019QA your code: The new Unity Test Framework – Unite Copenhagen 2019
QA your code: The new Unity Test Framework – Unite Copenhagen 2019
 
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
Engineering.com webinar: Real-time 3D and digital twins: The power of a virtu...
 
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
Supplying scalable VR training applications with Innoactive - Unite Copenhage...Supplying scalable VR training applications with Innoactive - Unite Copenhage...
Supplying scalable VR training applications with Innoactive - Unite Copenhage...
 
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
XR and real-time 3D in automotive digital marketing strategies | Visionaries ...
 
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
Real-time CG animation in Unity: unpacking the Sherman project - Unite Copenh...
 
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
Creating next-gen VR and MR experiences using Varjo VR-1 and XR-1 - Unite Cop...
 
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
What's ahead for film and animation with Unity 2020 - Unite Copenhagen 2019
 
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
How to Improve Visual Rendering Quality in VR - Unite Copenhagen 2019
 

Recently uploaded

MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Mydbops
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 

Recently uploaded (20)

MySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - MydbopsMySQL InnoDB Storage Engine: Deep Dive - Mydbops
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 

Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019

  • 1.
  • 3. — You’re interested in CPU performance — You’re considering porting engine systems to HPC# — You just go to technical talks because it’s cool Who this talk is for 3
  • 4. — Quick introduction to SIMD topics — Options for SIMD programming in HPC# today — The case for intrinsics and typeless SIMD — Case studies for intrinsics — Q & A Talk Contents 4
  • 6. What is SIMD? 6 — Single Instruction, Multiple Data – Doing more than one thing at a time — Available on essentially all hardware today in some form – Capabilities vary, but a few families exist — ARM Neon — x86/64 SSE and AVX
  • 7. SIMD Analogy: Chopping Veggies 7 Input Data Output Data Instruction Preprocessed!
  • 8. Why is SIMD important? 8 — It’s more efficient to do more with less instructions — There is dedicated hardware for this stuff — Often the only way you can get the max cache bandwidth
  • 10. Cache Bandwidth 10 — L1 Caches can deliver N bits every cycle – N typically much larger than 64 – 128 or 256 bits per cycle common in most CPUs today — Without using SIMD instructions, only get a fraction of this – Important part of not leaving performance on the table
  • 11. Cache Bandwidth 11 — This matters when you are processing in cache – Which is what we hope to do most of the time — Processing floats (4 bytes, 32 bits), 128-bit cache b/w – Each load wastes 75% of bandwidth from the cache — Better: Process 4 floats at a time – Full cache utilization
  • 12. The vector fallacy 12 — SIMD and mathematical vectors are mostly unrelated — Lots of confusion around this issue — False: Using a vector math library is somehow SIMD — True: Working with arrays of data can lead to opportunities for SIMD (but not always) — It’s especially problematic with 3-component vectors as we’ll see
  • 13. Quick smell test for float SIMD-ness (x86) 13 — xxxps instructions feeding into each other? – It’s probably SIMD code — xxxss instructions? – Scalar code — Occasional xxxps instructions with infrastructure? – Mix of SIMD and scalar, red flag!
  • 14. Example: 3D dot products 14 public static float DotExample(float3 a, float3 b) { return math.dot(a, b); } mulps xmm0, xmmword ptr [rdx] movshdup xmm1, xmm0 addss xmm1, xmm0 movhlps xmm0, xmm0 addss xmm0, xmm1 1 SIMD op, 2 infrastructure, 2 scalar ops => 1 dot product 4-wide mul, only 3 lanes valid shuffle overhead scalar addition shuffle overhead scalar addition
  • 15. Example: 3D dot products 15 — But wait a minute, what is a dot product? — For 3D: – a.x * b.x + a.y * b.y + a.z * b.z — What if we go back to basics and base our code on this?
  • 16. Example: 3D dot products (back to basics) 16 public static float DotExample1( float ax, float ay, float az float bx, float by, float bz) { return ax * bx + ay * by + az * bz; } mulss xmm0, xmm3 mulss xmm1, dword ptr [rsp + 40] addss xmm0, xmm1 mulss xmm2, dword ptr [rsp + 48] addss xmm0, xmm2 0 SIMD ops, 0 infrastructure, 5 scalar ops => 1 dot product mul mul add mul add
  • 17. Example: 3D dot products (SIMD) 17 public static float4 DotExample4( float4 ax, float4 ay, float4 az float4 bx, float4 by, float4 bz) { return ax * bx + ay * by + az * bz; } mulps xmm2, xmmword ptr [r9] mulps xmm1, xmmword ptr [rcx] addps xmm1, xmm2 mulps xmm0, xmmword ptr [rax] addps xmm0, xmm1 5 SIMD ops, 0 infrastructure, 0 scalar ops => 4 dot products 4-wide mul 4-wide mul 4-wide add 4-wide mul 4-wide add
  • 18. SIMD mindset 18 — Important not to think in terms of abstractions — Don’t think about float4 as “a 4-D vector” — Better: 4 floats is the width of the vector unit on this CPU — Get used to the idea of 128 or 256 bit blocks of data – Divide into whatever size is convenient — What fits in 128 bits? – 16 bytes – 8 shorts – 4 floats or ints – 2 doubles or longs
  • 19. SIMD mindset, contd. 19 — Try to find opportunities to compute independent values – Like the 4 independent dot products we just saw — Fight the urge to think of vectors as horizontal values – Horizontal operations often go against the grain of SIMD instructions — Typically scalar code without abstractions vectorizes well – float3, float2 etc can be convenient but often get in the way
  • 20. So how do you get SIMD code? 20 — In HPC# we’ve had two options so far: – LLVM auto-vectorization – Unity.Mathematics explicit SIMD
  • 21. LLVM’s auto vectorizer 21 — Simple mode: Write scalar code, get SIMD code out — For simple loops, LLVM is often able to generate SIMD code — Checklist to look at before expecting SIMD: – Data ranges must not alias – Data must be contiguous in memory (for wide loads) – Data types must be integer or float with fast-math – Branches are kept to a minimum – There is no cross-element interference
  • 22. LLVM’s auto vectorizer 22 — Pros – Simpler code to read/write (at face value) – Often gives you a speedup where you didn’t expect one — Cons – Need to learn a bunch of rules to get SIMD code from loops – No way to tell when you’ve stopped getting SIMD – (We’re looking at ways to make this a compile error if desired) – Hard to reinterpret data types – Often surprising what will not vectorize
  • 23. Example of successful vectorization 23 [BurstCompile] public struct VectorizeDemo : IJob { public NativeArray<int> Inputs; public NativeArray<int> Outputs; public void Execute() { for (int i = 0; i < Inputs.Length; ++i) { if (Inputs[i] >= 0) { Outputs[i] = Inputs[i]; } else { Outputs[i] = 0; } } } } .LBB0_7: vpmaxsd ymm1, ymm0, ymmword ptr [r10 + 4*rdx] vpmaxsd ymm2, ymm0, ymmword ptr [r10 + 4*rdx + 32] vpmaxsd ymm3, ymm0, ymmword ptr [r10 + 4*rdx + 64] vpmaxsd ymm4, ymm0, ymmword ptr [r10 + 4*rdx + 96] vmovdqu ymmword ptr [rcx + 4*rdx], ymm1 vmovdqu ymmword ptr [rcx + 4*rdx + 32], ymm2 vmovdqu ymmword ptr [rcx + 4*rdx + 64], ymm3 vmovdqu ymmword ptr [rcx + 4*rdx + 96], ymm4 add rdx, 32 cmp rax, rdx jne .LBB0_7
  • 24. Example of unsuccessful vectorization 24 [BurstCompile] public struct VectorizeDemo : IJob { public NativeArray<int> Inputs; public NativeArray<int> Outputs; public void Execute() { for (int i = 0; i < Inputs.Length; ++i) { if (Inputs[i] >= 0) { Outputs[i] = Inputs[i] * 2; } else { Outputs[i] = 0; } } } } .LBB0_2: mov edx, dword ptr [r10 + 4*rax] lea ecx, [rdx + rdx] test edx, edx cmovs ecx, r8d mov dword ptr [r11 + 4*rax], ecx inc rax cmp r9, rax jne .LBB0_2
  • 25. Explicit SIMD with Unity.Mathematics 25 — Use e.g. float4, int4 vertically (as in dot product example) — Maps directly to LLVM vector types, you will get vector code — Checklist: – Avoid branches, use select/mask idioms – Use native arrays, with ReinterpretLoad/Store as needed – Handle end-of-array cases manually
  • 26. Explicit Unity.Mathematics SIMD Example 26 static public IntersectResult Intersect2(NativeArray<PlanePacket4> cullingPlanePackets, AABB a) { // … int4 outCounts = 0; int4 inCounts = 0; for (int i = 0; i < cullingPlanePackets.Length; i++) { var p = cullingPlanePackets[i]; float4 distances = dot4(p.Xs, p.Ys, p.Zs, mx, my, mz) + p.Distances; float4 radii = dot4(ex, ey, ez, math.abs(p.Xs), math.abs(p.Ys), math.abs(p.Zs)); outCounts += (int4) (distances + radii <= 0); inCounts += (int4) (distances > radii); } int inCount = math.csum(inCounts); int outCount = math.csum(outCounts); if (outCount != 0) return IntersectResult.Out; else return (inCount == 4 * cullingPlanePackets.Length) ? IntersectResult.In : IntersectResult.Partial; }
  • 27. The Case For Intrinsics 27
  • 28. The need for typeless SIMD 28 — In the engine space it’s frequently useful to reinterpret data — Want control over instruction selection for particular HW — Want to leverage tricks that compilers don’t use
  • 29. Data reinterpretation 29 — Work with floats bits using integer operations — Example: Converting small integers to floats ushort x = ...; uint y = x | 0x4b000000; float f = as_float(y) - 8388608.0f;
  • 30. Instruction selection 30 — Often useful to base core engine loops around specific h/w — Example: x86 pmulhrsw
  • 31. Leveraging data tricks 31 — Many tricks are not in the repertoire of most compilers — Example: Quickly generating mask from sign of float data float x = ...; uint mask = as_int(x) >> 31;
  • 33. What we’re working on 33 — Typeless SIMD library of intrinsics — Start with x86, with ARM to come — Good C# integration with debugging considerations
  • 34. Typeless? 34 — Types are mostly an annoyance for real world SIMD — Often need to reinterpret float/int — Often need to deal with masks, which are unclearly typed — Canonical example: comparisons – _mm_cmpeq_ps – returns a mask of all ones when equal – So… is that a float? Or an int?
  • 35. Do what the hardware does 35 — The hardware just has registers, not types (obviously) — That’s what we expose in our intrinsics API — m128 – 128 bit SIMD register — m256 – 256 bit SIMD register — Instructions determine how the register contents are interpreted
  • 36. API Usage Example 36 using static Burst.Compiler.IL.x86; // … m128 a, b = …; m128 mask = cmpeq_ps(a, b); // …
  • 37. API Extract 37 // _mm_cmpeq_ps /// <summary> Compare packed single-precision (32-bit) /// floating-point elements in "a" and "b" for equality, /// and store the results in "dst". </summary> [X86InstructionFamily(InstructionFamily.SSE)] [DebuggerStepThrough] public static m128 cmpeq_ps(m128 a, m128 b) { m128 dst = default(m128); dst.UInt0 = a.Float0 == b.Float0 ? ~0u : 0; dst.UInt1 = a.Float1 == b.Float1 ? ~0u : 0; dst.UInt2 = a.Float2 == b.Float2 ? ~0u : 0; dst.UInt3 = a.Float3 == b.Float3 ? ~0u : 0; return dst; } C# Reference Implementation
  • 38. A more complete example 38
  • 39. A more complete example 39 For each door: open = 0 For each player position: if player in range and correct team: open = 1 store open state for door
  • 40. A more complete example 40 — Basic N vs M test — N doors, M players public struct Door { public float3 Pos; public float RadiusSquared; public int Team; } public struct DoorTestPos { public float3 Pos; public int Team; }
  • 41. Reference version 41 [BurstCompile] public struct DoorTest_Reference : IJob { public NativeArray<Door> Doors; public NativeArray<DoorTestPos> TestPos; public NativeArray<int> DoorOpenStates; public void Execute() { for (int j = 0; j < Doors.Length; ++j) { bool shouldOpen = false; for (int i = 0; i < TestPos.Length; ++i) { float3 delta = TestPos[i].Pos - Doors[j].Pos; float dsq = math.csum(delta * delta); if (dsq < Doors[j].RadiusSquared && Doors[j].Team == TestPos[i].Team) { shouldOpen = true; break; } } DoorOpenStates[j] = shouldOpen ? 1 : 0; } } }
  • 42. Reference disassembly 42 .LBB0_6: vmovsd xmm2, qword ptr [rsi - 12] vinsertps xmm2, xmm2, dword ptr [rsi - 4], 32 vsubps xmm2, xmm2, xmm0 vmulps xmm2, xmm2, xmm2 vmovshdup xmm3, xmm2 vpermilpd xmm4, xmm2, 1 vaddss xmm3, xmm3, xmm4 vaddss xmm2, xmm2, xmm3 vucomiss xmm2, xmm1 jae .LBB0_10 ; not inside radius? mov ebx, dword ptr [rdx] cmp ebx, dword ptr [rsi] je .LBB0_8 ; break out of loop .LBB0_10: inc rdi add rsi, 16 cmp rdi, rax jl .LBB0_6
  • 43. Let’s lose the branches 43 public void Execute() { for (int j = 0; j < Doors.Length; ++j) { bool shouldOpen = false; for (int i = 0; i < TestPos.Length; ++i) { float3 delta = TestPos[i].Pos - Doors[j].Pos; float dsq = math.csum(delta * delta); bool inRadius = dsq < Doors[j].RadiusSquared; bool teamMatches = Doors[j].Team == TestPos[i].Team; shouldOpen |= (inRadius & teamMatches) ? true : false; } DoorOpenStates[j] = shouldOpen ? 1 : 0; } } }
  • 44. Branch-free disassembly 44 .LBB0_4: vmovsd xmm2, qword ptr [rdi - 12] vinsertps xmm2, xmm2, dword ptr [rdi - 4], 32 vsubps xmm2, xmm2, xmm0 vmulps xmm2, xmm2, xmm2 vmovshdup xmm3, xmm2 vpermilpd xmm4, xmm2, 1 vaddss xmm3, xmm3, xmm4 vaddss xmm2, xmm2, xmm3 vucomiss xmm2, xmm1 setb al cmp ebp, dword ptr [rdi] sete dl and dl, al movzx eax, dl or esi, eax add rdi, 16 dec rbx jne .LBB0_4
  • 45. Explicit SIMD with Unity Mathematics 45 public struct DoorGroup { public float4 Xs; public float4 Ys; public float4 Zs; public float4 RadiiSquared; public int4 Teams; } public NativeArray<DoorGroup> Doors;
  • 46. Explicit SIMD with Unity Mathematics 46 for (int j = 0; j < Doors.Length; ++j) { bool4 openMask = false; for (int i = 0; i < TestPos.Length; ++i) { float4 xdeltas = TestPos[i].X - Doors[j].Xs; float4 ydeltas = TestPos[i].Y - Doors[j].Ys; float4 zdeltas = TestPos[i].Z - Doors[j].Zs; float4 xdsq = xdeltas * xdeltas; float4 ydsq = ydeltas * ydeltas; float4 zdsq = zdeltas * zdeltas; float4 dsq = xdsq + ydsq + zdsq; bool4 rangeMask = dsq < Doors[j].RadiiSquared; bool4 teamMask = TestPos[i].Team == Doors[j].Teams; openMask |= teamMask & rangeMask; } DoorOpenStates[j] = math.select(new int4(0), new int4(1), openMask); }
  • 47. Explicit Math version disassembly 47 .LBB0_2: vbroadcastss xmm0, dword ptr [rdx - 12] vsubps xmm0, xmm0, xmm11 vbroadcastss xmm2, dword ptr [rdx - 8] vsubps xmm2, xmm2, xmm4 vbroadcastss xmm3, dword ptr [rdx - 4] vsubps xmm3, xmm3, xmm5 vmulps xmm0, xmm0, xmm0 vmulps xmm2, xmm2, xmm2 vmulps xmm3, xmm3, xmm3 vaddps xmm0, xmm0, xmm3 vaddps xmm0, xmm2, xmm0 vcmpltps xmm0, xmm0, xmm7 vpcmpeqd xmm2, xmm1, xmmword ptr [rdx] vpand xmm0, xmm2, xmm0 vpsrld xmm0, xmm0, 31 vpor xmm6, xmm6, xmm0 add rdx, 28 dec rsi jne .LBB0_2
  • 48. Explicit SIMD with Burst Intrinsics 48 public struct Door4 { public m128 Xs; public m128 Ys; public m128 Zs; public m128 RadiiSquared; public m128 Teams; }
  • 49. Explicit SIMD with Burst Intrinsics 49 for (int j = 0; j < Doors.Length; ++j) { m128 openMask = new m128(~0u); for (int i = 0; i < TestPos.Length; ++i) { m128 tx = new m128(TestPos[i].X); m128 ty = new m128(TestPos[i].Y); m128 tz = new m128(TestPos[i].Z); m128 tt = new m128(TestPos[i].Team); m128 xdeltas = sub_ps(Doors[j].Xs, tx); m128 ydeltas = sub_ps(Doors[j].Ys, ty); m128 zdeltas = sub_ps(Doors[j].Zs, tz); m128 xdsq = mul_ps(xdeltas, xdeltas); m128 ydsq = mul_ps(ydeltas, ydeltas); m128 zdsq = mul_ps(zdeltas, zdeltas); m128 dsq = add_ps(xdsq, add_ps(ydsq, zdsq)); m128 rangeMask = cmple_ps(dsq, Doors[j].RadiiSquared); rangeMask = and_ps(rangeMask, cmpeq_epi32(Doors[j].Teams, tt)); openMask = or_ps(openMask, rangeMask); } DoorOpenStates.ReinterpretStore(j * 4, openMask); }
  • 50. Explicit SIMD Disassembly 50 .LBB1_3: vbroadcastss xmm4, dword ptr [rax - 12] vbroadcastss xmm5, dword ptr [rax - 8] vbroadcastss xmm6, dword ptr [rax - 4] vpbroadcastd xmm7, dword ptr [rax] vpcmpeqd xmm7, xmm3, xmm7 vsubps xmm4, xmm1, xmm4 vsubps xmm5, xmm1, xmm5 vsubps xmm6, xmm1, xmm6 vmulps xmm4, xmm4, xmm4 vmulps xmm5, xmm5, xmm5 vaddps xmm4, xmm5, xmm4 vmulps xmm5, xmm6, xmm6 vaddps xmm4, xmm5, xmm4 vcmpleps xmm4, xmm4, xmm2 vpand xmm4, xmm7, xmm4 vpor xmm0, xmm4, xmm0 inc rsi add rax, 16 cmp rsi, rdx jl .LBB1_3
  • 51. Guidelines for SIMD with Burst 51 — Become familiar with the Burst inspector — Eliminate branches (typically a good idea) — Prefer wider batches of input data — Use Unity.Mathematics vertically (as in this example) — SIMD intrinsics gives you least surprises, but require the most effort
  • 52. What about System.Numerics? 52 — We might consider supporting this API at a later stage — We want complete control and easy porting of C++ intrinsic code to HPC# — Similar to the approach we took with HLSL code for Math
  • 53. Summary 53 — Intrinsics are coming — Be careful with abstractions — Adopt a SIMD mindset with Unity.Mathematics today — Independent values are your friends — Get familiar with the Burst inspector — Go forth and compute more things quickly!
  • 54. Thank you! 54 — Q & A — Forum feedback welcome — Twitter: @deplinenoise