Best Practices aren't static -- as Unity's underlying architecture evolves to support Data-Oriented Design, the old tricks might no longer be the best ways to squeeze performance out of the engine. In this talk, we'll discuss how Unity has changed between Unity 5, Unity 2017 and Unity 2018 and how to take advantage of these changes.
Ian Dundore (Unity Technologie)
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Unity's Evolving Best Practices
1. This slide deck was presented at Unite Berlin 2018.
This offline version includes numerous additional slides, cut
from the original presentation for brevity and/or time.
These extra slides contains more examples and data, but are
not essential for understanding the presentation.
2. Optimization & Best Practices:
Through The Ages
Ian Dundore
Unity Technologies
9. oops.
• In the specific case of String.Equals, that advice is wrong!
• From a performance perspective, at least.
• For all other string comparisons, it’s right!
• Compare, StartsWith, EndsWith, IndexOf, etc.
• Again, from a performance perspective.
• (Psst! This is documented!)
https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings#common-string-comparison-methods-in-net
11. Testing Considerations
• How does the code path differ with different inputs?
• What is the environment around the executing code?
• Runtime
• IL2CPP/Mono? .Net version?
• Hardware
• Pipeline depth, cache size, cache-line length
• # of cores, core affinity settings on threads, throttling
• What exactly is your test measuring?
12. Your Test Harness Matters!
Profiler.BeginSample(“Test A”);
for (int i=0; i<NUM_TESTS; ++i) {
DoAThing(i);
}
Profiler.EndSample();
int i = 0;
Profiler.BeginSample(“Test B”);
DoAThing(0);
while (i<NUM_TESTS) {
DoAThing(++i);
DoAThing(++i);
DoAThing(++i);
// … repeat a lot …
DoAThing(++i);
}
Profiler.EndSample();
Less Loop OverheadMore Loop Overhead
13. public bool Equals(String value) {
if (this == null)
throw new NullReferenceException();
if (value == null)
return false;
if (Object.ReferenceEquals(this, value))
return true;
if (this.Length != value.Length)
return false;
return EqualsHelper(this, value);
}
Mono’s String.cs (1)
14. What does EqualsHelper do?
• Uses unsafe code to pin strings to memory addresses.
• C-style integer comparison of raw bytes of the strings.
• Core is a special cache-friendly loop.
• 64-bit: Step through strings with a stride of 12 bytes.
while (length >= 12)
{
if (*(long*)a != *(long*)b) return false;
if (*(long*)(a+4) != *(long*)(b+4)) return false;
if (*(long*)(a+8) != *(long*)(b+8)) return false;
a += 12; b += 12; length -= 12;
}
15. public bool Equals(String value, StringComparison comparisonType) {
if (comparisonType < StringComparison.CurrentCulture ||
comparisonType > StringComparison.OrdinalIgnoreCase)
throw new ArgumentException(…);
Contract.EndContractBlock();
if ((Object)this == (Object)value) {
return true;
}
if ((Object)value == null) {
return false;
}
Mono’s String.cs (2)
16. switch (comparisonType) {
case StringComparison.CurrentCulture:
return (CultureInfo.CurrentCulture.CompareInfo.Compare(this,
value, CompareOptions.None) == 0);
case StringComparison.CurrentCultureIgnoreCase:
return (CultureInfo.CurrentCulture.CompareInfo.Compare(this,
value, CompareOptions.IgnoreCase) == 0);
case StringComparison.InvariantCulture:
return (CultureInfo.InvariantCulture.CompareInfo.Compare(this,
value, CompareOptions.None) == 0);
case StringComparison.InvariantCultureIgnoreCase:
return (CultureInfo.InvariantCulture.CompareInfo.Compare(this,
value, CompareOptions.IgnoreCase) == 0);
Mono’s String.cs (3)
18. But wait!
• For non-matching strings, length will often differ.
• But for length-invariant strings, first character usually differs.
• This optimization is found in CompareOrdinal, but not Equals.
public static int CompareOrdinal(String strA, String strB) {
if ((Object)strA == (Object)strB)
return 0;
if (strA == null)
return -1;
if (strB == null)
return 1;
// Most common case, first character is different.
if ((strA.m_firstChar - strB.m_firstChar) != 0)
return strA.m_firstChar - strB.m_firstChar;
return CompareOrdinalHelper(strA, strB);
}
19. This is getting silly.
public static int CompareOrdinal(String strA, int indexA,
String strB, int indexB, int length) {
if (strA == null || strB == null) {
if ((Object)strA==(Object)strB) { //they're both null;
return 0;
}
return (strA==null)? -1 : 1; //-1 if A is null, 1 if B is null.
}
return nativeCompareOrdinalEx(strA, indexA, strB, indexB, length);
}
An overload that goes almost directly to native code!
20. Test Design: 4 cases
• Case 1: Two identical strings.
• Case 2: Two strings of random characters of same length.
• Case 3: Two strings of random characters of same length.
• First characters identical, to bypass check in Compare.
• Case 4: Two strings of random characters, different lengths.
• Comparison’s worst case is bounded by the shorter string.
• Constrained range to 15-25 characters to be similar to above tests.
21. Mono 3.5
Identical Content
Identical Length
Random Content
Identical Length
First Char Equal
Identical Length
Random Content
Random Length
String.Equals 2.97 1.75 1.73 1.30
String.Equals
with Ordinal type
5.87 3.46 3.56 3.39
String.Compare 37.52 33.29 64.66 31.35
String.Compare
with Ordinal type
6.23 3.35 3.35 3.26
CompareOrdinal 5.68 3.10 3.18 2.99
CompareOrdinal
with Indices
5.53 3.30 3.42 3.95
Simple
Hand-Coded
5.46 1.75 2.18 1.40
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, Windows Standalone, Mono 3.5, Core i7-3500K
22. Mono 3.5
Identical Content
Identical Length
Random Content
Identical Length
First Char Equal
Identical Length
Random Content
Random Length
String.Equals 3.23 1.80 1.82 1.21
String.Equals
with Ordinal type
3.84 2.13 2.03 1.38
String.Compare 34.72 28.70 63.03 29.74
String.Compare
with Ordinal type
5.16 1.75 2.68 1.65
CompareOrdinal 4.93 1.55 2.21 1.40
CompareOrdinal
with Indices
4.77 3.59 3.59 4.41
Simple
Hand-Coded
4.40 1.66 1.95 1.28
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, Windows Standalone, Mono 4.6, Core i7-3500K
23. IL2CPP
Identical Content
Identical Length
Random Content
Identical Length
First Char Equal
Identical Length
Random Content
Random Length
String.Equals 2.61 1.26 1.27 0.95
String.Equals
with Ordinal type
5.38 3.80 3.84 3.66
String.Compare 39.12 29.32 60.56 28.01
String.Compare
with Ordinal type
4.84 3.58 3.62 3.52
CompareOrdinal 4.78 3.55 3.58 3.51
CompareOrdinal
with Indices
4.93 3.71 3.72 4.17
Simple
Hand-Coded
13.83 3.52 3.93 2.16
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, Windows Standalone, IL2CPP 3.5, Core i7-6700K
24. IL2CPP
Identical Content
Identical Length
Random Content
Identical Length
First Char Equal
Identical Length
Random Content
Random Length
String.Equals 2.64 1.92 1.93 0.96
String.Equals
with Ordinal type
2.94 2.26 2.73 1.49
String.Compare 40.98 30.61 60.82 29.26
String.Compare
with Ordinal type
3.18 1.46 2.29 1.32
CompareOrdinal 2.99 1.18 2.06 1.12
CompareOrdinal
with Indices
5.56 3.93 4.08 4.41
Simple
Hand-Coded
14.14 3.78 4.14 2.35
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, Windows Standalone, IL2CPP 4.6, Core i7-6700K
27. Conclusions & more questions
• String.Equals clearly wins for plain string comparison.
• .NET 4.6 has improvements for String.Compare variants.
• Ordinal comparisons clearly win on culture-sensitive APIs.
• Use String.CompareOrdinal instead of String.Compare.
• Use StringComparison.Ordinal on other String APIs.
• How does this map across platforms?
28. IL2CPP
Identical Content
Identical Length
Random Content
Identical Length
First Char Equal
Identical Length
Random Content
Random Length
String.Equals 13.48 5.08 5.01 5.26
String.Equals
with Ordinal type
25.42 19.46 19.85 14.16
String.Compare 118.80 128.69 254.30 124.81
String.Compare
with Ordinal type
24.23 11.49 11.57 10.95
CompareOrdinal 23.92 11.09 11.54 10.75
CompareOrdinal
with Indices
23.79 14.76 18.62 15.05
Simple
Hand-Coded
58.02 12.04 21.86 8.13
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, iOS, IL2CPP 3.5, iPad Mini 3
30. Another tip!
• See a lot of time going to NullCheck in IL2CPP builds?
• Disable these checks in release builds!
• Works on types, methods & properties.
• Code is in IL2CppSetOptionAttribute.cs, under Unity install folder
[Il2CppSetOption(Option.NullChecks, false)]
public bool MyEquals(String strA, String strB) {
// …
}
31. IL2CPP
Identical Content
Identical Length
Random Content
Random Length
Normal 58.02 8.13
NullCheck
Disabled
53.02 7.03
100,000 comparisons. Timings in milliseconds.
Unity 2018.1.0f2, iOS, IL2CPP 3.5, iPad Mini 3
Small, but helpful.
34. OnTransformChanged
• Internal message, broadcast each time a Transform changes
• Position, rotation, scale, parent, sibling order, etc.
• Tells other components to update their internal state
• PhysX/Box2D, UnityUI, Renderers (AABBs), etc.
• Repeated messages can cause performance problems
• Use Transform.SetPositionAndRotation (5.6+)
36. Enter the Dispatch
• TransformChangeDispatch was first introduced in 5.4
• Other systems migrated to use it, slowly.
• Renderers in 5.6
• Animators in 2017.1
• Physics in 2017.2
• RectTransforms in 2017.3
• OnTransformChanged was removed entirely in 2018.1
37. How Transforms are structured
• 1 TransformHierarchy structure represents a root Transform
• Contains buffers tracking data of all transforms in hierarchy
• TRS, indices for parents & siblings
• Interest bitmask & dirty bitmask
• Internal systems register interest & track state via specific bits
• Physics is one bit, renderer is another bit, etc.
• System walks affected parts of TransformHierarachy structure
• dirtyMask |= -1 & interestMask
38. When are async changes applied?
• TCD keeps a list of dirty TransformHierarchy pointers
• Systems request list of changed Transforms before running.
• e.g. Before FixedUpdate, before rendering, before animating.
• Use list to update internal system state.
• TCD iterates over list of dirty TransformHierarchies.
• Iterates over all Transforms to check each dirty bit.
39.
40. Quick Reminder
• Buffer size: Transform.hierarchyCapacity
• Set before mass reparenting operations!
• Reparent & reposition during instantiate!
• GameObject.Instantiate( prefab, parent );
• GameObject.Instantiate( prefab, parent, position, rotation );
41. Split your hierarchies.
• Changing any Transform marks the whole Hierarchy dirty.
• Dirty hierarchies must be fully examined for change bits.
• Smaller hierarchies = more granular Hierarchy tracking.
• Smaller hierarchies = fewer Transforms to check.
• Fewer roots = more Transforms to check for changes.
• Change checks are jobified, but operate on roots.
43. A welcome effect.
Parented Unparented
Main Thread 553 ms 32 ms
Worker Threads 139 ms 14 ms
100 Rotating Cubes, 100k Empty GameObjects.
iPad Mini 3. CPU time used over 10 seconds.
44. This is just checking hierarchies!
Parented Unparented
Main Thread 1.77 ms 0.11 ms
100 Rotating Cubes, 100k Empty GameObjects.
iPad Mini 3. CPU time used over 10 seconds.
“PostLateUpdate.UpdateAllRenderers”
45. Transforms & Physics: 2017.2+
• 2017.1/older: Physics components were synced to Transforms.
• Each Transform change = expensive update of Physics scene.
• 2017.2/newer: Updates can be delayed to next FixedUpdate.
• Update Physics entities from set of changed Transforms.
• Re-indexing computations are batched.
• This could have side-effects!
• Move a Collider + immediately Raycast towards it? No bueno.
46. Physics.AutoSyncTransforms
• When true, forces legacy behavior.
• Colliders/Rigidbodies check for syncs on every Physics call.
• Yes, every Raycast, Spherecast, etc.
• Huge performance regression, if you’re not batching updates.
• When false, uses delayed-update behavior.
• Can force updates: Physics.SyncTransforms
• Default value is true in 2017.2 through 2018.2
• 2018.3 is the first version where the default is false.
47. void Update()
{
float rotAmt = 2f * Time.deltaTime;
Vector3 up = Vector3.up;
if (batched)
{
for(int i = 0; i < NUM_PARENTS; ++i)
rotators[i].Rotate(up, rotAmt);
for(int i = 0; i < NUM_PARENTS; ++i)
Physics.Raycast(Vector3.zero, Random.insideUnitSphere);
}
else
{
for (int i = 0; i < NUM_PARENTS; ++i)
{
rotators[i].Rotate(up, rotAmt);
Physics.Raycast(Vector3.zero, Random.insideUnitSphere);
}
}
}
A test.
“Batched”
“Immediate”
48. Seriously, a big effect.
Parented
Immediate
Unparented
Immediate
Parented
Batched
Unparented
Batched
Script 4450 ms 4270 ms 1980 ms 882 ms
Physics 1410 ms 1820 ms 1840 ms 1770 ms
100 Rotating Cubes, Rigidbodies, Trigger Box Colliders. 100k Empty GameObjects.
App Framerate: 30. Physics Timestep 0.04 sec.
iPad Mini 3. CPU time used over 10 seconds.
50. The Basics
• Unity uses FMOD internally.
• Audio decoding & playback occurs on separate threads.
• Unity supports a handful of codecs.
• PCM
• ADPCM
• Vorbis
• MP3
51. Audio “Load Type” Setting
• Decompress On Load
• Decoding & file I/O happen at load time only.
• Compressed In Memory
• Decoding happens during playback.
• Streamed
• File I/O & decoding happen during playback.
52. Every frame…
• Unity iterates over all active Audio Sources.
• Calculates distance to Listener(s).
• FMOD mixes active Audio Sources (“voices”).
• True volume = Volume setting * distance to listener * clip.
• If the clip is compressed, FMOD must decode audio data
• Chooses X loudest voices to mix together.
• X = “Real Voices” audio setting.
53. Everything is done in software.
• Decoding & mixing are done entirely in software.
• Mixing occurs on the FMOD thread.
• Decoding occurs at loading time or on the FMOD thread.
• All playing voices are evaluated and mixed.
• Max number of voices is controlled by Audio settings.
54. A trap.
This voice is Muted.
This voice is Active.
This voice will not be heard,
but the Clip must be processed.
55. A warning.
• AudioSystem.Update is Unity updating the AudioSources which
are submitted to FMOD for playback.
• Audio decoding does not show up in the Unity CPU Profiler.
56. Check both places!
• Decoding & mixing audio is in the details of the Audio profiler.
60. ~Test time~ <(^^<) (>^^)>
• Identical 4 minute audio clip, copied 4 times.
• Once per codec under test.
• Varying number of AudioSources.
• Captured CPU time on main & FMOD threads
• Sum of CPU time consumed over 10 seconds real-time
61. Again.
10 Clips 100 Clips 500 Clips
PCM 95 ms 467 ms 2040 ms
ADPCM 89 ms 474 ms 2070 ms
MP3 84 ms 469 ms 2030 ms
Vorbis 93 ms 473 ms 1990 ms
CPU time on main thread, 10 seconds real-time.
62. With intensity.
10 Voices 100 Voices 500 Voices
PCM 214 ms 451 ms 634 ms
ADPCM 485 ms 1391 ms 1591 ms
MP3 1058 ms 4061 ms 4167 ms
Vorbis 1161 ms 3408 ms 3629 ms
CPU time on all FMOD threads, 10 seconds real-time.
63. Principles.
• Avoid having many audio sources set to Mute.
• Disable/Stop instead of Mute, if possible.
• If you can afford the memory overhead, Decompress on Load.
• Best for short clips that are frequently played.
• Avoid playing lots of compressed Clips, especially on mobile.
64. Or clamp the voice count.
10
Playing Clips
100
Playing Clips
500
Playing Clips
512 VV 318 ms 923 ms 2708 ms
100 VV 304 ms 905 ms 1087 ms
10 VV 315 ms 350 ms 495 ms
1 VV 173 ms 210 ms 361 ms
PCM. CPU time on FMOD + Main threads, 10 seconds real-time.
65. How, you ask?
public void SetNumVoices(int nv) {
var config = AudioSettings.GetConfiguration();
if(config.numVirtualVoices == nv)
return;
config.numVirtualVoices = nv;
config.numRealVoices = Mathf.Clamp(config.numRealVoices,
1, config.numVirtualVoices);
AudioSettings.Reset(config);
}
Just an example! Probably too simple for real use.
67. Animator
• Formerly called Mecanim.
• Graph of logical states.
• Blends between states.
• States contain animation clips and/or
blend trees.
• Animator component attached to
GameObject
• AnimatorController referenced by
Animator component.
68. Playables
• Technology underlying Animator & Timeline.
• Generic framework for “stuff that can be played back”.
• Animation clips, audio clips, video clips, etc.
• Docs: https://docs.unity3d.com/Manual/Playables.html
69. Animation
• Unity’s original animation system.
• Custom code
• Not based on Playables.
• Very simple: plays an animation clip.
• Can crossfade, loop.
71. The Test
• 100 GameObjects with Animator or Animation component
• Animator uses simple AnimatorController: 1 state, looping
• Animation plays back 1 AnimationClip, looping
72. 0 ms
10 ms
20 ms
30 ms
1 100 200 300 400 500 600 700 800
Animation Animator
100 Components, Variable Curve Count, iPad Mini 3
TimeperFrame
74. 0 ms
1 ms
2 ms
3 ms
1 100 200 300 400 500 600 700 800
Animation Animator
100 Components, Variable Curve Count, Win10/Core i7
TimeperFrame Crossover on iPad Mini 3
75. 0 ms
1 ms
2 ms
3 ms
1 100 200 300 400 500
Animation Animator
100 Curves, Variable Component Count, Win10/Core i7
TimeperFrame
76. Scaling Factors
• Performance is heavily dependent on curve & core count.
• Fewer cores: Animation retains advantage longer.
• More cores: Animator rapidly outperforms Animation.
• Both systems scale linearly as number of Components rises.
• “Best” system determined by target hardware vs curve count.
• Use Animation for simple animations.
• Use Animators for high curve counts or complex scenarios.
77. 0 ms
13 ms
27 ms
40 ms
1 100 200 300 400 500
Animation Animator
100 Curves, Variable Component Count, iPad Mini 3
TimeperFrame
78. What about “constant” curves?
• Still interpolated at runtime.
• No measurable impact on CPU usage.
• Significant memory/file savings.
• Example: 11kb vs. 3.7kb for 100 position curves (XYZ)
80. Be careful with Layers!
• The active state on each layer will be evaluated once per frame.
• Layer Weight does not matter.
• Weight=0? Still evaluated!
• This is to ensure that state is correct.
• Zero-weight layers = waste work
• Use layers sparingly!
(Yes, the docs are wrong.)
81. The Cost of Layering
1 Layer 2 Layers 3 Layers 4 Layers 5 Layers
Aggregate 1966 ms 2260 ms 2510 ms 2690 ms 2890 ms
Per Frame 10.08 ms 11.77 ms 12.86 ms 14.31 ms 17.65 ms
50 x “Ellen” from 3D Gamekit. Unity 2018.1.0f2.
Main Thread CPU time consumed during 10 Seconds Realtime.
iPad Mini 3.
83. Nope.
50 x “Ellen” from 3D Gamekit. Layers 2-5 Masked.
Main Thread CPU time consumed during 10 Seconds Realtime.
Unity 2018.1.0f2. iPad Mini 3.
1 Layer 2 Layers 3 Layers 4 Layers 5 Layers
Unmasked 1966 ms 2260 ms 2510 ms 2690 ms 2890 ms
60/108
Masked
1992 ms 2230 ms 2530 ms 2740 ms 2920 ms
84. Use the right rig!
• The Humanoid rig runs IK & retargeting calculations.
• The Generic rig does not.
1 Layer 2 Layers 3 Layers 4 Layers 5 Layers
Generic 1966 ms 2260 ms 2510 ms 2690 ms 2890 ms
Humanoid 2775 ms 3210 ms 3510 ms 3730 ms 4020 ms
Identical test to previous slide, different Rig import settings.
85. The pooling problem
• Animators reset their state when their GameObject is disabled.
• The only workaround? Disable Animator component, not
GameObject.
• Leads to messy side effects, like having to manage other
components (e.g. Colliders/Rigidbodies) manually.
• This made Animator-driven objects difficult to pool.
86. There’s an API to fix it, now!
• Animator.keepControllerStateOnDisable
• Available in 2018.1+
• If true, Animators do not discard data buffers when their
GameObject is disabled.
• Awesome for pooling!
• Careful of the higher memory usage of disabled Animators!