Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How to add an optimization for C#
to JIT compiler
@EgorBo
Engineer at Microsoft
RuyJIT
From C# to machine code
2
C# IL IR ASM
float y = x / 2;
float y = x * 0.5f;
Let’s start with a simple optimization
3
vdivss (Latency: 20, R.Throughput: 14)
vmuls...
X / C
4
double y = x / 2;
float y = x / 2;
double y = x / 10;
double y = x / 8;
double y = x / -0.5;
float x = 48.665f;
Co...
X / C – let me optimize it in Roslyn!
5
static float GetSpeed(float distance, float time)
{
return distance / time;
}
...
...
Where to implement my custom optimization?
6
• Roslyn
+ No time constraints
+ It’s written in C# - easy to add optimizatio...
RuyJIT: phases
7
RuyJIT: where to morph my X/C?
8
GenTree
GenTreeStmt
GenTree
(GenTreeOp)
BasicBlock
GenTree
(GenTreeDblCon)
GenTree
(GenTreeDblCon)
GenTree
(GenTreeCall)
G...
Dump IR via COMPlus_JitDump
Back to X/C: Morph IR Tree
static float Calculate(float distance, float v0)
{
return distance / 2 + v0;
}
ADD
DIV
LCL_VAR
...
Implementing the opt in morph.cpp
DIV
op2op1
Inspired by GT_ROL
13
/// <summary>
/// Rotates the specified value left by the specified number of bits.
/// </summary>
p...
15
"Hello".Length -> 5
static void Append(string str)
{
if (str.Length <= 2)
QuickAppend(str);
else
SlowAppend(str);
}
......
16
"Hello".Length => 5
ARR_LENGTH
CNS_STR
(ScpHnd, SconCPX)
CNS_INT
(5)
VM
case GT_ARR_LENGTH:
{
if (op1->OperIs(GT_CNS_ST...
17
bool Test1(int x) => x % 4 == 0; bool Test1(int x) => x & 3 == 0;
MOD
EQ/NE
CNS_INT
(0)
CNS_INT
(4)
anything
op1 op2
op...
Roslyn and “!=“ operator
(unexpected optimization)
18
public static bool Test1(int x) => x != 42;
public static bool Test2...
19
Math.Pow(x, 2)
Math.Pow(x, 1)
Math.Pow(x, 4)
Math.Pow(x, -1)
// can be added:
Math.Pow(42, 3)
Math.Pow(1, x)
Math.Pow(2...
Range check elimination
20
Range check elimination
21
public static void Test(int[] a)
{
a[0] = 4;
a[1] = 2;
for (int i = 0; i < a.Length; i++)
{
a[i...
Range check elimination
22
public static void Test(int[] a)
{
if (a.Length <= 0) throw new IndexOutOfRangeException();
a[0...
Range check elimination
23
public static void Test(int[] a)
{
if (a.Length <= 0) throw new IndexOutOfRangeException();
a[0...
Range check elimination
24
public static void Test(int[] a)
{
if (a.Length <= 1) throw new IndexOutOfRangeException();
a[1...
Range check elimination
25
public static void Test(int[] a)
{
if (a.Length <= 1) throw new IndexOutOfRangeException();
a[1...
rangecheck.cpp (simplified)
26
void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk)
{
// Get the range for this ...
rangecheck.cpp (simplified)
27
void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk)
{
// Get the range for this ...
rangecheck.cpp (simplified)
28
void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk)
{
// Get the range for this ...
Byte array
29
private static readonly byte[] _data =
new byte[256] { 1, 2, 3, … };
public static byte GetByte(int i)
{
ret...
Byte array: Roslyn hack (new feature)
30
private static ReadOnlySpan<byte> _data =>
new byte[256] { 1, 2, 3, … };
public s...
Byte array: byte index (my PR)
31
private static ReadOnlySpan<byte> _data =>
new byte[256] { 1, 2, 3, … };
public static b...
rangecheck.cpp (simplified)
32
void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk)
{
// Get the range for this ...
Loop optimizations
33
Loop Invariant Code Hoisting
34
public static bool Test(int[] a, int c)
{
for (int i = 0; i < a.Length; i++)
{
if (a[i] ==...
Loop Invariant Code Hoisting
35
public static bool Test(int[] a, int c)
{
int tmp = c + 44;
for (int i = 0; i < a.Length; ...
NYI: Loop-unrolling
36
public static int Test(int[] a)
{
int sum = 0;
for (int i = 0; i < a.Length; i++)
{
sum += a[i];
}
...
NYI: Loop-unrolling
37
public static int Test(int[] a)
{
int sum = 0;
for (int i = 0; i < a.Length - 3; i += 4)
{
sum += a...
NYI: Loop-unswitch
38
public static int Test(int[] a, bool condition)
{
int agr = 0;
for (int i = 0; i < a.Length; i++)
{
...
NYI: Loop-unswitch
39
public static int Test (int[] a, bool condition)
{
int agr = 0;
if (condition)
for (int i = 0; i < a...
NYI: Loop-deletion
40
public static void Test()
{
for (int i = 0; i < 10; i++) { }
}
; Method Program:Test()
G_M44187_IG01...
NYI: Loop-deletion
41
public static void Test()
{
}
; Method Program:DeadLoop()
ret
; Total bytes of code: 1
NYI: InductiveRangeCheckElimination
42
public static void Zero1000Elements(int[] array)
{
for (int i = 0; i < 1000; i++)
a...
NYI: InductiveRangeCheckElimination
43
public static void Zero1000Elements(int[] array)
{
int limit = Math.Min(array.Lengt...
NYI: InductiveRangeCheckElimination
44
public static void Zero1000Elements(int[] array)
{
int limit = Math.Min(array.Lengt...
NYI: InductiveRangeCheckElimination
45
public static void Zero1000Elements(int[] array)
{
int limit = Math.Min(array.Lengt...
Q&A
Twitter: EgorBo
Upcoming SlideShare
Loading in …5
×

How to add an optimization for C# to RyuJIT

904 views

Published on

How to add an optimization for C# to RyuJIT

Published in: Software
  • Be the first to comment

How to add an optimization for C# to RyuJIT

  1. 1. How to add an optimization for C# to JIT compiler @EgorBo Engineer at Microsoft
  2. 2. RuyJIT From C# to machine code 2 C# IL IR ASM
  3. 3. float y = x / 2; float y = x * 0.5f; Let’s start with a simple optimization 3 vdivss (Latency: 20, R.Throughput: 14) vmulss (Latency: 5, R.Throughput: 0.5) ^ for my MacBook (Haswell)
  4. 4. X / C 4 double y = x / 2; float y = x / 2; double y = x / 10; double y = x / 8; double y = x / -0.5; float x = 48.665f; Console.WriteLine(x / 10f); // 4.8665 Console.WriteLine(x * 0.1f); // 4.8665004 = x * 0.5 = x * 0.5f = x * 0.1 = x * 0.125 = x * -2
  5. 5. X / C – let me optimize it in Roslyn! 5 static float GetSpeed(float distance, float time) { return distance / time; } ... float speed = GetSpeed(distance, 2); Does Roslyn see “X/C” here? NO! It doesn’t inline methods
  6. 6. Where to implement my custom optimization? 6 • Roslyn + No time constraints + It’s written in C# - easy to add optimizations, easy to debug and experiment - No cross-assembly optimizations - No CPU-dependent optimizations (IL is cross-platform) - Doesn’t know how the code will look like after inlining, CSE, loop optimizations, etc. - F# doesn’t use Roslyn • JIT + Inlining, CSE, Loop opts, etc phases create more opportunities for optimizations + Knows everything about target platform, CPU capabilities - Written in C++, difficult to experiment - Time constraints for optimizations (probably not that important with Tiering) • R2R (AOT) + No time constraints (some optimizations are really time consuming, e.g. full escape analysis) - No CPU-dependent optimizations - Will be most likely re-jitted anyway? • ILLink Custom Step + Cross-assembly IL optimizations + Written in C# + We can manually de-virtualize types/methods/calls (if we know what we are doing) - Still no inlining, CSE, etc..
  7. 7. RuyJIT: phases 7
  8. 8. RuyJIT: where to morph my X/C? 8
  9. 9. GenTree GenTreeStmt GenTree (GenTreeOp) BasicBlock GenTree (GenTreeDblCon) GenTree (GenTreeDblCon) GenTree (GenTreeCall) GenTree (GenTreeLclVar) static float Test(float x) { return Foo(x, 3.14) * 2; } Compiler MUL 2.0 3.14x Foo
  10. 10. Dump IR via COMPlus_JitDump
  11. 11. Back to X/C: Morph IR Tree static float Calculate(float distance, float v0) { return distance / 2 + v0; } ADD DIV LCL_VAR (v0) CNS_DBL (2.0) LCL_VAT (distance) ADD MUL LCL_VAR (v0) CNS_DBL (0.5) LCL_VAR (distance) Morph
  12. 12. Implementing the opt in morph.cpp DIV op2op1
  13. 13. Inspired by GT_ROL 13 /// <summary> /// Rotates the specified value left by the specified number of bits. /// </summary> public static uint RotateLeft(uint value, int offset) => (value << offset) | (value >> (32 - offset)); OR ROL / / / / LSH RSZ => x y / / x AND x AND / / y 31 ADD 31 / NEG 32 | y rol eax, cl
  14. 14. 15 "Hello".Length -> 5 static void Append(string str) { if (str.Length <= 2) QuickAppend(str); else SlowAppend(str); } ... builder.Append("/>"); builder.QuickAppend("/>"); Inline, remove if (2 <= 2)
  15. 15. 16 "Hello".Length => 5 ARR_LENGTH CNS_STR (ScpHnd, SconCPX) CNS_INT (5) VM case GT_ARR_LENGTH: { if (op1->OperIs(GT_CNS_STR)) { GenTreeStrCon* strCon = op1->AsStrCon(); int len = info.compCompHnd->getStringLength( strCon->gtScpHnd, strCon->gtSconCPX); return gtNewIconNode(len); } break; } JIT <-> VM Interface op1 Access VM’s data from JIT
  16. 16. 17 bool Test1(int x) => x % 4 == 0; bool Test1(int x) => x & 3 == 0; MOD EQ/NE CNS_INT (0) CNS_INT (4) anything op1 op2 op2op1 AND EQ/NE CNS_INT (0) CNS_INT (3) anything op1 op2 op2op1
  17. 17. Roslyn and “!=“ operator (unexpected optimization) 18 public static bool Test1(int x) => x != 42; public static bool Test2(int x) => x != 0; IL_0000: ldarg.0 IL_0001: ldc.i4.42 IL_0002: ceq IL_0004: ldc.i4.0 IL_0005: ceq IL_0007: ret IL_0000: ldarg.0 IL_0001: ldc.i4.0 IL_0002: cgt.un IL_0004: ret return (uint)x > 0 return (x == 42) == false JIT: ok, it’s GT_NE JIT: what?.. ok, GT_GT
  18. 18. 19 Math.Pow(x, 2) Math.Pow(x, 1) Math.Pow(x, 4) Math.Pow(x, -1) // can be added: Math.Pow(42, 3) Math.Pow(1, x) Math.Pow(2, x) Math.Pow(x, 0) Math.Pow(x, 0.5) | x * x | x | x * x * x * x | 1 / x | 74088 | 1 | exp2(x) | 1 | sqrt(x)
  19. 19. Range check elimination 20
  20. 20. Range check elimination 21 public static void Test(int[] a) { a[0] = 4; a[1] = 2; for (int i = 0; i < a.Length; i++) { a[i] = 0; } a[1] = 2; }
  21. 21. Range check elimination 22 public static void Test(int[] a) { if (a.Length <= 0) throw new IndexOutOfRangeException(); a[0] = 4; if (a.Length <= 1) throw new IndexOutOfRangeException(); a[1] = 2; for (int i = 0; i < a.Length; i++) { if (a.Length <= i) throw new IndexOutOfRangeException(); a[i] = 0; } if (a.Length <= 2) throw new IndexOutOfRangeException(); a[1] = 2; }
  22. 22. Range check elimination 23 public static void Test(int[] a) { if (a.Length <= 0) throw new IndexOutOfRangeException(); a[0] = 4; if (a.Length <= 1) throw new IndexOutOfRangeException(); a[1] = 2; for (int i = 0; i < a.Length; i++) { if (a.Length <= i) throw new IndexOutOfRangeException(); a[i] = 0; } if (a.Length <= 2) throw new IndexOutOfRangeException(); a[1] = 2; }
  23. 23. Range check elimination 24 public static void Test(int[] a) { if (a.Length <= 1) throw new IndexOutOfRangeException(); a[1] = 2; if (a.Length <= 1) throw new IndexOutOfRangeException(); a[0] = 4; for (int i = 0; i < a.Length; i++) { if (a.Length <= i) throw new IndexOutOfRangeException(); a[i] = 0; } if (a.Length <= 2) throw new IndexOutOfRangeException(); a[1] = 2; }
  24. 24. Range check elimination 25 public static void Test(int[] a) { if (a.Length <= 1) throw new IndexOutOfRangeException(); a[1] = 2; a[0] = 4; for (int i = 0; i < a.Length; i++) { a[i] = 0; } a[1] = 2; }
  25. 25. rangecheck.cpp (simplified) 26 void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk) { // Get the range for this index. Range range = GetRange(...); // If upper or lower limit is unknown, then return. if (range.UpperLimit().IsUnknown() || range.LowerLimit().IsUnknown()) { return; } // Is the range between the lower and upper bound values. if (BetweenBounds(range, 0, bndsChk->gtArrLen)) { m_pCompiler->optRemoveRangeCheck(treeParent, stmt); } return; }
  26. 26. rangecheck.cpp (simplified) 27 void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk) { // Get the range for this index. Range range = GetRange(...); // If upper or lower limit is unknown, then return. if (range.UpperLimit().IsUnknown() || range.LowerLimit().IsUnknown()) { return; } // Is the range between the lower and upper bound values. if (BetweenBounds(range, 0, bndsChk->gtArrLen)) { m_pCompiler->optRemoveRangeCheck(treeParent, stmt); } return; }
  27. 27. rangecheck.cpp (simplified) 28 void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk) { // Get the range for this index. Range range = GetRange(...); // If upper or lower limit is unknown, then return. if (range.UpperLimit().IsUnknown() || range.LowerLimit().IsUnknown()) { return; } // Is the range between the lower and upper bound values. if (BetweenBounds(range, 0, bndsChk->gtArrLen)) { m_pCompiler->optRemoveRangeCheck(treeParent, stmt); } return; }
  28. 28. Byte array 29 private static readonly byte[] _data = new byte[256] { 1, 2, 3, … }; public static byte GetByte(int i) { return _data[i]; }
  29. 29. Byte array: Roslyn hack (new feature) 30 private static ReadOnlySpan<byte> _data => new byte[256] { 1, 2, 3, … }; public static byte GetByte(int i) { return _data[i]; } 256
  30. 30. Byte array: byte index (my PR) 31 private static ReadOnlySpan<byte> _data => new byte[256] { 1, 2, 3, … }; public static byte GetByte(int i) { return _data[(byte)i]; } Byte indexer will never go out of bounds!
  31. 31. rangecheck.cpp (simplified) 32 void RangeCheck::OptimizeRangeCheck(GenTreeBoundsChk* bndsChk) { // Get the range for this index. Range range = GetRange(...); // If upper or lower limit is unknown, then return. if (range.UpperLimit().IsUnknown() || range.LowerLimit().IsUnknown()) { return; } // Is the range between the lower and upper bound values. if (BetweenBounds(range, 0, bndsChk->gtArrLen)) { m_pCompiler->optRemoveRangeCheck(treeParent, stmt); } return; } [Byte.MinValue ... Byte.MaxValue] ArrLen = 256
  32. 32. Loop optimizations 33
  33. 33. Loop Invariant Code Hoisting 34 public static bool Test(int[] a, int c) { for (int i = 0; i < a.Length; i++) { if (a[i] == c + 44) return false; } return true; }
  34. 34. Loop Invariant Code Hoisting 35 public static bool Test(int[] a, int c) { int tmp = c + 44; for (int i = 0; i < a.Length; i++) { if (a[i] == tmp) return false; } return true; }
  35. 35. NYI: Loop-unrolling 36 public static int Test(int[] a) { int sum = 0; for (int i = 0; i < a.Length; i++) { sum += a[i]; } return sum; }
  36. 36. NYI: Loop-unrolling 37 public static int Test(int[] a) { int sum = 0; for (int i = 0; i < a.Length - 3; i += 4) { sum += a[i]; sum += a[i+1]; sum += a[i+2]; sum += a[i+3]; } return sum; }
  37. 37. NYI: Loop-unswitch 38 public static int Test(int[] a, bool condition) { int agr = 0; for (int i = 0; i < a.Length; i++) { if (condition) agr += a[i]; else agr *= a[i]; } return agr; }
  38. 38. NYI: Loop-unswitch 39 public static int Test (int[] a, bool condition) { int agr = 0; if (condition) for (int i = 0; i < a.Length; i++) agr += a[i]; else for (int i = 0; i < a.Length; i++) agr *= a[i]; return agr; }
  39. 39. NYI: Loop-deletion 40 public static void Test() { for (int i = 0; i < 10; i++) { } } ; Method Program:Test() G_M44187_IG01: G_M44187_IG02: xor eax, eax G_M44187_IG03: inc eax cmp eax, 10 jl SHORT G_M44187_IG03 G_M44187_IG04: ret ; Total bytes of code: 10
  40. 40. NYI: Loop-deletion 41 public static void Test() { } ; Method Program:DeadLoop() ret ; Total bytes of code: 1
  41. 41. NYI: InductiveRangeCheckElimination 42 public static void Zero1000Elements(int[] array) { for (int i = 0; i < 1000; i++) array[i] = 0; // bound checks will be inserted here }
  42. 42. NYI: InductiveRangeCheckElimination 43 public static void Zero1000Elements(int[] array) { int limit = Math.Min(array.Length, 1000); for (int i = 0; i < limit; i++) array[i] = 0; // bound checks are not needed here! for (int i = limit; i < 1000; i++) array[i] = 0; // bound checks are needed here // so at least we could "zero" first `limit` elements without bound checks } NOTE: this LLVM optimization pass is not enabled by default in `opt –O2`. Contributed by Azul developers (LLVM for JVM)
  43. 43. NYI: InductiveRangeCheckElimination 44 public static void Zero1000Elements(int[] array) { int limit = Math.Min(array.Length, 1000); for (int i = 0; i < limit - 3; i += 4) { array[i] = 0; array[i+1] = 0; array[i+2] = 0; array[i+3] = 0; } for (int i = limit; i < 1000; i++) array[i] = 0; // bound checks are needed here // so at least we could "zero" first `limit` elements without bound checks } Now we can even unroll the first loop!
  44. 44. NYI: InductiveRangeCheckElimination 45 public static void Zero1000Elements(int[] array) { int limit = Math.Min(array.Length, 1000); memset(array, 0, limit); for (int i = limit; i < 1000; i++) array[i] = 0; // bound checks are needed here // so at least we could "zero" first `limit` elements without bound checks } Or just replace with memset call
  45. 45. Q&A Twitter: EgorBo

×