This document discusses .NET systems programming and garbage collection. It covers garbage collection generations, modes, and considerations for minimizing allocations. It also discusses eliminating delegates to reduce allocations, using value types appropriately, avoiding empty collections, and optimizing for thread locality to reduce context switching overhead. Data structures and synchronization techniques are discussed, emphasizing the importance of choosing lock-free data structures when possible to improve performance.
About Me
• .NETdeveloper since 2005 (college
internship)
• Built large-scale SaaS on top of
.NET
• Creator and maintainer of
Akka.NET since 2013
• Canonical actor model
implementation in .NET
• Highly concurrent, low-latency, and
distributed
• Used to build mission-critical real
time applications
• Performance is a feature
GC Generations
The higherthe generation, the most expensive the GC:
• Memory is more fragmented (access is random, not contiguous)
• Compaction takes longer (bigger gaps, more stuff to move,
longer GC pauses)
6.
.NET Memory Model
privatereadonly Random myRandom = Random.Shared;
private void DoThing()
{
var i = myRandom.Next();
var j = myRandom.Next(i);
var b = i + j;
var str = b.ToString();
Console.WriteLine(str);
}
Stack
0xAEDC DoThing_vtable
0xFFBD ref(Random.Shared)
0x11CD i = 10;
0x11CE j = 5;
0x11CF b = 15;
0xADDE ref(string)
Managed Heap
0xAEDC class Thing_DoThing mthd
0xFFBD Random.Shared [1024b]
…
…
…
0xADDE string “15”
7.
GC Considerations
• Ifyou can: keep allocations in Gen 0 / 1
• Value types (no GC)
• Less memory fragmentation, compaction
• Less impact on latency, throughput
• If you can’t: keep Gen 2 objects in Gen 2 forever
• No GC if they’re still rooted!
8.
GC Practice: ObjectPools
• Microsoft.Extensions.ObjectPool<T> - great option for
long-lived Gen2 objects
• Best candidates are “reusable” types
• StringBuilder
• byte[](there are separate MemoryPool types for this)
• Use pre-allocated object, return to pool upon completion
• Doesn’t cause allocations so long as pool capacity isn’t exceeded
9.
GC Practice: ObjectPools
StringBuilder sb = null;
try
{
sb = _sbPool.Get();
using (var tw = new StringWriter(sb, CultureInfo.InvariantCulture))
{
var ser = JsonSerializer.CreateDefault(Settings);
ser.Formatting = Formatting.None;
using (var jw = new JsonTextWriter(tw))
{
ser.Serialize(jw, obj);
}
return Encoding.UTF8.GetBytes(tw.ToString());
}
}
finally
{
if (sb != null)
{
_sbPool.Return(sb);
}
}
Rent an instance from the
ObjectPool<StringBuilder>
Do our work
Return to the pool
10.
GC Practice: ObjectPools
• Pooling StringBuilder inside Newtonsoft.Json
~30% memory savings,
eliminated 100% of Gen 1
GC
~28% throughput
improvement in concurrent
use cases
Allocations: Delegates andClosures
/// <summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
Actor.UseThreadContext(() =>
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
});
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages,
possibly
}
}
Critical path of actor msg
processing
Closes over ‘this’, allocates
delegate each time
16.
Eliminate Delegate: Inlining
///<summary>
/// Processes the contents of the mailbox
/// </summary>
public void Run()
{
try
{
if (!IsClosed()) // Volatile read, needed here
{
var tmp = InternalCurrentActorCellKeeper.Current;
InternalCurrentActorCellKeeper.Current = Actor;
try
{
ProcessAllSystemMessages(); // First, deal with any system messages
ProcessMailbox(); // Then deal with messages
}
finally
{
//ensure we set back the old context
InternalCurrentActorCellKeeper.Current = tmp;
}
}
}
finally
{
SetAsIdle(); // Volatile write, needed here
Dispatcher.RegisterForExecution(this, false, false); // schedule to run again if there are more messages, possibly
}
}
Eliminate delegate by inlining function
From 21kb & 203kb to ~1kb
Throughput improvement of ~10%
17.
Other Delegate AllocationRemoval Methods
• C#9: declare `static` delegates
• Cache delegates / use expression compiler
• ValueDelegates
18.
Value Delegates
private readonlystruct RequestWorkerTask : IRunnable
{
private readonly DedicatedThreadPoolTaskScheduler _scheduler;
public RequestWorkerTask(DedicatedThreadPoolTaskScheduler
scheduler)
{
_scheduler = scheduler;
}
public void Run()
{
// do work
}
}
private void RequestWorker()
{
_pool.QueueUserWorkItem(new RequestWorkerTask(this));
}
Implement our “delegate interface”
using a value type
Runs just the same as a reference
type
Execute the work (might cause a
boxing allocation!)
Reference Type: FSMEvents
public sealed class Event<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
We allocate millions of these per
second in busy networks
23.
public readonly structEvent<TD> : INoSerializationVerificationNeeded
{
public Event(object fsmEvent, TD stateData)
{
StateData = stateData;
FsmEvent = fsmEvent;
}
public object FsmEvent { get; }
public TD StateData { get; }
public override string ToString()
{
return $"Event: <{FsmEvent}>, StateData: <{StateData}>";
}
}
Value Type: FSM Events
Change to value type
Reduction of ~30mb
Minor throughput
improvement
24.
Value Types: BoxingAllocations
• Boxing occurs implicitly – when a
struct is cast into an object
• The struct will be wrapped
into an object and placed on
the managed heap.
• Unboxing happens explicitly –
when the object is cast back
into its associated value type.
• Can create a lot of allocations!
StateName is usually an enum (value
type) – is the object.Equals call
boxing?
25.
Value Types: BoxingAllocations
// avoid boxing
if (!EqualityComparer<TState>.Default.Equals(_currentState.StateName, nextState.StateName) || nextState.Notifies)
{
_nextState = nextState;
HandleTransition(_currentState.StateName, nextState.StateName);
Listeners.Gossip(new Transition<TState>(Self, _currentState.StateName, nextState.StateName));
_nextState = default;
}
Used generic comparer to avoid casting
value types into object – removed 100%
of boxing allocations at this callsite.
26.
Value Type: MessageEnvelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public readonly struct Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
Used millions of times per
second in Akka.NET
readonly struct? Value
type? Should be “zero
allocations”
27.
Reference Type: MessageEnvelope
/// <summary>
/// Envelope class, represents a message and the sender of the message.
/// </summary>
public sealed class Envelope
{
public Envelope(object message, IActorRef sender)
{
Message = message;
Sender = sender;
}
public IActorRef Sender { get; }
public object Message { get; }
}
What if we change to a
reference type? Will this reduce
allocations?
394kb 264kb
3.15mb 2.1mb
215 us 147 us
1860 us 1332 us
28.
Value Type Pitfalls
•Copy-by-Value
• References to value types in other scopes requires copying
• ref parameters can work, but in narrowly defined contexts
• Excessive copying can be more expensive than allocating a reference
• Use reference types when semantics are “referential”
• Value types are not magic – work best in “tight” scopes
• Use the right tool for the job
30.
Reference Type: MessageEnvelope
• What happens when we benchmark with significantly increased
cross-thread message traffic?
• Now if we convert Envelope back into a struct again…
• Thread access makes a difference!
ThreadStatic and ThreadLocal<T>
•Allocates objects directly into thread local storage
• Objects stay there and are available each time thread is used
• Ideal for caching and pooling
• No synchronization
• Data and work all performed adjacent to stack memory
• Downside: thread-local data structures aren’t synchronized
• Variants!
33.
Thread Local Storage& Context Switching
• Reference types passed between
threads often age into older
generations of GC
• Value types passed between
threads are copied (no GC)
• Thread-local state is copied into
CPU’s L1/L2 cache from memory
typically during execution
• Context switching occurs when
threads get scheduled onto
different CPUs or work is moves
onto different threads.
34.
Thread Locality &Context Switching
Each thread gets
~30ms of execution
time before yielding
35.
Thread Locality &Context Switching
Current quantum is
over – time for
other threads to
have a turn
36.
Thread Locality &Context Switching
Context switch! Thread
0 now executing on CPU
1 – memory and state
will have to be
transferred.
37.
Context Switching: HighLatency Impact
/// <summary>
/// An asynchronous operation will be executed by
a <see cref="MessageDispatcher"/>.
/// </summary>
#if NETSTANDARD
public interface IRunnable
#else
public interface IRunnable : IThreadPoolWorkItem
#endif
{
/// <summary>
/// Executes the task.
/// </summary>
void Run();
}
// use native .NET 6 APIs here to reduce
allocations
// preferLocal to help reduce context switching
ThreadPool.UnsafeQueueUserWorkItem(run,
true);
IThreadPoolWorkItem interface
added in .NET Core 3.0 – avoids delegate
allocations for executing on ThreadPool
Consume IThreadPoolWorkItem
with preferLocal=true – tells the
ThreadPool to attempt to reschedule
work on current thread / CPU.
Thread Locality w/oContext Switching
No context switch –
same thread will have a
chance to execute on
same CPU. Might be
able to benefit from
L1/L2 cache, locality of
memory access, etc.
40.
Data Structures &Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly ConcurrentQueue<Envelope> _queue = new ConcurrentQueue<Envelope>();
/// <inheritdoc cref="IMessageQueue"/>
public bool HasMessages
{
get { return !_queue.IsEmpty; }
}
/// <inheritdoc cref="IMessageQueue"/>
public int Count
{
get { return _queue.Count; }
}
….
}
Could, in theory, improve
memory performance by
replacing with a LinkedList (no
array segment allocations from
resizing)
41.
Data Structures &Synchronization
/// <summary> An unbounded mailbox message queue. </summary>
public class UnboundedMessageQueue : IMessageQueue, IUnboundedMessageQueueSemantics
{
private readonly object s_lock = new object();
private readonly LinkedList<Envelope> _linkedList = new LinkedList<Envelope>();
public bool HasMessages
{
get
{
return Count > 0;
}
}
public int Count
{
get
{
lock (s_lock)
{
return _linkedList.Count;
}
}
}
….
Not a thread-safe data
structure, has to be synced-
with lock
Should offer better memory
performance than
ConcurrentQueue<T>
Wooooooof 🤮
42.
Data Structures &Synchronization
• What went wrong there?
• ConcurrentQueue<T> is lock-free
• Uses volatile and atomic compare-and-swap operations
• i.e. Interlocked.CompareExchange
• Significantly less expensive, even on a single thread, than lock
• LinkedList<T> may not be all that memory efficient
• Internal data structure allocations per-insert rather than array block
allocations
• Better off rolling your own, probably
#20 32 bytes adds up when you allocate millions of these per-second
https://stackoverflow.com/questions/16131641/memory-usage-of-an-empty-list-or-dictionary
public class List<T> : IList<T>, ICollection<T>, IList, ICollection, IReadOnlyList<T>, IReadOnlyCollection<T>, IEnumerable<T>, IEnumerable
{
private T[] _items; //4 bytes for x86, 8 for x64
private int _size; //4 bytes
private int _version; //4 bytes
[NonSerialized]
private object _syncRoot; //4 bytes for x86, 8 for x64
private static readonly T[] _emptyArray; //one per type
private const int _defaultCapacity = 4; //one per type
...
}