This document discusses various techniques for enhancing the performance of .NET applications, including:
1) Implementing value types correctly by overriding Equals, GetHashCode, and IEquatable<T> to avoid boxing;
2) Applying precompilation techniques like NGen to improve startup time;
3) Using unsafe code and pointers for high-performance scenarios like reading structures from streams at over 100 million structures per second;
4) Choosing appropriate collection types like dictionaries for fast lookups or linked lists for fast insertions/deletions.
3. Content
ā¢ Implementing value types correctly
ā¢ Applying pre-compilation
ā¢ Using unsafe code and pointers
ā¢ Choosing a collection
ā¢ Make your code as parallel as necessary
5. Two Categories of Types
ā¢ Reference types
ā Offer a set of managed services: locks, inheritance, and
more
ā¢ Value types
ā Do not offer these services
ā¢ Additional superficial differences
ā Parameter passing
ā Equality
6. Object Layout
ā¢ Heap objects (reference types) have two
header fields
ā¢ Stack objects (value types) donāt have
headers
ā¢ Why two types of types and object layouts
7. Using Value Types
ā¢ Use value types when performance is critical
ā Creating a large number of objects
ā Creating a large collection of objects
13. GetHashCode
ā¢ Used by Dictionary, HashSet, and other collections
ā¢ Declared by System.Object, overridden by System.ValueType
ā¢ Must be consistent with Equals:
A.Equals(B) ļ A.GetHashCode() == B.GetHashCode()
14. ā¢ Use value types in high-performance
scenarios
ā Tight loops, large collections
ā¢ Implement value types correctly
ā Equals, IEquatable<T>, GetHashCode
15. Applying precompilation
ā¢ Improving startup time
ā¢ Precompilation
ā Ngen
ā Serialization assemblies
ā Regular expressions
ā¢ Other ways of improving startup time
ā Multi-core background JIT
ā MPGO
17. Improving Startup Time with NGen
ā¢ NGen precompiles .NET assemblies to native code
> ngen install MyApp.exe
ā Includes dependencies
ā Precompiled assemblies stored in
C:WindowsAssemblyNativeImages_*
ā Fall back to original if stale
ā¢ Automatic NGen in Windows 8 and CLR 4.5
18. Multi-Core Background JIT
ā¢ Usually, methods are compiled to native when invoked
ā¢ Multi-core background JIT in CLR 4.5
ā Opt in using System.Runtime.ProfileOptimization class
using System.Runtime;
ProfileOptimization.SetProfileRoot(folderName);
ProfileOptimization.StartProfile(profileName);
ā¢ Relies on profile information generated at runtime
ā Can use multiple profiles
19. RyuJIT
ā¢ A rewrite of the JIT compiler
ā Faster compilation (throughput)
ā Better code (quality)
20. Managed Profile-Guided Optimization
(MPGO)
ā¢ Introduced in .NET 4.5
ā Improves precompiled assembliesā disk layout
ā Places hot code and data closer together on disk
ā¢ Relies on profile information collected at
runtime
21. Improving Cold Startup
ā¢ I/O costs are #1 thing to improve
ā¢ ILMerge (Microsoft Research)
ā¢ Executable packers
ā¢ Placing strong-named assemblies in the GAC
ā¢ Windows SuperFetch
22. Precompiling Serialization Assemblies
ā¢ Serialization often creates dynamic methods
on first use
ā¢ These methods can be precompiled
ā SGen.exe creates precompiled serialization
assemblies on Xm
ā Protobuf-net has a precompilation tool
23. Precompiling Regexes
ā¢ By default, the Regex class interprets the regular expression
when you match it
ā¢ Regex can generate IL code instead of using interpretation:
ā¢ Even better, you can precompile regular expressions to an
assembly:
25. Pointers? In C#?
ā¢ Raw pointers are part of the C# syntax
ā¢ Interoperability with Win32 and other DLLs
ā¢ Performance in specific scenarios
26. Pointers and Pinning
ā¢ We want to go from byte[]to byte*
ā¢ When getting a pointer to a heap object, what if the GC moves it?
ā¢ Pinning is required
byte[] source = ...;
fixed(byte* p = &source)
{
...
}
ā¢ Directly manipulate memory
*p = (byte)12;
int x = *(int*)p;
ā¢ Requires unsafeblock and āAllow unsafe codeā
27. Copying Memory Using Pointers
ā¢ Mimicking Array.Copyor Buffer.BlockCopy
ā¢ Better to copy more than one byte per iteration
fixed (byte* p = &src)
fixed (byte* q = &dst)
{
long*pSrc= (long*)p;
long*pDst= (long*)q;
for (inti= 0; i< dst.Length/8; ++i)
{
*pDst= *pSrc;
++pDst; ++pSrc;
}
}
ā¢ Might be interesting to unroll the loop
28. Reading Structures
ā¢ Read structures from a potentially infinite stream
structTcpHeader
{
public uintSrcIP, DstIP;
public ushortSrcPort, DstPort;
}
ā¢ Do it fast āseveral GBps, >100M structures/second
ā We will look at multiple approaches and measure them
29. The Pointer-Free Approach
TcpHeaderRead(byte[] data, intoffset)
{
MemoryStreamms= new MemoryStream(data);
BinaryReaderbr= new BinaryReader(ms);
TcpHeaderresult = new TcpHeader();
result.SrcIP= br.ReadUInt32();
result.DstIP= br.ReadUInt32();
result.SrcPort= br.ReadUInt16();
result.DstPort= br.ReadUInt16();
return result;
}
30. Marshal.PtrToStructure
ā¢ System.Runtime.InteropServices.Marshal is designed for interoperability
scenarios
ā¢ Marshal.PtrToStructure seems useful
Object PtrToStObject PtrToStructure(Type type, IntPtraddress)
ā¢ GCHandle can pin an object in memory and give us a pointer to it
GCHandlehandle = GCHandle.Alloc(obj, GCHandleType.Pinned);
Try
{
IntPtraddress = handle.AddrOfPinnedObject();
}
Finally
{
handle.Free();
}
31. Using Pointers
ā¢ Pointers can help by casting
fixed (byte* p = &data[offset])
{
TcpHeader* pHeader= (TcpHeader*)p;
return *pHeader;
}
ā¢ Very simple, doesnāt require helper routines
32. A Generic Approach
ā¢ Unfortunately, T*doesnāt work āT must be blittable
unsafe T Read(byte[] data, int offset)
{
fixed (byte* p = &data[offset])
{
return *(T*)p;
}
}
ā¢ We can generate a method for each T and call it when necessary
ā Reflection.Emit
ā CSharpCodeProvider
ā Roslyn
34. Collection Considerations
ā¢ There are many built-in collection classes
ā There are even more in third-party libraries like C5
ā¢ Fundamental operations: insert, delete, find
ā¢ Evaluation criteria:
35. Example: LinkedList<T>
ā¢ Doubly linked list, lots of memory overhead
per node
ā¢ Insertion and deletion are very fast ā O(1)
ā¢ Lookup is slow ā O(n)
36. Arrays
ā¢ Flat, sequential, statically sized
ā¢ Very fast access to elements
ā¢ No per-element overhead
ā¢ Foundation for many other collection classes
37. List<T>
ā¢ Dynamic (resizable) array
ā Doubles its size with each expansion
ā For 100,000,000 insertions: [log 100,000,000] = 27
expansions
ā¢ Insertions not at the end are very expensive
ā Good for append-only data
ā¢ No specialized lookup facility
ā¢ Still no per-element overhead
38. LinkedList<T>
ā¢ Doubly-linked list
ā¢ Very flexible collection for insertions/deletions
ā¢ Still requires linear-time (O(n)) for lookup
ā¢ Very big space overhead per element
39. Trees
ā¢ SortedDictionary<K,V> and SortedSet<T> are implemented
with a balanced binary search tree
ā Efficient lookup by key
ā Sorted by key
ā¢ All fundamental operations take O(log(n)) time
ā For example, log(100,000,000) is less than 27
ā Great for storing dynamic data that is queried often
ā¢ Big space overhead per element (several additional fields)
40. Associative Collections
ā¢ Dictionary<K,V> and HashSet<T> use hashing to arrange the
elements
ā¢ Insertion, deletion and lookup work in constant time ā O(1)
ā GetHashCode must be well-distributed for this to happen
ā¢ Medium memory overhead
ā Combination of arrays and linked lists
ā Smaller than trees in most cases
42. Scenarios
ā¢ Word frequency in a large body of text
ā Dictionary<string,uint>
ā¢ Queue of orders in a restaurant
ā LinkedList<Order>
ā¢ Buffer of continuous log messages
ā List<LogMessage>
44. Tries
ā¢ A text editor needs to store a dictionary of words
ā ārunā, ādolphinā, āregardā but also ārunningā, ādolphinsā,
āregardlessā
ā Offers spell checking and automatic word completion
ā¢ HashSet
ā Super-fast spell checking
ā Not sorted, so automatic completion by prefix is O(n)
ā¢ SortedSet
ā Still fast spell checking
ā Sorted but access to predecessor/successor is not exposed
ā¢ Enter: Trie
45. Trie Internals
ā¢ Very compact
ā Shared prefixes are only stored once
ā¢ Finding all words with a prefix is āby designā
46. Union-Find
ā¢ Tracking which nodes are in each connected component in a graph
ā Connected component = set of nodes that are connected
ā¢ Need to support fast insertion of new edges
ā¢ Basic operations required:
ā Find the connected component to which a node belongs
ā Unify two connected components into one
ā¢ Using a list of nodes per component makes merging expensive
ā¢ Enter: Disjoint set forest
47. Disjoint Set Forest
ā¢ Each node has a reference to its parent
ā The node without a parent is the representative of the set
ā¢ Union and find:
ā The representative knows the connected component
ā Merging means updating representatives
ā¢ Problem: find could be O(n), fixed by:
ā Attaching smaller tree to larger one when merging
ā Flattening the hierarchy while running find
ā¢ O(a(n) running time, less than 5 for all practical values
49. Garbage Collection
ā¢ Garbage collection means we donāt have to manually free
memory
ā¢ Garbage collection isnāt free and has performance trade-offs
ā Questionable on real-time systems, mobile devices, etc.
ā¢ The CLR garbage collector (GC) is an almost-concurrent,
parallel, compacting, mark-and-sweep, generational, tracing
GC
50. Mark and Sweep
ā¢ Mark: identify all live objects
ā¢ Sweep: reclaim dead objects
ā¢ Compact: shift live objects
together
ā¢ Objects that can still be used
must be kept alive
51. Roots
ā¢ Starting points for the garbage collector
ā¢ Static variables
ā¢ Local variables
ā More tricky than they appear
ā¢ Finalization queue, f-reachable queue, GC
handles, etc.
ā¢ Roots can cause memory leaks
52. Workstation GC
ā¢ There are multiple garbage collection flavors
ā¢ Workstation GC is ākind ofā suitable for client apps
ā The default for almost all .NET applications
ā¢ GC runs on a single thread
ā¢ Concurrent workstation GC
ā Special GC thread
ā Runs concurrently with application threads, only short suspensions
ā¢ Non-concurrent workstation GC
ā One of the app threads does the GC
ā All threads are suspended during GC
ā¢ Workstation GC doesnāt use all CPU cores
53. Server GC
ā¢ One GC thread per logical processor, all working
at once
ā¢ Separate heap area for each logical processor
ā¢ Until CLR 4.5, server GC was non-concurrent
ā¢ In CLR 4.5, server GC becomes concurrent
ā Now a reasonable default for many high-memory apps
54. Switching GC Flavors
ā¢ Configure preferred flavor in app.config
ā Ignored if invalid (e.g. concurrent GC on CLR 2.0)
ā¢ Canāt switch flavors at runtime
ā But can query flavor using GCSettingsclass
55. Generational Garbage Collection
ā¢ A full GC is expensive and inefficient
ā¢ Divide the heap into regions and perform small
collections often
ā Modern server apps canāt live with frequent full GCs
ā Frequently-touched regions should have many dead
objects
ā¢ Newobjects die fast, oldobjects stay alive
ā Typical behavior for many applications, although
exceptions exist
56. .NET Generations
ā¢ Three heap regions (generations)
ā¢ Gen 0 and gen 1 are typically quite smallA high
allocation rate leads to many fast gen 0
collections
ā¢ Survivors from gen 0 are promoted to gen 1, and
so on
ā¢ Make sure your temporary objects die young
and avoid frequent promotions to generation 2
57. The Large Object Heap
ā¢ Large objects are stored in a separate heap region (LOH)
ā¢ Large means larger than 85,000 bytes or array of >1,000
doubles
ā¢ The GC doesnāt compact the LOH
ā This may cause fragmentation
ā¢ The LOH is considered part of generation 2
ā Temporary large objects are a common GC performance
problem
58. Explicit LOH Compilation
ā¢ LOH fragmentation leads to a waste of
memory
ā¢ .NET 4.5.1 introduces LOH compaction
ā You can test for LOH fragmentation using the
!dumpheap-statSOS command
59. Foreground and Background GC
ā¢ In concurrent GC, application threads continue to run during full
GC
ā¢ What happens if an application thread allocates during GC?
ā In CLR 2.0, the application thread waits for full GC to complete
ā¢ In CLR 4.0, the application thread launches a foregroundGC
ā¢ In servercon current GC, there are special foreground GC
threads
ā¢ Background/foreground GC is only available as part of
concurrent GC
60. Resource Cleanup
ā¢ The GC only takes care of memory, not all
reclaimable resources
ā Sockets, file handles, database transactions, etc.
ā When a database transaction dies, it has to abort the
transaction and close the network connection
ā¢ C++ has destructors: deterministic cleanup
ā¢ The .NET GC doesnāt release objects
deterministically
61. Finalization
ā¢ The CLR runs a finalizer after the object becomes
unreachable
ā¢ Letās design the finalization mechanism:
ā Finalization queue for potentially āfinalizableā objects
ā Identifying candidates for finalization
ā Selecting a thread for finalization: the finalizer thread
ā F-reachable queue for finalization candidates
ā Objects removed from f-reachable queue can be GCād
ā¢ This is pretty much how CLR finalization works!
62. Performance Problems with Finalization
ā¢ Finalization extends object lifetime
ā¢ The f-reachable queue might fill up faster than the finalizer
thread can drain it
ā Can be addressed by deterministic finalization (Dispose)
ā¢ Itās possible for a finalizerto run while an instance method
hasnāt returned yet
63. The Dispose Pattern
ā¢ Stay away from finalization and use deterministic cleanup
ā No performance problems
ā Youāre responsible for resource management
ā¢ The Dispose pattern
ā¢ Can combine Dispose with finalization
64. Resurrection and Object Pooling
ā¢ Bring an object back to life from the finalizer
ā¢ Can be used to implement an object pool
ā A cache of objects, like DB connections, that are
expensive to initialize
66. Kinds of Parallelism
ā¢ Parallelism - Running multiple threads in
parallel
ā¢ Concurrency - Doing multiple things at once
ā¢ Asynchrony - Without blocking the callerās
thread
70. I/O-Bound Workloads and Asynchronous I/O
ā¢ Data parallelism is suited for CPU-bound
workloads
ā CPUs arenāt good at sitting and waiting for I/O
ā¢ Asynchronous I/O operations
ā Asynchronous file read
ā Asynchronous HTTP POST
ā¢ Multiple outstanding I/O operations per
thread
72. Awaiting Tasks and IAsyncOperation
ā¢ await support
ā The TPL Task class
ā The IAsyncOperation Windows Runtime interface
// In System.Net.Http.HttpClient
public Task<string>GetStringAsync(string requestUri);
// In Windows.Web.Http.HttpClient
public IAsyncOperationWithProgress<String,
HttpProgress>GetStringAsync(Uri uri);
73. Parallelizing I/O Requests
ā¢ Start a few outstanding I/O operations and
then..
ā Wait-All : Process results when all operations are
done
ā Wait-Any : Process each operationās results when
available
74. Task.WhenAll
Task<string>[] tasks = new Task<string>[] {
m_http.GetStringAsync(url1),
m_http.GetStringAsync(url2),
m_http.GetStringAsync(url3)
};
Task<string[]> all = Task.WhenAll(tasks);
string[] results = await all;
// Process the results
75. Task.WhenAny
List<Task<string>> tasks = new List<Task<string>>[] {
m_http.GetStringAsync(url1),
m_http.GetStringAsync(url2),
m_http.GetStringAsync(url3)
};
while (tasks.Count> 0)
{
Task<Task<string>> any = Task.WhenAny(tasks);
Task<string> completed = await any;
// Process the result in completed.Result
tasks.Remove(completed);
}
76. Synchronization and Amdahlās Law
ā¢ When using parallelism, shared resources
require synchronization
ā¢ Amdahlās Law
ā If the fraction P of the application requires
synchronization, the maximum possible speedup is:
ā E.g., for P = 0.5 (50%), the maximum speedup is 2x
ā¢ Scalability is critical as # of CPUs increases
77. Concurrent Data Structures
ā¢ Thread-safe data structures in the TPL
ā¢ Use them instead of a lock around the
standard collections
78. Aggregation
ā¢ Collect intermediate results into thread-local structures
Parallel.For(
from,
to,
() => produce thread local state,
(i, _, local) => do work and return new local state,
local => combine local states into global state
);
79. Lock-Free Operations
ā¢ Atomic hardware primitives from the Interlocked class
ā Interlocked.Increment, Interlocked.Decrement, Interlocked.Add, etc.
ā¢ Especially useful: Interlocked.CompareExchange
// Performs āshared *= xā atomically
static void AtomicMultiply(ref intshared, intx)
{
intold, result;
do
{
old = shared;
result = old * x;
}
while (old != Interlocked.CompareExchange(
ref shared, old, result));
}