Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multithreading Fundamentals

3,318 views

Published on

Everybody knows the lock keyword, but how does it implemented? What are its performance characteristics. Gael Fraiteur scratches the surface of multithreaded programming in .NET and goes deep through the Windows Kernel down to CPU microarchitecture.

Published in: Technology, Education

Multithreading Fundamentals

  1. 1. @gfraiteur a deep investigation on how .NET multithreading primitives map to hardware and Windows Kernel You Thought You Understood Multithreading Gaël Fraiteur PostSharp Technologies Founder & Principal Engineer
  2. 2. @gfraiteur Hello! My name is GAEL. No, I don’t think my accent is funny. my twitter
  3. 3. @gfraiteur Agenda 1. Hardware 2. Windows Kernel 3. Application
  4. 4. @gfraiteur Microarchitecture Hardware
  5. 5. @gfraiteurThere was no RAM in Colossus, the world’s first electronic programmable computing device (19 http://en.wikipedia.org/wiki/File:Colossus.jpg
  6. 6. @gfraiteur Hardware Processor microarchitecture Intel Core i7 980x
  7. 7. @gfraiteur Hardware Hardware latency 26.2 8.58 3 1.2 0.3 0 5 10 15 20 25 30 DRAM L3 Cache L2 Cache L1 Cache CPU Cycle ns Latency Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra
  8. 8. @gfraiteur Hardware Cache Hierarchy • Distributed into caches • L1 (per core) • L2 (per core) • L3 (per die) Die 1 Core 4 Processor Core 5 Processor Core 7 Processor L2 Cache Core 6 Processor L2 Cache Memory Store (DRAM) Die 0 Core 1 Processor Core 0 Processor Core 2 Processor L2 Cache Core 3 Processor L2 Cache L2 Cache L3 Cache L2 CacheL2 Cache L3 Cache L2 Cache Data Transfer Core Interconnec t Core Interconnec t Processor Interconnect Synchronization • Synchronized through an asynchronous “bus”: • Request/response • Message queues • Buffers
  9. 9. @gfraiteur Hardware Memory Ordering • Issues due to implementation details: • Asynchronous bus • Caching & Buffering • Out-of-order execution
  10. 10. @gfraiteur Hardware Memory Ordering Processor 0 Processor 1 Store A := 1 Load A Store B := 1 Load B Possible values of (A, B) for loads? Processor 0 Processor 1 Load A Load B Store B := 1 Store A := 1 Possible values of (A, B) for loads? Processor 0 Processor 1 Store A := 1 Store B := 1 Load B Load A Possible values of (A, B) for loads?
  11. 11. @gfraiteur Hardware Memory Models: Guarantees Type Alpha ARMv7 PA-RISC POWER SPARC RMO SPARC PSO SPARCTSO x86 x86 oostore AMD64 IA64 zSeries Loads reordered after Loads Y Y Y Y Y Y Y Loads reordered after Stores Y Y Y Y Y Y Y Stores reordered after Stores Y Y Y Y Y Y Y Y Stores reordered after Loads Y Y Y Y Y Y Y Y Y Y Y Y Atomic reordered with Loads Y Y Y Y Y Atomic reordered with Stores Y Y Y Y Y Y Dependent Loads reordered Y Incoherent Instruction cache pipeline Y Y Y Y Y Y Y Y Y Y http://en.wikipedia.org/wiki/Memory_ordering X84,AMD64: strong orderingARM: weak ordering
  12. 12. @gfraiteur Hardware Memory Models Compared Processor 0 Processor 1 Store A := 1 Load A Store B := 1 Load B Processor 0 Processor 1 LoadA Load B Store B := 1 Store A := 1
  13. 13. @gfraiteur Hardware Compiler Memory Ordering • The compiler (CLR) can: • Cache (memory into register), • Reorder • Coalesce writes • volatile keyword disables compiler optimizations. And that’s all!
  14. 14. @gfraiteur Hardware Thread.MemoryBarrier() Processor 0 Processor 1 Store A := 1 Store B := 1 Thread.MemoryBarrier() Thread.MemoryBarrier() Load B Load A Barriers serialize access to memory.
  15. 15. @gfraiteur Hardware Atomicity Processor 0 Processor 1 Store A := (long) 0xFFFFFFFF Load A • Up to native word size (IntPtr) • Properly aligned fields (attention to struct fields!)
  16. 16. @gfraiteur Hardware Interlocked access • The only locking mechanism at hardware level. • System.Threading.Interlocked • Add Add(ref x, int a ) { x += a; return x; } • Increment Inc(ref x ) { x += 1; return x; } • Exchange Exc( ref x, int v ) { int t = x; x = a; return t; } • CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; } • Read(ref long) • Implemented by L3 cache and/or interconnect messages. • Implicit memory barrier
  17. 17. @gfraiteur Hardware Cost of interlocked 9.75 8.58 5.5 1.87 1.2 0 2 4 6 8 10 12 Interlocked increment (contended, same socket) L3 Cache Interlocked increment (non-contended) Non-interlocked increment L1 Cache ns Latency Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with C# code. Cache latencies measured with SiSoftware Sandra.
  18. 18. @gfraiteur Hardware Cost of cache coherency
  19. 19. @gfraiteur Multitasking • No thread, no process at hardware level • There no such thing as “wait” • One core never does more than 1 “thing” at the same time (except HyperThreading) • Task-State Segment: • CPU State • Permissions (I/O, IRQ) • Memory Mapping • A task runs until interrupted by hardware or OS
  20. 20. @gfraiteur Lab: Non-Blocking Algorithms • Non-blocking stack (4.5 MT/s) • Ring buffer (13.2 MT/s)
  21. 21. @gfraiteur Windows Kernel Operating System
  22. 22. @gfraiteur SideKick (1988). http://en.wikipedia.org/wiki/SideKi
  23. 23. @gfraiteur • Program (Sequence of Instructions) • CPU State • Wait Dependencies • (other things) • Virtual Address Space • Resources • At least 1 thread • (other things) Processes and Threads Process Thread Windows Kernel
  24. 24. @gfraiteur Windows Kernel Processed and Threads 8 Logical Processors (4 hyper-threaded cores) 2384 threads 155 processes
  25. 25. @gfraiteur Windows Kernel Dispatching threads to CPU Executing CPU 0 CPU 1 CPU 2 CPU 3 Thread H Thread G Thread I Thread J Mutex Semaphore Event Timer Thread A Thread C Thread B Wait Thread D Thread E Thread F Ready
  26. 26. @gfraiteur • Live in the kernel • Sharable among processes • Expensive! • Event • Mutex • Semaphore • Timer • Thread Dispatcher Objects (WaitHandle) Kinds of dispatcher objects Characteristics Windows Kernel
  27. 27. @gfraiteur Windows Kernel Cost of kernel dispatching 2 6 10 295 2,093 1 10 100 1000 10000 Normal increment Non-contented interlocked increment Contented interlocked increment Access kernel dispatcher object Create+dispose kernel dispatcher object Duration (ns) Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)
  28. 28. @gfraiteur Windows Kernel Wait Methods • Thread.Sleep • Thread.Yield • Thread.Join • WaitHandle.Wait • (Thread.SpinWait)
  29. 29. @gfraiteur Windows Kernel SpinWait: smart busy loops 1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading) 2. Thread.Sleep(0) (give up to threads of same priority) 3. Thread.Yield() (give up to threads of any priority) 4. Thread.Sleep(1) (sleeps for 1 ms) Processor 0 Processor 1 A = 1; B = 1; SpinWait w = new SpinWait(); int localB; if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce(); }
  30. 30. @gfraiteur .NET Framework Application Programming
  31. 31. @gfraiteur Application Programming Design Objectives • Ease of use • High performance from applications!
  32. 32. @gfraiteur Application Programming Instruments from the platform Hardware • Interlocked • SpinWait • Volatile Operating System • Thread • WaitHandle (mutex, event, semaphore, …) • Timer
  33. 33. @gfraiteur Application Programming New Instruments Concurrency & Synchronization •Monitor (Lock), SpinLock •ReaderWriterLock •Lazy, LazyInitializer •Barrier, CountdownEvent Data Structures •BlockingCollection •Concurrent(Queue|Stack|Bag|Dictionary) •ThreadLocal,ThreadStatic Thread Dispatching •ThreadPool •Dispatcher, ISynchronizeInvoke •BackgroundWorker Frameworks •PLINQ •Tasks •Async/Await
  34. 34. @gfraiteur Concurrency & Synchronization Application Programming
  35. 35. @gfraiteur Concurrency & Synchronization Monitor: the lock keyword 1. Start with interlocked operations (no contention) 2. Continue with “spin wait”. 3. Create kernel event and wait  Good performance if low contention
  36. 36. @gfraiteur Concurrency & Synchronization ReaderWriterLock • Concurrent reads allowed • Writes require exclusive access
  37. 37. @gfraiteur Concurrency & Synchronization Lab: ReaderWriterLock
  38. 38. @gfraiteur Data Structures Application Programming
  39. 39. @gfraiteur Data Structures BlockingCollection • Use case: pass messages between threads • TryTake: wait for an item and take it • Add: wait for free capacity (if limited) and add item to queue • CompleteAdding: no more item
  40. 40. @gfraiteur Data Structures Lab: BlockingCollection
  41. 41. @gfraiteur Data Structures ConcurrentStack node top Producer A Producer B Producer C Consumer A Consumer B Consumer C Consumers and producers compete for the top of the stack node
  42. 42. @gfraiteur Data Structures ConcurrentQueue node node tail Producer A Producer B Producer C Consumer A Consumer B Consumer C head Consumers compete for the head. Producers compete for the tail. Both compete for the same node if the queue is empty.
  43. 43. @gfraiteur Data Structures ConcurrentBag Thread A Bag Thread Local Storage A Thread Local Storage B node node node node Producer A Consumer A Thread B Producer B Consumer B • Threads preferentially consume nodes they produced themselves. • No ordering
  44. 44. @gfraiteur Data Structures ConcurrentDictionary • TryAdd • TryUpdate (CompareExchange) • TryRemove • AddOrUpdate • GetOrAdd
  45. 45. @gfraiteur Thread Dispatching Application Programming
  46. 46. @gfraiteur Thread Dispatching ThreadPool • Problems solved: • Execute a task asynchronously (QueueUserWorkItem) • Waiting for WaitHandle (RegisterWaitForSingleObject) • Creating a thread is expensive • Good for I/O intensive operations • Bad for CPU intensive operations
  47. 47. @gfraiteur Thread Dispatching Dispatcher • Execute a task in the UI thread • Synchronously: Dispatcher.Invoke • Asynchonously: Dispatcher.BeginInvoke • Under the hood, uses the HWND message loop
  48. 48. @gfraiteur Thread Dispatching
  49. 49. @gfraiteur Q&A Gael Fraiteur gael@postsharp.net @gfraiteur
  50. 50. @gfraiteur Summary
  51. 51. @gfraiteur http://www.postsharp.net/ gael@postsharp.net

×