Lock-Free Algorithms For  Ultimate Performance  Martin Thompson - @mjpt777
Watch the video with slide                         synchronization on InfoQ.com!                      http://www.infoq.com...
Presented at QCon San Francisco                          www.qconsf.comPurpose of QCon- to empower software development by...
Modern Hardware   Overview
Modern Hardware (Intel Sandy Bridge)C1     ...   Cn     Registers/Buffers                          <1ns                   ...
Memory Ordering           Core 1                  Core 2                         Core n       Registers               Regi...
Cache Structure & Coherence                    MOB                         L0(I) – 1.5k µops                64-byte       ...
Memory Models
Hardware Memory ModelsMemory consistency models describe how threads mayinteract through shared memory consistently. • Pro...
Intel x86/64 Memory Modelhttp://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdfhttp://www.intel.com/content/...
Language/Runtime Memory ModelsSome languages/Runtimes have a well defined memorymodel for portability: •   Java Memory Mod...
Measuring What Is   Going On
Model Specific Registers (MSR)Many and varied uses • Timestamp Invariant Counter • Memory Type Range RegistersPerformance ...
Accessing MSRsvoid rdmsr(uint32_t msr, uint32_t* lo, uint32_t* hi){    asm volatile(“rdmsr” : “=a” lo, “=d” hi : “c” msr);...
Accessing MSRs On Linuxf = new RandomAccessFile(“/dev/cpu/0/msr”, “rw”);ch = f.getChannel();buffer.order(ByteOrder.nativeO...
Accessing MSRs Made Easy!Intel VTune  • http://software.intel.com/en-us/intel-vtune-amplifier-xeLinux “perf stat”  • http:...
Biggest Performance      Enemy   “Contention!”
Contention• Managing Contention > Locks > CAS Techniques• Little’s & Amdahl’s Laws > L = λW > Sequential Component   Const...
Locks
Software Locks• Mutex, Semaphore, Critical Section, etc. >   What happens when un-contended? >   What happens when content...
Hardware Locks• Atomic Instructions > Compare And Swap/Set > Lock instructions on x86   – LOCK XADD is a bit special• Used...
Let’s Look At A Lock-   Free Algorithm
OneToOneQueue – Take 1public final class OneToOneConcurrentArrayQueue<E>    implements Queue<E>{    private final E[] buff...
OneToOneQueue – Take 1public boolean offer(final E e){    final long currentTail = tail;    final long wrapPoint = current...
OneToOneQueue – Take 1public E poll(){    final long currentHead = head;    if (currentHead >= tail)    {        return nu...
Concurrent Queue Performance Results                                 Ops/Sec (Millions)              Mean Latency (ns) Lin...
Let’s Apply Some“Mechanical Sympathy”
Mechanical Sympathy In ActionKnowing the cost of operations • Remainder Operation • Volatile writes and lock instructions ...
Operation Costs
Signalling// Lockpthread_mutex_lock(&lock);sequence = i;pthread_cond_signal(&condition);pthread_mutex_unlock(&lock);// Sof...
Signalling Costs                   Lock      Fence    SoftMillion Ops/Sec      9.4      45.7    108.1L2 Hit Ratio        1...
OneToOneQueue – Take 2public final class OneToOneConcurrentArrayQueue2<E>    implements Queue<E>{    private final int mas...
OneToOneQueue – Take 2public boolean offer(final E e){    final long currentTail = tail.get();    final long wrapPoint = c...
OneToOneQueue – Take 2public E poll(){    final long currentHead = head.get();    if (currentHead >= tail.get())    {     ...
Concurrent Queue Performance Results                                 Ops/Sec (Millions)              Mean Latency (ns) Lin...
Cache Misses
False Sharing and Cache Lines                                                             64-byte             Unpadded    ...
False Sharing Testingint64_t* address = seq->addressfor (int i = 0; i < ITERATIONS; i++){    int64_t value = *address;    ...
False Sharing Test Results                    Unpadded    Padded  Million Ops/sec      12.4      104.9   L2 Hit Ratio     ...
OneToOneQueue – Take 3public final class OneToOneConcurrentArrayQueue3<E>    implements Queue<E>{    private final int cap...
OneToOneQueue – Take 3public boolean offer(final E e){    final long currentTail = tail.get();    final long wrapPoint = c...
OneToOneQueue – Take 3public E poll(){    final long currentHead = head.get();    if (currentHead >= tailCache.value)    {...
Concurrent Queue Performance Results                                 Ops/Sec (Millions)              Mean Latency (ns) Lin...
How Far Can We Go  With Lock Free   Algorithms?
Further Adventures With Lock-Free Algorithms•   State Machines•   CAS operations•   Wait-Free in addition to Lock-Free alg...
Questions?Blog: http://mechanical-sympathy.blogspot.com/Code: https://github.com/mjpt777/examplesTwitter: @mjpt777  “The m...
Lock-Free Algorithms For Ultimate Performance
Upcoming SlideShare
Loading in …5
×

Lock-Free Algorithms For Ultimate Performance

2,384 views

Published on

Video and slides synchronized, mp3 and slide download available at http://bit.ly/116q7FA.

Martin Thompson discusses the need to measure what’s going on at the hardware level in order to be able to create high performing lock-free algorithms. Filmed at qconsf.com.

Martin Thompson is a high-performance and low-latency specialist, with experience gained over two decades working with large scale transactional and big-data domains, including automotive, gaming, financial, mobile, and content management. Martin was the co-founder and CTO of LMAX, until he left to specialize in helping other people achieve great performance with their software. Twitter: @mjpt777

Published in: Technology
1 Comment
16 Likes
Statistics
Notes
No Downloads
Views
Total views
2,384
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
0
Comments
1
Likes
16
Embeds 0
No embeds

No notes for slide

Lock-Free Algorithms For Ultimate Performance

  1. 1. Lock-Free Algorithms For Ultimate Performance Martin Thompson - @mjpt777
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /Lock-Free-Algorithms InfoQ.com: News & Community Site• 750,000 unique visitors/month• Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese)• Post content from our QCon conferences• News 15-20 / week• Articles 3-4 / week• Presentations (videos) 12-15 / week• Interviews 2-3 / week• Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.comPurpose of QCon- to empower software development by facilitating the spread ofknowledge and innovationStrategy - practitioner-driven conference designed for YOU: influencers ofchange and innovation in your teams- speakers and topics driving the evolution and innovation- connecting and catalyzing the influencers and innovatorsHighlights- attended by more than 12,000 delegates since 2007- held in 9 cities worldwide
  4. 4. Modern Hardware Overview
  5. 5. Modern Hardware (Intel Sandy Bridge)C1 ... Cn Registers/Buffers <1ns C1 ... Cn L1 ... L1 ~3-4 cycles ~1ns L1 ... L1 L2 ... L2 ~10-12 cycles ~3ns L2 ... L2 L3 ~40-45 cycles ~15ns L3 MC MC QPI ~40ns DRAM DRAM DRAM DRAM ~65ns DRAM DRAM DRAM DRAM
  6. 6. Memory Ordering Core 1 Core 2 Core n Registers Registers Execution Units Execution Units Store Buffer Load Buffer MOB MOB LF/WC LF/WC L1 L1 Buffers Buffers L2 L2 L3
  7. 7. Cache Structure & Coherence MOB L0(I) – 1.5k µops 64-byte “Cache-lines” 128 bits 16 Bytes TLBLF/WC Pre-fetchersBuffers L1(D) - 32K L1(I) – 32K 256 bits 128 bits SRAM TLB Pre-fetchers L2 - 256K 32 Bytes Ring Bus QPI Bus QPI MESI+F State Model Memory Controller Memory Channels L3 – 8-20MB System Agent
  8. 8. Memory Models
  9. 9. Hardware Memory ModelsMemory consistency models describe how threads mayinteract through shared memory consistently. • Program Order (PO) for a single thread • Sequential Consistency (SC) [Lamport 1979] > What you expect a program to do! (for race free) • Strict Consistency (Linearizability) > Some special instructions • Total Store Order (TSO) > Sparc model that is stronger than SC • x86/64 is TSO + (Total Lock Order & Causal Consistency) > http://www.youtube.com/watch?v=WUfvvFD5tAA • Other Processors can have weaker models
  10. 10. Intel x86/64 Memory Modelhttp://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdfhttp://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html1. Loads are not reordered with other loads.2. Stores are not reordered with other stores.3. Stores are not reordered with older loads.4. Loads may be reordered with older stores to different locations but not with older stores to the same location.5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).6. In a multiprocessor system, stores to the same location have a total order.7. In a multiprocessor system, locked instructions have a total order.8. Loads and stores are not reordered with locked instructions.
  11. 11. Language/Runtime Memory ModelsSome languages/Runtimes have a well defined memorymodel for portability: • Java Memory Model (Java 5) • C++ 11 • Erlang • GoFor most other languages we are at the mercy of the compiler • Instruction reordering • C “volatile” is inadequate • Register allocation for caching values • No mapping to the hardware memory model • Fences/Barriers need to be applied
  12. 12. Measuring What Is Going On
  13. 13. Model Specific Registers (MSR)Many and varied uses • Timestamp Invariant Counter • Memory Type Range RegistersPerformance Counters!!! • L2/L3 Cache Hits/Misses • TLB Hits/Misses • QPI Transfer Rates • Instruction and Cycle Counts • Lots of others....
  14. 14. Accessing MSRsvoid rdmsr(uint32_t msr, uint32_t* lo, uint32_t* hi){ asm volatile(“rdmsr” : “=a” lo, “=d” hi : “c” msr);}void wrmsr(uint32_t msr, uint32_t lo, uint32_t hi){ asm volatile(“wrmsr” :: “c” msr, “a” lo, “d” hi);}
  15. 15. Accessing MSRs On Linuxf = new RandomAccessFile(“/dev/cpu/0/msr”, “rw”);ch = f.getChannel();buffer.order(ByteOrder.nativeOrder());ch.read(buffer, msrNumber);long value = buffer.getLong(0);
  16. 16. Accessing MSRs Made Easy!Intel VTune • http://software.intel.com/en-us/intel-vtune-amplifier-xeLinux “perf stat” • http://linux.die.net/man/1/perf-statlikwid - Lightweight Performance Counters • http://code.google.com/p/likwid/
  17. 17. Biggest Performance Enemy “Contention!”
  18. 18. Contention• Managing Contention > Locks > CAS Techniques• Little’s & Amdahl’s Laws > L = λW > Sequential Component Constraint • Single Writer Principle • Shared Nothing Designs
  19. 19. Locks
  20. 20. Software Locks• Mutex, Semaphore, Critical Section, etc. > What happens when un-contended? > What happens when contention occurs? > What if we need condition variables? > What are the cost of software locks? > Can they be optimised?
  21. 21. Hardware Locks• Atomic Instructions > Compare And Swap/Set > Lock instructions on x86 – LOCK XADD is a bit special• Used to update sequences and pointers• What are the costs of these operations?• Guess how software locks are created?• TSX (Transactional Synchronization Extensions)
  22. 22. Let’s Look At A Lock- Free Algorithm
  23. 23. OneToOneQueue – Take 1public final class OneToOneConcurrentArrayQueue<E> implements Queue<E>{ private final E[] buffer; private volatile long tail = 0; private volatile long head = 0; public OneToOneConcurrentArrayQueue(int capacity) { buffer = (E[])new Object[capacity]; }
  24. 24. OneToOneQueue – Take 1public boolean offer(final E e){ final long currentTail = tail; final long wrapPoint = currentTail - buffer.length; if (head <= wrapPoint) { return false; } buffer[(int)(currentTail % buffer.length)] = e; tail = currentTail + 1; return true;}
  25. 25. OneToOneQueue – Take 1public E poll(){ final long currentHead = head; if (currentHead >= tail) { return null; } final int index = (int)(currentHead % buffer.length); final E e = buffer[index]; buffer[index] = null; head = currentHead + 1; return e;}
  26. 26. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600ConcurrentLinkedQueue 13 NA / ~180ConcurrentArrayQueue 13 NA / ~150Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHzLatency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  27. 27. Let’s Apply Some“Mechanical Sympathy”
  28. 28. Mechanical Sympathy In ActionKnowing the cost of operations • Remainder Operation • Volatile writes and lock instructions Why so many cache misses? • False Sharing • Algorithm Opportunities ˃ “Smart Batching” • Memory layout
  29. 29. Operation Costs
  30. 30. Signalling// Lockpthread_mutex_lock(&lock);sequence = i;pthread_cond_signal(&condition);pthread_mutex_unlock(&lock);// Soft Barrierasm volatile(“” ::: “memory”);sequence = i;// Fenceasm volatile(“” ::: “memory”);sequence = i;asm volatile(“lock addl $0x0,(%rsp)”);
  31. 31. Signalling Costs Lock Fence SoftMillion Ops/Sec 9.4 45.7 108.1L2 Hit Ratio 17.26 28.17 13.32L3 Hit Ratio 0.78 29.60 27.99Instructions 12846 M 906 M 801 MCPU Cycles 28278 M 5808 M 1475 MIns/Cycle 0.45 0.16 0.54
  32. 32. OneToOneQueue – Take 2public final class OneToOneConcurrentArrayQueue2<E> implements Queue<E>{ private final int mask; private final E[] buffer; private final AtomicLong tail = new AtomicLong(0); private final AtomicLong head = new AtomicLong(0); public OneToOneConcurrentArrayQueue2(int capacity) { capacity = findNextPositivePowerOfTwo(capacity); mask = capacity - 1; buffer = (E[])new Object[capacity]; }
  33. 33. OneToOneQueue – Take 2public boolean offer(final E e){ final long currentTail = tail.get(); final long wrapPoint = currentTail - buffer.length; if (head.get() <= wrapPoint) { return false; } buffer[(int)currentTail & mask] = e; tail.lazySet(currentTail + 1); return true;}
  34. 34. OneToOneQueue – Take 2public E poll(){ final long currentHead = head.get(); if (currentHead >= tail.get()) { return null; } final int index = (int)currentHead & mask; final E e = buffer[index]; buffer[index] = null; head.lazySet(currentHead + 1); return e;}
  35. 35. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600ConcurrentLinkedQueue 13 NA / ~180ConcurrentArrayQueue 13 NA / ~150ConcurrentArrayQueue2 45 NA / ~120Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHzLatency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  36. 36. Cache Misses
  37. 37. False Sharing and Cache Lines 64-byte Unpadded Padded “Cache-lines” *address1 *address2 *address1 *address2 (thread a) (thread b) (thread a) (thread b)
  38. 38. False Sharing Testingint64_t* address = seq->addressfor (int i = 0; i < ITERATIONS; i++){ int64_t value = *address; ++value; *address = value; asm volatile(“lock addl 0x0,(%rsp)”);}
  39. 39. False Sharing Test Results Unpadded Padded Million Ops/sec 12.4 104.9 L2 Hit Ratio 1.16% 23.05% L3 Hit Ratio 2.51% 39.18% Instructions 4559 M 4508 M CPU Cycles 63480 M 7551 M Ins/Cycle Ratio 0.07 0.60
  40. 40. OneToOneQueue – Take 3public final class OneToOneConcurrentArrayQueue3<E> implements Queue<E>{ private final int capacity; private final int mask; private final E[] buffer; private final AtomicLong tail = new PaddedAtomicLong(0); private final AtomicLong head = new PaddedAtomicLong(0); public static class PaddedLong { public long value = 0, p1, p2, p3, p4, p5, p6; } private final PaddedLong tailCache = new PaddedLong(); private final PaddedLong headCache = new PaddedLong();
  41. 41. OneToOneQueue – Take 3public boolean offer(final E e){ final long currentTail = tail.get(); final long wrapPoint = currentTail - capacity; if (headCache.value <= wrapPoint) { headCache.value = head.get(); if (headCache.value <= wrapPoint) { return false; } } buffer[(int)currentTail & mask] = e; tail.lazySet(currentTail + 1); return true;}
  42. 42. OneToOneQueue – Take 3public E poll(){ final long currentHead = head.get(); if (currentHead >= tailCache.value) { tailCache.value = tail.get(); if (currentHead >= tailCache.value) { return null; } } final int index = (int)currentHead & mask; final E e = buffer[index]; buffer[index] = null; head.lazySet(currentHead + 1); return e;}
  43. 43. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600ConcurrentLinkedQueue 13 NA / ~180ConcurrentArrayQueue 13 NA / ~150ConcurrentArrayQueue2 45 NA / ~120ConcurrentArrayQueue3 150 NA / ~100Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHzLatency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  44. 44. How Far Can We Go With Lock Free Algorithms?
  45. 45. Further Adventures With Lock-Free Algorithms• State Machines• CAS operations• Wait-Free in addition to Lock-Free algorithms• Thread Affinity• x86 and busy spinning and back off
  46. 46. Questions?Blog: http://mechanical-sympathy.blogspot.com/Code: https://github.com/mjpt777/examplesTwitter: @mjpt777 “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.” - Henry Peteroski

×