Understanding the Disruptor


A Beginner's Guide to Hardcore Concurrency
Why is concurrency so difficult

               ?
Ordering

Program Order:   Execution Order (maybe):

int w = 10;      int x = 20;
int x = 20;      int y = 30;
int y = 30;      int b = x * y;
int z = 40;
                 int w = 10;
int a = w + z;   int z = 40;
int b = x * y;   int a = w + z;
Visibility
Why should we care about the
          details
             ?
Increment a Counter


static long foo = 0;

private static void increment() {
  for (long l = 0; l < 500000000L; l++) {
    foo++;
  }
}
Using a Lock

public static long foo = 0;
public static Lock lock = new Lock();

private static void increment() {
  for (long l = 0; l < 500000000L; l++) {
    lock.lock();
    try {
        foo++;
    } finally {
        lock.unlock();
    }
  }
}
Using an AtomicLong


static AtomicLong foo = new AtomicLong(0);

private static void increment() {
  for (long l = 0; l < 500000000L; l++) {
    foo.getAndIncrement();
  }
}
The Cost of Contention
         Increment a counter 500 000 000 times.

● One Thread     :   300 ms
The Cost of Contention
          Increment a counter 500 000 000 times.

● One Thread        : 300 ms
● One Thread (volatile): 4 700 ms (15x)
The Cost of Contention
          Increment a counter 500 000 000 times.

● One Thread        : 300 ms
● One Thread (volatile): 4 700 ms (15x)
● One Thread (Atomic) : 5 700 ms (19x)
The Cost of Contention
          Increment a counter 500 000 000 times.

● One Thread        : 300 ms
● One Thread (volatile): 4 700 ms (15x)
● One Thread (Atomic) : 5 700 ms (19x)
● One Thread (Lock) : 10 000 ms (33x)
The Cost of Contention
          Increment a counter 500 000 000 times.

● One Thread        : 300 ms
● One Thread (volatile): 4 700 ms (15x)
● One Thread (Atomic) : 5 700 ms (19x)
● One Thread (Lock) : 10 000 ms (33x)
● Two Threads (Atomic) : 30 000 ms (100x)
The Cost of Contention
          Increment a counter 500 000 000 times.

● One Thread        : 300 ms
● One Thread (volatile): 4 700 ms (15x)
● One Thread (Atomic) : 5 700 ms (19x)
● One Thread (Lock) : 10 000 ms (33x)
● Two Threads (Atomic) : 30 000 ms (100x)
● Two Threads (Lock) : 224 000 ms (746x)
             ^^^^^^^^
           ~4 minutes!!!
Parallel v. Serial - String Splitting

Guy Steele @ Strangle Loop:

http://www.infoq.com/presentations/Thinking-Parallel-
Programming

Scala Implementation and Brute Force version in Java:

https://github.com/mikeb01/folklore/
Performance Test



Parallel (Scala): 440 ops/sec
Serial (Java) : 1768 ops/sec
CPUs Are Getting Faster

     Single threaded string split on different CPUs
What problem were we trying to
            solve
              ?
Classic Approach to the Problem
The Problems We Found
Why Queues Suck
Why Queues Suck - Linked List
Why Queues Suck - Linked List
Contention Free Design
Now our Pipeline Looks Like...
How Fast Is It - Throughput
How Fast Is It - Latency

                                   ABQ       Disruptor



 Min Latency (ns)                  145          29



 Mean Latency (ns)                32 757        52



 99 Percentile Latency (ns)      2 097 152     128



 99.99 Percentile Latency (ns)   4 194 304    8 192



 Max Latency (ns)                5 069 086   175 567
How does it all work

         ?
Ordering and Visibility

 private static final int SIZE = 32;
 private final Object[] data = new Object[SIZE];
 private volatile long sequence = -1;
 private long nextValue = -1;

 public void publish(Object value) {
   long index = ++nextValue;
   data[(int)(index % SIZE)] = value;
   sequence = index;
 }

 public Object get(long index) {
   if (index <= sequence) {
      return data[(int)(index % SIZE)];
   }
   return null;
 }
Ordering and Visibility - Store

mov $0x1,%ecx
add 0x18(%rsi),%rcx ;*ladd
;...
lea (%r12,%r8,8),%r11 ;*getfield data
;...
mov %r12b,(%r11,%r10,1)
mov %rcx,0x10(%rsi)
lock addl $0x0,(%rsp) ;*ladd
Ordering and Visibility - Load

mov %eax,-0x6000(%rsp)
push %rbp
sub $0x20,%rsp       ;*synchronization entry
             ; - RingBuffer::get@-1 (line 17)
mov 0x10(%rsi),%r10 ;*getfield sequence
             ; - RingBuffer::get@2 (line 17)
cmp %r10,%rdx
jl 0x00007ff92505f22d ;*iflt
             ; - RingBuffer::get@6 (line 17)
mov %edx,%r11d ;*l2i ; - RingBuffer::get@14 (line 19)
Look Ma' No Memory Barrier


AtomicLong sequence = new AtomicLong(-1);

public void publish(Object value) {
  long index = ++nextValue;
  data[(int)(index % SIZE)] = value;
  sequence.lazySet(index);
}
False Sharing - Hidden Contention
Cache Line Padding

public class PaddedAtomicLong extends AtomicLong {

    public volatile long p1, p2, p3, p4, p5, p6 = 7L;

    //... lines omitted

    public long sumPaddingToPreventOptimisation() {
      return p1 + p2 + p3 + p4 + p5 + p6;
    }
}
In Summary

● Concurrency is a tool
● Ordering and visibility are the key challenges
● For performance the details matter
● Don't believe everything you read
   ○ Come up with your own theories and test them!
Q&A

recruitment@lmax.com

Understanding the Disruptor

  • 1.
    Understanding the Disruptor ABeginner's Guide to Hardcore Concurrency
  • 2.
    Why is concurrencyso difficult ?
  • 3.
    Ordering Program Order: Execution Order (maybe): int w = 10; int x = 20; int x = 20; int y = 30; int y = 30; int b = x * y; int z = 40; int w = 10; int a = w + z; int z = 40; int b = x * y; int a = w + z;
  • 4.
  • 5.
    Why should wecare about the details ?
  • 6.
    Increment a Counter staticlong foo = 0; private static void increment() { for (long l = 0; l < 500000000L; l++) { foo++; } }
  • 7.
    Using a Lock publicstatic long foo = 0; public static Lock lock = new Lock(); private static void increment() { for (long l = 0; l < 500000000L; l++) { lock.lock(); try { foo++; } finally { lock.unlock(); } } }
  • 8.
    Using an AtomicLong staticAtomicLong foo = new AtomicLong(0); private static void increment() { for (long l = 0; l < 500000000L; l++) { foo.getAndIncrement(); } }
  • 9.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms
  • 10.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms ● One Thread (volatile): 4 700 ms (15x)
  • 11.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms ● One Thread (volatile): 4 700 ms (15x) ● One Thread (Atomic) : 5 700 ms (19x)
  • 12.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms ● One Thread (volatile): 4 700 ms (15x) ● One Thread (Atomic) : 5 700 ms (19x) ● One Thread (Lock) : 10 000 ms (33x)
  • 13.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms ● One Thread (volatile): 4 700 ms (15x) ● One Thread (Atomic) : 5 700 ms (19x) ● One Thread (Lock) : 10 000 ms (33x) ● Two Threads (Atomic) : 30 000 ms (100x)
  • 14.
    The Cost ofContention Increment a counter 500 000 000 times. ● One Thread : 300 ms ● One Thread (volatile): 4 700 ms (15x) ● One Thread (Atomic) : 5 700 ms (19x) ● One Thread (Lock) : 10 000 ms (33x) ● Two Threads (Atomic) : 30 000 ms (100x) ● Two Threads (Lock) : 224 000 ms (746x) ^^^^^^^^ ~4 minutes!!!
  • 15.
    Parallel v. Serial- String Splitting Guy Steele @ Strangle Loop: http://www.infoq.com/presentations/Thinking-Parallel- Programming Scala Implementation and Brute Force version in Java: https://github.com/mikeb01/folklore/
  • 16.
    Performance Test Parallel (Scala):440 ops/sec Serial (Java) : 1768 ops/sec
  • 17.
    CPUs Are GettingFaster Single threaded string split on different CPUs
  • 18.
    What problem werewe trying to solve ?
  • 19.
  • 20.
  • 21.
  • 22.
    Why Queues Suck- Linked List
  • 23.
    Why Queues Suck- Linked List
  • 24.
  • 25.
    Now our PipelineLooks Like...
  • 26.
    How Fast IsIt - Throughput
  • 27.
    How Fast IsIt - Latency ABQ Disruptor Min Latency (ns) 145 29 Mean Latency (ns) 32 757 52 99 Percentile Latency (ns) 2 097 152 128 99.99 Percentile Latency (ns) 4 194 304 8 192 Max Latency (ns) 5 069 086 175 567
  • 28.
    How does itall work ?
  • 29.
    Ordering and Visibility private static final int SIZE = 32; private final Object[] data = new Object[SIZE]; private volatile long sequence = -1; private long nextValue = -1; public void publish(Object value) { long index = ++nextValue; data[(int)(index % SIZE)] = value; sequence = index; } public Object get(long index) { if (index <= sequence) { return data[(int)(index % SIZE)]; } return null; }
  • 30.
    Ordering and Visibility- Store mov $0x1,%ecx add 0x18(%rsi),%rcx ;*ladd ;... lea (%r12,%r8,8),%r11 ;*getfield data ;... mov %r12b,(%r11,%r10,1) mov %rcx,0x10(%rsi) lock addl $0x0,(%rsp) ;*ladd
  • 31.
    Ordering and Visibility- Load mov %eax,-0x6000(%rsp) push %rbp sub $0x20,%rsp ;*synchronization entry ; - RingBuffer::get@-1 (line 17) mov 0x10(%rsi),%r10 ;*getfield sequence ; - RingBuffer::get@2 (line 17) cmp %r10,%rdx jl 0x00007ff92505f22d ;*iflt ; - RingBuffer::get@6 (line 17) mov %edx,%r11d ;*l2i ; - RingBuffer::get@14 (line 19)
  • 32.
    Look Ma' NoMemory Barrier AtomicLong sequence = new AtomicLong(-1); public void publish(Object value) { long index = ++nextValue; data[(int)(index % SIZE)] = value; sequence.lazySet(index); }
  • 33.
    False Sharing -Hidden Contention
  • 34.
    Cache Line Padding publicclass PaddedAtomicLong extends AtomicLong { public volatile long p1, p2, p3, p4, p5, p6 = 7L; //... lines omitted public long sumPaddingToPreventOptimisation() { return p1 + p2 + p3 + p4 + p5 + p6; } }
  • 35.
    In Summary ● Concurrencyis a tool ● Ordering and visibility are the key challenges ● For performance the details matter ● Don't believe everything you read ○ Come up with your own theories and test them!
  • 36.