Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concurrency | Trisha Gee & Mike Barker

  • 8,309 views
Uploaded on

2011-11-02 | 05:45 PM - 06:35 PM | Victoria …

2011-11-02 | 05:45 PM - 06:35 PM | Victoria
The Disruptor is new open-source concurrency framework, designed as a high performance mechanism for inter-thread messaging. It was developed at LMAX as part of our efforts to build the world's fastest financial exchange. Using the Disruptor as an example, this talk will explain of some of the more detailed and less understood areas of concurrency, such as memory barriers and cache coherency. These concepts are often regarded as scary complex magic only accessible by wizards like Doug Lea and Cliff Click. Our talk will try and demystify them and show that concurrency can be understood by us mere mortal programmers.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,309
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
216
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • (Trish)\n\nIntroduce ourselves, mention the award :)\n\n“Duke's Choice Award for Innovative Programming Framework”\n\nIntroduce what we’re going to cover\n - concurrency/performance\n - deep & narrow\n - contradictory - going to argue against abstractions\n
  • (Trish)\n\nAll of the ask the audience questions:\n\n- Who works with concurrent code daily?\n- Who finds concurrency difficult?\n- Who cares about performance?\n\n
  • (Trish)\n\nCompilers and CPUs are allowed to reorder instructions as long as program semantics are maintained.  Without any explicit requests those correctness semantics are limited to observers in the same thread.\n\nDifferent CPUs reorder instructions to varying degrees.  E.g. Intel x86 not much, DEC Alpha lots, Intel Atom not at all.\n\nUnless explicit instructions are used to ensure ordering observers in another thread will see different results.  I.e. a separate thread can't assume that because z = 40 is true, x = 20 is also true.  It may not have happened yet.\n\nThat's if the other thread can even see the data, throw to next slide on visibility.\n\n1) Compilers and CPUs are free to re-order instructions\n2) Different CPUs reorder different amounts - intel x86 not much\n3) unless otherwise specified, you can only guarantee ordering within same thread\n4) other thread can't guarantee order. x is not necessarily 20. if it's even visible\n\n
  • (Trish)\n\nMemory on modern CPUs consists of multiple layers of buffering and caching.  Storing a value does not mean that it is immediately visible to all threads running on all cores.\n\nExplicit instructions are needed to make data visible to other threads.\n\nThis is the crux of why concurrency is hard.  The logical order of your program is not maintained when observing it from another thread (but sometimes it is).\n\nThere are tools to help reason about concurrent programs, the main one is the Memory Model. Memory Models exist at multiple levels langauge, VM and CPUs all can have memory models.  Java fortunately has a good one which is portable.  C++ only introduced one in the most recent spec.  C++ programmers often had to think about the CPUs memory model more often (though there are helpful libraries and compiler intrinsics too).\n\nReordering and smart use of caching is the result of many years of hardware engineering applied to deal with the signficant performance mismatch between the CPU and accessing Memory.\n\nHowever correctness is not the only concern, a lack of understanding of the detail can lead to other problems...(thow to Mike).\n\n1) different layers of storage on modern cpu (explain diag)\n2) the different levels exist because main memory is slow\n3) data for your instruction could be at any of these levels.  threads on a different CPU might not see it\n4) Java Memory Model is a good tool to reason about concurrent programming, and it's cross platform\n
  • (Mike)\n\nAre we taking a step back (Martijn)?  Yes, necessarily so. Parallelism and concurrency are means to an end, not an end in their own right.  The goal being performance.  Performance generally covers throughput, latency, scalability.  I'd through in 4th, energy efficiency. Concurrent code can bring with it a number of performance surprises, let's look an example.\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • (Mike)\n\nUp to 3 orders of magnitude difference in performance between best and worst case. Important to understand what's happening at the lower layers to  understand why code can perform poorly. Understanding the details of how concurrency works at the machine level makes understanding higher level concurrency models (e.g. Clojure STM, Queues, Actors) much easier. Locks require kernal arbitration. The Disruptor looks to remove as much contention as possible.\n
  • (Mike)\n\nDon't be scared of putting code into a single thread. Ported Guy Steele's algorithm written in Fortress to Scala and compared it to a brute single threaded implementation. Use Scala's new parallel collections library. Not too much detail on the algorithm, based on divide/conqueor model to fit easily with fork/join.\n\n
  • (Mike)\n\nTested with a copy of Alice in Wonderland. Guess how many cores used to get 440 ops/sec?  8 cores with hyper-threading - 16 concurrent threads. While eventually the Scala version would be faster, given enough cores it is horribly inefficient with its use of energy. This is likely to be more of an issue as we move into the future. Don't take this as a negative regarding Scala's parallel collections.\n\n
  • (Mike)\n\nCPU performance is more complicated than a simple measurement of GHz or number of cores.  Many other factors come into place.  Cache size, cache speed, bus architecture, data path sizes, numbers of caches.....\n
  • \n
  • (Trish)\n\nIntro to lmax\nReal world problems\nDR / replication, high availability\n\n
  • (Trish)\n\nSEDA Architecture. A real enterprise solution needs more than just business logic. DR, journalling - gives you reliability\n\nThis is a single service. Business logic is the interesting thing.\n\n
  • (Trish)\n\n- Testing showed each queue had its own latency overhead. \n- When you add it all up, even including your IO for replication and journalling, it's a big chunk of overall latency.\n- Business logic is such a tiny amount of time\n\n
  • \n
  • \n
  • \n
  • (Trish)\n\nIntroduce the Ring buffer, talk about how this is the basis of the Disruptor\n\nAll event processors own their own sequence numbers. Producer writes to ringbuffer, which updates its sequence number. Consumer reads from ring buffer, and writes to its own sequence number\n\n1) RingBuffer/Disruptor intro\n2) Producer writes, consumer reads\n3) Individual sequence numbers\n\n
  • (Trish)\n\nNote that things can now be parallelised\n\n
  • \n
  • \n
  • Diving in deep! If assembly code scares you, go get beer now\n\n
  • \n
  • \n
  • (Mike)\n\nIntel doesn't need a special instruction for volatile reads.  Just ensures that it is not stored in a register and that the write takes care of the cache invalidation.  Doesn't reorder reads with respect to each other.  Intel has a strong memory model.  Other CPU would require fence instructions on the read too.\n\n*notes from Brown bag*\nFriendlier comment/java code\ntalk about where cache lines are flushed\ntalk about number of cycles\n\n
  • \n
  • (Trish)\n\nexplain cache lines\n64 bytes\n\nthread 2 reads tail\nthread 1 writes head\nthread 2 needs to reload tail\n
  • (Trish)\n(Still not sure about this, doesn't make it clear where the sequence number is)\n\ncreate a bunch of empty longs to pad out the cache line\nadd them to a public method so they don't get optimised away\n\nsee our shiny method names\n\n
  • Concurrency is a good way to make your code slower and more complex.\n
  • \n

Transcript

  • 1. Trisha Gee & Michael Barker / LMAXThe Disruptor -A Beginners Guide toHardcore Concurrency
  • 2. Why is Concurrency So Difficult ?
  • 3. Program Order: Execution Order (maybe):int w = 10; int x = 20;int x = 20; int y = 30;int y = 30; int b = x * y;int z = 40; int w = 10;int a = w + z; int z = 40;int b = x * y; int a = w + z;
  • 4. Why Should We Care About the Details ?
  • 5. static long foo = 0;private static void increment() { for (long l = 0; l < 500000000L; l++) { foo++; }}
  • 6. public static long foo = 0;public static Lock lock = new Lock();private static void increment() { for (long l = 0; l < 500000000L; l++){ lock.lock(); try { foo++; } finally { lock.unlock(); } }}
  • 7. static AtomicLong foo = new AtomicLong(0);private static void increment() { for (long l = 0; l < 500000000L; l++) { foo.getAndIncrement(); }}
  • 8. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms
  • 9. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms• One Thread (volatile): 4 700 ms (15x)
  • 10. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms• One Thread (volatile): 4 700 ms (15x)• One Thread (Atomic) : 5 700 ms (19x)
  • 11. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms• One Thread (volatile): 4 700 ms (15x)• One Thread (Atomic) : 5 700 ms (19x)• One Thread (Lock) : 10 000 ms (33x)
  • 12. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms• One Thread (volatile): 4 700 ms (15x)• One Thread (Atomic) : 5 700 ms (19x)• One Thread (Lock) : 10 000 ms (33x)• Two Threads (Atomic) : 30 000 ms (100x)
  • 13. Cost of Contention Increment a counter 500 000 000 times.• One Thread : 300 ms• One Thread (volatile): 4 700 ms (15x)• One Thread (Atomic) : 5 700 ms (19x)• One Thread (Lock) : 10 000 ms (33x)• Two Threads (Atomic) : 30 000 ms (100x)• Two Threads (Lock) : 224 000 ms (746x) ^^^^^^^^ ~4 minutes!!!
  • 14. Parallel v. Serial - String SplitGuy Steele @ Strangle Loop:http://www.infoq.com/presentations/Thinking-Parallel-ProgrammingScala Implementation and Brute Force version in Java:https://github.com/mikeb01/folklore/ 15
  • 15. Parallel (Scala) Serial (Java)2000.01500.01000.0 500.0 0 String Split (ops/sec) higher is better
  • 16. CPUs Are Getting Faster 17
  • 17. Ya Rly! P8600 (Core 2 Duo) E5620 (Nehalem EP) i7 2667M (Sandy Bridge ULV) i7 2720QM (Sandy Bride)3000.02250.01500.0 750.0 0 String Split 18
  • 18. What Problem Were Trying To Solve ?
  • 19. 20
  • 20. 21
  • 21. Why Queues Suck - Array Backed 22
  • 22. Why Queues Suck - Linked List 23
  • 23. Why Queues Suck - Linked List 24
  • 24. Contention Free Design 25
  • 25. 26
  • 26. How Fast Is It - Throughput ABQ Disruptor30000000.022500000.015000000.0 7500000.0 0 Unicast Diamond 27
  • 27. How Fast Is It - Latency ABQ Disruptor Min 145 29 Mean 32,757 52 99 Percentile 2,097,152 128 99.99 Percentile 4,194,304 8,192 Max 5,069,086 175,567 28
  • 28. How Does It Work ?
  • 29. Ordering and Visibility private static final int SIZE = 32; private final Object[] data = new Object[SIZE]; private volatile long sequence = -1; private long nextValue = -1; public void publish(Object value) { long index = ++nextValue; data[(int)(index % SIZE)] = value; sequence = index; } public Object get(long index) { if (index <= sequence) { return data[(int)(index % SIZE)]; } return null; } 30
  • 30. Ordering and Visibility - Storemov $0x1,%ecxadd 0x18(%rsi),%rcx ;*ladd;...lea (%r12,%r8,8),%r11 ;*getfield data;...mov %r12b,(%r11,%r10,1)mov %rcx,0x10(%rsi)lock addl $0x0,(%rsp) ;*ladd 31
  • 31. Ordering and Visibility - Load mov %eax,-0x6000(%rsp) push %rbp sub $0x20,%rsp ;*synchronization entry ; - RingBuffer::get@-1 mov 0x10(%rsi),%r10 ;*getfield sequence ; - RingBuffer::get@2 cmp %r10,%rdx jl 0x00007ff92505f22d ;*iflt ; - RingBuffer::get@6 mov %edx,%r11d ;*l2i ; - RingBuffer::get@14 32
  • 32. Look Ma’ No Memory Barrier AtomicLong sequence = new AtomicLong(-1); public void publish(Object value) { long index = ++nextValue; data[(int)(index % SIZE)] = value; sequence.lazySet(index); } 33
  • 33. False Sharing - Hidden Contention 34
  • 34. Cache Line Paddingpublic class PaddedAtomicLong extends AtomicLong { public volatile long p1, p2, p3, p4, p5, p6 = 7L; //... lines omitted public long sumPaddingToPreventOptimisation() { return p1 + p2 + p3 + p4 + p5 + p6; }} 35
  • 35. Summary• Concurrency is a tool• Ordering and visibility are the key challenges• For performance the details matter• Dont believe everything you read o Come up with your own theories and test them! 36
  • 36. Q&Arecruitment@lmax.com