Java and the machine - Martijn Verburg and Kirk Pepperdine


Published on

In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This is who we are, this is what we do, Google the rest\n
  • Kirk after too much sun in Crete\n
  • i7-3770\n
  • Intel 4004, the first commercial microprocessor (1971)\n
  • Intel 4004, the first commercial microprocessor (1971)\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • * Don't know what this is? Get out.\n
  • Talk about critical sections / points of execution serialisation\n
  • Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.\n It was presented at the AFIPS Spring Joint Computer Conference in 1967.\n
  • Moore's law is about the capacity of the CPU\nClock speed is about throughput\nFinite amount of space on the edge of the chip\nFixed width to the bus\n
  • \n
  • If you have one server per customer in a perfect parallel world...\nTalk to your sys admins / devops folks, buy more hardware ;-)\nUtilisation is an aggregation - look at the system as a whole\n
  • The calculus and assembly programming are in the next sections\n\n
  • <TODO: Is it really Hardware threads?>\nHardware thread core/hyperthreaded core\n
  • Thread as an abstraction helps you deal with sequential computation, the problem lies when you pluralize this.\n
  • copy ~= clone\n
  • \n
  • \n
  • Even when there are multiple threads there is very little chance of corruption as forks sharing a hardware thread all accessed the same cache\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Here's your takeaway point - locks are f%^&in slow\n
  • Here's your takeaway point - locks are f%^&in slow\n
  • \n
  • <TODO: Maybe kill horizontal scalibilty>\n* hypervisor, also called virtual machine manager (VMM) - e.g. KVM, or Xen.\n* Containers, like chroot on steroids, so change what you think is root and partition off a chunk of the Linux O/S from there, e.g. LXC\n\n
  • Kirk --> Martijn --> Ask the audience!\nAdministration and Business data centre reasoning\n
  • \n
  • * A process is an artificial abstraction that we use\n\n
  • And there's no such thing as a thread either\n
  • * Early cloud had network bottlenecks (think dial up modems!)\n* NAS is a problem (S3 for example) - turns a disk problem into a network problem\nHypervisors interact with the scheduler (look at the linux kernel work recently)\n\n\n
  • <TODO Keep slide hidden?>The cost savings aren't always better\n
  • * Bosses talk $£, not 1's and 0's\n
  • \n\n
  • It's early 1990's tech that's in maintenance\n
  • \n
  • \n
  • Instruction fetch\nInstruction decode and register fetch\nExecute\nMemory access\nRegister write back\n
  • Unlike Hypervisors, which can spin up a thread on a single core\n\n
  • JVM's responsibility is to execute byte code, CPU to execute instruction pipelines\n
  • Talk to concurrent GC and STW\nlive == pinned, so leaked memory is live memory by definition\n
  • * This is an extra resource, but it's a specialised one for you number crunchers\n
  • \n
  • Utilise pseudo thread affinity\n NUMA might/can/does do that\n But perhaps a developer could do better?\n Do we need a CPULocal?\n\n
  • <TODO weak section>\n
  • \n
  • \n
  • Lambdas are a neat way on encapsulating behaviour and data that you will like want to run in a parallel fashion in the future....\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Java and the machine - Martijn Verburg and Kirk Pepperdine

    1. 1. TextJava and the Machine Kirk Pepperdine (@kcpeppe) Martijn Verburg (@karianna)
    2. 2. This guy is coming for you
    3. 3. The developer version of the T-800
    4. 4. Hey Martijn, do you know what this is?
    5. 5. A relic simpler era!!!
    6. 6. Intel 4004Shipped 1971108khz-740khz10µm die (1/10th of human hair)16 pin DIP2,300 transistorsshared address and data bus46300 or 92600 instructions / secondInstruction cycle of 10.8 or 21.6 µs (8 clock cycles / instruction cycle)4 bit bus (12 bit addressing, 8 bit instruction, 4 bit word)46 instructions (41-8 bit, 5-16 bit), described in a 9 page doc!!!
    7. 7. Intel i7Shipped 20083.3Ghz (4200x)2nm die (5000x smaller)64 bit instructions1370 landings774 million transistors (~336,000x)50000x less expensive, 350,000x faster, 5000x less power / transistor31 stage instruction pipeline (instructions retired @ 1 / clock) • adjacent and nearly adjacent instructions run at the same timepackaging doc is 102 pagesMSR documentation is well over 7000 pages
    8. 8. Lots of challenges!• Back to basics• Challenge: Hardware Threads• Challenge: Virtualisation• Challenge: The JVM• Challenge: Java• Conclusion
    9. 9. Its 18:45 on conference day 1You have a comforting beverage in your hand This talk is going to hurt T-800 dont care. 9
    10. 10. Back to Basics• Moores Law• Littles Law• Amdahls Law• Its the hardware stupid!• What can I do?
    11. 11. Moores Law• Its about transistors not clocks
    12. 12. Littles Lawλ=1/µ
    13. 13. Littles Law λ=1/µThroughput = 1 / Service Time
    14. 14. Amdahls Law
    15. 15. Its the Hardware stupid!• Software often thinks there are no limitations – If youre Turing complete you can do anything!• The reality is that its fighting over finite resources – You remember hardware right?• These all have finite capacities and throughputs – CPU – RAM – Disk – Network – Bus
    16. 16. Mechanical Sympathy "Mechanical Sympathy - Hardware and Software working together"— Martin Thompson
    17. 17. What can I do?• Dont panic• Learn the laws – Have them up next to your monitor• Code the laws – See them in action!• Amdahls law says "dont serialise!" – With no serialisation, no need to worry abut Littles law!• Understand what hardware you have – Understand capacities and throughputs – Learn how to measure, not guess those numbers
    18. 18. Still with us? Good. 17
    19. 19. Hardware Threads• Threads are non deterministic• Single threaded programs• Multi core programs• What can I do?• Protect your code
    20. 20. Threads are non-deterministic "Threads seem to be a small step from sequential computation" "In fact, they represent a huge step."— The Problem with Threads, Edward A. Lee, UC Berkeley, 2006
    21. 21. In the beginning there was fork• It was a copy operation – No common memory space – So we had to work hard to share state between processes• Everything was local to us – We had no race conditions – Hardware caching was transparent, we didnt care – And we could cache anything at anytime without fear!• Life was simpler – Ordering was guaranteed
    22. 22. In case youve forgotten
    23. 23. In case youve forgotten
    24. 24. Then we had multi-thread / single core • new Thread() is like a light-weight fork – Memory is now shared – You dont work hard to share state – Instead you work hard to guarantee order • Guaranteeing order is a much harder problem – We started with volatile and synchronized – They were fence builders, to help guarantee order – It made earlier processors do more work • Hardware, O/S & JVM guys were playing leap frog!
    25. 25. Now we have multiple hardware threads • new Thread() is still a light-weight fork – Memory is still shared – Now we have to work hard to share state due to caching – We still have to work hard to guarantee order • Guaranteed order is now a performance hit – volatile and synchronized force draining and refreshing of caches – Some hardware support to get around this (CAS) – Some software support to get around this (NUMA) • Hardware, O/S & JVM guys are still playing leap frog!
    26. 26. What can I do? "We protect code with the hope of protecting data."— Gil Tene, Azul Systems, 2008
    27. 27. Protect your data• You can use a bunch of different techniques – synchronized – volatile – atomics, e.g. AtomicInteger – Explicit Java 5 Locks, e.g. ReentrantLock• But these all have a performance hit• Lets see examples of these in action
    28. 28. This fence was atomic
    29. 29. The following source code can be read laterDont worry there is a point to this 28
    30. 30. Running Naked private int counter;public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter++; counter--; iterations++; } } finally { totalIterations += iterations; } } }); } ...
    31. 31. Using volatileprivate volatile int counter;public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter++; counter--; iterations++; } } finally { totalIterations += iterations; } } }); } ...
    32. 32. Using synchronizedprivate int counter;private final Object lock = new Object();public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { synchronized (lock) { counter++; counter--; } iterations++; } } finally { totalIterations += iterations; } } }); } ...
    33. 33. Using fully explicit Locksprivate int counter;private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { try { lock.writeLock.lock(); counter++; counter--; } finally { lock.writeLock.unlock(); } iterations++; } } finally { totalIterations += iterations; } ...
    34. 34. Using AtomicIntegerprivate AtomicInteger counter = new AtomicInteger(0);public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter.getAndIncrement(); counter.getAndDecrement(); iterations++; } } finally { totalIterations += iterations; } ...
    35. 35. A small benchmark comparison
    36. 36. A small benchmark comparison
    37. 37. So the point is that locking kills performance! 35
    38. 38. Challenge: Virtualisation• Better utilisation of hardware?• You cant virtualise into more hardware• What can I do?
    39. 39. Virtualisation "Why do we do it?"— Martijn Verburg and Kirk Pepperdine, JAX London 2012
    40. 40. Better utilisation of hardware?• Could be utopia because we waste hardware? – Most hardware is idle – Load averages are far less than 10% on many systems• But why are our systems under utilised? – Often because we cant feed them (especially CPU) – Throughput in a system is often limited by a single resource• Trisha Gee is talking more about this tomorrow! – 14:25 - Java Performance FAQ
    41. 41. T-800 is incapable of seeing your process "A Process is defined as a locus of control, an instruction sequence and an abstraction entity which moves through the instructions of a procedure, as the procedure is executed by a processor" — Jack B Dennis, Earl C Van Horn, 1965
    42. 42. Theres no such thing as aprocess as far as the hardware is concerned 40
    43. 43. You cant virtualise into more hardware! • Throughput is often limited by a single resource – That bottleneck can starve everybody else! • Going from most problematic to most problematic – Network, Disk, RAM, CPU – Throughput overwhelms the wire – NAS means more pressure on network – CPU/RAM speed gap growing (8% per year) – Many thread scheduling issues (including cache!) – The list goes on and on and on.......... • Virtual machines sharing hardware causes resource conflict – You need more than core affinity
    44. 44. What can I do?• Build your financial and capacity use cases – TCO for virtual vs bare metal – What does your growth curve look like?• CPU Pinning trick / Processor Affinity• Memory Pinning trick• Move back to bare metal! – Its OK - really!
    45. 45. Final Thought "There are many reasons to move to the cloud - performance isnt necessarily one of them."— Kirk Pepperdine - random rant - 2010
    46. 46. Challenge: The JVM• WORA• The cost of a strong memory model• GC scalability• GPU• File Systems• What can I do?
    47. 47. Brian?Youve got a lot of work to do 45
    48. 48. WORA costs• CPU Model differences – When are where are you allowed to cache? – When and where are you allowed to reorder? – AMD vs Intel vs Sparc vs ARM vs .....• File system differences – O/S Level support• Display Devices - Argh! – Impossible to keep up with the consumer trends – Aspect ratios, resolution, colour depth etc• Power – From unlimited to extremely limited
    49. 49. The cost of the strong memory model • The JVM is ultra conservative – It rightly ensures correctness over performance • Locks enforce correctness – But in a very pessimistic way, which is expensive • Locks delineate regions of serialisation – Serialisation? – Uh oh! Remember Littles and Amdahls laws?
    50. 50. The visibility mismatch• Unit of visibility on a CPU != that in the JVM – CPU - Cache Line – JVM - Where the memory fences in the instruction pipeline are positioned• False sharing is an example of this visibility mismatch – volatile creates a fence which flushes the cache• We think that volatile is a modifier on a field – In reality its a modifier on a cache line• It can make other fields accidentally volatile
    51. 51. GC Scalability• GC cost is dominated by the # of live objects in heap – Larger Heap ~= more live objects• Mark identifies live objects – Takes longer with more objects• Sweep deals with live objects – Compaction only deals with live objects – Evacuation only deals with live objects• G1, Zing, and Balance are no different! – They still have to mark and sweep live objects – Mark for Zing is more efficient
    52. 52. GPU• Cant utilise easily today – Java does not have normalised access to the GPU – The GPU does not have access to Java heap• Today - third party libraries – That translate byte code to GPU RISC instructions – Bundle the data and the instructions together – Then push data and instructions through the GPU• Project Sumatra on its way – – Aparapi backs onto OpenCL• CPU and GPU memory to converge?
    53. 53. What can I do?• Not using java.util.concurrent? – Youre probably doing it wrong• Avoid locks where possible! – They make it difficult to run processes concurrently – But dont forget integrity!• Use Unsafe to implement lockless concurrency – Dangerous!! – Allows non-blocking / wait free implementations – Gives us access to the pointer system – Gives us access to the CAS instruction• Cliff Clicks awesome lib –
    54. 54. What can I do?• Workaround GC with SLAB allocators – Hiding the objects from the GC in native memory – Its a slippery slope to writing your own GC • Were looking at you Cassandra!• Workaround GC using near cache – Storing the objects in another JVM• Peter Lawreys thread affinity OSS project –• Join Adopt OpenJDK –
    55. 55. Challenge: Java• java.lang.Thread is low level• File systems• What can I do?
    56. 56. java.lang.Thread is low level• Has low level co-ordination – wait() – notify(), notifyAll()• Also has dangerous operations – stop() unlocks all the monitors - leaving data in an inconsistent state• java.util.concurrent helps – But managing your threads is still manual
    57. 57. File Systems• It was difficult to do asynchronous I/O – You would get stalled on large file interaction – You would get stalled on multi-casting• There was no native file system support – NIO bought in pipe and selector support – NIO.2 bought in symbolic links support
    58. 58. What can I do?• Read Brians book – Java Concurrency in Practice• Move to java.util.concurrent – People smarter than us have thought about this – Use Runnables, Callables and ExecutorServices• Use thread safe collections – ConcurrentHashMap – CopyOnWriteArrayList• Move to Java 7 – NIO.2 gives you asynchronous I/O ! – Fork and Join helps with Amdahls law
    59. 59. Conclusion• Learn about hardware again – Check your capacity and throughput• Virtualise with caution – It can work for and against you• The JVM needs to evolve and so do you – OpenJDK is adapting, hopefully fast enough!• Learn to use parallelism in code – Challenge each other in peer reviews!
    60. 60. Dont become extinct
    61. 61. Be like Sarah
    62. 62. Acknowledgements• The jClarity Team• Gil Tene - Azul Systems• Warner Bros pictures• Lionsgate Films• Dan Hardiker - Adaptivist• Trisha Gee - 10gen• James Gough - Stackthread
    63. 63. Enjoy the community night! Kirk Pepperdine (@kcpeppe) Martijn Verburg (@karianna)