Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
TextJava and the Machine   Kirk Pepperdine (@kcpeppe)   Martijn Verburg (@karianna)
This guy is coming for you
The developer version of the T-800
Hey Martijn, do you know what this is?
A relic simpler era!!!
Intel 4004Shipped 1971108khz-740khz10µm die (1/10th of human hair)16 pin DIP2,300 transistorsshared address and data bus46...
Intel i7Shipped 20083.3Ghz (4200x)2nm die (5000x smaller)64 bit instructions1370 landings774 million transistors (~336,000...
Lots of challenges!• Back to basics• Challenge: Hardware Threads• Challenge: Virtualisation• Challenge: The JVM• Challenge...
Its 18:45 on conference day 1You have a comforting beverage in your hand    This talk is going to hurt             T-800 d...
Back to Basics• Moores Law• Littles Law• Amdahls Law• Its the hardware stupid!• What can I do?
Moores Law• Its about transistors not clocks
Littles Lawλ=1/µ
Littles Law        λ=1/µThroughput = 1 / Service Time
Amdahls Law
Its the Hardware stupid!• Software often thinks there are no limitations    – If youre Turing complete you can do anything...
Mechanical Sympathy             "Mechanical Sympathy                      -   Hardware and Software working             to...
What can I do?• Dont panic• Learn the laws    – Have them up next to your monitor• Code the laws    – See them in action!•...
Still with us?   Good.                 17
Hardware Threads• Threads are non deterministic• Single threaded programs• Multi core programs• What can I do?• Protect yo...
Threads are non-deterministic  "Threads seem to be a small step    from sequential computation"      "In fact, they repres...
In the beginning there was fork• It was a copy operation    – No common memory space    – So we had to work hard to share ...
In case youve forgotten
In case youve forgotten
Then we had multi-thread / single core   • new Thread() is like a light-weight fork       – Memory is now shared       – Y...
Now we have multiple hardware threads    • new Thread() is still a light-weight fork        – Memory is still shared      ...
What can I do?    "We protect code with the hope                   of protecting data."— Gil Tene, Azul Systems, 2008
Protect your data• You can use a bunch of different techniques    – synchronized    – volatile    – atomics, e.g. AtomicIn...
This fence was atomic
The following source code can be            read laterDont worry there is a point to this                               28
Running Naked    private int counter;public void execute( int threadCount) {     init();     Thread[] threads = new Thread...
Using volatileprivate volatile int counter;public void execute( int threadCount) {    init();    Thread[] threads = new Th...
Using synchronizedprivate int counter;private final Object lock = new Object();public void execute( int threadCount) {    ...
Using fully explicit Locksprivate int counter;private final ReentrantReadWriteLock lock =          new ReentrantReadWriteL...
Using AtomicIntegerprivate AtomicInteger counter = new AtomicInteger(0);public void execute( int threadCount) {    init();...
A small benchmark comparison
A small benchmark comparison
So the point is that locking kills         performance!                               35
Challenge: Virtualisation• Better utilisation of hardware?• You cant virtualise into more hardware• What can I do?
Virtualisation                    "Why do we do it?"— Martijn Verburg and Kirk Pepperdine, JAX London 2012
Better utilisation of hardware?• Could be utopia because we waste hardware?    – Most hardware is idle    – Load averages ...
T-800 is incapable of seeing your process        "A Process is defined as a locus            of control, an instruction   ...
Theres no such thing as aprocess as far as the hardware         is concerned                          40
You cant virtualise into more hardware!    • Throughput is often limited by a single resource        – That bottleneck can...
What can I do?• Build your financial and capacity use cases    – TCO for virtual vs bare metal    – What does your growth ...
Final Thought  "There are many reasons to move   to the cloud - performance isnt      necessarily one of them."— Kirk Pepp...
Challenge: The JVM• WORA• The cost of a strong memory model• GC scalability• GPU• File Systems• What can I do?
Brian?Youve got a lot of work to do                           45
WORA costs• CPU Model differences    – When are where are you allowed to cache?    – When and where are you allowed to reo...
The cost of the strong memory model   • The JVM is ultra conservative       – It rightly ensures correctness over performa...
The visibility mismatch• Unit of visibility on a CPU != that in the JVM    – CPU - Cache Line    – JVM - Where the memory ...
GC Scalability• GC cost is dominated by the # of live objects in heap    – Larger Heap ~= more live objects• Mark identifi...
GPU• Cant utilise easily today    – Java does not have normalised access to the GPU    – The GPU does not have access to J...
What can I do?• Not using java.util.concurrent?    – Youre probably doing it wrong• Avoid locks where possible!    – They ...
What can I do?• Workaround GC with SLAB allocators    – Hiding the objects from the GC in native memory    – Its a slipper...
Challenge: Java• java.lang.Thread is low level• File systems• What can I do?
java.lang.Thread is low level• Has low level co-ordination    – wait()    – notify(), notifyAll()• Also has dangerous oper...
File Systems• It was difficult to do asynchronous I/O    – You would get stalled on large file interaction    – You would ...
What can I do?• Read Brians book    – Java Concurrency in Practice• Move to java.util.concurrent    – People smarter than ...
Conclusion• Learn about hardware again   – Check your capacity and throughput• Virtualise with caution   – It can work for...
Dont become extinct
Be like Sarah
Acknowledgements• The jClarity Team• Gil Tene - Azul Systems• Warner Bros pictures• Lionsgate Films• Dan Hardiker - Adapti...
Enjoy the community night!      Kirk Pepperdine (@kcpeppe)      Martijn Verburg (@karianna)           www.jclarity.com
Upcoming SlideShare
Loading in …5
×

Java and the machine - Martijn Verburg and Kirk Pepperdine

1,099 views

Published on

In Terminator 3 - Rise of the Machines, bare metal comes back to haunt humanity, ruthlessly crushing all resistance. This keynote is here to warn you that the same thing is happening to Java and the JVM! Java was designed in a world where there were a wide range of hardware platforms to support. Its premise of Write Once Run Anywhere (WORA) proved to be one of the compelling reasons behind Java's dominance (even if the reality didn't quite meet the marketing hype). However, this WORA property means that Java and the JVM struggled to utilise specialist hardware and operating system features that could make a massive difference in the performance of your application. This problem has recently gotten much, much worse. Due to the rise of multi-core processors, massive increases in main memory and enhancements to other major hardware components (e.g. SSD), the JVM is now distant from utilising that hardware, causing some major performance and scalability issues! Kirk Pepperdine and Martijn Verburg will take you through the complexities of where Java meets the machine and loses. They'll give up some of their hard-won insights on how to work around these issues so that you can plan to avoid termination, unlike some of the poor souls that ran into the T-800...

  • Be the first to comment

  • Be the first to like this

Java and the machine - Martijn Verburg and Kirk Pepperdine

  1. 1. TextJava and the Machine Kirk Pepperdine (@kcpeppe) Martijn Verburg (@karianna)
  2. 2. This guy is coming for you
  3. 3. The developer version of the T-800
  4. 4. Hey Martijn, do you know what this is?
  5. 5. A relic simpler era!!!
  6. 6. Intel 4004Shipped 1971108khz-740khz10µm die (1/10th of human hair)16 pin DIP2,300 transistorsshared address and data bus46300 or 92600 instructions / secondInstruction cycle of 10.8 or 21.6 µs (8 clock cycles / instruction cycle)4 bit bus (12 bit addressing, 8 bit instruction, 4 bit word)46 instructions (41-8 bit, 5-16 bit), described in a 9 page doc!!!
  7. 7. Intel i7Shipped 20083.3Ghz (4200x)2nm die (5000x smaller)64 bit instructions1370 landings774 million transistors (~336,000x)50000x less expensive, 350,000x faster, 5000x less power / transistor31 stage instruction pipeline (instructions retired @ 1 / clock) • adjacent and nearly adjacent instructions run at the same timepackaging doc is 102 pagesMSR documentation is well over 7000 pages
  8. 8. Lots of challenges!• Back to basics• Challenge: Hardware Threads• Challenge: Virtualisation• Challenge: The JVM• Challenge: Java• Conclusion
  9. 9. Its 18:45 on conference day 1You have a comforting beverage in your hand This talk is going to hurt T-800 dont care. 9
  10. 10. Back to Basics• Moores Law• Littles Law• Amdahls Law• Its the hardware stupid!• What can I do?
  11. 11. Moores Law• Its about transistors not clocks
  12. 12. Littles Lawλ=1/µ
  13. 13. Littles Law λ=1/µThroughput = 1 / Service Time
  14. 14. Amdahls Law
  15. 15. Its the Hardware stupid!• Software often thinks there are no limitations – If youre Turing complete you can do anything!• The reality is that its fighting over finite resources – You remember hardware right?• These all have finite capacities and throughputs – CPU – RAM – Disk – Network – Bus
  16. 16. Mechanical Sympathy "Mechanical Sympathy - Hardware and Software working together"— Martin Thompson
  17. 17. What can I do?• Dont panic• Learn the laws – Have them up next to your monitor• Code the laws – See them in action!• Amdahls law says "dont serialise!" – With no serialisation, no need to worry abut Littles law!• Understand what hardware you have – Understand capacities and throughputs – Learn how to measure, not guess those numbers
  18. 18. Still with us? Good. 17
  19. 19. Hardware Threads• Threads are non deterministic• Single threaded programs• Multi core programs• What can I do?• Protect your code
  20. 20. Threads are non-deterministic "Threads seem to be a small step from sequential computation" "In fact, they represent a huge step."— The Problem with Threads, Edward A. Lee, UC Berkeley, 2006
  21. 21. In the beginning there was fork• It was a copy operation – No common memory space – So we had to work hard to share state between processes• Everything was local to us – We had no race conditions – Hardware caching was transparent, we didnt care – And we could cache anything at anytime without fear!• Life was simpler – Ordering was guaranteed
  22. 22. In case youve forgotten
  23. 23. In case youve forgotten
  24. 24. Then we had multi-thread / single core • new Thread() is like a light-weight fork – Memory is now shared – You dont work hard to share state – Instead you work hard to guarantee order • Guaranteeing order is a much harder problem – We started with volatile and synchronized – They were fence builders, to help guarantee order – It made earlier processors do more work • Hardware, O/S & JVM guys were playing leap frog!
  25. 25. Now we have multiple hardware threads • new Thread() is still a light-weight fork – Memory is still shared – Now we have to work hard to share state due to caching – We still have to work hard to guarantee order • Guaranteed order is now a performance hit – volatile and synchronized force draining and refreshing of caches – Some hardware support to get around this (CAS) – Some software support to get around this (NUMA) • Hardware, O/S & JVM guys are still playing leap frog!
  26. 26. What can I do? "We protect code with the hope of protecting data."— Gil Tene, Azul Systems, 2008
  27. 27. Protect your data• You can use a bunch of different techniques – synchronized – volatile – atomics, e.g. AtomicInteger – Explicit Java 5 Locks, e.g. ReentrantLock• But these all have a performance hit• Lets see examples of these in action
  28. 28. This fence was atomic
  29. 29. The following source code can be read laterDont worry there is a point to this 28
  30. 30. Running Naked private int counter;public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter++; counter--; iterations++; } } finally { totalIterations += iterations; } } }); } ...
  31. 31. Using volatileprivate volatile int counter;public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter++; counter--; iterations++; } } finally { totalIterations += iterations; } } }); } ...
  32. 32. Using synchronizedprivate int counter;private final Object lock = new Object();public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { synchronized (lock) { counter++; counter--; } iterations++; } } finally { totalIterations += iterations; } } }); } ...
  33. 33. Using fully explicit Locksprivate int counter;private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { try { lock.writeLock.lock(); counter++; counter--; } finally { lock.writeLock.unlock(); } iterations++; } } finally { totalIterations += iterations; } ...
  34. 34. Using AtomicIntegerprivate AtomicInteger counter = new AtomicInteger(0);public void execute( int threadCount) { init(); Thread[] threads = new Thread[ threadCount]; for ( int i = 0; i < threads.length; i++) { threads[i] = new Thread( new Runnable() { public void run() { long iterations = 0; try { while ( running) { counter.getAndIncrement(); counter.getAndDecrement(); iterations++; } } finally { totalIterations += iterations; } ...
  35. 35. A small benchmark comparison
  36. 36. A small benchmark comparison
  37. 37. So the point is that locking kills performance! 35
  38. 38. Challenge: Virtualisation• Better utilisation of hardware?• You cant virtualise into more hardware• What can I do?
  39. 39. Virtualisation "Why do we do it?"— Martijn Verburg and Kirk Pepperdine, JAX London 2012
  40. 40. Better utilisation of hardware?• Could be utopia because we waste hardware? – Most hardware is idle – Load averages are far less than 10% on many systems• But why are our systems under utilised? – Often because we cant feed them (especially CPU) – Throughput in a system is often limited by a single resource• Trisha Gee is talking more about this tomorrow! – 14:25 - Java Performance FAQ
  41. 41. T-800 is incapable of seeing your process "A Process is defined as a locus of control, an instruction sequence and an abstraction entity which moves through the instructions of a procedure, as the procedure is executed by a processor" — Jack B Dennis, Earl C Van Horn, 1965
  42. 42. Theres no such thing as aprocess as far as the hardware is concerned 40
  43. 43. You cant virtualise into more hardware! • Throughput is often limited by a single resource – That bottleneck can starve everybody else! • Going from most problematic to most problematic – Network, Disk, RAM, CPU – Throughput overwhelms the wire – NAS means more pressure on network – CPU/RAM speed gap growing (8% per year) – Many thread scheduling issues (including cache!) – The list goes on and on and on.......... • Virtual machines sharing hardware causes resource conflict – You need more than core affinity
  44. 44. What can I do?• Build your financial and capacity use cases – TCO for virtual vs bare metal – What does your growth curve look like?• CPU Pinning trick / Processor Affinity• Memory Pinning trick• Move back to bare metal! – Its OK - really!
  45. 45. Final Thought "There are many reasons to move to the cloud - performance isnt necessarily one of them."— Kirk Pepperdine - random rant - 2010
  46. 46. Challenge: The JVM• WORA• The cost of a strong memory model• GC scalability• GPU• File Systems• What can I do?
  47. 47. Brian?Youve got a lot of work to do 45
  48. 48. WORA costs• CPU Model differences – When are where are you allowed to cache? – When and where are you allowed to reorder? – AMD vs Intel vs Sparc vs ARM vs .....• File system differences – O/S Level support• Display Devices - Argh! – Impossible to keep up with the consumer trends – Aspect ratios, resolution, colour depth etc• Power – From unlimited to extremely limited
  49. 49. The cost of the strong memory model • The JVM is ultra conservative – It rightly ensures correctness over performance • Locks enforce correctness – But in a very pessimistic way, which is expensive • Locks delineate regions of serialisation – Serialisation? – Uh oh! Remember Littles and Amdahls laws?
  50. 50. The visibility mismatch• Unit of visibility on a CPU != that in the JVM – CPU - Cache Line – JVM - Where the memory fences in the instruction pipeline are positioned• False sharing is an example of this visibility mismatch – volatile creates a fence which flushes the cache• We think that volatile is a modifier on a field – In reality its a modifier on a cache line• It can make other fields accidentally volatile
  51. 51. GC Scalability• GC cost is dominated by the # of live objects in heap – Larger Heap ~= more live objects• Mark identifies live objects – Takes longer with more objects• Sweep deals with live objects – Compaction only deals with live objects – Evacuation only deals with live objects• G1, Zing, and Balance are no different! – They still have to mark and sweep live objects – Mark for Zing is more efficient
  52. 52. GPU• Cant utilise easily today – Java does not have normalised access to the GPU – The GPU does not have access to Java heap• Today - third party libraries – That translate byte code to GPU RISC instructions – Bundle the data and the instructions together – Then push data and instructions through the GPU• Project Sumatra on its way – http://openjdk.java.net/projects/sumatra/ – Aparapi backs onto OpenCL• CPU and GPU memory to converge?
  53. 53. What can I do?• Not using java.util.concurrent? – Youre probably doing it wrong• Avoid locks where possible! – They make it difficult to run processes concurrently – But dont forget integrity!• Use Unsafe to implement lockless concurrency – Dangerous!! – Allows non-blocking / wait free implementations – Gives us access to the pointer system – Gives us access to the CAS instruction• Cliff Clicks awesome lib – http://sourceforge.net/projects/high-scale-lib/
  54. 54. What can I do?• Workaround GC with SLAB allocators – Hiding the objects from the GC in native memory – Its a slippery slope to writing your own GC • Were looking at you Cassandra!• Workaround GC using near cache – Storing the objects in another JVM• Peter Lawreys thread affinity OSS project – https://github.com/peter-lawrey/Java-Thread-Affinity• Join Adopt OpenJDK – http://www.java.net/projects/adoptopenjdk
  55. 55. Challenge: Java• java.lang.Thread is low level• File systems• What can I do?
  56. 56. java.lang.Thread is low level• Has low level co-ordination – wait() – notify(), notifyAll()• Also has dangerous operations – stop() unlocks all the monitors - leaving data in an inconsistent state• java.util.concurrent helps – But managing your threads is still manual
  57. 57. File Systems• It was difficult to do asynchronous I/O – You would get stalled on large file interaction – You would get stalled on multi-casting• There was no native file system support – NIO bought in pipe and selector support – NIO.2 bought in symbolic links support
  58. 58. What can I do?• Read Brians book – Java Concurrency in Practice• Move to java.util.concurrent – People smarter than us have thought about this – Use Runnables, Callables and ExecutorServices• Use thread safe collections – ConcurrentHashMap – CopyOnWriteArrayList• Move to Java 7 – NIO.2 gives you asynchronous I/O ! – Fork and Join helps with Amdahls law
  59. 59. Conclusion• Learn about hardware again – Check your capacity and throughput• Virtualise with caution – It can work for and against you• The JVM needs to evolve and so do you – OpenJDK is adapting, hopefully fast enough!• Learn to use parallelism in code – Challenge each other in peer reviews!
  60. 60. Dont become extinct
  61. 61. Be like Sarah
  62. 62. Acknowledgements• The jClarity Team• Gil Tene - Azul Systems• Warner Bros pictures• Lionsgate Films• Dan Hardiker - Adaptivist• Trisha Gee - 10gen• James Gough - Stackthread
  63. 63. Enjoy the community night! Kirk Pepperdine (@kcpeppe) Martijn Verburg (@karianna) www.jclarity.com

×