Highly Scalable Java Programming
      for Multi-Core System

        Zhi Gan (ganzhi@gmail.com)

        http://ganzhi.bl...
Agenda

 • Software Challenges

 • Profiling Tools Introduction

 • Best Practice for Java Programming

 • Rocket Science:...
Software challenges
• Parallelism
   – Larger threads per system = more parallelism needed to achieve
     high utilizatio...
Typical Scalability Curve
The 1st Step: Profiling Parallel
Application
Important Profiling Tools
• Java Lock Monitor (JLM)
  – understand the usage of locks in their applications
  – similar to...
Tprof and VPA tool
Java Lock Monitor



• %MISS : 100 * SLOW / NONREC
• GETS : Lock Entries
• NONREC : Non Recursive Gets
• SLOW : Non Recurs...
Multi-core SDK
                              Dead Lock View




       Synchronization View
Best Practice for High Scalable Java
            Programming
What Is Lock Contention?




                           From JLM tool website
Lock Operation Itself Is Expensive
• CAS operations are predominantly used for
  locking
• it takes up a big part of the e...
Reduce Locking Scope
public synchronized void foo1(int k)    public void foo2(int k) {
  {                                ...
Results from JLM report




                          Reduced AVER_HTM
Lock Splitting
 public synchronized void   public void addUser2(String u){
   addUser1(String u) {       synchronized(user...
Result from JLM report




                         Reduced lock tries
Lock Striping
 public synchronized void       public void put2(int indx,
   put1(int indx, String k) {     String k) {
   ...
Result from JLM report




                         More locks with
                         less AVER_HTM
Split Hot Points : Scalable Counter




  – ConcurrentHashMap maintains a independent
    counter for each segment of hash...
Alternatives of Exclusive Lock
• Duplicate shared resource if possible
• Atomic variables
  – counter, sequential number g...
Example of AtomicLongArray
public synchronized void set1(int   private final AtomicLongArray a;
  idx, long val) {
  d[idx...
Using Concurrent Container
• java.util.concurrent package
  – since Java1.5
  – ConcurrentHashMap, ConcurrentLinkedQueue,
...
Using Immutable and Thread Local data
• Immutable data
  – remain unchanged in its life cycle
  – always thread-safe
• Thr...
Reduce Memory Allocation
• JVM: Two level of memory allocation
  – firstly from thread-local buffer
  – then from global b...
Rocket Science: Lock-Free Programming
Using Lock-Free/Wait-Free Algorithm
• Lock-Free allow concurrent updates of
  shared data structures without using any
  l...
Why Lock-Free Often Means Better Scalability? (I)




  Lock:All threads wait for one
                               Lock ...
Why Lock-Free Often Means Better Scalability? (II)




     X                                  X




  Lock:All threads wa...
Performance of A Lock-Free Stack




  Picture from: http://www.infoq.com/articles/scalable-java-components
References
• Amino Lib
  – http://amino-cbbs.sourceforge.net/
• MSDK
  – http://www.alphaworks.ibm.com/tech/msdk
• JLA
  –...
Backup
Upcoming SlideShare
Loading in...5
×

Highly Scalable Java Programming for Multi-Core System

6,119

Published on

This is a list of java programming skill which can be used to improve scalability of Java application.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,119
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
167
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • What if all previous best prestise cannot meet your need? You would like to optimize your application manually?
  • msdk – This tool can be used to do detailed performance analysis of concurrent Java applications. It does an in-depth analysis of the complete execution stack, starting from the hardware to the application layer. Information is gathered from all four layers of the stack – hardware, operating system, jvm and application.
  • `
  • For multi-thread application, lock-free approach is different with lock-based approach in several aspects: When accessing shared resource, lock-based approach will only allow one thread to enter critical section and others will wait for it On the contrary, lock-free approach will all every thread to modify state of shared state. But one of the all threads can succeed, and all other threads will be aware of their action are failed so they will retry or choose other actions.
  • The real difference occurs when something bad happens to the running thread. If a running thread is paused by OS scheduler, different thing will happen to the two approach: Lock-based approach: All other threads are waiting for this thread, and no one can make progress Lock-free approach: Other threads will be free to do any operations. And the paused thread might fail its current operation From this difference, we can found in multi-core environment, lock-free will have more advantage. It will have better scalability since threads don’t wait for each other. And it will waste some CPU cycles if contention. But this won’t be a problem for most cases since we have more than enough CPU resource 
  • Transcript of "Highly Scalable Java Programming for Multi-Core System"

    1. 1. Highly Scalable Java Programming for Multi-Core System Zhi Gan (ganzhi@gmail.com) http://ganzhi.blogspot.com
    2. 2. Agenda • Software Challenges • Profiling Tools Introduction • Best Practice for Java Programming • Rocket Science: Lock-Free Programming 2
    3. 3. Software challenges • Parallelism – Larger threads per system = more parallelism needed to achieve high utilization – Thread-to-thread affinity (shared code and/or data) • Memory management – Sharing of cache and memory bandwidth across more threads = greater need for memory efficiency – Thread-to-memory affinity (execute thread closest to associated data) • Storage management – Allocate data across DRAM, Disk & Flash according to access frequency and patterns 3
    4. 4. Typical Scalability Curve
    5. 5. The 1st Step: Profiling Parallel Application
    6. 6. Important Profiling Tools • Java Lock Monitor (JLM) – understand the usage of locks in their applications – similar tool: Java Lock Analyzer (JLA) • Multi-core SDK (MSDK) – in-depth analysis of the complete execution stack • AIX Performance Tools – Simple Performance Lock Analysis Tool (SPLAT) – XProfiler – prof, tprof and gprof
    7. 7. Tprof and VPA tool
    8. 8. Java Lock Monitor • %MISS : 100 * SLOW / NONREC • GETS : Lock Entries • NONREC : Non Recursive Gets • SLOW : Non Recursives that Wait • REC : Recursive Gets • TIER2 : SMP: Total try-enter spin loop cnt (middle for 3 tier) • TIER3 : SMP: Total yield spin loop cnt (outer for 3 tier) • %UTIL : 100 * Hold-Time / Total-Time • AVER-HTM : Hold-Time / NONREC
    9. 9. Multi-core SDK Dead Lock View Synchronization View
    10. 10. Best Practice for High Scalable Java Programming
    11. 11. What Is Lock Contention? From JLM tool website
    12. 12. Lock Operation Itself Is Expensive • CAS operations are predominantly used for locking • it takes up a big part of the execution time
    13. 13. Reduce Locking Scope public synchronized void foo1(int k) public void foo2(int k) { { String key = String key = Integer.toString(k); Integer.toString(k); String value = key+"value"; String value = key+"value"; if (null == key){ if (null == key){ return ; return ; }else { }else{ maph.put(key, value); synchronized(this){ } maph.put(key, value); } } } } 25% Execution Time: 16106 Execution Time: 12157 milliseconds milliseconds
    14. 14. Results from JLM report Reduced AVER_HTM
    15. 15. Lock Splitting public synchronized void public void addUser2(String u){ addUser1(String u) { synchronized(users){ users.add(u); users.add(u); } } } public void addQuery2(String q){ public synchronized void synchronized(queries){ addQuery1(String q) { queries.add(q); queries.add(q); } } } Execution Time: 12981 Execution Time: 4797 milliseconds milliseconds 64%
    16. 16. Result from JLM report Reduced lock tries
    17. 17. Lock Striping public synchronized void public void put2(int indx, put1(int indx, String k) { String k) { share[indx] = k; synchronized } (locks[indx%N_LOCKS]) { share[indx] = k; } } Execution Time: 5536 Execution Time: 1857 milliseconds milliseconds 66%
    18. 18. Result from JLM report More locks with less AVER_HTM
    19. 19. Split Hot Points : Scalable Counter – ConcurrentHashMap maintains a independent counter for each segment of hash map, and use a lock for each counter – get global counter by sum all independent counters
    20. 20. Alternatives of Exclusive Lock • Duplicate shared resource if possible • Atomic variables – counter, sequential number generator, head pointer of linked-list • Concurrent container – java.util.concurrent package, Amino lib • Read-Write Lock – java.util.concurrent.locks.ReadWriteLock
    21. 21. Example of AtomicLongArray public synchronized void set1(int private final AtomicLongArray a; idx, long val) { d[idx] = val; public void set2(int idx, long val) { } a.addAndGet(idx, val); } public synchronized long get1(int public long get2(int idx) { idx) { long ret = a.get(idx); return ret; long ret = d[idx]; } return ret; } Execution Time: 23550 Execution Time: 842 milliseconds milliseconds 96%
    22. 22. Using Concurrent Container • java.util.concurrent package – since Java1.5 – ConcurrentHashMap, ConcurrentLinkedQueue, CopyOnWriteArrayList, etc • Amino Lib is another good choice – LockFreeList, LockFreeStack, LockFreeQueue, etc • Thread-safe container • Optimized for common operations • High performance and scalability for multi-core platform • Drawback: without full feature support
    23. 23. Using Immutable and Thread Local data • Immutable data – remain unchanged in its life cycle – always thread-safe • Thread Local data – only be used by a single thread – not shared among different threads – to replace global waiting queue, object pool – used in work-stealing scheduler
    24. 24. Reduce Memory Allocation • JVM: Two level of memory allocation – firstly from thread-local buffer – then from global buffer • Thread-local buffer will be exhausted quickly if frequency of allocation is high • ThreadLocal class may be helpful if temporary object is needed in a loop
    25. 25. Rocket Science: Lock-Free Programming
    26. 26. Using Lock-Free/Wait-Free Algorithm • Lock-Free allow concurrent updates of shared data structures without using any locking mechanisms – solves some of the basic problems associated with using locks in the code – helps create algorithms that show good scalability • Highly scalable and efficient • Amino Lib
    27. 27. Why Lock-Free Often Means Better Scalability? (I) Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads need retry
    28. 28. Why Lock-Free Often Means Better Scalability? (II) X X Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads often need to retry
    29. 29. Performance of A Lock-Free Stack Picture from: http://www.infoq.com/articles/scalable-java-components
    30. 30. References • Amino Lib – http://amino-cbbs.sourceforge.net/ • MSDK – http://www.alphaworks.ibm.com/tech/msdk • JLA – http://www.alphaworks.ibm.com/tech/jla
    31. 31. Backup
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×