A study of the Scalability of Stop-
the-World Garbage Collectors on
            Multicores
          Aliya Ibragimova
        University of Fribourg
Agenda
•   Overview
•   Problem Statement
•   Parallel Scavenge description
•   Identifying bottlenecks
•   Methods and solutions
•   Results
•   Conclusion
Overview
• A Stop-the-World Collector performs garbage
  collection while the application is completely
  stopped
• A Parallel Collection uses multiple threads to
  perform Garbage Collection

Parallel Scavenge example available in
OpenJDK7
Problem Statement
   Stop-the-world (STW) algorithm degrades badly beyond
8 – cores on a 48-core NUMA-machine with OpenJDK 7:

  – Does the Stop-the-World design has intrinsic
    limitations?
  – If no what are the limitations of the STW approach?
  – How we can improve the current design?
Parallel Scavenge
Contended locks: GC monitor’s lock
Beginning of parallel phase

                 GC monitor’s lock




                         GC task queue



   GC threads

     Solution: use Michael-Scott lock-free queue
Contended locks: GC monitor’s lock
The end of parallel phase

                      GC monitor’s lock

                      Global
                      counter




Solution: remove redundant synchronization
          use timestamps to avoid race conditions
Contended locks: GC monitor’s lock
Idea: remove GC monitor’s lock

1. Task queue
     Use lock-free task queue

2. Barrier at the end of parallel phase
     Remove redundant synchronization

3. Conditional variable of the GC monitor
     Replace conditional variable with Linux’s
     futex_wait calls.
Lack of NUMA-awareness

        Memory           Memory

       CPU   CPU        CPU   CPU



     NUMA – Non-Uniform Memory access

• Memory access imbalance
• Memory locality
Lack of NUMA-awareness
 • Interleaved spaces
     – map pages from different nodes with round robin
       policy
 • Fragmented spaces
     – thread allocates memory from the fragment
       associated with the node where it is executing
 • Segregated spaces
     – Fragmented space that is restricted to being
       accessed by GC threads running on the same node
Best performance: fragmented spaces in the young space interleaved
in others
Results
Resulting GC, NAPS for NUMA-Aware Parallel Scavenge

Look at the effect of the optimization on 3
benchmarks:
      • SPECjbb2005
      • SPECjvm2008
      • DeCapo
8 memory nodes, 48 cores, 96 GB RAM, Linux 3.0 64-bit
Results
• NAPS improves performance and scalability over
  Parallel Scavenge all most in all cases
• NAPS performance continue to increase up to 48
  cores
• NAPS reduces pause time up to 2.8 times in the best
  case
• NAPS improves responsiveness of applications
Conclusion

• This slide is about next steps…
Questions
If you have any questions you are welcome to ask.

Stop-the-world GCs on milticores

  • 1.
    A study ofthe Scalability of Stop- the-World Garbage Collectors on Multicores Aliya Ibragimova University of Fribourg
  • 2.
    Agenda • Overview • Problem Statement • Parallel Scavenge description • Identifying bottlenecks • Methods and solutions • Results • Conclusion
  • 3.
    Overview • A Stop-the-WorldCollector performs garbage collection while the application is completely stopped • A Parallel Collection uses multiple threads to perform Garbage Collection Parallel Scavenge example available in OpenJDK7
  • 4.
    Problem Statement Stop-the-world (STW) algorithm degrades badly beyond 8 – cores on a 48-core NUMA-machine with OpenJDK 7: – Does the Stop-the-World design has intrinsic limitations? – If no what are the limitations of the STW approach? – How we can improve the current design?
  • 5.
  • 6.
    Contended locks: GCmonitor’s lock Beginning of parallel phase GC monitor’s lock GC task queue GC threads Solution: use Michael-Scott lock-free queue
  • 7.
    Contended locks: GCmonitor’s lock The end of parallel phase GC monitor’s lock Global counter Solution: remove redundant synchronization use timestamps to avoid race conditions
  • 8.
    Contended locks: GCmonitor’s lock Idea: remove GC monitor’s lock 1. Task queue Use lock-free task queue 2. Barrier at the end of parallel phase Remove redundant synchronization 3. Conditional variable of the GC monitor Replace conditional variable with Linux’s futex_wait calls.
  • 9.
    Lack of NUMA-awareness Memory Memory CPU CPU CPU CPU NUMA – Non-Uniform Memory access • Memory access imbalance • Memory locality
  • 10.
    Lack of NUMA-awareness • Interleaved spaces – map pages from different nodes with round robin policy • Fragmented spaces – thread allocates memory from the fragment associated with the node where it is executing • Segregated spaces – Fragmented space that is restricted to being accessed by GC threads running on the same node Best performance: fragmented spaces in the young space interleaved in others
  • 11.
    Results Resulting GC, NAPSfor NUMA-Aware Parallel Scavenge Look at the effect of the optimization on 3 benchmarks: • SPECjbb2005 • SPECjvm2008 • DeCapo 8 memory nodes, 48 cores, 96 GB RAM, Linux 3.0 64-bit
  • 12.
    Results • NAPS improvesperformance and scalability over Parallel Scavenge all most in all cases • NAPS performance continue to increase up to 48 cores • NAPS reduces pause time up to 2.8 times in the best case • NAPS improves responsiveness of applications
  • 13.
    Conclusion • This slideis about next steps…
  • 14.
    Questions If you haveany questions you are welcome to ask.