Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger , Kathryn McKinley * ,  Robert Blumofe, Pau...
Motivation <ul><li>Parallel multithreaded programs becoming prevalent </li></ul><ul><ul><li>web servers, search engines, d...
Assessment Criteria for Multiprocessor Allocators <ul><li>Speed </li></ul><ul><ul><li>competitive with uniprocessor alloca...
Uniprocessor Allocators on Multiprocessors <ul><li>Fragmentation:  Excellent </li></ul><ul><ul><li>Very low for most progr...
Allocator-Induced False Sharing <ul><li>Allocators cause false sharing! </li></ul><ul><li>Cache lines can end up spread ac...
Existing Multiprocessor Allocators <ul><li>Speed: </li></ul><ul><ul><li>One concurrent heap (e.g., concurrent B-tree):    ...
Multiprocessor Allocator I: Pure Private Heaps <ul><li>Pure private heaps : one heap per processor. </li></ul><ul><ul><li>...
How to Break Pure Private Heaps: Fragmentation <ul><li>Pure private heaps : </li></ul><ul><ul><li>memory consumption can g...
Multiprocessor Allocator II: Private Heaps with Ownership <ul><li>Private heaps with ownership: free  puts memory back on ...
How to Break Private Heaps with Ownership:Fragmentation <ul><li>Private heaps with ownership: memory consumption can blowu...
So What Do We Do Now?
The Hoard Multiprocessor Memory Allocator <ul><li>Manages memory in page-sized  superblocks  of same-sized objects </li></...
Hoard Example <ul><li>Hoard : one heap per processor + a global heap </li></ul><ul><ul><li>malloc  gets memory from a  sup...
Summary of Analytical Results <ul><li>Worst-case memory consumption: </li></ul><ul><ul><li>O(n log M/m  + P ) [instead of ...
Experiments <ul><li>Run on a dedicated 14-processor Sun Enterprise </li></ul><ul><ul><li>300 MHz UltraSparc, 1 GB of RAM <...
Performance:  threadtest speedup (x,P) =  runtime (Solaris allocator, one processor)   /  runtime (x on P processors)
Performance:  Larson Server-style benchmark with sharing
Performance:  false sharing Each thread reads & writes heap data
Fragmentation Results <ul><li>On most standard uniprocessor benchmarks, Hoard’s fragmentation was low: </li></ul><ul><ul><...
Hoard Conclusions <ul><li>Speed:  Excellent </li></ul><ul><ul><li>As fast as a uniprocessor allocator on one processor </l...
Hoard Heap Details <ul><li>“ Segregated size class” allocator </li></ul><ul><ul><li>Size classes are logarithmically-space...
Upcoming SlideShare
Loading in...5
×

Hoard: A Scalable Memory Allocator for Multithreaded Applications

4,730
-1

Published on

Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.

In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,730
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
129
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Hoard: A Scalable Memory Allocator for Multithreaded Applications

  1. 1. Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger , Kathryn McKinley * , Robert Blumofe, Paul Wilson Department of Computer Sciences * Department of Computer Science
  2. 2. Motivation <ul><li>Parallel multithreaded programs becoming prevalent </li></ul><ul><ul><li>web servers, search engines, database managers, etc. </li></ul></ul><ul><ul><li>run on SMP’s for high performance </li></ul></ul><ul><ul><li>often embarrassingly parallel </li></ul></ul><ul><li>Memory allocation is a bottleneck </li></ul><ul><ul><li>prevents scaling with number of processors </li></ul></ul>
  3. 3. Assessment Criteria for Multiprocessor Allocators <ul><li>Speed </li></ul><ul><ul><li>competitive with uniprocessor allocators on one processor </li></ul></ul><ul><li>Scalability </li></ul><ul><ul><li>performance linear with the number of processors </li></ul></ul><ul><li>Fragmentation (= max allocated / max in use) </li></ul><ul><ul><li>competitive with uniprocessor allocators </li></ul></ul><ul><ul><ul><li>worst-case and average-case </li></ul></ul></ul>
  4. 4. Uniprocessor Allocators on Multiprocessors <ul><li>Fragmentation: Excellent </li></ul><ul><ul><li>Very low for most programs [Wilson & Johnstone] </li></ul></ul><ul><li>Speed & Scalability: Poor </li></ul><ul><ul><li>Heap contention </li></ul></ul><ul><ul><ul><li>a single lock protects the heap </li></ul></ul></ul><ul><ul><li>Can exacerbate false sharing </li></ul></ul><ul><ul><ul><li>different processors can share cache lines </li></ul></ul></ul>
  5. 5. Allocator-Induced False Sharing <ul><li>Allocators cause false sharing! </li></ul><ul><li>Cache lines can end up spread across a number of processors </li></ul><ul><li>Practically all allocators do this </li></ul>processor 1 processor 2 x2 = malloc(s); x1 = malloc(s); A cache line thrash… thrash…
  6. 6. Existing Multiprocessor Allocators <ul><li>Speed: </li></ul><ul><ul><li>One concurrent heap (e.g., concurrent B-tree): too expensive </li></ul></ul><ul><ul><ul><li>too many locks/atomic updates </li></ul></ul></ul><ul><ul><ul><li>O(log n) cost per memory operation </li></ul></ul></ul><ul><ul><li> Fast allocators use multiple heaps </li></ul></ul><ul><li>Scalability: </li></ul><ul><ul><li>Allocator-induced false sharing and other bottlenecks </li></ul></ul><ul><li>Fragmentation: P-fold increase or even unbounded </li></ul>
  7. 7. Multiprocessor Allocator I: Pure Private Heaps <ul><li>Pure private heaps : one heap per processor. </li></ul><ul><ul><li>malloc gets memory from the processor's heap or the system </li></ul></ul><ul><ul><li>free puts memory on the processor's heap </li></ul></ul><ul><li>Avoids heap contention </li></ul><ul><ul><li>Examples: STL, ad hoc (e.g., Cilk 4.1) </li></ul></ul>x1= malloc(s) free(x1) free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1 processor 2 = allocated by heap 1 = free, on heap 2
  8. 8. How to Break Pure Private Heaps: Fragmentation <ul><li>Pure private heaps : </li></ul><ul><ul><li>memory consumption can grow without bound! </li></ul></ul><ul><li>Producer-consumer: </li></ul><ul><ul><li>processor 1 allocates </li></ul></ul><ul><ul><li>processor 2 frees </li></ul></ul>free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1 processor 2 x3= malloc(s) free(x3)
  9. 9. Multiprocessor Allocator II: Private Heaps with Ownership <ul><li>Private heaps with ownership: free puts memory back on the originating processor 's heap. </li></ul><ul><li>Avoids unbounded memory consumption </li></ul><ul><ul><li>Examples: ptmalloc [Gloger], LKmalloc [Larson & Krishnan] </li></ul></ul>x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1 processor 2
  10. 10. How to Break Private Heaps with Ownership:Fragmentation <ul><li>Private heaps with ownership: memory consumption can blowup by a factor of P. </li></ul><ul><li>Round-robin producer-consumer: </li></ul><ul><ul><li>processor i allocates </li></ul></ul><ul><ul><li>processor i+1 frees </li></ul></ul><ul><li>This really happens (NDS). </li></ul>free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1 processor 2 processor 3
  11. 11. So What Do We Do Now?
  12. 12. The Hoard Multiprocessor Memory Allocator <ul><li>Manages memory in page-sized superblocks of same-sized objects </li></ul><ul><ul><li>- Avoids false sharing by not carving up cache lines </li></ul></ul><ul><ul><li>- Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks </li></ul></ul><ul><li>Adds a global heap that is a repository of superblocks </li></ul><ul><li>When the fraction of free memory exceeds the empty fraction , moves superblocks to the global heap </li></ul><ul><ul><li>- Avoids blowup in memory consumption </li></ul></ul>
  13. 13. Hoard Example <ul><li>Hoard : one heap per processor + a global heap </li></ul><ul><ul><li>malloc gets memory from a superblock on its heap. </li></ul></ul><ul><ul><li>free returns memory to its superblock . If the heap is “too empty”, it moves a superblock to the global heap. </li></ul></ul>x1= malloc(s) processor 1 global heap free(x7) … some mallocs … some frees Empty fraction = 1/3
  14. 14. Summary of Analytical Results <ul><li>Worst-case memory consumption: </li></ul><ul><ul><li>O(n log M/m + P ) [instead of O( P n log M/m)] </li></ul></ul><ul><ul><ul><li>n = memory required </li></ul></ul></ul><ul><ul><ul><li>M = biggest object size </li></ul></ul></ul><ul><ul><ul><li>m = smallest object size </li></ul></ul></ul><ul><ul><ul><li>P = number of processors </li></ul></ul></ul><ul><ul><li>Best possible: O(n log M/m) [Robson] </li></ul></ul><ul><li>Provably low synchronization in most cases </li></ul>
  15. 15. Experiments <ul><li>Run on a dedicated 14-processor Sun Enterprise </li></ul><ul><ul><li>300 MHz UltraSparc, 1 GB of RAM </li></ul></ul><ul><ul><li>Solaris 2.7 </li></ul></ul><ul><li>All programs compiled with g++ version 2.95.1 </li></ul><ul><li>Allocators: </li></ul><ul><ul><li>Hoard version 2.0.2 </li></ul></ul><ul><ul><li>Solaris (system allocator) </li></ul></ul><ul><ul><li>Ptmalloc (GNU libc – private heaps with ownership) </li></ul></ul><ul><ul><li>mtmalloc (Sun’s “MT-hot” allocator) </li></ul></ul>
  16. 16. Performance: threadtest speedup (x,P) = runtime (Solaris allocator, one processor) / runtime (x on P processors)
  17. 17. Performance: Larson Server-style benchmark with sharing
  18. 18. Performance: false sharing Each thread reads & writes heap data
  19. 19. Fragmentation Results <ul><li>On most standard uniprocessor benchmarks, Hoard’s fragmentation was low: </li></ul><ul><ul><li>p2c (Pascal-to-C): 1.20 espresso: 1.47 </li></ul></ul><ul><ul><li>LRUsim : 1.05 Ghostscript : 1.15 </li></ul></ul><ul><ul><li>Within 20% of Lea’s allocator </li></ul></ul><ul><li>On the multiprocessor benchmarks and other codes: </li></ul><ul><ul><li>Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark (shbench : 3.17) . </li></ul></ul>
  20. 20. Hoard Conclusions <ul><li>Speed: Excellent </li></ul><ul><ul><li>As fast as a uniprocessor allocator on one processor </li></ul></ul><ul><ul><ul><li>amortized O(1) cost </li></ul></ul></ul><ul><ul><ul><li>1 lock for malloc , 2 for free </li></ul></ul></ul><ul><li>Scalability: Excellent </li></ul><ul><ul><li>Scales linearly with the number of processors </li></ul></ul><ul><ul><li>Avoids false sharing </li></ul></ul><ul><li>Fragmentation: Very good </li></ul><ul><ul><li>Worst-case is provably close to ideal </li></ul></ul><ul><ul><li>Actual observed fragmentation is low </li></ul></ul>
  21. 21. Hoard Heap Details <ul><li>“ Segregated size class” allocator </li></ul><ul><ul><li>Size classes are logarithmically-spaced </li></ul></ul><ul><ul><li>Superblocks hold objects of one size class </li></ul></ul><ul><ul><ul><li>empty superblocks are “recycled” </li></ul></ul></ul><ul><li>Approximately radix-sorted: </li></ul><ul><ul><li>Allocate from mostly-full superblocks </li></ul></ul><ul><ul><li>Fast removal of mostly-empty superblocks </li></ul></ul>8 16 24 32 40 48 sizeclass bins radix-sorted superblock lists (emptiest to fullest) superblocks
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×