Memory Management for High-Performance Applications
Upcoming SlideShare
Loading in...5
×
 

Memory Management for High-Performance Applications

on

  • 5,886 views

Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support ...

Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.

In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60.

Statistics

Views

Total Views
5,886
Views on SlideShare
5,857
Embed Views
29

Actions

Likes
4
Downloads
1,291
Comments
3

2 Embeds 29

http://www.slideshare.net 18
http://prisms.cs.umass.edu 11

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • excellent captures
    Are you sure you want to
    Your message goes here
    Processing…
  • Very interesting ppt. The research indicated PPT only contains 30% of information; therefore the 70% valuable information comes from the presenter himself/herself. soEZLecturing.com provides you a chance to record your voice with your PowerPoint presentation and upload to the website. It can share with more readers and also promote your presentation more effectively on soEZLecturing.com.
    Are you sure you want to
    Your message goes here
    Processing…
  • What memory at line : 79 mean? memory at line : 79 Error in Face book

    http://ready2beat.com/technology/software/facebook-error-out-memory-line-79

    What memory at line : 79 mean? memory at line : 79 Error in Face book

    http://ready2beat.com/technology/software/facebook-error-out-memory-line-79
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Memory Management for High-Performance Applications Memory Management for High-Performance Applications Presentation Transcript

  • Memory Management for High-Performance Applications Emery Berger University of Massachusetts Amherst UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science AMHERST
  • High-Performance Applications Web servers,  search engines, scientific codes cpu cpu cpu cpu RAM cpu cpu cpu RAM cpu C or C++ cpu RAM  cpu RAID drive cpu Raid drive cpu Raid drive Run on one or  cluster of server boxes software compiler Needs support at every level  runtime system operating system hardware UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2 AMHERST
  • New Applications, Old Memory Managers Applications and hardware have changed  Multiprocessors now commonplace  Object-oriented, multithreaded  Increased pressure on memory manager  (malloc, free) But memory managers have not kept up  Inadequate support for modern applications  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3 AMHERST
  • Current Memory Managers Limit Scalability As we add  Runtime Performance processors, 14 13 program slows 12 Ideal 11 10 down Actual 9 Speedup 8 Caused by heap 7  6 5 contention 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of Processors Larson server benchmark on 14-processor Sun UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4 AMHERST
  • The Problem Current memory managers  inadequate for high-performance applications on modern architectures Limit scalability & application  performance UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5 AMHERST
  • This Talk Building memory managers  Heap Layers framework  Problems with current memory managers  Contention, false sharing, space  Solution: provably scalable memory manager  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6 AMHERST
  • Implementing Memory Managers Memory managers must be  Space efficient  Very fast  Heavily-optimized C code  Hand-unrolled loops  Macros  Monolithic functions  Hard to write, reuse, or extend  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7 AMHERST
  • Real Code: DLmalloc 2.7.2 #d e f i n e c h u n k s i z e ( p ) ( ( p ) - >s i z e & ~( S I ZE_BI TS ) ) #d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) #d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) ) #d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) #d e f i n e i n u s e ( p ) ( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E) #d e f i n e s e t _ i n u s e ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E #d e f i n e c l e a r _ i n u s e ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E) #d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s ) ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E) #d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s ) ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E) #d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s ) do { I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ; CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ; l o n g mc n ; i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; } s wi t c h ( mc t mp ) { c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ; c a s e 7: * mz p ++ = 0 ; c a s e 6: * mz p ++ = 0 ; c a s e 5: * mz p ++ = 0 ; c a s e 4: * mz p ++ = 0 ; c a s e 3: * mz p ++ = 0 ; c a s e 2: * mz p ++ = 0 ; c a s e 1: * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; } } } wh i l e ( 0 ) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8 AMHERST
  • Programming Language Support Classes Mixins   Overhead No overhead   Rigid hierarchy Flexible hierarchy   Sounds great...  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9 AMHERST
  • A Heap Layer C++ mixin with malloc & free methods  RedHeapLayer template <class SuperHeap> class GreenHeapLayer : public SuperHeap {…}; GreenHeapLayer UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10 AMHERST
  • Example: Thread-Safe Heap Layer LockedHeap protect the superheap with a lock LockedMallocHeap m a llocH ea p L ockedH ea p UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11 AMHERST
  • Empirical Results Runtime (normalized to Lea allocator) Heap Layers vs.  Kingsley KingsleyHeap Lea LeaHeap Normalized Runtime 1.5 originals: 1.25 1 0.75 KingsleyHeap  0.5 0.25 vs. BSD allocator 0 cfrac espresso lindsay LRUsim perl roboop Average Benchmark LeaHeap  vs. DLmalloc 2.7 Space (normalized to Lea allocator) Kingsley KingsleyHeap Lea LeaHeap Competitive  Normalized Space 2.5 2 runtime and 1.5 1 memory efficiency 0.5 0 cfrac espresso lindsay LRUsim perl roboop Average Benchmark UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12 AMHERST
  • Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13 AMHERST
  • Problems with General-Purpose Memory Managers Previous work for multiprocessors  Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]  Impractical  Multiple heaps [Larson 98, Gloger 99]  Reduce contention but cause other problems:  P-fold or even unbounded increase in space  we show Allocator-induced false sharing  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14 AMHERST
  • Multiple Heap Allocator: Pure Private Heaps Key: One heap per processor: = in use, processor 0  = free, on heap 1 gets memory malloc  from its local heap processor 0 processor 1 puts memory free  x1= malloc(1) on its local heap x2= malloc(1) free(x1) free(x2) x4= malloc(1) x3= malloc(1) STL, Cilk, ad hoc free(x3) free(x4)  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15 AMHERST
  • Problem: Unbounded Memory Consumption processor 0 processor 1 Producer-consumer:  x1= malloc(1) free(x1) Processor 0 allocates  x2= malloc(1) Processor 1 frees free(x2)  x3= malloc(1) free(x3) Unbounded memory  consumption Crash!  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 16 AMHERST
  • Multiple Heap Allocator: Private Heaps with Ownership processor 0 processor 1 returns memory  free x1= malloc(1) to original heap free(x1) x2= malloc(1) Bounded memory  free(x2) consumption No crash!  “Ptmalloc” (Linux),  LKmalloc UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17 AMHERST
  • Problem: P-fold Memory Blowup Occurs in practice  processor 0 processor 1 processor 2 Round-robin producer- x1= malloc(1)  free(x1) consumer x2= malloc(1) free(x2) processor i mod P allocates  x3=malloc(1) processor (i+1) mod P frees  free(x3) Footprint = 1 (2GB),  but space = 3 (6GB) Exceeds 32-bit address space:  Crash! UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18 AMHERST
  • Problem: Allocator-Induced False Sharing False sharing  CPU 0 CPU 1 Non-shared objects  on same cache line cache cache Bane of parallel applications  bus Extensively studied  cache line processor 0 processor 1 All these allocators  x1= malloc(1) x2= malloc(1) cause false sharing! thrash… thrash… UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19 AMHERST
  • So What Do We Do Now? Where do we put free memory?  on central heap: Heap contention   on our own heap: Unbounded memory   (pure private heaps) consumption on the original heap: P-fold blowup   (private heaps with ownership) How do we avoid false sharing?  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20 AMHERST
  • Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21 AMHERST
  • Hoard: Key Insights Bound local memory consumption   Explicitly track utilization  Move free memory to a global heap  Provably bounds memory consumption Manage memory in large chunks   Avoids false sharing  Reduces heap contention UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22 AMHERST
  • Overview of Hoard global heap Manage memory in heap blocks  Page-sized  Avoids false sharing  Allocate from local heap block  Avoids heap contention  processor 0 processor P-1 Low utilization  … Move heap block to global heap  Avoids space blowup  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23 AMHERST
  • Summary of Analytical Results Space consumption: near optimal worst-case  Hoard: O(n log M/m + P) {P « n}  Optimal: O(n log M/m)  n = memory required [Robson 70] M = biggest object size Private heaps with ownership: m = smallest object size  P = processors O(P n log M/m) Provably low synchronization  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24 AMHERST
  • Empirical Results Measure runtime on 14-processor Sun  Allocators   Solaris (system allocator)  Ptmalloc (GNU libc)  mtmalloc (Sun’s “MT-hot” allocator) Micro-benchmarks  Threadtest: no sharing  Larson: sharing (server-style)  Cache-scratch: mostly reads & writes  (tests for false sharing) Real application experience similar  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25 AMHERST
  • Runtime Performance: threadtest Many  threads, no sharing Hoard  achieves linear speedup speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26 AMHERST
  • Runtime Performance: Larson Many  threads, sharing (server-style) Hoard  achieves linear speedup UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27 AMHERST
  • Runtime Performance: false sharing Many  threads, mostly reads & writes of heap data Hoard  achieves linear speedup UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28 AMHERST
  • Hoard in the “Real World” Open source code  www.hoard.org  13,000 downloads  Solaris, Linux, Windows, IRIX, …  Widely used in industry  AOL, British Telecom, Novell, Philips  Reports: 2x-10x, “impressive” improvement in performance  Search server, telecom billing systems, scene rendering,  real-time messaging middleware, text-to-speech engine, telephony, JVM Scalable general-purpose memory manager  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29 AMHERST
  • Overview Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Extended memory manager for servers  Reap  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30 AMHERST
  • Custom Memory Allocation Replace new/delete, Very common practice   bypassing general-purpose Apache, gcc, lcc, STL,  allocator database servers… Language-level Reduce runtime – often   support in C++ Expand functionality – sometimes  Reduce space – rarely  “Use custom allocators” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31 AMHERST
  • The Reality Lea allocator  Runtime - Custom Allocator Benchmarks often as fast Custom Win32 DLmalloc or faster 1.75 non-regions regions averages Normalized Runtime 1.5 Custom  1.25 1 allocation 0.75 ineffective, 0.5 0.25 except for 0 regions. ll s le ze r ns he c sim r c ra vp se on lc gc l ud ee io ac ve 5. ar gi d- 6. eg m br 17 ap O re .p xe 17 [OOPSLA 2002] R c- 7 - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 32 AMHERST
  • Overview of Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 33 AMHERST
  • Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34 AMHERST
  • Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35 AMHERST
  • Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast  Adapts to use (region or heap style)  Cheap deletion  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36 AMHERST
  • Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37 AMHERST
  • Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 38 AMHERST
  • Summary Building memory managers  Heap Layers framework [PLDI 2001]  Problems with current memory managers  Contention, false sharing, space  Solution: provably scalable memory manager  Hoard [ASPLOS-IX]  Extended memory manager for servers  Reap [OOPSLA 2002]  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 39 AMHERST
  • Current Projects CRAMM: Cooperative Robust Automatic Memory  Management  Garbage collection without paging  Automatic heap sizing SAVMM: Scheduler-Aware Virtual Memory Management  Markov:   Programming language for building high-performance servers COLA: Customizable Object Layout Algorithms   Improving locality in Java UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 40 AMHERST
  • www.cs.umass.edu/~plasma UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 41 AMHERST
  • UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 42 AMHERST
  • Looking Forward “New” programming languages  Increasing use of Java = garbage collection  New architectures  NUMA: SMT/CMP (“hyperthreading”)  Technology trends  Memory hierarchy  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 43 AMHERST
  • The Ever-Steeper Memory Hierarchy Higher = smaller, faster, closer to CPU  A real desktop machine (mine)  registers 8 integer, 8 floating-point; 1-cycle latency L1 cache 8K data & instructions; 2-cycle latency L2 cache 512K; 7-cycle latency RAM 1GB; 100 cycle latency Disk 40 GB; 38,000,000 cycle latency (!) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 44 AMHERST
  • Swapping & Throughput Heap > available memory - throughput plummets  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 45 AMHERST
  • Why Manage Memory At All? Just buy more!  Simplifies memory management  Still have to collect garbage eventually…  Workload fits in RAM = no more swapping!  Sounds great…  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 46 AMHERST
  • Memory Prices Over Time RAM Prices Over Time (1977 dollars) $10,000.00 $1,000.00 2K $100.00 8K Dollars per GB 32K $10.00 128K conventional DRAM 512K 2M $1.00 8M $0.10 $0.01 1977 1980 1981 1982 1985 1986 1987 1989 1990 1991 1992 1993 1994 1995 1997 1998 1999 2000 2002 2003 2004 2005 1978 1979 1983 1984 1988 1996 2001 Year “Soon it will be free…” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 47 AMHERST
  • Memory Prices: Inflection Point! RAM Prices Ov er Time (1977 dollars) $10,000.00 $1,000.00 2K 8K $100.00 32K Dollars per GB 128K $10.00 512K S DRA M , conventional DRAM R DR A M , 2M DDR , Chipkill 8M $1.00 512M 1G $0.10 $0.01 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Year UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 48 AMHERST
  • Memory Is Actually Expensive Desktops:  Most ship with 256MB  1GB = 50% more $$  Laptops = 70%, if possible  Limited capacity  Servers:  Buy 4GB, get 1 CPU  free! Sun Enterprise 10000:  8GB extra = $150,000! 8GB Sun RAM = Fast RAM – new  technologies 1 Ferrari Modena Cosmic rays…  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 49 AMHERST
  • Key Problem: Paging Garbage collectors: VM oblivious  GC disrupts LRU queue  Touches non-resident pages  Virtual memory managers: GC oblivious  Likely to evict pages needed by GC  Paging  Orders of magnitude more time than RAM  BIG hit in performance and LONG pauses  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 50 AMHERST
  • Cooperative Robust Automatic Memory Management (CRAMM) Garbage collector Virtual memory manager I’m a cooperative application! Coarse-grained change in (heap-level) memory pressure Tracks per-process, new heap size Adjusts heap size overall memory utilization Fine-grained page eviction (page-level) notification Evacuates pages Page replacement victim page(s) Selects victim pages Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 51 AMHERST
  • Fine-Grained Cooperative GC Garbage collector Virtual memory manager Fine-grained page eviction notification Evacuates pages Page replacement victim page(s) Selects victim pages Goal: GC triggers no additional paging  Key ideas:  Adapt collection strategy on-the-fly  Page-oriented memory management  Exploit detailed page information from VM  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 52 AMHERST
  • Summary Building memory managers  Heap Layers framework  Problems with memory managers  Contention, space, false sharing  Solution: provably scalable allocator  Hoard  Future directions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 53 AMHERST
  • If You Have to Spend $$... more Ferraris: good more memory: bad UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 54 AMHERST
  • www.cs.umass.edu/~emery/plasma UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 55 AMHERST
  • This Page Intentionally Left Blank UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 56 AMHERST
  • Virtual Memory Manager Support New VM required: detailed page-level information  “Segmented queue” for low-overhead  unprotected protected Local LRU order per-process, not gLRU (Linux)  Complementary to SAVM work:  “Scheduler-Aware Virtual Memory manager” Under development – modified Linux kernel  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 57 AMHERST
  • Current Work: Robust Performance Currently: no VM-GC communicaton  BAD interactions under memory pressure  Our approach (with Eliot Moss, Scott Kaplan):  Cooperative Robust Automatic Memory Management LRU queue memory pressure Virtual Garbage memory collector empty pages manager / allocator reduced impact UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 58 AMHERST
  • Current Work: Predictable VMM Recent work on scheduling for QoS  E.g., proportional-share  Under memory pressure, VMM is scheduler  Paged-out processes may never recover  Intermittent processes may wait long time  Scheduler-faithful virtual memory  (with Scott Kaplan, Prashant Shenoy)  Based on page value rather than order UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 59 AMHERST
  • Conclusion Memory management for high-performance applications  Heap Layers framework [PLDI 2001] Reusable components, no runtime cost  Hoard scalable memory manager [ASPLOS-IX]  High-performance, provably scalable & space-efficient  Reap hybrid memory manager [OOPSLA 2002]  Provides speed & robustness for server applications  Current work: robust memory management for  multiprogramming UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 60 AMHERST
  • The Obligatory URL Slide http://www.cs.umass.edu/~emery UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 61 AMHERST
  • If You Can Read This, I Went Too Far UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 62 AMHERST
  • Hoard: Under the Hood S ystem Heap get or return memory to global heap HeapBlockManager LockedHeap HeapBlockManager HeapBlockManager S uperblockHeap malloc from local heap, LockedHeap Empty LockedHeap LockedHeap free to heap block Heap Blocks P erP rocessorHeap FreeT oHeapBlock Large objects MallocOrF reeHeap (> 4K) S electS izeHeap select heap based on size UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 63 AMHERST
  • Custom Memory Allocation Replace new/delete, Very common practice   bypassing general-purpose Apache, gcc, lcc, STL,  allocator database servers… Language-level Reduce runtime – often   support in C++ Expand functionality – sometimes  Reduce space – rarely  “Use custom allocators” UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 64 AMHERST
  • Drawbacks of Custom Allocators Avoiding memory manager means:  More code to maintain & debug  Can’t use memory debuggers  Not modular or robust:  Mix memory from custom  and general-purpose allocators → crash! Increased burden on programmers  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 65 AMHERST
  • Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 66 AMHERST
  • (1) Per-Class Allocators Recycle freed objects from a free list  a = new Class1; Class1 Fast free list + b = new Class1; c = new Class1; Linked list operations + a delete a; Simple + delete b; Identical semantics b + delete c; C++ language support + a = new Class1; c Possibly space-inefficient - b = new Class1; c = new Class1; UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 67 AMHERST
  • (II) Custom Patterns Tailor-made to fit allocation patterns  Example: 197.parser (natural language parser)  db a c char[MEMORY_LIMIT] end_of_array end_of_array end_of_array end_of_array end_of_array a = xalloc(8); Fast + b = xalloc(16); Pointer-bumping allocation + c = xalloc(8); - Brittle xfree(b); - Fixed memory size xfree(c); d = xalloc(8); - Requires stack-like lifetimes UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 68 AMHERST
  • (III) Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 69 AMHERST
  • Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 70 AMHERST
  • Custom Allocators Are Faster… Runtime - Custom Allocator Benchmarks Custom Win32 1.75 non-regions regions averages Normalized Runtime 1.5 1.25 1 0.75 0.5 0.25 0 s r er he ll ll e ze m c ns c vp on ra gc lc rs si ud ac ee io 5. ve gi 6. d- pa eg m 17 ap br -re O 17 xe 7. R c- on bo 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 71 AMHERST
  • Not So Fast… Runtime - Custom Allocator Benchmarks Custom Win32 DLmalloc 1.75 non-regions regions averages Normalized Runtime 1.5 1.25 1 0.75 0.5 0.25 0 l s l le s ze r he c er sim al c vp n on lc gc r ud rs io ee ac ve 5. d- 6. i g pa eg m br 17 ap O re 17 xe 7. R c- - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 72 AMHERST
  • The Lea Allocator (DLmalloc 2.7.0) Optimized for common allocation patterns  Per-size quicklists ≈ per-class allocation  Deferred coalescing  (combining adjacent free objects) Highly-optimized fastpath  Space-efficient  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 73 AMHERST
  • Space Consumption Results Space - Custom Allocator Benchmarks Original DLmalloc 1.75 non-regions regions averages Normalized Space 1.5 1.25 1 0.75 0.5 0.25 0 ll lle s c r e s e er c im ra vp lc n on z ch c ud rs io ee .g -s ve 5. a i g pa eg 6 ed m br 17 ap O re 17 7. R c- x - bo on 19 N UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 74 AMHERST
  • Overview Introduction  Perceived benefits and drawbacks  Three main kinds of custom allocators  Comparison with general-purpose allocators  Advantages and drawbacks of regions  Reaps – generalization of regions & heaps  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 75 AMHERST
  • Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 76 AMHERST
  • Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 77 AMHERST
  • Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast + Adapts to use (region or heap style) + Cheap deletion + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 78 AMHERST
  • Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 79 AMHERST
  • Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 80 AMHERST
  • Conclusion Empirical study of custom allocators  Lea allocator often as fast or faster  Custom allocation ineffective,  except for regions Reaps:  Nearly matches region performance  without other drawbacks Take-home message:  Stop using custom memory allocators!  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 81 AMHERST
  • Software http://www.cs.umass.edu/~emery (part of Heap Layers distribution) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 82 AMHERST
  • Experimental Methodology Comparing to general-purpose allocators  Same semantics: no problem  E.g., disable per-class allocators  Different semantics: use emulator  Uses general-purpose allocator  but adds bookkeeping regionfree: Free all associated objects  Other functionality (nesting, obstacks) UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 83 AMHERST
  • Use Custom Allocators? Strongly recommended by practitioners  Little hard data on performance/space  improvements Only one previous study [Zorn 1992]  Focused on just one type of allocator  Custom allocators: waste of time  Small gains, bad allocators  Different allocators better? Trade-offs?  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 84 AMHERST
  • Kinds of Custom Allocators Three basic types of custom allocators  Per-class  Fast  Custom patterns  Fast, but very special-purpose  Regions  Fast, possibly more space-efficient  Convenient  Variants: nested, obstacks  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 85 AMHERST
  • Optimization Opportunity Time Spent in Memory Operations Memory Operations Other 100 80 % of runtime 60 40 20 0 lcc ll e sim cc e ze e pr r se ag h ud v g ee ac 5. d- 6. ar er m ap br 17 xe 17 p Av 7. c- bo 19 UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 86 AMHERST
  • UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 87 AMHERST
  • Custom Memory Allocation Programmers often replace malloc/free  Attempt to increase performance  Provide extra functionality (e.g., for servers)  Reduce space (rarely)  Empirical study of custom allocators  Lea allocator often as fast or faster  Custom allocation ineffective,  except for regions. [OOPSLA 2002] UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 88 AMHERST
  • Overview of Regions Separate areas, deletion only en masse  regioncreate(r) r regionmalloc(r, sz) regiondelete(r) - Risky Fast + - Accidental deletion Pointer-bumping allocation + - Too much space Deletion of chunks + Convenient + One call frees all memory + UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 89 AMHERST
  • Why Regions? Apparently faster, more space-efficient  Servers need memory management support:  Avoid resource leaks  Tear down memory associated with terminated  connections or transactions Current approach (e.g., Apache): regions  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 90 AMHERST
  • Drawbacks of Regions Can’t reclaim memory within regions  Problem for long-running computations,  producer-consumer patterns, off-the-shelf “malloc/free” programs unbounded memory consumption  Current situation for Apache:  vulnerable to denial-of-service  limits runtime of connections  limits module programming  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 91 AMHERST
  • Reap Hybrid Allocator Reap = region + heap  Adds individual object deletion & heap  reapcreate(r) r reapmalloc(r, sz) reapfree(r,p) reapdelete(r) Can reduce memory consumption  Fast  Adapts to use (region or heap style)  Cheap deletion  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 92 AMHERST
  • Using Reap as Regions Runtime - Region-Based Benchmarks Original Win32 DLmalloc WinHeap Vmalloc Reap 4.08 2.5 Normalized Runtime 2 1.5 1 0.5 0 lcc mudlle Reap performance nearly matches regions UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 93 AMHERST
  • Reap: Best of Both Worlds Combining new/delete with regions  usually impossible: Incompatible API’s  Hard to rewrite code  Use Reap: Incorporate new/delete code into Apache   “mod_bc” (arbitrary-precision calculator) Changed 20 lines (out of 8000)  Benchmark: compute 1000th prime  With Reap: 240K  Without Reap: 7.4MB  UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 94 AMHERST