Very interesting ppt. The research indicated PPT only contains 30% of information; therefore the 70% valuable information comes from the presenter himself/herself. soEZLecturing.com provides you a chance to record your voice with your PowerPoint presentation and upload to the website. It can share with more readers and also promote your presentation more effectively on soEZLecturing.com.
Memory Management for High-Performance Applications - Presentation Transcript
Memory Management
for High-Performance Applications
Emery Berger
University of Massachusetts Amherst
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
AMHERST
High-Performance Applications
Web servers,
search engines,
scientific codes cpu
cpu
cpu cpu RAM
cpu
cpu cpu RAM
cpu
C or C++
cpu RAM
cpu RAID drive
cpu Raid drive
cpu Raid drive
Run on one or
cluster of server
boxes software
compiler
Needs support at every level
runtime system
operating system
hardware
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
AMHERST
New Applications,
Old Memory Managers
Applications and hardware have changed
Multiprocessors now commonplace
Object-oriented, multithreaded
Increased pressure on memory manager
(malloc, free)
But memory managers have not kept up
Inadequate support for modern applications
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3
AMHERST
Current Memory Managers
Limit Scalability
As we add
Runtime Performance
processors, 14
13
program slows 12
Ideal
11
10
down Actual
9
Speedup
8
Caused by heap 7
6
5
contention 4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Processors
Larson server benchmark on 14-processor Sun
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4
AMHERST
The Problem
Current memory managers
inadequate for high-performance
applications on modern architectures
Limit scalability & application
performance
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5
AMHERST
This Talk
Building memory managers
Heap Layers framework
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6
AMHERST
Implementing Memory Managers
Memory managers must be
Space efficient
Very fast
Heavily-optimized C code
Hand-unrolled loops
Macros
Monolithic functions
Hard to write, reuse, or extend
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7
AMHERST
Real Code: DLmalloc 2.7.2
#d e f i n e c h u n k s i z e ( p ) ( ( p ) - >s i z e & ~( S I ZE_BI TS ) )
#d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) )
#d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) )
#d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) )
#d e f i n e i n u s e ( p ) \\
( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E)
#d e f i n e s e t _ i n u s e ( p ) \\
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E
#d e f i n e c l e a r _ i n u s e ( p ) \\
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E)
#d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s ) \\
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E)
#d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s ) \\
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E)
#d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s ) \\
do { \\
I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ; \\
CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ; \\
l o n g mc n ; \\
i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; } \\
s wi t c h ( mc t mp ) { \\
c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ; \\
c a s e 7: * mz p ++ = 0 ; \\
c a s e 6: * mz p ++ = 0 ; \\
c a s e 5: * mz p ++ = 0 ; \\
c a s e 4: * mz p ++ = 0 ; \\
c a s e 3: * mz p ++ = 0 ; \\
c a s e 2: * mz p ++ = 0 ; \\
c a s e 1: * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; } \\
} \\
} wh i l e ( 0 )
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8
AMHERST
Programming Language Support
Classes Mixins
Overhead No overhead
Rigid hierarchy Flexible hierarchy
Sounds great...
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9
AMHERST
A Heap Layer
C++ mixin with malloc & free methods
RedHeapLayer template <class SuperHeap>
class GreenHeapLayer :
public SuperHeap {…};
GreenHeapLayer
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10
AMHERST
Example: Thread-Safe Heap Layer
LockedHeap
protect the superheap
with a lock
LockedMallocHeap
m a llocH ea p
L ockedH ea p
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11
AMHERST
Empirical Results
Runtime (normalized to Lea allocator)
Heap Layers vs.
Kingsley KingsleyHeap Lea LeaHeap
Normalized Runtime
1.5
originals: 1.25
1
0.75
KingsleyHeap
0.5
0.25
vs. BSD allocator 0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
LeaHeap
vs. DLmalloc 2.7 Space (normalized to Lea allocator)
Kingsley KingsleyHeap Lea LeaHeap
Competitive
Normalized Space
2.5
2
runtime and 1.5
1
memory efficiency 0.5
0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12
AMHERST
Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13
AMHERST
Problems with General-Purpose
Memory Managers
Previous work for multiprocessors
Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]
Impractical
Multiple heaps [Larson 98, Gloger 99]
Reduce contention but cause other problems:
P-fold or even unbounded increase in space
we show
Allocator-induced false sharing
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14
AMHERST
Multiple Heap Allocator:
Pure Private Heaps
Key:
One heap per processor: = in use, processor 0
= free, on heap 1
gets memory
malloc
from its local heap
processor 0 processor 1
puts memory
free
x1= malloc(1)
on its local heap x2= malloc(1)
free(x1) free(x2)
x4= malloc(1)
x3= malloc(1)
STL, Cilk, ad hoc free(x3) free(x4)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15
AMHERST
Multiple Heap Allocator:
Private Heaps with Ownership
processor 0 processor 1
returns memory
free
x1= malloc(1)
to original heap free(x1)
x2= malloc(1)
Bounded memory
free(x2)
consumption
No crash!
“Ptmalloc” (Linux),
LKmalloc
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17
AMHERST
Problem:
P-fold Memory Blowup
Occurs in practice
processor 0 processor 1 processor 2
Round-robin producer- x1= malloc(1)
free(x1)
consumer x2= malloc(1)
free(x2)
processor i mod P allocates
x3=malloc(1)
processor (i+1) mod P frees
free(x3)
Footprint = 1 (2GB),
but space = 3 (6GB)
Exceeds 32-bit address space:
Crash!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18
AMHERST
Problem:
Allocator-Induced False Sharing
False sharing
CPU 0 CPU 1
Non-shared objects
on same cache line cache cache
Bane of parallel applications
bus
Extensively studied
cache line
processor 0 processor 1
All these allocators
x1= malloc(1) x2= malloc(1)
cause false sharing! thrash… thrash…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19
AMHERST
So What Do We Do Now?
Where do we put free memory?
on central heap: Heap contention
on our own heap: Unbounded memory
(pure private heaps) consumption
on the original heap: P-fold blowup
(private heaps with ownership)
How do we avoid false sharing?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20
AMHERST
Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21
AMHERST
Hoard: Key Insights
Bound local memory consumption
Explicitly track utilization
Move free memory to a global heap
Provably bounds memory consumption
Manage memory in large chunks
Avoids false sharing
Reduces heap contention
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22
AMHERST
Overview of Hoard
global heap
Manage memory in heap blocks
Page-sized
Avoids false sharing
Allocate from local heap block
Avoids heap contention
processor 0 processor P-1
Low utilization
…
Move heap block to global heap
Avoids space blowup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23
AMHERST
Summary of Analytical Results
Space consumption: near optimal worst-case
Hoard: O(n log M/m + P) {P « n}
Optimal: O(n log M/m)
n = memory required
[Robson 70]
M = biggest object size
Private heaps with ownership: m = smallest object size
P = processors
O(P n log M/m)
Provably low synchronization
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24
AMHERST
Empirical Results
Measure runtime on 14-processor Sun
Allocators
Solaris (system allocator)
Ptmalloc (GNU libc)
mtmalloc (Sun’s “MT-hot” allocator)
Micro-benchmarks
Threadtest: no sharing
Larson: sharing (server-style)
Cache-scratch: mostly reads & writes
(tests for false sharing)
Real application experience similar
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25
AMHERST
Runtime Performance:
threadtest
Many
threads,
no sharing
Hoard
achieves
linear
speedup
speedup(x,P) = runtime(Solaris allocator, one processor)
/ runtime(x on P processors)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26
AMHERST
Runtime Performance:
Larson
Many
threads,
sharing
(server-style)
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27
AMHERST
Runtime Performance:
false sharing
Many
threads,
mostly reads
& writes of
heap data
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28
AMHERST
Hoard in the “Real World”
Open source code
www.hoard.org
13,000 downloads
Solaris, Linux, Windows, IRIX, …
Widely used in industry
AOL, British Telecom, Novell, Philips
Reports: 2x-10x, “impressive” improvement in performance
Search server, telecom billing systems, scene rendering,
real-time messaging middleware, text-to-speech engine,
telephony, JVM
Scalable general-purpose memory manager
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29
AMHERST
Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30
AMHERST
Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31
AMHERST
The Reality
Lea allocator
Runtime - Custom Allocator Benchmarks
often as fast Custom Win32 DLmalloc
or faster 1.75
non-regions regions averages
Normalized Runtime
1.5
Custom
1.25
1
allocation 0.75
ineffective, 0.5
0.25
except for 0
regions.
ll
s
le
ze
r
ns
he
c
sim
r
c
ra
vp
se
on
lc
gc
l
ud
ee
io
ac
ve
5.
ar
gi
d-
6.
eg
m
br
17
ap
O
re
.p
xe
17
[OOPSLA 2002]
R
c-
7
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 32
AMHERST
Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 33
AMHERST
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34
AMHERST
Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35
AMHERST
Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36
AMHERST
Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37
AMHERST
Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 38
AMHERST
Summary
Building memory managers
Heap Layers framework [PLDI 2001]
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard [ASPLOS-IX]
Extended memory manager for servers
Reap [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 39
AMHERST
Current Projects
CRAMM: Cooperative Robust Automatic Memory
Management
Garbage collection without paging
Automatic heap sizing
SAVMM: Scheduler-Aware Virtual Memory Management
Markov:
Programming language for building high-performance servers
COLA: Customizable Object Layout Algorithms
Improving locality in Java
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 40
AMHERST
www.cs.umass.edu/~plasma
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 41
AMHERST
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 42
AMHERST
Looking Forward
“New” programming languages
Increasing use of Java = garbage collection
New architectures
NUMA: SMT/CMP (“hyperthreading”)
Technology trends
Memory hierarchy
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 43
AMHERST
The Ever-Steeper
Memory Hierarchy
Higher = smaller, faster, closer to CPU
A real desktop machine (mine)
registers 8 integer, 8 floating-point; 1-cycle latency
L1 cache 8K data & instructions; 2-cycle latency
L2 cache 512K; 7-cycle latency
RAM 1GB; 100 cycle latency
Disk 40 GB; 38,000,000 cycle latency (!)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 44
AMHERST
Swapping & Throughput
Heap > available memory - throughput plummets
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 45
AMHERST
Why Manage Memory At All?
Just buy more!
Simplifies memory management
Still have to collect garbage eventually…
Workload fits in RAM = no more swapping!
Sounds great…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 46
AMHERST
Memory Prices Over Time
RAM Prices Over Time
(1977 dollars)
$10,000.00
$1,000.00
2K
$100.00
8K
Dollars per GB
32K
$10.00 128K
conventional DRAM
512K
2M
$1.00
8M
$0.10
$0.01
1977
1980
1981
1982
1985
1986
1987
1989
1990
1991
1992
1993
1994
1995
1997
1998
1999
2000
2002
2003
2004
2005
1978
1979
1983
1984
1988
1996
2001 Year
“Soon it will be free…”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 47
AMHERST
Memory Prices: Inflection Point!
RAM Prices Ov er Time
(1977 dollars)
$10,000.00
$1,000.00
2K
8K
$100.00
32K
Dollars per GB
128K
$10.00 512K
S DRA M ,
conventional DRAM R DR A M ,
2M
DDR ,
Chipkill 8M
$1.00
512M
1G
$0.10
$0.01
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 48
AMHERST
Memory Is Actually Expensive
Desktops:
Most ship with 256MB
1GB = 50% more $$
Laptops = 70%, if possible
Limited capacity
Servers:
Buy 4GB, get 1 CPU
free!
Sun Enterprise 10000:
8GB extra = $150,000!
8GB Sun RAM =
Fast RAM – new
technologies 1 Ferrari Modena
Cosmic rays…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 49
AMHERST
Key Problem: Paging
Garbage collectors: VM oblivious
GC disrupts LRU queue
Touches non-resident pages
Virtual memory managers: GC oblivious
Likely to evict pages needed by GC
Paging
Orders of magnitude more time than RAM
BIG hit in performance and LONG pauses
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 50
AMHERST
Cooperative Robust Automatic
Memory Management (CRAMM)
Garbage collector Virtual memory manager
I’m a
cooperative
application!
Coarse-grained
change in
(heap-level)
memory pressure
Tracks per-process,
new heap size
Adjusts heap size overall
memory utilization
Fine-grained
page eviction
(page-level)
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 51
AMHERST
Fine-Grained Cooperative GC
Garbage collector Virtual memory manager
Fine-grained page eviction
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Goal: GC triggers no additional paging
Key ideas:
Adapt collection strategy on-the-fly
Page-oriented memory management
Exploit detailed page information from VM
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 52
AMHERST
Summary
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Future directions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 53
AMHERST
If You Have to Spend $$...
more Ferraris: good
more memory: bad
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 54
AMHERST
www.cs.umass.edu/~emery/plasma
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 55
AMHERST
This Page Intentionally Left Blank
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 56
AMHERST
Virtual Memory Manager Support
New VM required: detailed page-level information
“Segmented queue” for low-overhead
unprotected protected
Local LRU order per-process, not gLRU (Linux)
Complementary to SAVM work:
“Scheduler-Aware Virtual Memory manager”
Under development – modified Linux kernel
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 57
AMHERST
Current Work: Robust
Performance
Currently: no VM-GC communicaton
BAD interactions under memory pressure
Our approach (with Eliot Moss, Scott Kaplan):
Cooperative Robust Automatic Memory
Management
LRU queue
memory pressure
Virtual Garbage
memory collector
empty pages
manager / allocator
reduced impact
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 58
AMHERST
Current Work: Predictable VMM
Recent work on scheduling for QoS
E.g., proportional-share
Under memory pressure, VMM is scheduler
Paged-out processes may never recover
Intermittent processes may wait long time
Scheduler-faithful virtual memory
(with Scott Kaplan, Prashant Shenoy)
Based on page value rather than order
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 59
AMHERST
Conclusion
Memory management for high-performance applications
Heap Layers framework [PLDI 2001]
Reusable components, no runtime cost
Hoard scalable memory manager [ASPLOS-IX]
High-performance, provably scalable & space-efficient
Reap hybrid memory manager [OOPSLA 2002]
Provides speed & robustness for server applications
Current work: robust memory management for
multiprogramming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 60
AMHERST
The Obligatory URL Slide
http://www.cs.umass.edu/~emery
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 61
AMHERST
If You Can Read This,
I Went Too Far
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 62
AMHERST
Hoard: Under the Hood
S ystem Heap
get or return memory to global heap
HeapBlockManager
LockedHeap
HeapBlockManager
HeapBlockManager
S uperblockHeap
malloc from local heap,
LockedHeap Empty
LockedHeap
LockedHeap
free to heap block
Heap Blocks
P erP rocessorHeap FreeT oHeapBlock
Large
objects
MallocOrF reeHeap
(> 4K)
S electS izeHeap
select heap based on size
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 63
AMHERST
Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 64
AMHERST
Drawbacks of Custom Allocators
Avoiding memory manager means:
More code to maintain & debug
Can’t use memory debuggers
Not modular or robust:
Mix memory from custom
and general-purpose allocators → crash!
Increased burden on programmers
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 65
AMHERST
Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 66
AMHERST
(1) Per-Class Allocators
Recycle freed objects from a free list
a = new Class1; Class1
Fast
free list +
b = new Class1;
c = new Class1; Linked list operations
+
a
delete a;
Simple
+
delete b;
Identical semantics
b +
delete c;
C++ language support
+
a = new Class1; c
Possibly space-inefficient
-
b = new Class1;
c = new Class1;
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 67
AMHERST
(II) Custom Patterns
Tailor-made to fit allocation patterns
Example: 197.parser (natural language parser)
db
a c
char[MEMORY_LIMIT]
end_of_array end_of_array
end_of_array
end_of_array
end_of_array
a = xalloc(8); Fast
+
b = xalloc(16);
Pointer-bumping allocation
+
c = xalloc(8);
- Brittle
xfree(b);
- Fixed memory size
xfree(c);
d = xalloc(8); - Requires stack-like lifetimes
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 68
AMHERST
(III) Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 69
AMHERST
Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 70
AMHERST
Custom Allocators Are Faster…
Runtime - Custom Allocator Benchmarks
Custom Win32
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
s
r
er
he
ll
ll e
ze
m
c
ns
c
vp
on
ra
gc
lc
rs
si
ud
ac
ee
io
5.
ve
gi
6.
d-
pa
eg
m
17
ap
br
-re
O
17
xe
7.
R
c-
on
bo
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 71
AMHERST
Not So Fast…
Runtime - Custom Allocator Benchmarks
Custom Win32 DLmalloc
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
l
s
l le
s
ze
r
he
c
er
sim
al
c
vp
n
on
lc
gc
r
ud
rs
io
ee
ac
ve
5.
d-
6.
i
g
pa
eg
m
br
17
ap
O
re
17
xe
7.
R
c-
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 72
AMHERST
The Lea Allocator (DLmalloc 2.7.0)
Optimized for common allocation patterns
Per-size quicklists ≈ per-class allocation
Deferred coalescing
(combining adjacent free objects)
Highly-optimized fastpath
Space-efficient
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 73
AMHERST
Space Consumption Results
Space - Custom Allocator Benchmarks
Original DLmalloc
1.75
non-regions regions averages
Normalized Space
1.5
1.25
1
0.75
0.5
0.25
0
ll
lle
s
c
r
e
s
e
er
c
im
ra
vp
lc
n
on
z
ch
c
ud
rs
io
ee
.g
-s
ve
5.
a
i
g
pa
eg
6
ed
m
br
17
ap
O
re
17
7.
R
c-
x
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 74
AMHERST
Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 75
AMHERST
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 76
AMHERST
Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 77
AMHERST
Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
+
Adapts to use (region or heap style)
+
Cheap deletion
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 78
AMHERST
Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 79
AMHERST
Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 80
AMHERST
Conclusion
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions
Reaps:
Nearly matches region performance
without other drawbacks
Take-home message:
Stop using custom memory allocators!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 81
AMHERST
Software
http://www.cs.umass.edu/~emery
(part of Heap Layers distribution)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 82
AMHERST
Experimental Methodology
Comparing to general-purpose allocators
Same semantics: no problem
E.g., disable per-class allocators
Different semantics: use emulator
Uses general-purpose allocator
but adds bookkeeping
regionfree: Free all associated objects
Other functionality (nesting, obstacks)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 83
AMHERST
Use Custom Allocators?
Strongly recommended by practitioners
Little hard data on performance/space
improvements
Only one previous study [Zorn 1992]
Focused on just one type of allocator
Custom allocators: waste of time
Small gains, bad allocators
Different allocators better? Trade-offs?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 84
AMHERST
Kinds of Custom Allocators
Three basic types of custom allocators
Per-class
Fast
Custom patterns
Fast, but very special-purpose
Regions
Fast, possibly more space-efficient
Convenient
Variants: nested, obstacks
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 85
AMHERST
Optimization Opportunity
Time Spent in Memory Operations
Memory Operations Other
100
80
% of runtime
60
40
20
0
lcc
ll e
sim
cc
e
ze
e
pr
r
se
ag
h
ud
v
g
ee
ac
5.
d-
6.
ar
er
m
ap
br
17
xe
17
p
Av
7.
c-
bo
19
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 86
AMHERST
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 87
AMHERST
Custom Memory Allocation
Programmers often replace malloc/free
Attempt to increase performance
Provide extra functionality (e.g., for servers)
Reduce space (rarely)
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions. [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 88
AMHERST
Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 89
AMHERST
Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 90
AMHERST
Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 91
AMHERST
Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 92
AMHERST
Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 93
AMHERST
Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 94
AMHERST
Fast and effective memory management is crucial for more
Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.
In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60. less
3 comments
Comments 1 - 3 of 3 previous next Post a comment