Profcompact
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Profcompact

on

  • 313 views

 

Statistics

Views

Total Views
313
Views on SlideShare
313
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Profcompact Presentation Transcript

  • 1. Profile-Guided Code Compression Saumya Debray William Evans Department of Computer Science Department of Computer Science University of Arizona University of British Columbia Tucson, AZ 85721. Vancouver B.C. Canada, V6T 1Z4. debray@cs.arizona.edu will@cs.ubc.ca RACT 1. INTRODUCTION 2010.05.17 years there has been an increasing trend towar uters are increasingly used in contexts where the amount In recent ble memory is limited, it becomes important to devise es that reduce the memory footprint of application pro- : incorporation of computers into a wide variety of devices, s palm-tops, telephones, embedded controllers, etc. In many o hile leaving them in an executable form. This paper de- devices, the amount of memory available is limited, due to c n approach to applying data compression techniques to erations such as space, weight, power consumption, or pric he size of infrequently executed portions of a program. example, the widely used TMS320-C5x DSP processor from pressed code is decompressed dynamically (via software) Instruments has only 64 Kwords of program memory for exec d, prior to execution. The use of data compression tech- code [23]. At the same time, there is an increasing desire ncreases the amount of code size reduction that can be more and more sophisticated software in such devices, such ; their application to infrequently executed code limits the cryption software in telephones, speech/image processing so overhead due to dynamic decompression; and the use of in palm-tops, fault diagnosis software in embedded processo Since these devices typically have no secondary storage, an
  • 2. Citation Count citation count 2002 2003 2004 2005 2006 2007 2008 2009 0 5 10 15 DAC, ASPDAC, IEEE Transaction on Computer Aided Design of Integrated Circuits and Systems
  • 3. profile-directed optimization runtime code generation/modification program compression 13.7%(θ=0.0) - 18.8%(θ=0.00005) +Δ(θ=0.0) - -27%(θ=0.00005)
  • 4. The basic orgnization infrequently executed functions frequently code stub never-compressed part call sites compressed call sites function offset table runtime buffer code C1 0 f C1 g f 1 C2 f f.stub h 2 C2 [0] g C3 [1] C3 g.stub h C4 g [2] Decompressor C4 C5 h.stub C5 C6 h C6 (a) Original (b) Compressed Figure 1: Code Organization: Before and After Compression f g HE BASIC APPROACH 2.2 (1) Buffer Management The scheme described above is conceptually fairly straigh verview (2) but fails to mention several issues whose resolution d ward / (JIT ) 1 shows The basic organization of code in our system. (3) f its performance. The most important of these is the mines restore g / a program with three infrequently executed functions,1 f, of function calls in the compressed code. Suppose that in Figu
  • 5. g stub never-compressed part instruction f: offset EntryStub: entry 0 bsr r, Decompress instruction <index(f), 0> f: offset entry 0 RestoreStub(f,98): cs0 bsr $ra, CreateStub 96 bsr $ra, Decompress br g 97 <index(f), 98> ... 98 cs0 bsr $ra, g 96 <count> ... 97 return return never−compressed runtime stub list runtime buffer (a) Original (b) Transformed, during runtime after CreateStub has created Re- storeStub(f,98) Figure 2: Managing Function Calls Out of the Runtime Buffer. executable code, and only discards it to prevent the system after f’s call to g. This stub obviously cannot be placed in the from running out of memory. runtime buffer, since it may be overwritten there; it must be placed The main drawback with this approach is that the runtime in the never-compressed portion of the program. Since every call buffer must be made large enough to hold all of the decom- from a compressed function requires its own stub, these restore pressed functions that can possibly coexist on the call stack. stubs amount to a large fraction of the final executable’s size (e.g.,
  • 6. Compression & Decompression splitting streams approach [9] by encoding each field using Huffman code canonical Huffman encoding
  • 7. instance, function calls from within a compressed region are still sing the handled as discussed in Section 2. Compressible Region We now face the problem of how to choose regions to com- press. We want these regions to be reasonably small so that the runtime buffer can be small, yet we want few control transfers be- tween different regions so that the number of entry stubs is small. This is an optimization problem. The input is a control flow graph for a program in which a vertex represents a basic block and has size equal to the number of instructions in the block, and an edge represents a control transfer from to . In addition, the input specifies a subset of the vertices that can be compressed. The output is a partition of a subset of the compressible vertices into regions so that the quence, following cost is minimized: y ) with an never-compressed code the in- ividual s of the compressed code stream, uffman function offset table coding quence ally the entry stubs pressed runtime buffer streams has the time of where is the size of the region after compression, is mpress the set of blocks requiring an entry stub, i.e., one de- and for some mpres- lex de- the constant is the number of words required for an entry stub, and
  • 8. Compressible Regions 1.20 1.20 1.20 d c 1.10 1.10 1.10 d Normalized code size Normalized code size Normalized code size c a a d c d c a 1.00 e d c a e 1.00 a 1.00 h i a e d a e a b c h b d e d c h a g f e d b i g f c b g f d c c e b i d c e k j i h a h k e a k b e e d a i i g j d b a g j e g b 0.90 i d c d e c i b h 0.90 c h i e 0.90 a d c h g f h i c i e f g f e b h f h g f i b a f k b a a k a b e a d c h k e e j a b h b h b g h i a e i g i a d g b g h k j d g c b i b i e b i k a h g i e a c h i k i f k g g f j f k d g h d c g h f j f b k d g e g f f g k c f g h k j c i f b g i e i k j k k f k f k k f k f k d h c b d k c f k h b f 0.80 j j 0.80 j 0.80 h j j j j j j j j j j j j 0.70 0.70 0.70 32 64 128 256 512 1024 2048 4096 32 64 128 256 512 1024 2048 4096 32 64 128 256 512 1024 2048 4096 Buffer size bound Buffer size bound Buffer size bound (a) (b) (c) Key: 1.00 a: adpcm b: epic c: g721 dec Normalized code size d: g721 enc 0.90 e: gsm 0.0 0.00001 f: jpeg dec 0.00005 g: jpeg enc 0.80 h: mpeg2dec i: mpeg2enc j: pgp k: rasta 0.70 32 64 128 256 512 1024 2048 4096 Buffer size bound (d) mean upper bound of runtime buffer K= 512 Figure 3: Effect of Buffer Size Bound on Code Size is the number of external function calls within (the decom- a value for , we get a large number of small compressible re-
  • 9. Cold Code (the geometric mean of) the relative amount of cold and compressible code in our programs 1.00 com 0.90 4, a 0.80 it is Fraction of Code 0.70 0.60 0.50 6. 0.40 0.30 cold code 6.1 0.20 compressible code A 0.10 the 0.00 0.0 0.00001 0.0001 0.001 0.01 0.1 1.0 inst Threshold the invo time Figure 4: Amount of Cold and Compressible Code (Normal- the ized) the ther
  • 10. Optimizations Buffer-Safe Functions Unswitching indirect jump
  • 11. Program Profiling Input Timing Input file name size (KB) file name size (KB) adpcm clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2 clinton.adpcm 73.8 mlk IHaveADream.adpcm 182.1 epic baboon.tif 262.4 baboon.tif 262.4 lena.tif 262.4 g721 dec clinton.g721 73.8 mlk IHaveADream.g721 368.8 g721 enc clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2 gsm clinton.pcm 295.0 mlk IHaveADream.pcm 1475.2 jpeg dec testimg.jpg 5.8 roses17.jpg 25.1 jpeg end testimg.ppm 101.5 roses17.ppm 681.1 mpeg2dec sarnoff2.m2v 102.5 tceh v2.m2v 2310.7 mpeg2enc sarnoff2.m2v 102.5 tceh v2.m2v 2310.7 pgp compression.ps 717.2 TI-320-user-manual.ps 8456.6 rasta ex5 c1.wav 17.0 phone.pcmle.wav 83.7 Figure 5: Inputs used for profiling and timing runs
  • 12. jpeg dec testimg.jpg 5.8 roses17.jpg 25.1 jpeg end testimg.ppm 101.5 roses17.ppm 681.1 mpeg2dec sarnoff2.m2v 102.5 tceh v2.m2v 2310.7 mpeg2enc sarnoff2.m2v 102.5 tceh v2.m2v 2310.7 pgp compression.ps 717.2 TI-320-user-manual.ps 8456.6 rasta ex5 c1.wav 17.0 phone.pcmle.wav 83.7 Figure 5: Inputs used for profiling and timing runs 30 Code Size reduction (%) 20 10 0 abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M abcde f gh i j k M 0.0 0.00001 0.0001 0.001 0.01 0.1 1.0 Thresholds Key: a: adpcm d: g721 enc g: jpeg enc j: pgp b: epic e: gsm h: mpeg2dec k: rasta c: g721 dec f: jpeg dec i: mpeg2enc M: G EOM . M EAN Figure 6: Code Size Reduction due to Profile-Guided Code Compression at Different Thresholds been space optimized by about 30% on average. Squash, using inputs refer to those used to obtain the execution profiles that were the runtime decompression scheme outlined in this paper, compacts used to carry out compression, while the timing inputs refer to the squeezed binaries by about another 14–19% on average. inputs used to generate execution time data for the uncompressed
  • 13. However, as is increased, the runtime overhead associated with repeated dynamic decompression of code quickly begins to make itself felt. Our experience with this set of programs (and others) indicates that beyond the runtime overhead becomes quite noticeable. To obtain a reasonable balance between code size improvements and execution speed, we focus on values of up to 0.00005. Execution time data were obtained on a workstation with a 667 MHz Compaq Alpha 21264 EV67 processor with a split two-way set-associative primary cache (64 Kbytes each of instruction and data cache) and 512 MB of main memory running Tru64 Unix. In each case, the execution time was obtained as the smallest of 10 runs of an executable on an otherwise unloaded system. Figure 7 examines the performance of our programs, both in 30 Thresholds Code Size reduction (%) terms of size and speed, for ranging from 0.0 to 0.00005. The fi- 0.0 0.00001 nal set of bars in this figure shows the mean values for code size re- 18.8 20 0.00005 16.8 13.7 10 0 adpcm epic g721_dec g721_enc gsm jpeg_dec jpeg_enc mpeg2dec mpeg2enc pgp rasta Geom. Mean (a) Code Size 2.5 Execution Time (Normalized) 2.0 Thresholds 0.0 1.5 1.24 0.00001 1.04 1.00 0.00005 1.0 0.5 0.0 adpcm epic g721_dec g721_enc gsm jpeg_dec jpeg_enc mpeg2dec mpeg2enc pgp rasta Geom. Mean (b) Execution Time