MILEPOST GCC: machine learning based research compiler
                    Grigori Fursin,              Mircea Namolaru,  ...
A key goal of the project is to make machine learning          learning tools to correlate aspects of program structure,
b...
MILEPOST GCC                                  CCC
                        Program1            (with ICI and ML routines)  ...
High-level scripting
                                                                                                 (jav...
x v ¦iii u§tsr¦q§pi ¤ £¡  
                                                                                               ...
6
a program needed by the model to distinguish between                                               avoid_gate so that we...
good and bad optimizations.                                         Each such struct data type may introduce an entity, an...
times (or speed-ups y) so that we have for each pro-             Let us denote the distribution over good solutions on
   ...
2.24                                                   2.31 2.31                                               1.97
      ...
database where they are stored for future reference.               6 Conclusions and Future Work
Figure 5a shows that cons...
[7] Plugin project.                                              Languages, Compilers, and Tools for Embedded
     http://...
Optimizations for DSP and Embedded Systems,                   optimization-space exploration. In Proceedings of
     coloc...
MILEPOST GCC: machine learning based research compiler
Upcoming SlideShare
Loading in …5
×

MILEPOST GCC: machine learning based research compiler

445 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
445
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MILEPOST GCC: machine learning based research compiler

  1. 1. MILEPOST GCC: machine learning based research compiler Grigori Fursin, Mircea Namolaru, Edwin Bonilla, Cupertino Miranda, Elad Yom-Tov, John Thomson, Olivier Temam Ayal Zaks, Hugh Leather, INRIA Saclay, France Bilha Mendelson Chris Williams, IBM Haifa, Israel Michael O’Boyle University of Edinburgh, UK Phil Barnard, Elton Ashton Eric Courtois, Francois Bodin ARC International, UK CAPS Enterprise, France Contact: grigori.fursin@inria.fr Abstract The difficulty of achieving portable compiler perfor- mance has led to iterative compilation [11, 16, 15, 26, Tuning hardwired compiler optimizations for rapidly 33, 19, 30, 24, 25, 18, 20, 22] being proposed as a evolving hardware makes porting an optimizing com- means of overcoming the fundamental problem of static piler for each new platform extremely challenging. Our modeling. The compiler’s static model is replaced by radical approach is to develop a modular, extensible, a search of the space of compilation strategies to find self-optimizing compiler that automatically learns the the one which, when executed, best improves the pro- best optimization heuristics based on the behavior of the gram. Little or no knowledge of the current platform is platform. In this paper we describe MILEPOST1 GCC, needed so programs can be adapted to different archi- a machine-learning-based compiler that automatically tectures. It is currently used in library generators and adjusts its optimization heuristics to improve the exe- by some existing adaptive tools [36, 28, 31, 8, 1, 3]. cution time, code size, or compilation time of specific However it is largely limited to searching for combina- programs on different architectures. Our preliminary tions of global compiler optimization flags and tweaking experimental results show that it is possible to consider- a few fine-grain transformations within relatively nar- ably reduce execution time of the MiBench benchmark row search spaces. The main barrier to its wider use is suite on a range of platforms entirely automatically. the currently excessive compilation and execution time needed in order to optimize each program. This prevents its wider adoption in general purpose compilers. 1 Introduction Our approach is to use machine learning which has the potential to reuse knowledge across iterative compila- Current architectures and compilers continue to evolve tion runs, gaining the benefits of iterative compilation bringing higher performance, lower power and smaller while reducing the number of executions needed. size while attempting to keep time to market as short as possible. Typical systems may now have multiple het- The MILEPOST project’s [4] objective is to develop erogeneous reconfigurable cores and a great number of compiler technology that can automatically learn how to compiler optimizations available, making manual com- best optimize programs for configurable heterogeneous piler tuning increasingly infeasible. Furthermore, static embedded processors using machine learning. It aims to compilers often fail to produce high-quality code due to dramatically reduce the time to market of configurable a simplistic model of the underlying hardware. systems. Rather than developing a specialised compiler 1 MILEPOST - MachIne Learning for Embedded PrOgramS op- by hand for each configuration, MILEPOST aims to Timization [4] produce optimizing compilers automatically. 1
  2. 2. A key goal of the project is to make machine learning learning tools to correlate aspects of program structure, based compilation a realistic technology for general- or features, with optimizations, building a strategy that purpose compilation. Current approaches [29, 32, 10, predicts a good combination of optimizations. 14] are highly preliminary; limited to global compiler flags or simple transformations considered in isolation. In order to learn a good strategy, machine learning tools GCC was selected as the compiler infrastructure for need a large number of compilations and executions as MILEPOST as it is currently the most stable and ro- training examples. These training examples are gener- bust open-source compiler. It supports multiple archi- ated by a tool, the Continuous Collective Compilation tectures and has multiple aggressive optimizations mak- Framework[2] (CCC), which evaluates different compi- ing it a natural vehicle for our research. In addition, lation optimizations, storing execution time, code size each new version usually features new transformations and other metrics in a database. The features of the pro- demonstrating the need for a system to automatically re- gram are extracted from MILEPOST GCC via a plugin tune its optimization heuristics. and are also stored in the database. Plugins allow fine grained control and examination of the compiler, driven In this paper we present early experimental results externally through shared libraries. showing that it is possible to improve the performance of the well-known MiBench [23] benchmark suite on a range of platforms including x86 and IA64. We ported Deployment Once sufficient training data is gathered, our tools to the new ARC GCC 4.2.1 that targets ARC a model is created using machine learning modeling. International’s configurable core family. Using MILE- The model is able to predict good optimization strate- POST GCC, after a few weeks training, we were able to gies for a given set of program features and is built as learn a model that automatically improves the execution a plugin so that it can be re-inserted into MILEPOST time of MiBench benchmark by 11% demonstrating the GCC. On encountering a new program the plugin deter- use of our machine learning based compiler. mines the program’s features, passing them to the model which determines the optimizations to be applied. This paper is organized as follows: the next section de- scribes the overall MILEPOST framework and is, it- self, followed by a section detailing our implementation Framework In this paper we use a new version of the of the Interactive Compilation Interface for GCC that Interactive Compilation Interface (ICI) for GCC which enables dynamic manipulation of optimization passes. controls the internal optimization decisions and their pa- Section 4 describes machine learning techniques used rameters using external plugins. It now allows the com- to predict good optimization passes for programs us- plete substitution of default internal optimization heuris- ing static program features and optimization knowledge tics as well as the order of transformations. reuse. Section 5 provides experimental results and is We use the Continuous Collective Compilation Frame- followed by concluding remarks. work [2] to produce a training set for machine learn- ing models to learn how to optimize programs for the 2 MILEPOST Framework best performance, code size, power consumption and any other objective function needed by the end-user. The MILEPOST project uses a number of components, This framework allows knowledge of the optimization at the heart of which is the machine learning enabled space to be reused among different programs, architec- MILEPOST GCC, shown in Figure 1. MILEPOST tures and data sets. GCC currently proceeds in two distinct phases, in ac- Together with additional routines needed for machine cordance with typical machine learning practice: train- learning, such as program feature extraction, this forms ing and deployment. the MILEPOST GCC. MILEPOST GCC transforms the compiler suite into a powerful research tool suitable for adaptive computing. Training During the training phase we need to gather information about the structure of programs and record The next section describes the new ICI structure and ex- how they behave when compiled under different op- plains how program features can be extracted for later timization settings. Such information allows machine machine learning in Section 4. 2
  3. 3. MILEPOST GCC CCC Program1 (with ICI and ML routines) Continuous Collective Compilation Framework IC Plugins Training Drivers for … Recording pass iterative sequences compilation and model Extracting static training ProgramN program features MILEPOST GCC Deployment Extracting static program features Predicting “good” New program passes to improve exec.time, code size Selecting “good” passes and comp. time Figure 1: Framework to automatically tune programs and improve default optimization heuristics using machine learning techniques, MILEPOST GCC with Interactive Compilation Interface (ICI) and program features extractor, and Continuous Collective Compilation Framework to train ML model and predict good optimization passes 3 Interactive Compilation Interface 3.1 Internal structure To avoid the drawbacks of the first version of the ICI, we designed a new version, as shown in Figure 2. This ver- This section describes the Interactive Compilation Inter- sion can now transparently monitor execution of passes face (ICI). The ICI provides opportunities for external or replace the GCC Controller (Pass Manager), if de- control and examination of the compiler. Optimization sired. Passes can be selected by an external plugin settings at a fine-grained level, beyond the capabilities which may choose to drive them in a very different order of command line options or pragmas, can be managed to that currently used in GCC, even choosing different through external shared libraries, leaving the compiler pass orderings for each and every function in program uncluttered. being compiled. Furthermore, the plugin can provide its own passes, implemented entirely outside of GCC. The first version of ICI [21] was reactive and required In an additional set of enhancements, a coherent event minimal changes to GCC. It was, however, unable to and data passing mechanism enables external plugins to modify the order of optimization passes within the com- discover the state of the compiler and to be informed as piler and so large opportunities for speedup were closed it changes. At various points in the compilation process to it. The new version of ICI [9] expands on the ca- events (IC Event) are raised indicating decisions about pabilities of its predecessor permitting the pass order to transformations. Auxiliary data (IC Data) is registered be modified. This version of ICI is used in the MILE- if needed. POST GCC to automatically learn good sequences of optimization passes. In replacing default optimization Since plugins now extend GCC through external shared heuristics, execution time, code size and compilation libraries, experiments can be built with no further mod- time can be improved. ifications to the underlying compiler. Modifications for 3
  4. 4. High-level scripting (java, python, etc) GCC GCC with ICI Detect Detect ICI IC Plugins optimization flags optimization flags <Dynamically linked shared libraries> IC Selecting pass Event sequences ... GCC Controller GCC Controller IC Extracting static (Pass Manager) (Pass Manager) Event program features Interactive Compilation Interface Pass 1 Pass N Pass 1 Pass N CCC Continuous Collective ... IC Compilation Framework Event ML drivers to optimize programs and tune compiler GCC Data Layer GCC Data Layer IC optimization AST, CFG, CF, etc AST, CFG, CF, etc Data heuristic (a) (b) Figure 2: GCC Interactive Compilation Interface: a) original GCC, b) GCC with ICI and plugins different analysis, optimization and monitoring scenar- called pass_execution. ios proceed in a tight engineering environment. These plugins communicate with external drivers and can al- Figure 3c shows simple modifications in GCC Con- low both high-level scripting and communication with troller (Pass Manager) to enable monitoring of exe- machine learning frameworks such as MILEPOST cuted passes. When the GCC Controller function exe- GCC. cute_one_pass is invoked, we register an IC-Parameter called pass_name giving the real name of the executed Note that it is not the goal of this project to develop pass and trigger an IC-Event pass_execution. This in fully fledged plugin system. Rather, we show the util- turn invokes the plugin function executed_pass where ity of such approaches for iterative compilation and we can obtain the current name of the compiled function machine learning in compilers. We may later utilize using ici_get_feature("function_name") and the pass GCC plugin systems currently in development, for ex- name using ici_get_parameter("pass_name"). ample [7] and [13]. IC-Features provide read only data about the compila- Figure 3 shows some of the modifications needed to en- tion state. IC-Parameters, on the other hand, can be dy- able ICI in GCC with an example of a passive plugin namically changed by plugins to change the subsequent to monitor executed passes. The plugin is invoked by behavior of the compiler. Such behavior modification is the new -fici GCC flag or by setting ICI_USE environ- demonstrated in the next subsection using an example ment variable to 1 (to enable non-intrusive optimiza- with an avoid_gate parameter needed for dynamic pass tions without changes to Makefiles). When GCC detects manipulation. these options, it loads a plugin (dynamic library) with a name specified by ICI_PLUGIN environment variable Since we use the name field from the GCC pass struc- and checks for two functions start and stop as shown in ture to identify passes, we have had to ensure that each Figure 3a. pass has a unique name. Previously, some passes have had no name at all and we suggest that in the future The start function of the example plugin registers an a good development practice of always having unique event handler function executed_pass on an IC-Event names would be sensible. 4
  5. 5. x v ¦iii u§tsr¦q§pi ¤ £¡   ¢ ƒ '  y € ¦iii u ¤ £¡   ¢ ¦© ©¨§¦¥ ¤ £¡   ¢ w ‚ ¢ AA 5 ¡ !¡ £3 ¢ !¡ „ X … c ‰ # b # (ˆ( ‡ £3 ¢ !¡ ‡ … c ‚ ' X ‘ †† € aa$ ‰ #‰b (‰” ‰ d ‚ ££ X1 1 ) HA 7H2 H 7£¡   5¢! 4321 ) ' ( %$ # ¢ ¡ 0 ' £3 ¢ !¡ „ X … ‰($“ ‰b ' ˆ( ‡ c ‚ ' ’‡ ’‡ ‡ … 6 6 ‘ †† †† 8 6 8 9 ’ ¡ f¢ ¡ !¡ ¢ AA5 ! ¡ ¡ E H X 3 Y aa$ % ‰b (‰”‰ 3 A @ d € ¡ 73 A@ ¡ ) 6 9 8 A3A ! ¡ 3£ ’ X ! ¡ ! E G F1 £  3 A@ ¡ E 5DB C 2 ¡ B 6 E £’¢ AA5 ¡ X E G F1 A 1 £ ¢ AA5 1 H21 E £ 0Y ! 6 6 Y ! E £ !X  21 0 6 FF £’ € ) F a b $ba‰b$ £ 1 k 7 D j i X D RD Q I £¡   5¢! 43 ! £HA 3 F 5DB C ¢ ¡ ¢ ¡ B E 7UTSD hg H ¤ P ) 6 6 E 7 £ ’ ) E 7 1 A1 £ 3 1 A  W £ 1 ¢ F 1 A1 £ V 1 £H0 ’ A 0 £ £B 2‡ X 1 1 7£ X £  3 £ ¡ 5A 1 ! ) hg H ) 0 6 6 ‡21 £3 A H CC• ¢ ¡ ’1 A £@ A 2 1 ! ¡ ¡ ¡ £’ )1 ££ … Y H 1 £ ’£ 1 € ¡ E 7a b$ba‰b$  I … 1 l X ¡ ` c ` ¡ B F 1 ¡ E 7 b $b a I 5DB C )Y 4 3 7 !X ) ¢   6 6 6 6 ¡ ¡ E 7 ‰b$  % #m $ ) ! £@£ ¢¢ ` ' ` £ E 7 1 A1 £ 3 1 A  W £ ) 1 ¢ 0 F 1 A1 V 1 721 ) F £ 0 Y ! !X   6 6 6 6 6 ¡ ¡ E 7 £’ )1 ££ … X … Y H 1 £ ’£ !X 1 ¡ 1 ¡ E 7 d #b a I 5DB C )Y 4 3 7 !X ) F HA ¡ ` ` ¡ B ¢   ` – $ # b( € “ v ‰ € b$‰“ b‰  ( — ƒ ‰ ‚ ‚ ' ‚ ` c ' ' 6 6 6 6 £ E 7 1 A1 £ 3 1 A  W £ ) 1 ¢ 0 F 1 A1 V 1 6 6 7 £ ’) ¡ X n   721 ) F £ 0 Y ! H 6 9 6 E £  !1X £ ¢ 1 e — ƒ ‰ ` – $ aa$ ` v ‰b‰ ‚ d c – $ $ b‰  ( c d ' ' HI £Y ! !X  I… ! ™ ˜ ˜ )  !¡1 H … Y ! E 7£ 6 6 ƒp % #mv ` ‰ ' – $ $ ‰ba ‰ ( – $ aa$ ` v ‰b ‰ ‚ d c c d c ' c ' ' e o — ƒƒ ‰ – $ qˆa a$ v ‚ d e — ƒ ‚ # b (‰”‰a a$ ` vb ‰m‰ $( ( ` ' € d ‚ ' ' — ƒ ‰ – $ $ ‰ba ‰ ‚€ ( ` – $ aa$ ` v ‰b‰ ‚ d c c d c ' c ' ' b $ba1 c 0 73 A @ ¡ ) 8 9 I`‚ # b (‰”‰aa$ ` v ! £@£ 1 £ ’£ ¡ ¡ ' € d ¡ 1 6 6 E 7aa$ %‰b (‰ ”‰d d € e f h hg f hef r Figure 3: Some GCC modifications to enable ICI and an example of a plugin to monitor executed passes: a) IC Framework within GCC, b) IC Plugin to monitor executed passes, c) GCC Controller (pass manager) modifi- cation 3.2 Dynamic Manipulation of GCC Passes performance to be gained within GCC from such actions otherwise there is no point in trying to learn. By using Previous research shows a great potential to improve the Continuous Collective Compilation Framework [2] program execution time or reduce code size by carefully to random search though the optimization flag space selecting global compiler flags or transformation param- (50% probability of selecting each optimization flag) eters using iterative compilation. The quality of gener- and MILEPOST GCC 4.2.2 on AMD Athlon64 3700+ ated code can also be improved by selecting different and Intel Xeon 2800MHz we could improve execution optimization orders as shown in [16, 15, 17, 26]. Our time of susan_corners by around 16%, compile time by approach combine the selection of optimal optimization 22% and code size by 13% using Pareto optimal points orders and tuning parameters of transformations at the as described in the previous work [24, 25]. Note, that same time. the same combination of flags degrade execution time of this benchmark on Itanium-2 1.3GHz by 80% thus The new version of ICI enables arbitrary selection of le- demonstrating the importance of adapting compilers to gal optimization passes and has a mechanism to change each new architecture. Figure 4a shows the combina- parameters or transformations within passes. Since tion of flags found for this benchmark on AMD platform GCC currently does not provide enough information while Figures 4b,c show the passes invoked and moni- about dependencies between passes to detect legal or- tored by MILEPOST GCC for the default -O3 level and ders, and the optimization space is too large to check for the best combination of flags respectively. all possible combinations, we focused on detecting in- fluential passes and legal orders of optimizations. We Given that there is good performance to be gained examined the pass orders generated by compiler flags by searching for good compiler flags, we now wish that improved program execution time or code size us- to automatically select good optimization passes and ing iterative compilation. transformation parameters. These should enable fine- Before we attempt to learn good optimization settings grained program and compiler tuning as well as non- and pass orders we first confirmed that there is indeed intrusive continuous program optimizations without 5
  6. 6. 6 a program needed by the model to distinguish between avoid_gate so that we can set gate_status to TRUE representation and characterize the essential aspects of trol we use IC-Parameter gate_status and IC-Event tures are typically a summary of the internal program optimization flags is selected. To avoid this gate con- gram structure or program features. The program fea- (pass-gate()) to execute the pass only if the associate mization to apply to an input program based on its pro- execute_one_pass shown in Figure 3c has gate control Our machine learnt model predicts the best GCC opti- Figure 4c. However, note that the GCC internal function mark with the -O3 flag but selecting passes shown in 3.3 Adding program feature extractor pass pass orders using ICI, we recompiled the same bench- To verify that we can change the default optimization future plan to add support for such cases explicitly. their interaction and dependencies for future work. pile programs with such flags always enabled, and in the optimizations. However, we leave the determination of generation in other passes. Thus, at this point we recom- shown in Figure 2b or search for new good orders of fine-grain transformation parameters and influence code of passes previously found by the CCC Framework as ated pass such as -funroll-loops but also select specific default GCC Pass Manager and execute good sequences son is that some compiler flags not only invoke associ- execute_one_pass. Therefore, we can circumvent the its execution time by 13% instead of 16% and the rea- ici_run_pass function that in turn invokes GCC function cution of the generated binary shows that we improve of ICI allows passes to be called directly using the within plugins and thus force its execution. The exe- modifications to Makefiles, etc. The current version -O3 using ICI, c) recorded compiler passes for the good selection of flags (a) improve execution and compilation time for susan_corners benchmark over -O3, b) recorded compiler passes for Figure 4: a) Selection of compiler flags found using CCC Framework with uniform random search strategy that 5e 3 7 § © © 8 §¥7 £ ¤ £8 © £8 ¥ § ¤ H ( ¦ © 8© ¤ 7 ( ¥ § 7 § © ¦¤7¤§(  ¦§7¤(§(('7¤§ © §7¦¥$7 £ 8 §( £ ¤ © $ ¥7¤ © §$ 8 © $ ¥70¥ © 7¨§¦¥¤7¤§( 8£ § § ¤ 7 (''7§((7 §¥7§ ¦§§7 7 7 7 ¨ 7 7 7 §( T IDP9U7§ 7H  §¨ 7 7 ¢ ¤¥ H £ §¤¥ §¤¥ 8 §( © ¨ §( © §( ¤ ¤ Y £ ¤ $ % ©¤ § 1 $ §( 7§¥7§' 7 7 7 7 ¨ 7 7 7 7 7 7 7 7 $ ¥ % § £ §¤¥ `Dd § 8 S S@CEWcbA@@S ¦¥ ©H¤ 8 © ( 1 8 © 8 %§¥ %§¤¥ $ # 7 7 7 7 7 ' 7 ¥ 7 7 6 7 ¥ § 7 ¥ § ¤ §( 1 ¨§( ¦¤ ¤ 1© ¤( © ¥ ¤ © £ 8© © $ ¥ 8£ 8© § ( ¨ § ¤ ¤ © © 8 £ 8 §( £ ¤§( © ( © © ¨ 8 §( £ ¤ 7( § ¥ 7 7 '7 1 7¨§) 7 ¦7 (H( 7 ( 7 7 7 ¨ © ( 8 © £8 (H §¥ 8 £ ¤0¥ $ © © £ §¥¨¨¥ ( 1 ¥ ¤¤§( § 7 ( ¥7 ¥ 17 ¨ ' ¨( ¥§(7 §7¥¥ 7¦¥ H 7 ( ¥7 7 7 7 7 7 7 7' 7 § ¤ 8 2© $ ¤ © ¤ 2 © 0 ¤ ¤ § ( ¨§ ©(¥ ¤ £ (2 ¥ 8 ( © ¤ 7 § ' 7 ¥¥ 8 ( © 7)¤ # 7 © ¦7 (H( £ ¤7§¥¨ 7¥ ¤¤§(7¤7(¤7(§H 6¥7¦¥7§ £ ( 7¤7 © ¦7§¥¨7( 1 ¦§(§$7 (2 ¥ 7 ¤ 7 (H( 7§¥¨7§( 7¥¥7 § 7 7 7 7 ¥ § 7 ¨ ¥ ¥ £ £ © ¤ © ( ¤ ¤¤ ¤ ¤ ¤ £8 ¥ 8© ¤§( ¤( 18 §¥ §(§ £ ( ¤ § 7 © ) $ © 87¤§( © ( © © ¨ 8©7 £ 6 £ ¤ 5 a 3 § © © 8 § ¥ ¤ 7 7 ¥ § ( ¥ § 7 £ ¤ £ 8 © £8 H (¦ © 8© ¤7§ © ¦¤7¤§ ¦§7¤(§(('7¤§ © §7¦¥$7 £ 8 §( £ ¤ © $ ¥7¤ © §$ 8 © $ ¥ § 7 7 7 7 ''7 7 ¨ 0 ¥ © ¨ § ¦ ¥ ¤ ¤ § ( 8£ § ( ¤ §((7¢§¥7§ ¦§§7¤¥ 7H £ §¤¥7§¤¥ 8 §( © 7¨ §( © 7§(7§(7§ £ 7H¤ §¨ $ ¤ ¤ 7 7 7 7 7 7 7 UUFAB % © ¤7§ 1 $§ (7§¥7§'$ ¥ 7%§ £7§¤¥7§ ¨ 8 ¦¥ ©H ¤ 8 © ( 1 8 © 8 %§¥ ` 7 DU9X7%§ ¥7 7 7 7 7 ( 7 ' 7 ¥ 7 7 6 7 ¥ § Y ¤ $ # ¤ §( 1 ¨§(¦¤ ¤ 1© ¤ © ¥ ¤ © £ 8© © $ ¥ 8£ 8© § ( ¨ § ¤ ¤ © © 8 £ 8 §( £ 7 ¥ § 7 § ¥ 7 ( §( © ( © © ¨ 8 §( £ ( © ( 8 © £8 (H 7§¥ 8 £ ¤0¥ '7 1 7¨§)$ © 7 A@CA9EWTDGFEDCBA@9T9SQFR7 © ¦7 (H( £ ¤ 7DUI7§¥¨¨¥7 A@CA9QPA TG@I7( 1 ¥ 7 7 ¨ 7 URA@VQ7 7 ¥ 17 ¨ ' ¨( ¥§(7 §7 ¥¥ 7¦¥ H 7 GQS7 ( ¥ ¤¤§( § ( ¥ ¤ 8 2© $ ¤ © ¤ 2 7 7 7 7 7 7 7 7' 7 § 7 § 7) ' 7 ¦7 (H( 7 7 UI7§¥¨7¥ 7 © 0 ¤ ¤ §( ¨§ ©(¥ ¤ £ (2 ¥ 8 ( © ¥¥ 8 ( © ¤ ¤ ¤# © £ ¤ D ¤¤§( A@CA9QPA TG@ITDGFEDCBA@97 7 7 6 77 7 7 R7 7 ¦7 A@CA9QPA7 G@I7§¥¨7( 1 ¦§(§ 7 7 7 7 ¤ (¤ (§H ¥ ¦¥ § £ ( CSQF ¤ © $ (2 ¥ (H( £ §¥¨ 7§( 7¥¥7 DGFEDCBA@97 § 7 7 7 7 ¥ § 7 ¨ ¥ ¥ £ © ¤ © ( ¤ ¤¤ ¤ ¤ ¤ £8 ¥ 8© ¤§( ¤( 18 §¥ §(§ £ ( ¤ § 7 © ) $ © 87¤§( © ( © © ¨ 8©7 £ 6 £ ¤ 54 3 (§ 1    ¥       (   §§(   §( §§(   (  §§(     ¤ © § 1 §§( © £ ¤ ©£ ©£ ¤ © £                     § ¨ §§(             §)$ ©   §§( © £ (§ §§( © £ $ §§( © £ ¤ © £ ¤©  ( © $ ¨  §§( © £ §$§(2 ¥ §§( © £   ¥                         §    §   ¤ © ( © £ §¥¨§( ¦ © §( © £ (§ ( ¤ ¤ ¤ 1 © ¤ £ ¤ ¤ £ ¤ ¤ ¨§¦¥¤ £ ¤ ¤ ¨§¦¥¤ £ '         ¨       ¨ ¨       '             ¤0¥ (§ ¤ §¤ ¨§¦¥¤ £ ¥§¤ ¨§ ¦¥¤ £ ¤ (§ ¥§¤ ¨§¦¥¤ £ 0¥ (§ © ¨§¦¥¤ £ © (§( £ '                  '   § )     § 1     )   ¤0¥ (§¨( §( £ § 1 $ §( £ ¤2(( ¦¥ © £ ( § §   £ ¤ §§ £ ¤¥ ¤ $ © £ $ (§ © ¤ §( § $ © £ ((§ ¦       ¥             © $ £ ¤© 1  £ ¤ © £ § £ ¤(§ 1 ¥  £ £ §¤¥  £ ¤ §¤¥  £ ¨ §( (§ ©£  §¤¥ £     §   §     (¨¨ §¥( £ £  (§ £ ¨  £  (§ £ ¨ £ ¤0¥ ' 0¤ §¤¥  £ $ # ¤ (¥  £ §)$ ©  ¨   © ( © ¦¥(' £   §     ¤ §(    '    %         ¥        ¨§       © ¥ ¦¥ ( £ ¤ £ !¤$ # £ !¤ © £ £ §¨ ¤ ¤ © ¨§¦¥¤ £ ¢¡ ¤
  7. 7. good and bad optimizations. Each such struct data type may introduce an entity, and its f ields may introduce relations over the entity rep- The current version of ICI allows invoking auxiliary resenting the including struct data type and the entity passes that are not a part of default GCC compiler. representing the data type of the f ield. This data is col- These passes can monitor and profile the compilation lected using ml-feat pass. process or extract data structures needed to generate pro- gram features. In the second stage, we provide a Prolog program defin- ing the features to be computed from the Datalog re- During the compilation, the program is represented by lational extracted from the compiler internal data struc- several data structures, implementing the intermediate tures in the first stage. The extract_program_static _fea- representation (tree-SSA, RTL etc), control flow graph tures plugin invokes a Prolog compiler to execute this (CFG), def-use chains, the loop hierarchy, etc. The data program, the result being the vector of features shown structures available depend on the compilation pass cur- in Table 1 that can later be used by the Continuous Col- rently being performed. For statistical machine learn- lective Compilation Framework to build machine learn- ing, the information about these data structures is en- ing models and predict best sequences of passes for new coded as a vector of constant size of numbers (i.e fea- programs. This example shows the flexibility and capa- tures) - this process is called feature extraction and is bilities of the new version of ICI. needed to enable optimization knowledge reuse among different programs. 4 Using Machine Learning to Select Good Op- Therefore, we implemented an additional GCC pass ml- timization Passes feat to extract static program features. This pass is not invoked during default compilation but can be called us- The previous sections have described the infrastructure ing a extract_program_static_features plugin after any necessary to build a learning compiler. In this section arbitrary pass starting from FRE when all the GCC data we describe how this infrastructure is used in building a necessary to produce features is ready. model. In the MILEPOST GCC, the feature extraction is per- formed in two stages. In the first stage, a relational rep- Our approach to selecting good passes for programs is resentation of the program is extracted; in the second based upon the construction of a probabilistic model on stage, the vector of features is computed from this rep- a set of M training programs and the use of this model in resentation. order to make predictions of “good” optimization passes on unseen programs. In the first stage, the program is considered to be charac- terized by a number of entities and relations over these Our specific machine learning method is similar to that entities. The entities are a direct mapping of similar en- of [10] where a probability distribution over “good” so- tities defined by the language reference, or generated lutions (i.e. optimization passes or compiler flags) is during the compilation. Such examples of entities are learnt across different programs. This approach has variables, types, instructions, basic blocks, temporary been referred in the literature to as Predictive Search variables, etc. Distributions (PSD) [12]. However, unlike [10, 12] where such a distribution is used to focus the search of A relation over a set of entities is a subset of their Carte- compiler optimizations on a new program, we use the sian product. The relations specify properties of the en- distribution learned to make one-shot predictions on un- tities or the connections between them. For describing seen programs. Thus we do not search for the best opti- the relations we used a notation based on logic - Datalog mization, we automatically predict it. is a Prolog-like language but with a simpler semantics, suitable for expressing relations and operations between 4.1 The Machine Learning Model them [35, 34] For extracting the relational representation of the pro- Given a set of training programs T 1 , . . . , T M , which can gram, we used a simple method based on the examina- be described by (vectors of) features t1 . . . , tM , and for tion of the include files. The compiler main data struc- which we have evaluated different sequences of opti- tures are struct data types, having a number of f ields. mization passes (x) and their corresponding execution 7
  8. 8. times (or speed-ups y) so that we have for each pro- Let us denote the distribution over good solutions on j gram M j an associated dataset D j = {(xi , yi )}N , with i=1 each training program by P(x|T j ) with j = 1, . . . , M. In j = 1, . . . M, our goal is to predict a good sequence of principle, these distributions can belong to any paramet- optimization passes x∗ when a new program T ∗ is pre- ric family. However, In our experiments we use an IID sented. model where each of the elements of the sequence are considered independently. In other words, the probabil- We approach this problem by learning the mapping from ity of a “good” sequence of passes is simply the product the features of a program t to a distribution over good of each of the individual probabilities corresponding to solutions q(x|t, θ ), where θ are the parameters of the how likely each pass is to belong to a good solution: distribution. Once this distribution has been learnt, pre- dictions on a new program T ∗ is straightforward and it is L achieved by sampling at the mode of the distribution. In P(x|T j ) = ∏ P(x |T j ), (2) =1 other words, we obtain the predicted sequence of passes by computing: where L is the length of the sequence. x∗ = argmax q(x|t, θ ). (1) x As proposed in [12], once the individual training distri- butions P(x|T j ) have been obtained, the predictive dis- 4.2 Continuous Collective Compilation Frame- tribution q(x|t, θ ) can be learnt by maximization of the work conditional likelihood or by using k-nearest neighbor methods. In our experiments we use a 1-nearest neigh- We used Continuous Collective Compilation Frame- bor approach. In other words, we set the predictive dis- work [2] and MILEPOST GCC shown in Figure 1 to tribution q(x|t, θ ) to be the distribution corresponding generate a training set of programs together with com- to the training program that is closest in feature space to piler flags selected uniformly at random, associated se- the new (test) program. quences of passes, program features and speedups (code size, compilation time) that is stored in the externally Note that we currently predict “good” sequences of opti- accessible Global Optimization Database. We use this mization passes that are associated to the best combina- training set to build machine learning model described tion of compiler flags. We will investigate in future work in the next section which in turn is used to predict the the application of our models to the general problem of best sequence of passes for a new program given its determining “optimal” order of optimization passes for feature vector. Current version of CCC Framework re- programs. quires minimal changes to the Makefile to pass opti- mization flags or sequences of passes and has a support to verify the correctness of the binary by comparing pro- 5 Experiments gram output with the reference one to avoid illegal com- binations of optimizations. We performed our experiments on four different plat- forms: 4.3 Learning and Predicting In order to learn the model it is necessary to fit a dis- • AMD – a cluster with 16 AMD Athlon 64 3700+ tribution over good solutions to each training program processors running at 2.4GHz beforehand. These solutions can be obtained, for ex- • IA32 – a cluster with 4 Intel Xeon processors run- ample, by using uniform sampling or by running an es- ning at 2.8GHz timation of distribution algorithm (EDA, see [27] for an overview) on each of the training programs. In our • IA64 – a server with Itanium2 processor running at experiments we use uniform sampling and we choose 1.3GHz the set of good solutions to be those optimization set- tings that achieve at least 98% of the maximum speed- • ARC – FPGA implementation of the ARC 725D up available in the corresponding program-dependent processor running GNU/Linux with a 2.4.29 ker- dataset. nel. 8
  9. 9. 2.24 2.31 2.31 1.97 1.6 1.5 speedup 1.4 1.3 1.2 1.1 1 gsm n_c n_e n_s _c _d tra cia d ael_e sha _c _d 32 t1 unt d e h1 fish_ ael_ fish_ qsor jpeg CRC m jpeg m earc dijks patri bitco susa susa susa adpc adpc rijnd rijnd blow blow gs strin AMD IA32 IA64 (a) 1.4 1.3 speedup 1.2 1.1 1 0.9 0.8 age gsm n_c n_e n_s _c _d tra icia d e sha m_c m_d 32 t1 unt d e h1 ael_ fish_ ael_ fish_ qsor jpeg CRC jpeg earc dijks bitco patr Aver susa susa susa adpc adpc rijnd rijnd blow blow gs strin Iterative compilation Predicted optimization passes using ML and MILEPOST GCC (b) Figure 5: Experimental results when using iterative compilation with random search strategy (500 iterations; 50% probability to select each flags; AMD,IA32,IA64) and when predicting best optimization passes based on program features and ML model (ARC) In all case the compiler used is MILEPOST GCC 4.2.x sufficient information can be gleaned from this to allow with the ICI version 0.9.6 We decided to use open- significant speedup. Indeed, the size of the space left source MiBench benchmark with MiDataSets [20, 6] unexplored serves to highlight our lack of knowledge in (dataset No1 in all cases) due to its applicability to both this area, and the need for further work. general purpose and embedded domains. Firstly, features for each benchmark were extracted 5.1 Generating Training Data from programs using the new pass within MILEPOST GCC, and these features then sent to the Global Op- In order to build a machine learning model, we need timization Database within CCC Framework. An ML training data. This was generated by a random explo- model for each benchmark was built, using the execu- ration of a vast optimization search space using the CCC tion time gathered from 500 separate runs using differ- Framework. It generated 500 random sequences of flags ent random sequences of passes, and a fixed data set. either turned on or off. These flag settings can be read- Each run was repeated 5 times so speedups were not ily associated with different sequences of optimization caused by cache priming etc.). After each run, the ex- passes. Although such a number of runs is very small perimental results including execution time, compila- in respect to the optimisation space, we have shown that tion time, code size and program features are sent to the 9
  10. 10. database where they are stored for future reference. 6 Conclusions and Future Work Figure 5a shows that considerable speedups can be al- In this paper we have shown that MILEPOST GCC has ready obtained after iterative compilation on all plat- significant potential in the automatic tuning of GCC op- forms. However, this is a time-consuming process and timization. We plan to use these techniques and tools different speedups across different platforms motivates to further investigate the automatic selection of opti- the use of machine learning to automatically build spe- mal orders of optimization passes and fine-grain tun- cialized compilers and predict the best optimization ing of transformation parameters. The overall frame- flags or sequences of passes for different architectures. work will also allow the analysis of interactions be- tween optimizations and investigation of the influence 5.2 Evaluating Model Performance of program inputs and run-time state on program opti- mizations. Future work will also include fine-grain run- Once a model has been built for each of our bench- time adaptation for multiple program inputs on hetero- marks, we can evaluate the results by introducing a geneous multi-core architectures. new program to the system, and measuring how well the prediction provided by our model performs. In this 7 Acknowledgments work we did not use a separate testing benchmark suite due to time constraints, so instead leave-one-out-cross- The authors would like to thank Sebastian Pop for the validation was used. Using this method, all training data help with modifying and debugging GCC and John relating to the benchmark being tested is excluded from W.Lockhart for the help with editing this paper. This the training process, and the models rebuilt. When a sec- research is supported by the European Project MILE- ond benchmark is tested, the training data pertaining to POST (MachIne Learning for Embedded PrOgramS the first benchmark is returned to the training set, that of opTimization) [4] and by the HiPEAC European Net- the second benchmark excluded, and so on. In this way work of Excellence (High-Performance Embedded Ar- we ensure that each benchmark is tested as a new pro- chitecture and Compilation) [5]. gram entering the system for the first time—of course, in real-world usage, this process is unnecessary. References When a new program is compiled, features are first gen- erated using MILEPOST GCC. These features are then [1] ACOVEA: Using Natural Selection to Investigate sent to our ML model within CCC Framework (imple- Software Complexities. http://www. mented as a MATLAB server), which processes them coyotegulch.com/products/acovea. and returns a predicted sequence of passes which should [2] Continuous Collective Compilation Framework. either improve execution time or reduce code size or http://cccpf.sourceforge.net. both. We then evaluate the prediction by compiling the program with the suggested sequence of passes, mea- [3] ESTO: Expert System for Tuning Optimizations. sure the execution time and compare with the origi- http: nal time for the default ’-O3’ optimization level. It //www.haifa.ibm.com/projects/ is important to note that only one compilation occurs systems/cot/esto/index.html. at evaluation—there is no search involved. Figure 5b [4] EU Milepost project (MachIne Learning for shows these results for the ARC725D. It demonstrates Embedded PrOgramS opTimization). that except a few pathological cases where predicted http://www.milepost.eu. flags degraded performance and which analysis we leave for future work, using CCC Framework, MILEPOST [5] European Network of Excellence on GCC and Machine Learning Models we can improve High-Performance Embedded Architecture and original ARC GCC by around 11%. Compilation (HiPEAC). http://www.hipeac.net. This suggests that our techniques and tools can be effi- cient to build future iterative adaptive specialized com- [6] MiDataSets SourceForge development site. pilers. http://midatasets.sourceforge.net. 10
  11. 11. [7] Plugin project. Languages, Compilers, and Tools for Embedded http://libplugin.sourceforge.net. Systems (LCTES), pages 1–9, 1999. [8] QLogic PathScale EKOPath Compilers. [17] B. Elliston. Studying optimisation sequences in http://www.pathscale.com. the gnu compiler collection. Technical Report ZITE8199, School of Information Technology [9] GCC Interactive Compilation Interface. and Electical Engineering, Australian Defence http://gcc-ici.sourceforge.net, Force Academy, 2005. 2006. [18] B. Franke, M. O’Boyle, J. Thomson, and [10] F. Agakov, E. Bonilla, J.Cavazos, B.Franke, G. Fursin. Probabilistic source-level optimisation G. Fursin, M. O’Boyle, J. Thomson, of embedded programs. In Proceedings of the M. Toussaint, and C. Williams. Using machine Conference on Languages, Compilers, and Tools learning to focus iterative optimization. In for Embedded Systems (LCTES), 2005. Proceedings of the International Symposium on Code Generation and Optimization (CGO), 2006. [19] G. Fursin. Iterative Compilation and Performance Prediction for Numerical Applications. PhD [11] F. Bodin, T. Kisuki, P. Knijnenburg, M. O’Boyle, thesis, University of Edinburgh, United Kingdom, and E. Rohou. Iterative compilation in a 2004. non-linear optimisation space. In Proceedings of the Workshop on Profile and Feedback Directed [20] G. Fursin, J. Cavazos, M. O’Boyle, and Compilation, 1998. O. Temam. Midatasets: Creating the conditions for a more realistic evaluation of iterative [12] E. V. Bonilla, C. K. I. Williams, F. V. Agakov, optimization. In Proceedings of the International J. Cavazos, J. Thomson, and M. F. P. O’Boyle. Conference on High Performance Embedded Predictive search distributions. In W. W. Cohen Architectures Compilers (HiPEAC 2007), and A. Moore, editors, Proceedings of the 23rd January 2007. International Conference on Machine learning, pages 121–128, New York, NY, USA, 2006. [21] G. Fursin and A. Cohen. Building a practical ACM. iterative interactive compiler. In 1st Workshop on Statistical and Machine Learning Approaches [13] S. Callanan, D. J. Dean, and E. Zadok. Extending Applied to Architectures and Compilation gcc with modular gimple optimizations. In (SMART’07), colocated with HiPEAC 2007 Proceedings of the GCC Developers’ conference, January 2007. Summit’2007, 2007. [22] G. Fursin, A. Cohen, M. O’Boyle, and O. Temam. [14] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, A practical method for quickly evaluating M. O’Boyle, and O. Temam. Rapidly selecting program optimizations. In Proceedings of the good compiler optimizations using performance International Conference on High Performance counters. In Proceedings of the 5th Annual Embedded Architectures Compilers (HiPEAC International Symposium on Code Generation 2005), pages 29–46, November 2005. and Optimization (CGO), March 2007. [23] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. [15] K. Cooper, A. Grosul, T. Harvey, S. Reeves, Austin, T. Mudge, and R. B. Brown. Mibench: A D. Subramanian, L. Torczon, and T. Waterman. free, commercially representative embedded ACME: adaptive compilation made efficient. In benchmark suite. In IEEE 4th Annual Workshop Proceedings of the Conference on Languages, on Workload Characterization, Austin, TX, Compilers, and Tools for Embedded Systems December 2001. (LCTES), 2005. [24] K. Heydemann and F. Bodin. Iterative [16] K. Cooper, P. Schielke, and D. Subramanian. compilation for two antagonistic criteria: Optimizing for reduced code space using genetic Application to code size and performance. In algorithms. In Proceedings of the Conference on Proceedings of the 4th Workshop on 11
  12. 12. Optimizations for DSP and Embedded Systems, optimization-space exploration. In Proceedings of colocated with CGO, 2006. the International Symposium on Code Generation and Optimization (CGO), pages 204–215, 2003. [25] K. Hoste and L. Eeckhout. Cole: Compiler optimization level exploration. In Proceedings of [34] J. Ullman. Principles of database and knowledge the International Symposium on Code Generation systems. Computer Science Press, 1, 1988. and Optimization (CGO), 2008. [35] J. Whaley and M. S. Lam. Cloning based context [26] P. Kulkarni, W. Zhao, H. Moon, K. Cho, sensitive pointer alias analysis using binary D. Whalley, J. Davidson, M. Bailey, Y. Paek, and decision diagrams. In Proceedings of the K. Gallivan. Finding effective optimization phase Conference on Programming Language Design sequences. In Proc. Languages, Compilers, and and Implementation (PLDI), 2004. Tools for Embedded Systems (LCTES), pages 12–23, 2003. [36] R. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proc. Alliance, 1998. [27] P. Larrañaga and J. A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, Norwell, MA, USA, 2001. [28] F. Matteo and S. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1381–1384, Seattle, WA, May 1998. [29] A. Monsifrot, F. Bodin, and R. Quiniou. A machine learning approach to automatic production of compiler heuristics. In Proceedings of the International Conference on Artificial Intelligence: Methodology, Systems, Applications, LNCS 2443, pages 41–50, 2002. [30] Z. Pan and R. Eigenmann. Fast and effective orchestration of compiler optimizations for automatic performance tuning. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 319–332, 2006. [31] B. Singer and M. Veloso. Learning to predict performance from formula modeling and training data. In Proceedings of the Conference on Machine Learning, 2000. [32] M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In Proceedings of International Symposium on Code Generation and Optimization (CGO), pages 123–134, 2005. [33] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. August. Compiler 12

×