Static Neural Compiler Optimization via Deep Reinforcement Learning
1. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 1
STATIC NEURAL COMPILER OPTIMIZATION VIA
DEEP REINFORCEMENT LEARNING
Rahim Mammadli1,2, Ali Jannesari3, Felix Wolf1,2
1 Graduate School of Computational Engineering, TU Darmstadt
2 Laboratory for Parallel Programming, TU Darmstadt
3 Software Analytics and Pervasive Parallelism Lab, Iowa State University
2. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 2
Agenda
• Phase-ordering problem
• Motivation to use reinforcement learning
• Approach
• Evaluation
• Conclusion
3. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 3
Phase-ordering problem
• Which pass to run?
• How to set parameters?
Parameter1 Parameter2 Parameter3
Pass1 Pass2 Pass3 Pass4
4. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 4
Challenges (1/2)
• Large number of optimizations
• 56 transformation passes in LLVM 10
• More when considering parameters
• Example: loop-unroll pass accepts 21
• Effectiveness of optimizations depends on:
• Code
• Hardware
• Problem size
unroll-threshold
unroll-count
unroll-max-count
unroll-runtime
…
5. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 5
Challenges (2/2)
• Trade-off must be achieved between:
• compilation time
• program size
• performance
6. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 6
Current approach in compilers
• Predefined optimization sequences
• -O0, -O1, -O2, -O3, -Ofast, -Os, -Oz, -Og, -O, -O4
• Disadvantages:
• Rigid structure (pre-defined ordering)
• Input-sensitive[1] (imperfect cost function)
• Maintenance
7. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 7
Motivation for reinforcement learning
• IR resembles a state
• Passes resemble actions
• OS + LLVM optimizer resemble the environment
• Rewards come directly from runtime
IR0
100ms
IR1
25ms
IR2
200ms
Pass1
Pass2
8. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 8
Approach
Max
input_ir.ll
Action
3
2
1
Reward
-2.3
+1.5
+0.3
4 -0.9
Predictions
Value > 0
LLVM
Optimizer
ENDSTART
false
Agent
RecordAction
History
true
+1.52
Action Reward
9. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 9
Problem definition
• Learn the value function (Q):
• Reward is calculated based on runtime:
Q tS , tA,w( )= R tS , tA( )+γ max
a∈A
Q t+1S ,a,w( )
R = ln
T tS( )
T t+1S( )
10. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 10
State
States
• IR is encoded using NCC[2] embeddings
• Action history is one-hot encoded
IR
Action
History
statement-1
statement-2
…
statement-n
action-1
action-2
…
action-m
NCC
One-Hot
Vector
Representation
11. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 11
Action spaces
Three levels of abstraction:
Action #2
Pass #2
Action #1
Pass #1
Action #3
Pass #3
Action #2
Parameter #1
Action #3
Parameter #2
Action #1
Pass #1
...
...
Action #1
Pass #1 Pass #2 Pass #3
...High
Level
Medium
Level
Low
Level
12. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 12
CORL Framework
SERVER
CLIENT
Learner
Exploration
Exploitation
Evaluation
Optimization
Visualization
Replay
Memory
Logs
Source
Codes
SQL DB
Visualizer
Workers
Manager
BenchmarkingLogging
Persistence
Task
Distribution
async
async
13. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 13
Experimental Setup
• Dataset
• 109 single-source benchmarks from LLVM test suite
• 4:1 split across training and validation
• Max optimization sequence length: 16
• Workers: Lichtenberg HPC Cluster Phases I and II
• LLVM 3.8
14. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 14
Metrics
• Aggregate performance
• Model
• Highest Observed
• O3
• Individual performance
15. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 15
Space H
• 8 actions
• ɣ = 0.9
• τ = 400
• Very good fit
0: Passes [0-7]
1: Passes [8-19]
2: Passes [20-29]
3: Passes [30-32]
4: Passes [33-37]
5: Passes [38-54]
6: Passes [55-65]
7: Passes [66-73]
16. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 16
Space H – Aggregate Results
Training Validation
17. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 17
Space H – Aggregate Results
Agent O3
Training 2.24x 2.17x
Validation 2.38x 2.67x
19. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 19
Space M
• 42 actions
• Action = LLVM Pass
• ɣ = 0.5
• τ = 1000
• Average fit after ~ 6 days
20. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 20
Space M – Aggregate Results
Training Validation
21. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 21
Space L
• 187 actions
• 42 passes
• 145 parameter selections
• High bias (underfit)
22. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 22
Space L – Aggregate Results
Training Validation
23. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 23
Conclusion
• Phase-ordering as reinforcement learning
• Fully automatic approach
• CORL Framework
• Agent up to 1.32x faster than O3
24. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 24
Shortcomings / Outlook
• Small dataset
• Quality of embeddings
• More suitable NN architecture (Graph NNs)
• Dependence on particular system configuration
• Data efficiency
25. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 25
References
[1] Gong, Zhangxiaowen, et al. "An empirical study of the effect of source-
level loop transformations on compiler stability." Proceedings of the ACM
on Programming Languages 2.OOPSLA (2018): 1-29.
[2] Ben-Nun, Tal, Alice Shoshana Jakobovits, and Torsten Hoefler. "Neural
code comprehension: A learnable representation of code semantics."
Advances in Neural Information Processing Systems. 2018.
26. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 26
Acknowledgements
This work is supported by the Graduate School CE within
the Centre for Computational Engineering at Technische
Universität Darmstadt and by the Hessian LOEWE initiative
within the Software-Factory 4.0 project. The calculations for
this research were conducted on the Lichtenberg Cluster of
TU Darmstadt.
27. 04.12.20 | Department of Computer Science | Laboratory for Parallel Programming | Rahim Mammadli | 27
Thank you!
Questions?