고급컴파일러구성론_개레_230303.pptx

CODE TRANSLATION WITH COMPILER
REPRESENTATIONS
(Accepted as a conference paper at ICLR 2023)
Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve
(Meta AI)
Presented by: Gebremedhin G. Maru
Kangwon National University
Programming Language & Machine Learning Lab
March 2, 2023

Presentation Outline
1) Introduction.
2) Intermediate Representations In Compilers
3) Training Objectives.
4) Data.
5) Results.
6) Discussion.
7) Conclusion.
2

1) Introduction
• Automatic code translation allows to port old
codebases to new frameworks.
• Limitation of existing NMT for PL:
Unreliability
 Failure on translating semantics of the input
program accurately.
3

Intuition About The proposed
work
• Leverages information from compiler toolchains (LLVM).
• Compilers’ Intermediary Representations(IR).
• IR is language-agnostic pseudocode that describes the
semantics of the program.
• Benefits of IR:
 Help to align embeddings for different Programming Languages.
 Improves the semantic understanding of the code.
4

Motivational Example
Figure 1: Improvements over TransCoder. 5

Contributions of the paper
• IR-augmented translation(using LLVM).
 Average improvement of 5.5%.
• Useful in the low data situations.
 E.g 29.7% and 25.6% improvements when translating to and from Rust.
• Extending test datasets of 852 functions from TransCoder (Roziere et al.
2020) by adding 343 and 280 functions of Go and Rust, respectively.
• Achievement of 78% accuracy on decompiling LLVM IRs to C++.
6

2) Intermediate Representations In
Compilers.
• Compilers: translate source code to machine-specific executable (machine code).
7
Fig 2: Compiler toolchain with LLVM.

Why use an IR?
• Analysis and synthesis requirements in the
translations.
• To create machine independent representations and
optimization.
• Low data resource programming languages can be
benefited from IR augmented code representation.
8

3) Training Objectives.
• Unsupervised NMT.
 Learning multilingual sequence embeddings.
 Aligning the embeddings and generating an output from these
embeddings.
• Source sentence x = x1 ……..xNso ,
• Corresponding IR z(x) = z(x)
1 ………..z(x)Nir ,
• Target sentence y = y1……………yNta
• We define the machine translation loss(seq2seq loss) function as follows:
9

3.1 Common Objective Functions
• Masked Language Modeling (MLM):
 Trains an encoder to predict randomly masked inputs.
 Where mask(x) masked version of the code sentence x, and enc(t) the encoder
output
• Denoising Auto Encoding (AE):
 Retrieve an original sequence from a corrupted version.
 Where noise(x) denotes the corrupted version of x.
• Back-Translation (BT):
 Generate a noisy translation of input sentence, and then recover the original input from translation.
10

3.2 IR For Code
Representations
• IR provide additional information to training dataset about code to be
translated using three new objective functions.
• Translation Language Modeling (TLM):-
 Generates common representations for parallel sentences in different
languages.
• Translation Auto-Encoding (TAE):-
 Transposes the TLM objective into a denoising auto-encoder.
• IR Generation (MT):-
 Trains the model to translate the source code into the corresponding IR.
11

Figure 3: IR for code representation objectives.
12

3.3 Additional Losses: IR
Decompilation and Pivot
• IR used for 2 alternatives in this study:
I. IR decompilation.
 Predict Source code from IR, it reverses compilers' tasks.
II. IR pivot translation:
 Decompiling the uniform IR format of languages to one of the
target language.
 Uses neural decompiler.
13

4) Data.
4.1 Training Data.
• Google BigQuery
 Indexes over 2.8 million open-source repositories from GitHub.
 Extracted all individual C++, Java, Rust and Go functions.
• CodeNet dataset
 Repository of 14 million competitive programming solutions in 55
languages..
 Used for IR decompilation.
14

4.2 Generating Intermediate
Representations
• clang:- LLVM C++ compilation toolchain.
• JLang8:- Java.
• Gollvm9:- Go
• rustc:- Rust.
15

4.3 Evaluation
• The computational accuracy test suite used in Transcoder (Roziere et al.,
2020) is utilized and enhanced.
 852 parallel functions of C++, Java and Python in Roziere et al.(2020).
 In this work additional 280 in Rust and 343 functions in Go were created
as test sets.
16

5 Results
5.1 Experimental Details.
• The model has 12 layers (6 in the encoder and 6 in the decoder),
• 8 attention heads, and a dimension of 1024.
• 15% of tokens masked in MLM and TLM objectives.
• 20% of tokens masked in AE and TAE objectives.
• Except MLM other objectives are trained at function level.
17

Translation Results
.
Table 2: Translation performance (CA@1), for greedy decoding and beam size 5.
18

Cont’d
Table 4: Translation results with different beam sizes.
19

DECOMPILATION Results
Table 5: Performance of LLVM IRs Decompilation: outperforms RedDec on
C++
20

5.2 IR-Augmented Code Representations For
Translation.
• Best average performance by leveraging IR (Table 2).
• Comparing to baseline TransCoder average improvement of performance
5.5%.
• Translations from and into Rust (less data language) improved by 25.6%.
• Though translations using IR-Augmented objectives (TLM, TAE and MT)
good, IR Pivot method is relatively low performance.
• Generates embeddings that better capture token semantics (refer to
slide no 23).
21

Figure 6: Java to Rust translation examples.
Java bitwise complement operator ~ is ! in Rust.
signed int in Java is i32 in Rust.
22

Figure 10: Token similarities. Rank and token similarity with u32 for this
model (right) and the baseline model (left).
23

Table 7: Reduction of Rust error types.
24

6. DISCUSSION
• Different IR and interpreted languages:
 Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available
for Interpreted one too. Front-ends of the language-pairs should use same IR.
• Pivot vs Embedding:
 The pivot method learns to translate using only IR-level similarities, it uses
source code only to compute IR.
 Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the
model to learn multilingual representations of source code from similarities in the
IR and in the source code itself.
• Using our model at inference time:
 TLM, TAE and MT objectives are used only during training for improving
multilingual code representation, but at test time the process is same with
TransCoder.
25

PIVOT METHOD Issues: IR Dialects
Solution:
• One decoder per target language.
• Use back-translation to make the model to translate from any IR dialect to any
language.
I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source
language).
II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence).
III. Then train the model to re-generate the C++ sequences from noisy
translations.
26

7. Conclusion
• LLVM IRs to improve code translation.
• IR provides semantically rich compiled language.
• Provide 3 objectives (TLM, TAE and MT) which lead to 5.5%
average translation improvements.
• Seq2seq transformer shown its effectiveness on decompilation.
• The approach can be extended to any pair languages that share common IR.
• In future works, IR can be generated by compiling entire projects to solve the
current limitation in source and target sequences.
27

고급컴파일러구성론_개레_230303.pptx

Recommended

Recommended

More Related Content

Similar to 고급컴파일러구성론_개레_230303.pptx

Similar to 고급컴파일러구성론_개레_230303.pptx (20)

Recently uploaded

Recently uploaded (20)

고급컴파일러구성론_개레_230303.pptx

Editor's Notes