SlideShare a Scribd company logo
1 of 28
CODE TRANSLATION WITH COMPILER
REPRESENTATIONS
(Accepted as a conference paper at ICLR 2023)
Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve
(Meta AI)
Presented by: Gebremedhin G. Maru
Kangwon National University
Programming Language & Machine Learning Lab
March 2, 2023
Presentation Outline
1) Introduction.
2) Intermediate Representations In Compilers
3) Training Objectives.
4) Data.
5) Results.
6) Discussion.
7) Conclusion.
2
1) Introduction
• Automatic code translation allows to port old
codebases to new frameworks.
• Limitation of existing NMT for PL:
Unreliability
 Failure on translating semantics of the input
program accurately.
3
Intuition About The proposed
work
• Leverages information from compiler toolchains (LLVM).
• Compilers’ Intermediary Representations(IR).
• IR is language-agnostic pseudocode that describes the
semantics of the program.
• Benefits of IR:
 Help to align embeddings for different Programming Languages.
 Improves the semantic understanding of the code.
4
Motivational Example
Figure 1: Improvements over TransCoder. 5
Contributions of the paper
• IR-augmented translation(using LLVM).
 Average improvement of 5.5%.
• Useful in the low data situations.
 E.g 29.7% and 25.6% improvements when translating to and from Rust.
• Extending test datasets of 852 functions from TransCoder (Roziere et al.
2020) by adding 343 and 280 functions of Go and Rust, respectively.
• Achievement of 78% accuracy on decompiling LLVM IRs to C++.
6
2) Intermediate Representations In
Compilers.
• Compilers: translate source code to machine-specific executable (machine code).
7
Fig 2: Compiler toolchain with LLVM.
Why use an IR?
• Analysis and synthesis requirements in the
translations.
• To create machine independent representations and
optimization.
• Low data resource programming languages can be
benefited from IR augmented code representation.
8
3) Training Objectives.
• Unsupervised NMT.
 Learning multilingual sequence embeddings.
 Aligning the embeddings and generating an output from these
embeddings.
• Source sentence x = x1 ……..xNso ,
• Corresponding IR z(x) = z(x)
1 ………..z(x)Nir ,
• Target sentence y = y1……………yNta
• We define the machine translation loss(seq2seq loss) function as follows:
9
3.1 Common Objective Functions
• Masked Language Modeling (MLM):
 Trains an encoder to predict randomly masked inputs.
 Where mask(x) masked version of the code sentence x, and enc(t) the encoder
output
• Denoising Auto Encoding (AE):
 Retrieve an original sequence from a corrupted version.
 Where noise(x) denotes the corrupted version of x.
• Back-Translation (BT):
 Generate a noisy translation of input sentence, and then recover the original input from translation.
10
3.2 IR For Code
Representations
• IR provide additional information to training dataset about code to be
translated using three new objective functions.
• Translation Language Modeling (TLM):-
 Generates common representations for parallel sentences in different
languages.
• Translation Auto-Encoding (TAE):-
 Transposes the TLM objective into a denoising auto-encoder.
• IR Generation (MT):-
 Trains the model to translate the source code into the corresponding IR.
11
Figure 3: IR for code representation objectives.
12
3.3 Additional Losses: IR
Decompilation and Pivot
• IR used for 2 alternatives in this study:
I. IR decompilation.
 Predict Source code from IR, it reverses compilers' tasks.
II. IR pivot translation:
 Decompiling the uniform IR format of languages to one of the
target language.
 Uses neural decompiler.
13
4) Data.
4.1 Training Data.
• Google BigQuery
 Indexes over 2.8 million open-source repositories from GitHub.
 Extracted all individual C++, Java, Rust and Go functions.
• CodeNet dataset
 Repository of 14 million competitive programming solutions in 55
languages..
 Used for IR decompilation.
14
4.2 Generating Intermediate
Representations
• clang:- LLVM C++ compilation toolchain.
• JLang8:- Java.
• Gollvm9:- Go
• rustc:- Rust.
15
4.3 Evaluation
• The computational accuracy test suite used in Transcoder (Roziere et al.,
2020) is utilized and enhanced.
 852 parallel functions of C++, Java and Python in Roziere et al.(2020).
 In this work additional 280 in Rust and 343 functions in Go were created
as test sets.
16
5 Results
5.1 Experimental Details.
• The model has 12 layers (6 in the encoder and 6 in the decoder),
• 8 attention heads, and a dimension of 1024.
• 15% of tokens masked in MLM and TLM objectives.
• 20% of tokens masked in AE and TAE objectives.
• Except MLM other objectives are trained at function level.
17
Translation Results
.
Table 2: Translation performance (CA@1), for greedy decoding and beam size 5.
18
Cont’d
Table 4: Translation results with different beam sizes.
19
DECOMPILATION Results
Table 5: Performance of LLVM IRs Decompilation: outperforms RedDec on
C++
20
5.2 IR-Augmented Code Representations For
Translation.
• Best average performance by leveraging IR (Table 2).
• Comparing to baseline TransCoder average improvement of performance
5.5%.
• Translations from and into Rust (less data language) improved by 25.6%.
• Though translations using IR-Augmented objectives (TLM, TAE and MT)
good, IR Pivot method is relatively low performance.
• Generates embeddings that better capture token semantics (refer to
slide no 23).
21
Figure 6: Java to Rust translation examples.
Java bitwise complement operator ~ is ! in Rust.
signed int in Java is i32 in Rust.
22
Figure 10: Token similarities. Rank and token similarity with u32 for this
model (right) and the baseline model (left).
23
Table 7: Reduction of Rust error types.
24
6. DISCUSSION
• Different IR and interpreted languages:
 Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available
for Interpreted one too. Front-ends of the language-pairs should use same IR.
• Pivot vs Embedding:
 The pivot method learns to translate using only IR-level similarities, it uses
source code only to compute IR.
 Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the
model to learn multilingual representations of source code from similarities in the
IR and in the source code itself.
• Using our model at inference time:
 TLM, TAE and MT objectives are used only during training for improving
multilingual code representation, but at test time the process is same with
TransCoder.
25
PIVOT METHOD Issues: IR Dialects
Solution:
• One decoder per target language.
• Use back-translation to make the model to translate from any IR dialect to any
language.
I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source
language).
II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence).
III. Then train the model to re-generate the C++ sequences from noisy
translations.
26
7. Conclusion
• LLVM IRs to improve code translation.
• IR provides semantically rich compiled language.
• Provide 3 objectives (TLM, TAE and MT) which lead to 5.5%
average translation improvements.
• Seq2seq transformer shown its effectiveness on decompilation.
• The approach can be extended to any pair languages that share common IR.
• In future works, IR can be generated by compiling entire projects to solve the
current limitation in source and target sequences.
27
Thank You & Questions?

More Related Content

Similar to 고급컴파일러구성론_개레_230303.pptx

.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3aminmesbahi
 
1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptxvenkatapranaykumarGa
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxnuruddinnnaim
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question keyArthyR3
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction Thapar Institute
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfDrIsikoIsaac
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler ConstructionAhmed Raza
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overviewamudha arul
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfDrIsikoIsaac
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compilerAbha Damani
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll buildMark Stoodley
 
Introduction_to_Programming.pptx
Introduction_to_Programming.pptxIntroduction_to_Programming.pptx
Introduction_to_Programming.pptxPmarkNorcio
 

Similar to 고급컴파일러구성론_개레_230303.pptx (20)

.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3
 
1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx1-Phases of compiler-26-04-2023.pptx
1-Phases of compiler-26-04-2023.pptx
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptx
 
Mcs lec2
Mcs lec2Mcs lec2
Mcs lec2
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question key
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction
 
Introduction to programming c
Introduction to programming cIntroduction to programming c
Introduction to programming c
 
Chap01-Intro.ppt
Chap01-Intro.pptChap01-Intro.ppt
Chap01-Intro.ppt
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdf
 
1 cc
1 cc1 cc
1 cc
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overview
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdf
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Introduction_to_Programming.pptx
Introduction_to_Programming.pptxIntroduction_to_Programming.pptx
Introduction_to_Programming.pptx
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 

고급컴파일러구성론_개레_230303.pptx

  • 1. CODE TRANSLATION WITH COMPILER REPRESENTATIONS (Accepted as a conference paper at ICLR 2023) Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve (Meta AI) Presented by: Gebremedhin G. Maru Kangwon National University Programming Language & Machine Learning Lab March 2, 2023
  • 2. Presentation Outline 1) Introduction. 2) Intermediate Representations In Compilers 3) Training Objectives. 4) Data. 5) Results. 6) Discussion. 7) Conclusion. 2
  • 3. 1) Introduction • Automatic code translation allows to port old codebases to new frameworks. • Limitation of existing NMT for PL: Unreliability  Failure on translating semantics of the input program accurately. 3
  • 4. Intuition About The proposed work • Leverages information from compiler toolchains (LLVM). • Compilers’ Intermediary Representations(IR). • IR is language-agnostic pseudocode that describes the semantics of the program. • Benefits of IR:  Help to align embeddings for different Programming Languages.  Improves the semantic understanding of the code. 4
  • 5. Motivational Example Figure 1: Improvements over TransCoder. 5
  • 6. Contributions of the paper • IR-augmented translation(using LLVM).  Average improvement of 5.5%. • Useful in the low data situations.  E.g 29.7% and 25.6% improvements when translating to and from Rust. • Extending test datasets of 852 functions from TransCoder (Roziere et al. 2020) by adding 343 and 280 functions of Go and Rust, respectively. • Achievement of 78% accuracy on decompiling LLVM IRs to C++. 6
  • 7. 2) Intermediate Representations In Compilers. • Compilers: translate source code to machine-specific executable (machine code). 7 Fig 2: Compiler toolchain with LLVM.
  • 8. Why use an IR? • Analysis and synthesis requirements in the translations. • To create machine independent representations and optimization. • Low data resource programming languages can be benefited from IR augmented code representation. 8
  • 9. 3) Training Objectives. • Unsupervised NMT.  Learning multilingual sequence embeddings.  Aligning the embeddings and generating an output from these embeddings. • Source sentence x = x1 ……..xNso , • Corresponding IR z(x) = z(x) 1 ………..z(x)Nir , • Target sentence y = y1……………yNta • We define the machine translation loss(seq2seq loss) function as follows: 9
  • 10. 3.1 Common Objective Functions • Masked Language Modeling (MLM):  Trains an encoder to predict randomly masked inputs.  Where mask(x) masked version of the code sentence x, and enc(t) the encoder output • Denoising Auto Encoding (AE):  Retrieve an original sequence from a corrupted version.  Where noise(x) denotes the corrupted version of x. • Back-Translation (BT):  Generate a noisy translation of input sentence, and then recover the original input from translation. 10
  • 11. 3.2 IR For Code Representations • IR provide additional information to training dataset about code to be translated using three new objective functions. • Translation Language Modeling (TLM):-  Generates common representations for parallel sentences in different languages. • Translation Auto-Encoding (TAE):-  Transposes the TLM objective into a denoising auto-encoder. • IR Generation (MT):-  Trains the model to translate the source code into the corresponding IR. 11
  • 12. Figure 3: IR for code representation objectives. 12
  • 13. 3.3 Additional Losses: IR Decompilation and Pivot • IR used for 2 alternatives in this study: I. IR decompilation.  Predict Source code from IR, it reverses compilers' tasks. II. IR pivot translation:  Decompiling the uniform IR format of languages to one of the target language.  Uses neural decompiler. 13
  • 14. 4) Data. 4.1 Training Data. • Google BigQuery  Indexes over 2.8 million open-source repositories from GitHub.  Extracted all individual C++, Java, Rust and Go functions. • CodeNet dataset  Repository of 14 million competitive programming solutions in 55 languages..  Used for IR decompilation. 14
  • 15. 4.2 Generating Intermediate Representations • clang:- LLVM C++ compilation toolchain. • JLang8:- Java. • Gollvm9:- Go • rustc:- Rust. 15
  • 16. 4.3 Evaluation • The computational accuracy test suite used in Transcoder (Roziere et al., 2020) is utilized and enhanced.  852 parallel functions of C++, Java and Python in Roziere et al.(2020).  In this work additional 280 in Rust and 343 functions in Go were created as test sets. 16
  • 17. 5 Results 5.1 Experimental Details. • The model has 12 layers (6 in the encoder and 6 in the decoder), • 8 attention heads, and a dimension of 1024. • 15% of tokens masked in MLM and TLM objectives. • 20% of tokens masked in AE and TAE objectives. • Except MLM other objectives are trained at function level. 17
  • 18. Translation Results . Table 2: Translation performance (CA@1), for greedy decoding and beam size 5. 18
  • 19. Cont’d Table 4: Translation results with different beam sizes. 19
  • 20. DECOMPILATION Results Table 5: Performance of LLVM IRs Decompilation: outperforms RedDec on C++ 20
  • 21. 5.2 IR-Augmented Code Representations For Translation. • Best average performance by leveraging IR (Table 2). • Comparing to baseline TransCoder average improvement of performance 5.5%. • Translations from and into Rust (less data language) improved by 25.6%. • Though translations using IR-Augmented objectives (TLM, TAE and MT) good, IR Pivot method is relatively low performance. • Generates embeddings that better capture token semantics (refer to slide no 23). 21
  • 22. Figure 6: Java to Rust translation examples. Java bitwise complement operator ~ is ! in Rust. signed int in Java is i32 in Rust. 22
  • 23. Figure 10: Token similarities. Rank and token similarity with u32 for this model (right) and the baseline model (left). 23
  • 24. Table 7: Reduction of Rust error types. 24
  • 25. 6. DISCUSSION • Different IR and interpreted languages:  Though the 4 languages (C++, Java, Go and Rust) are compiled, IR is available for Interpreted one too. Front-ends of the language-pairs should use same IR. • Pivot vs Embedding:  The pivot method learns to translate using only IR-level similarities, it uses source code only to compute IR.  Adding of TLM, TAE, and MT objectives to the 3 UNMT objectives enables the model to learn multilingual representations of source code from similarities in the IR and in the source code itself. • Using our model at inference time:  TLM, TAE and MT objectives are used only during training for improving multilingual code representation, but at test time the process is same with TransCoder. 25
  • 26. PIVOT METHOD Issues: IR Dialects Solution: • One decoder per target language. • Use back-translation to make the model to translate from any IR dialect to any language. I. Embedding for every IR-dialects(IR-C++, IR-Go, IR-Java, IR-Rust per source language). II. Noisy translations (e.g., IR-Go, IR-Java and IR-Rust for every C++ sequence). III. Then train the model to re-generate the C++ sequences from noisy translations. 26
  • 27. 7. Conclusion • LLVM IRs to improve code translation. • IR provides semantically rich compiled language. • Provide 3 objectives (TLM, TAE and MT) which lead to 5.5% average translation improvements. • Seq2seq transformer shown its effectiveness on decompilation. • The approach can be extended to any pair languages that share common IR. • In future works, IR can be generated by compiling entire projects to solve the current limitation in source and target sequences. 27
  • 28. Thank You & Questions?

Editor's Notes

  1. Compilers consists:- Front-end: takes source code as input. Lexes (tokenizes) and parses program then produces AST. Translates AST to IR. Middle-end: Performs optimizations on IR (independent from the source language and target machine). Constant folding. Death-code analysis and storage reduction. Back-end: produces machine binary code. Converts the IR into machine-specific executable code.
  2. To build retargetable compilers: We can build new back ends for an existing front end (making the source language more portable across machines). We can build a new front-end for an existing back end (so a new machine can quickly get a set of compilers for different source languages). We only have to write 2n2n half-compilers instead of n(n−1)n(n−1) full compilers. (Though this might be a bit of an exaggeration in practice!) To build compilers. We can build new back ends for an existing front end. We can build a new front-end for an existing back end.
  3. IR decompilation consists in recovering source code corresponding to a given IR. In practice, it reverses the computations performed by the compiler. IR Pivot is a translation method built upon IR decompilation. Since LLVM can compile many languages (C++, Java, Rust, Go) into the same IR, an obvious approach to code translation consists in decompiling the IR generated from the source language into code in the target language.