SlideShare a Scribd company logo
1 of 15
Unsupervised Translation of
Programming Languages
Transcompiler is a system that converts source code from one high-
level programming language (such as C++ or Python) to another.
TransCoder
TransCoder is a sequence-to-sequence
(seq2seq) model with attention [2],
composed of an encoder and a decoder
with a transformer architecture.
A single shared model is used for all programming languages.
‘TransCoder’ is trained using 3 principles of unsupervised machine
translation
1. Cross Programming Language Model pretraining (Initialization)
2. Denoising auto-encoding (Language modeling)
3. Back-translation
1. Cross Programming Language Model pretraining
(Initialization)
• The first principle initializes the model with cross-lingual masked
language model pretraining.
• Pretraining is a key ingredient of unsupervised machine translation.
• It ensures that sequences with a similar meaning are mapped to the
same latent representation, regardless of their languages.
• The cross-lingual nature of the resulting model comes from the
significant number of common tokens (anchor points) that exist
across languages.
• In the context of English-French translation, the ‘anchor point’
consists essentially of digits and city and people names.
• In programming languages, these ‘anchor points’ come from common
keywords (e.g. for, while, if, try), and also digits, mathematical
operators, and English strings that appear in the source code.
• ‘TransCoder’ follows the pretraining strategy where a Cross-lingual
Language Model (XLM) is pretrained with a masked language
modeling objective on monolingual source code datasets.
• For the masked language modeling (MLM) objective, at each iteration
we consider an input stream of source code sequences, randomly
mask out some of the tokens, and train ‘TransCoder’ to predict the
tokens that have been masked out based on their contexts.
2. Denoising auto-encoding (Language modeling)
• Denoising auto-encoding, the second principle, trains the decoder to
always generate valid sequences, even when fed with noisy data, and
increases the encoder robustness to input noise.
• XLM pretraining allows the Encoder-Decoder model to generate high
quality representations of input sequences.
• Model is trained to encode and decode sequences with a Denoising
Auto-Encoding (DAE) objective.
• DAE objective operates like a supervised machine translation
algorithm, where the model is trained to predict a sequence of tokens
given a corrupted version of that sequence.
3. Back-translation
• Back-translation, the last principle, allows the model to generate
parallel data which can be used for training.
• Back-translation(feedback) is one of the most effective methods to
leverage monolingual data in a weakly(semi)-supervised scenario.
• In the unsupervised setting, a forward source-to-target model is
coupled with a backward target-to-source model trained in parallel.
• The target-to-source model is used to translate target sequences into
the source language, producing noisy source sequences.
• The source-to-target model is then trained in a weakly supervised
manner to reconstruct the target sequences from the noisy source
sequences generated by the target-to-source model, and vice versa.
• The two models are trained in parallel until convergence.
Experiments
Training data:
• GitHub public dataset available on Google BigQuery is used as
dataset.
• C++, Java, and Python project files are selected.
• ‘TransCoder’ translates at function level.
• ‘TransCoder’ pretrains on all source code available, and train the
denoising auto-encoding and back-translation objectives on functions
only.
• Comments in the source code increases the number of anchor points
across languages, which results in a better overall performance.
Preprocessing:
• Unlike multilingual natural language processing, universal tokenizer
would be suboptimal in translation of programming languages, as
different languages use different patterns and keywords.
• The logical operators && and || exist in C++ where they should be
tokenized as a single token, but not in Python.
• The indentations are critical in Python as they define the code
structure, but have no meaning in languages like C++ or Java.
• Tokenizer used:
• javalang5 tokenizer for Java
• Python6 tokenizer for Python
• clang7 tokenizer for C++
Evaluation:
• Coding problems and solutions in several programming languages
from ‘GeeksforGeeks’ are extracted, to create validation and test
sets.
• These functions not only return the same output, but also compute
the result with similar algorithm.
Observations
• ‘TransCoder’ successfully understands the syntax specific to each
language, learns data structures and their methods, and correctly
aligns libraries across programming languages.
• ‘TransCoder’ successfully map tokens with similar meaning to the
same latent representation, regardless of their languages.
• ‘TransCoder’ can adapt to small modifications.

More Related Content

What's hot

Programing paradigm & implementation
Programing paradigm & implementationPrograming paradigm & implementation
Programing paradigm & implementationBilal Maqbool ツ
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design IntroductionAman Sharma
 
Language processor
Language processorLanguage processor
Language processorAbha Damani
 
Lecture 01 introduction to compiler
Lecture 01 introduction to compilerLecture 01 introduction to compiler
Lecture 01 introduction to compilerIffat Anjum
 
Declare Your Language: What is a Compiler?
Declare Your Language: What is a Compiler?Declare Your Language: What is a Compiler?
Declare Your Language: What is a Compiler?Eelco Visser
 
Chap 1-language processor
Chap 1-language processorChap 1-language processor
Chap 1-language processorshindept123
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit IIIManoj Patil
 
Introduction to course
Introduction to courseIntroduction to course
Introduction to coursenikit meshram
 
Language processors
Language processorsLanguage processors
Language processorseShikshak
 
Chapter One
Chapter OneChapter One
Chapter Onebolovv
 
Cs6660 compiler design
Cs6660 compiler designCs6660 compiler design
Cs6660 compiler designhari2010
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compilerSudhaa Ravi
 

What's hot (19)

Programing paradigm & implementation
Programing paradigm & implementationPrograming paradigm & implementation
Programing paradigm & implementation
 
1 cc
1 cc1 cc
1 cc
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design Introduction
 
Language processor
Language processorLanguage processor
Language processor
 
Paradigms
ParadigmsParadigms
Paradigms
 
Lecture 01 introduction to compiler
Lecture 01 introduction to compilerLecture 01 introduction to compiler
Lecture 01 introduction to compiler
 
Declare Your Language: What is a Compiler?
Declare Your Language: What is a Compiler?Declare Your Language: What is a Compiler?
Declare Your Language: What is a Compiler?
 
Chap 1-language processor
Chap 1-language processorChap 1-language processor
Chap 1-language processor
 
Compiler design
Compiler designCompiler design
Compiler design
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit III
 
Introduction to course
Introduction to courseIntroduction to course
Introduction to course
 
Language processors
Language processorsLanguage processors
Language processors
 
Introduction to programming c
Introduction to programming cIntroduction to programming c
Introduction to programming c
 
Compiler construction
Compiler constructionCompiler construction
Compiler construction
 
Chapter One
Chapter OneChapter One
Chapter One
 
Compiler unit 1
Compiler unit 1Compiler unit 1
Compiler unit 1
 
Prgramming paradigms
Prgramming paradigmsPrgramming paradigms
Prgramming paradigms
 
Cs6660 compiler design
Cs6660 compiler designCs6660 compiler design
Cs6660 compiler design
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compiler
 

Similar to Trans coder

고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptxssuser1e7611
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compilerAbha Damani
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler ConstructionAhmed Raza
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction Thapar Institute
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overviewamudha arul
 
Pros and cons of c as a compiler language
  Pros and cons of c as a compiler language  Pros and cons of c as a compiler language
Pros and cons of c as a compiler languageAshok Raj
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxnuruddinnnaim
 
Concept of compiler in details
Concept of compiler in detailsConcept of compiler in details
Concept of compiler in detailskazi_aihtesham
 
Phases of Compiler.pptx
Phases of Compiler.pptxPhases of Compiler.pptx
Phases of Compiler.pptxssuser3b4934
 
Computer Programming In C.pptx
Computer Programming In C.pptxComputer Programming In C.pptx
Computer Programming In C.pptxchouguleamruta24
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Shivang Bajaniya
 
Python-L1.pptx
Python-L1.pptxPython-L1.pptx
Python-L1.pptxDukeCalvin
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfDrIsikoIsaac
 
Compiler Construction Lecture One .pptx
Compiler Construction Lecture One  .pptxCompiler Construction Lecture One  .pptx
Compiler Construction Lecture One .pptxانشال عارف
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler ConstructionSarmad Ali
 

Similar to Trans coder (20)

고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
Presentation1
Presentation1Presentation1
Presentation1
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overview
 
Pros and cons of c as a compiler language
  Pros and cons of c as a compiler language  Pros and cons of c as a compiler language
Pros and cons of c as a compiler language
 
The Phases of a Compiler
The Phases of a CompilerThe Phases of a Compiler
The Phases of a Compiler
 
week 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptxweek 2 - INTRO TO PROGRAMMING.pptx
week 2 - INTRO TO PROGRAMMING.pptx
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
Concept of compiler in details
Concept of compiler in detailsConcept of compiler in details
Concept of compiler in details
 
Phases of Compiler.pptx
Phases of Compiler.pptxPhases of Compiler.pptx
Phases of Compiler.pptx
 
Computer Programming In C.pptx
Computer Programming In C.pptxComputer Programming In C.pptx
Computer Programming In C.pptx
 
COMPILER DESIGN PPTS.pptx
COMPILER DESIGN PPTS.pptxCOMPILER DESIGN PPTS.pptx
COMPILER DESIGN PPTS.pptx
 
CD U1-5.pptx
CD U1-5.pptxCD U1-5.pptx
CD U1-5.pptx
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
 
Python-L1.pptx
Python-L1.pptxPython-L1.pptx
Python-L1.pptx
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdf
 
Compiler Construction Lecture One .pptx
Compiler Construction Lecture One  .pptxCompiler Construction Lecture One  .pptx
Compiler Construction Lecture One .pptx
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Trans coder

  • 2. Transcompiler is a system that converts source code from one high- level programming language (such as C++ or Python) to another.
  • 3. TransCoder TransCoder is a sequence-to-sequence (seq2seq) model with attention [2], composed of an encoder and a decoder with a transformer architecture. A single shared model is used for all programming languages.
  • 4. ‘TransCoder’ is trained using 3 principles of unsupervised machine translation 1. Cross Programming Language Model pretraining (Initialization) 2. Denoising auto-encoding (Language modeling) 3. Back-translation
  • 5. 1. Cross Programming Language Model pretraining (Initialization) • The first principle initializes the model with cross-lingual masked language model pretraining. • Pretraining is a key ingredient of unsupervised machine translation. • It ensures that sequences with a similar meaning are mapped to the same latent representation, regardless of their languages.
  • 6. • The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages. • In the context of English-French translation, the ‘anchor point’ consists essentially of digits and city and people names. • In programming languages, these ‘anchor points’ come from common keywords (e.g. for, while, if, try), and also digits, mathematical operators, and English strings that appear in the source code.
  • 7. • ‘TransCoder’ follows the pretraining strategy where a Cross-lingual Language Model (XLM) is pretrained with a masked language modeling objective on monolingual source code datasets. • For the masked language modeling (MLM) objective, at each iteration we consider an input stream of source code sequences, randomly mask out some of the tokens, and train ‘TransCoder’ to predict the tokens that have been masked out based on their contexts.
  • 8. 2. Denoising auto-encoding (Language modeling) • Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences, even when fed with noisy data, and increases the encoder robustness to input noise. • XLM pretraining allows the Encoder-Decoder model to generate high quality representations of input sequences.
  • 9. • Model is trained to encode and decode sequences with a Denoising Auto-Encoding (DAE) objective. • DAE objective operates like a supervised machine translation algorithm, where the model is trained to predict a sequence of tokens given a corrupted version of that sequence.
  • 10. 3. Back-translation • Back-translation, the last principle, allows the model to generate parallel data which can be used for training. • Back-translation(feedback) is one of the most effective methods to leverage monolingual data in a weakly(semi)-supervised scenario. • In the unsupervised setting, a forward source-to-target model is coupled with a backward target-to-source model trained in parallel.
  • 11. • The target-to-source model is used to translate target sequences into the source language, producing noisy source sequences. • The source-to-target model is then trained in a weakly supervised manner to reconstruct the target sequences from the noisy source sequences generated by the target-to-source model, and vice versa. • The two models are trained in parallel until convergence.
  • 12. Experiments Training data: • GitHub public dataset available on Google BigQuery is used as dataset. • C++, Java, and Python project files are selected. • ‘TransCoder’ translates at function level. • ‘TransCoder’ pretrains on all source code available, and train the denoising auto-encoding and back-translation objectives on functions only. • Comments in the source code increases the number of anchor points across languages, which results in a better overall performance.
  • 13. Preprocessing: • Unlike multilingual natural language processing, universal tokenizer would be suboptimal in translation of programming languages, as different languages use different patterns and keywords. • The logical operators && and || exist in C++ where they should be tokenized as a single token, but not in Python. • The indentations are critical in Python as they define the code structure, but have no meaning in languages like C++ or Java. • Tokenizer used: • javalang5 tokenizer for Java • Python6 tokenizer for Python • clang7 tokenizer for C++
  • 14. Evaluation: • Coding problems and solutions in several programming languages from ‘GeeksforGeeks’ are extracted, to create validation and test sets. • These functions not only return the same output, but also compute the result with similar algorithm.
  • 15. Observations • ‘TransCoder’ successfully understands the syntax specific to each language, learns data structures and their methods, and correctly aligns libraries across programming languages. • ‘TransCoder’ successfully map tokens with similar meaning to the same latent representation, regardless of their languages. • ‘TransCoder’ can adapt to small modifications.