The document presents a review of large language models (LLMs) for code generation. It discusses different types of LLMs including left-to-right, masked, and encoder-decoder models. Existing models for code generation like Codex, GPT-Neo, GPT-J, and CodeParrot are compared. A new model called PolyCoder with 2.7 billion parameters trained on 12 programming languages is introduced. Evaluation results show PolyCoder performs less well than comparably sized models but outperforms others on C language tasks. In general, performance improves with larger models and longer training, but training solely on code can be sufficient or advantageous for some languages.
1. A Comprehensive Review of
Large Language Models for
Code Generation
Presented By: Sai Pragna Kancheti
2. INTRODUCTION:
Chatgpt like chatbots has become popular in recent times, These chatbots are natural
language processing tools that are developed for general-purpose and uses artificial
intelligence to generate text after a user enters a prompt.
Although these chatbots are made for general purpose, they are also good at
generating code from user prompts using Large Language Models
In this presentation, we are going to systematically review Large Language
Models for code generation base on user prompts
At the end, based on the results we have presented some Insights for further
research in this direction
3. What are LLMs?
A large language model is a more advanced sort of language model that is
developed on vast volumes of text data using deep learning techniques.
These models can generate human-like text and perform a variety of natural
language processing tasks
The complexity of a language model can range from simple n-gram models to
more complex neural network models.
Examples: GPT-3 (Generative Pretrained Transformer 3), BERT (Bidirectional
Encoder Representations from Transformers), RoBERTa (Robustly Optimized
BERT Approach) ,etc.,
4. LLMs for code generation
The recent models excel at tasks like code completion and code synthesis
from natural language descriptions.
One such promising model developed in the recent times is Austin et al.
(2021),which has demonstrated significant progress toward AI-based
programming aid.
One of the largest of these models, Codex (Chen et al., 2021), has been
deployed as an in-IDE developer assistant that automatically generates code
based on the user's context in the real-world production tool GitHub Copilot1.
Despite the enormous success of large language models of code, the most
powerful models are not publicly accessible.
5. LLMs for code
generation
Some of the Existing models of
code,their sizes and
availability(open source or not
open-source ) is shown in the
figure.
6. Challenges With the available LLMs for code
Generation
Although these models can show good performance for code generation based
on the user prompt. There are some following challenges needed to be
addressed for these models for further development in this scope
There was no large open-source language model trained almost exclusively on
code from multiple programming languages.
Lack of availability of powerful models that are publicly accessible.
Unavailability of access to the model's internals.
This prohibits these models from being applied to code generation tasks and
inhibits research in this particular field for low-resource organizations
9. Left-to-Right Language Models
The auto-regressive, left-to-right language models predict the likelihood of a
certain token depending on the sequence of tokens that have come before it
These models' sequential, left-to-right operation is especially useful for
activities connected to program generation, such as auto-completion code.
However, because code isn't often produced in a single left-to-right pass,
utilizing context that appears "after" the moment of generation is difficult.
Examples: CodeParrot, GPT-Neo ,GPT-J (6B) ,Codex (12B), GPT-NeoX (20B),
and Google’s (137B) (Austin et al., 2021)
These type of the models are considered in review.
10. Masked Language Models
While auto-regressive language models are powerful for modeling the
probability of sequences, their unidirectional nature makes them less suitable
for producing effective whole-sequence representations for downstream tasks
such as classification.
One popular bidirectional objective function used widely in representation
learning is masked language modeling.
where the aim is to predict masked text pieces based on surrounding context.
Examples: CodeBERT (125M) and CuBERT (345M) are some of the examples of
these models.
11. Encoder-decoder Models
An encoder-decoder model first uses an encoder to encode an input
sequence, and then uses a left-to-right LM to decode an output sequence
conditioned on the input sequence.
Popular pretraining objectives include masked span prediction where the
input sequence is randomly masked with multiple masks and the output
sequence are the masked contents in order
and denoising sequence reconstruction where the input is a corrupted
sequence and the output is the original sequence.
These pretrained models are useful in many sequence-to-sequence tasks
Examples: CodeT5 (220M) and PLBART (406M)
13. Existing Models
Codex: Codex is a Language Learning Model (LLM) that has been specifically
adjusted using Python code available to the public on GitHub.
This model employs GPT-3 due to its substantial proficiency in creating Python
programs. Despite being considerably smaller than GPT-3, with a total of 12 billion
parameters, Codex still exhibits remarkable performance.
GPT-Neo: GPT-Neo is a series of substantial large language models have been trained
on the Pile dataset.
These models, similar to GPT-3, are available in different sizes including 125M, 1.3B,
and 2.7B parameter versions.
The GPT-Neo 2.7B version, in particular, is a transformer model that has been
developed based on EleutherAI's recreation of the GPT-3 architecture.
14. Existing Models
GPT-J : GPT-J, developed by EleutherAI, is an open source model with 6 billion
parameters, trained on The Pile dataset.
It largely adheres to the GPT-2 architecture and stands out as the highest performing
transformer language model available to the public, in terms of its zero-shot performance
on a range of subsequent tasks.
CodeParrot: CodeParrot is a model based on GPT-2, possessing 1.5 billion
parameters, which has been specifically fine-tuned using publicly accessible code from
GitHub for the purpose of generating Python code
15. Introduced model- PolyCoder
To overcome the challenges of
available LLMs for code
generation a new PolyCoder
model is introduced , which
boasts 2.7 billion parameters,
is trained on a diverse range of
repositories sourced from
GitHub, encompassing 12
distinct programming
languages. As shown in the
table
16. PolyCoder’s Training
Polycoder uses the GPT-2 model architecture.
To investigate the effect of model size scaling, it was trained using three
different model sizes: 2.7 billion, 400 million, and 160 million parameters,
with the largest 2.7B model equalling GPT-Neo's capacity to allow a fair
comparison
The 2.7 billion parameter model is a 32-layer, 2,560 dimensional Transformer
model with a maximum context window of 2048 tokens, and it was trained
using a batch size of 128 sequences (262K tokens) for a total of 150K steps
17. PolyCoder’s Training
The following table is a
Comparison of design
decisions and hyper-
parameters in training
different models of code.
18. PolyCoder’s Training
The following figure is
the Training and
validation loss during the
150K step training
process
20. Results of Extrinsic evaluations:
Among the current models, PolyCoder performs less effectively than the comparably
sized GPT-Neo and even the smaller Codex 300M. In the grand scheme of things,
PolyCoder ranks after Codex, GPT-Neo/J, but outperforms CodeParrot
Despite being trained exclusively on code, PolyCoder lags behind a model of similar
size, GPT-Neo 2.7B, which was trained on the Pile, a mix of both code and natural
language texts
This finding implies that future studies could profit from mixing code from diverse
programming languages, along with natural language text
21. Results of Extrinsic evaluations:
The following table
shows results of different
models on the
HumanEval benchmark,
and the number of
different typesof tokens
seen during the training
process.
22. Results of Intrinsic Evaluations
Interestingly, PolyCoder surpasses Codex and all other models when it comes to the C
language. When considering only open-source models, PolyCoder outperforms the
similarly sized GPT-Neo 2.7B in C, JavaScript, Rust, Scala, and TypeScript
In the remaining 11 languages apart from C, all other open-source models, including
the newly introduced PolyCoder, exhibit significantly lower performance (higher
perplexity) compared to Codex.
This observation could imply that for languages where larger models don't yield extra
benefits, training the model solely on code might be sufficient or even slightly more
advantageous than training on a combination of natural language and code
23. Conclusions
We've presented the results of a systematic evaluatoion of large language models for
code. The findings generally indicate that performance improves with bigger models
and extended training durations.
Based on the results, we infer that GPT-Neo's superior performance over PolyCoder in
certain languages suggests that training on both natural language text and code can
enhance code modeling
However, it's noteworthy that in the realm of the C programming language, PolyCoder
outperforms all models, including Codex, by achieving a lower perplexity