SlideShare a Scribd company logo
1 of 70
Competition-Level Code Generation with
AlphaCode
San Kim
2022.03.02
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi
Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert,
Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang,
Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molly, Daniel J. Mankowitz, Esme
Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol
Vinyals
Google Deepmind
Alpha Code
• Recent large-scale language models still perform poorly when evaluated on more
complex, unseen problems that require problem-solving skills beyond simply
translating instructions into code.
• Competitive programming problems which require an understanding of algorithms
and complex natural language remain extremely challenging.
1. An extensive and clean competitive programming dataset for training and
evaluation
2. Large and efficient-to-sample transformer-based architectures
3. Large-scale model sampling to explore the search space, followed by filtering
based on program behavior to a small set of submissions
3 key components were critical to achieve good and reliable performance
AlphaCode: a system for code generation that can create novel solutions to these
solutions that require deeper reasoning.
Achieved on average a ranking of top 54.3% in competitions with more than 5,000
participants (on Codeforce platform)
Competitive programming problem statement
A solution generated by AlphaCode
AlphaCode’s ranking
Alpha Code
1. Pre-train a transformer-based language model on GitHub code with standard
language modeling objectives. This model can reasonably represent the space of
human coding, which greatly reduces the problem search space.
2. Fine-tune the model on our dataset of competitive programming data, using
GOLD with tempering as the training objective. This further reduces the search
space, and compensates for the small amount of competitive programming data
by leveraging pre-training
3. Generate a very large number of samples from our models for each problem.
4. Filter the samples to obtain a small set of candidate submissions (at most 10), to
be evaluated on the hidden test cases, by using the example tests and clustering
to pick samples based on program behaviour
Pretraining Dataset
• pre-training dataset is based on a snapshot of selected public GitHub
repositories taken on 2021/07/14.
• Included all code files from several popular languages: C++, C#, Go, Java,
JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript
• Filtered out all files larger than 1MB or with lines longer than 100
characters, to exclude automatically generated code.
• Removed duplicates of the same file, ignoring whitespace in comparisons.
• Final pre-training dataset contains a total of 715.1 GB of code.
CodeContests
• The dataset includes problems, solution and test cases (scraped from the Codeforces platform)
• To guard against data leakage, a strict temporal split is adopted:
• Training dataset: before 2021/07/14
• Validation dataset: between 2021/07/15 and 2021/09/10
• Test dataset: after 2021/09/21
• Training dataset: Newly scraped data (from Codeforces) + Description2Code(OpenAI) + CodeNet (IBM
Research)
• Validation and test splits: Newly scraped data (from Codeforces)
CodeContests
• Metadata: difficulty ratings (800-1600), tags (greedy, math, dp, brute force, graphs, etc.)
• Neither the difficulty rating nor the tags are visible at competition time (and so should not be used at
test time)
• The dataset also contains both correct and incorrect human submissions written in the most popular
submission languages: C++, Python, and Java.
• Tests
• Example tests (in problem statements)
• Hidden tests (that are made available at the evaluation result pages)
Difficulty rating
tags
Example tests
CodeContests
• Lack of test coverage leads to “false positives” and “slow positives”
• “false positives”: incorrect submissions are marked as correct
• “slow positives”: correct but algorithmically inefficient solutions that do not fulfill time and memory
constraints are marked correct
CodeContests
• Reduce false positive rates by generating additional test cases
• Mutating existing test inputs
• Bit flips to binary inputs
• Randomly incrementing or decrementing integers
• Swapping and changing characters in string
• Mutated inputs are verified by running 30 correct solutions on then, and checking all solutions produce the
same output
• This process was run on each problem for a maximum of 10 CPU hours or 200 generated tests
• Because of complex input formats, they failed to generate the full set of 200 tests for 6.3% of problems
• They filtered out problems in the validation and test splits with insufficient test coverage, keeping only
problems with at least 5 hidden or generated test cases that result in at least 2 different outputs.
HumanEval dataset, Codex model
Evaluating Large Language Models Trained on Code
OpenAI, 2021
Generating short code snippets vs. generating entire programs
The difference in difficulty between generating short code snippets and entire programs can be
analogous to that of declarative versus imperative problem solving…
Generating short code snippets vs. generating entire programs
• Generating short code snippets typically amounts to translating the task specification
directly into code, and sometimes reduces to invoking the correct API calls.
• Generating entire programs often relies on understanding the task and figuring out
how to accomplish it, which requires deeper algorithmic reasoning.
APPS benchmark
Measuring Coding Challenge Competence With APPS
UC Berkeley, 2021
Model Architecture
• Encoder-decoder transformer architecture - The competitive programming code generation problem
can be viewed as a sequence-to-sequence translation task: given a problem description X in natural
language, produce a corresponding solution Y in a programming language
• Asymmetric architecture with 1536 tokens for the encoder but only 768 tokens for the decoder:
because problem descriptions are on average twice as long as their corresponding human solution
• A SentencePiece tokenizer with a vocabulary size of 8,000 tokens trained on a mix of GitHub and
CodeContests data.
Model Architecture
• Multi-query attention (to reduce the cost of sampling from their models): using a full set of query
heads but sharing key and value heads per attention block significantly reduces memory usage and
cache updates costs, which are the main bottleneck during sampling.
Model Architecture
Pre-training
• A standard cross-entropy next-token prediction loss: for decoder
• A masked language modeling: for encoder
• They split GitHub files by uniformly sampling pivot location, using content before the pivot as input
to the encoder, and content after for the decoder
• Base 1B parameter model was trained for 1 million steps with a batch size of 256
• They adjusted the amount of training for other model sizes such that larger models are trained more and
smaller models are trained less to optimize the use of compute
• Due to resource limitations and to make optimal use of compute, the training of largest 41B model was
stopped early
Fine-tuning
• Input
• Encoder: the natural language problem description
• Decoder: the program solution
• Loss
• Encoder: masked language modeling loss
• Decoder: standard next-token prediction
• Tempering
• Value conditioning & prediction
• GOLD
Fine-tuning - tempering
• Tempering is a regularization technique that makes the token probability distribution artificially
smoother or sharper at training time by dividing the output logits of a model by a scalar temperature
T before the softmax layer
• T<1 (training distribution sharper)  T’=1 (inference distribution smoother)
• T>1 (training distribution smoother)  T’=1 (inference distribution sharper)
Softmax Tempering for Training Neural Machine Translation Models (2020)
• Training time: T=0.2
• Sampling time:
• T’ = 0.12 (for models trained with tempering only)
• T’ = 0.25 (for models trained with tempering and GOLD)
Fine-tuning – value conditioning & prediction
• Value conditioning
• Whether of not a submission was correct (always correct at sampling time)
• Meta data (Ratings, Tags, Language)
• Value prediction (not used during sampling)
• Auxiliary value prediction task during training
• The last layer token representations before projecting to logits are used in a small Transformer to
classify whether the submission is correct
Fine-tuning – GOLD
• Solving competitive programming problems from description is inherently a one-of-many task
• Standard maximum likelihood objectives minimize loss by putting some weight on each solution in the
training set (like recall), whereas their metric measures whether a model can find a single correct
solution in the submission attempt budget (like precision)
• A variation of GOLD is adopted (an offline RL algorithm which allows the model to both learn from
tokens it already assigns high likelihood to, and to ignore tokens that are not in its distribution)
• The model can concentrate on precision rather than recall, and increase its chance of getting at least
one correct sample
• They add a short training phase between pre-training and fine-tuning, during which they apply
tempering but crucially not GOLD. This allows the initial pre-trained distribution to transition to a
smoother distribution.
Text Generation by learning from Demonstrations, ICLR 2021
∇ℒ𝐺𝑂𝐿𝐷 𝜃 = −
𝑠∈𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛𝑠
𝑃𝜃 𝑠 ∇P𝜃(s)
𝑃𝜃 𝑠 ← max 𝑃𝜃 𝑠 𝛼
, 𝛽 , 𝛼 = 0.5, 𝛽 = 0.05
Large scale sampling
• To ensure sufficient diversity in such a large number of samples
• Generate half of the samples in Python and half in C++
• Randomize the problem tags and ratings in the natural language prompt
• Problem tags: picked random tags from most popular 50 tag
• Ratings: uniformly in the range of 800 to 3500
• Use relatively high sampling temperature
• The optimal sampling temperature depends on the total number of samples (in general the
more samples, the higher the optimal temperature)
• (However) different temperatures in a wide range do not significantly change the solve rates
• Regular sampling with temperature
• Top-k and nucleus sampling: they did not observe significant performance improvements
Filtering and Clustering
• Filtering
• Filtering samples to only those that pass the example tests given in the problem statement
• Filtering removes approximately 99% of model samples
• Filtering can still leave tens of thousands of candidate samples for many problems
• Clustering
• Clustering the programs that are syntactically different but semantically equivalent
• Semantically equivalent programs could be detected if we had additional test inputs (grouping
programs that produce the same outputs together into clusters)
• Test input generation model
• Same architecture as their main models and initialized from the same GitHub pre-trained ckpt
• This model was trained to predict test inputs from problem descriptions, using example,
hidden, and generated test inputs as training data
1. Codeforces competitions evaluation
• All codeforces competitions from 2021/12/01 to 2021/12/28 with more than 5,000 participants per
contest, a total of 10 competitions.
• The system was an ensemble of 41B and 9B models with clustering (but turned out to be slightly worse
than using the 41B model alone with clustering)
• First run: up to 10 submissions
• 2nd ~3rd run: more than 10 submissions
1. Codeforces competitions evaluation
• C1591
• 3 problems: 10% ~ 37%
• 2 problems: 37% ~ 73%
1. Codeforces competitions evaluation
1. Codeforces competitions evaluation
2. CodeContests evaluation
• pass@k : The percentage of problems solved when they take k samples from the model for each
problem and submit all of them for evaluation on the hidden tests. If any solution in the specified sample
budget solves a problem, the problem is counted as solved
• 10@k : The percentage of problems solved when they take k samples from the model for each
problem but can only submit 10 of them for evaluation on the hidden tests
• Difference in solve rates between the validation and test sets are caused by variation in problem
distributions (as the test set and validation set were collected in temporally disjoint periods), as well as
some overfitting)
2. CodeContests evaluation
• Solve rates scale log-linearly with more samples
• Scaling up the total number of samples leads to massive improvements in model solve rate
• Improving solve rate requires exponentially increasing amounts of samples
• Better models have higher slopes in the scaling curve
• Larger models tend to have better model quality
2. CodeContests evaluation
• Solve rates scale log-linearly with more compute
• (a) The solve rate also scales approximately log-linearly with more training compute
• (b) larger models take more compute to draw each sample, but they eventually outperform smaller
models even with the same sampling compute as the better quality of samples from the larger
models become the dominant factor for performance
2. CodeContests evaluation
• The decoder-only model was trained with the same amount of compute
• Due to the significantly longer decoder sequence length (2304 tokens), with the same amount of
training compute it consumes 50% more loss tokens than training the encoder-decoder model
• Asymmetric encoder-decoder with multi-query attention significantly improves the sampling speed
while keeping the sample quality
1. An encoder-decoder architecture with asymmetric encoder and decoder structures
2. The multi-query attention setup which uses shared attention head for keys and one for values each
block
2. CodeContests evaluation
• Choice of the pre-training dataset
• Base 1B model
• GitHub (all language)
• The Python-only portion of GitHub
• The MassiveText(Gopher) generic text dataset which also includes a portion of GitHub
• Not pre-trained at all
2. CodeContests evaluation
• Model enhancement
2. CodeContests evaluation
• These percentages are calculated based on the full set of samples
• Overall less than 1% of samples from the models pass example tests
2. CodeContests evaluation
1. Randomly picking model samples without filtering
2. Filtering and than randomly selecting from the filtered samples
3. Filtering and then using clustering to select samples
4. Allowing unlimited evaluation attempts (the upper bound performance)
3. APPS
• Fine-tuned the pre-trained models on the APPS training set without using clustering, tags, ratings, value
conditioning, or prediction, and with sampling temperature 0.25 and nucleus sampling
Copying from training data
• 50 C++ and 50 Python solutions from a selection of 16 validation problems that had that many solutions
• The common substrings between model solutions and training data mostly contained boilerplate code
for reading and parsing the input data format, rather than key logic for solving problems (for example, a
FastIO Python class has length 805 and appears in 1.66% of all human Python solutions)
Model solution characteristics
Sensitivity to problem descriptions
Sensitivity to problem descriptions (simplification of problem descriptions)
Sensitivity to problem descriptions (incorrect or irrelevant rewording)
• Original: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Opposite: You are given n integers a_1 , a_2 , ... , a_n . Find the minimum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Related: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of ( a_l . a_r ) over all pairs (l , r) of
integers for which 1 <= l < r <= n.
• Underspecified: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of f( a_l , a_r ) over all pairs (l ,
r ) of integers for which 1 <= l < r <= n.
• Verbose: William has been given an array for his birthday , which consists of n integers a_1 , a_2 , ... , a_n . He is very
proud of his array , but naturally his friend Mary is curious about it . Mary would like to know a certain function of
consecutive elements of the array . Concretely , Mary would like to know the maximum value of a_l times a_ {l + 1} for
an integer l for which 1 <= l < n. Can you help William by calculating this value for him ?
• Algorithm described in words: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of the product
of two consecutive members of the array .
Sensitivity to problem descriptions (variable renaming)
Sensitivity to problem descriptions (variable renaming)
Sensitivity to problem descriptions (typos, synonym substitution, word permutation , deletion)
Sensitivity to problem descriptions (description section ablations)
Description
Specification
IO
Sensitivity to provided metadata
Loss is a poor proxy for solve rate
Solve rates conditioned on different ratings
Solve rates conditioned on different ratings
Conclusion
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
GPT Understands, Too Tsinghua Univ., 2021

More Related Content

Similar to Competition-Level Code Generation with AlphaCode: A System for Solving Competitive Programming Problems

Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceLionel Briand
 
Testing, a pragmatic approach
Testing, a pragmatic approachTesting, a pragmatic approach
Testing, a pragmatic approachEnrico Da Ros
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopCoverity
 
Improving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis codeImproving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis codeJohan Carlin
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?Michaela Greiler
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Research
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)Dilawar Khan
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech UpdateAdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Updatejamieayre
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented SoftwarePraveen Penumathsa
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptxssuser1e7611
 
Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)Christopher Conlan
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 

Similar to Competition-Level Code Generation with AlphaCode: A System for Solving Competitive Programming Problems (20)

Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
 
Testing, a pragmatic approach
Testing, a pragmatic approachTesting, a pragmatic approach
Testing, a pragmatic approach
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCop
 
Improving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis codeImproving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis code
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Architecture presentation 4
Architecture presentation 4Architecture presentation 4
Architecture presentation 4
 
CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech UpdateAdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

More from San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxSan Kim
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxSan Kim
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...San Kim
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptxSan Kim
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning taskSan Kim
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalSan Kim
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understandingSan Kim
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoningSan Kim
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Transformer xl
Transformer xlTransformer xl
Transformer xlSan Kim
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1San Kim
 
Gan seminar
Gan seminarGan seminar
Gan seminarSan Kim
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2San Kim
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1San Kim
 
Back propagation
Back propagationBack propagation
Back propagationSan Kim
 

More from San Kim (19)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Electra
ElectraElectra
Electra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
 
Back propagation
Back propagationBack propagation
Back propagation
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Competition-Level Code Generation with AlphaCode: A System for Solving Competitive Programming Problems

  • 1. Competition-Level Code Generation with AlphaCode San Kim 2022.03.02 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molly, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol Vinyals Google Deepmind
  • 2. Alpha Code • Recent large-scale language models still perform poorly when evaluated on more complex, unseen problems that require problem-solving skills beyond simply translating instructions into code. • Competitive programming problems which require an understanding of algorithms and complex natural language remain extremely challenging. 1. An extensive and clean competitive programming dataset for training and evaluation 2. Large and efficient-to-sample transformer-based architectures 3. Large-scale model sampling to explore the search space, followed by filtering based on program behavior to a small set of submissions 3 key components were critical to achieve good and reliable performance AlphaCode: a system for code generation that can create novel solutions to these solutions that require deeper reasoning. Achieved on average a ranking of top 54.3% in competitions with more than 5,000 participants (on Codeforce platform)
  • 4. A solution generated by AlphaCode
  • 6. Alpha Code 1. Pre-train a transformer-based language model on GitHub code with standard language modeling objectives. This model can reasonably represent the space of human coding, which greatly reduces the problem search space. 2. Fine-tune the model on our dataset of competitive programming data, using GOLD with tempering as the training objective. This further reduces the search space, and compensates for the small amount of competitive programming data by leveraging pre-training 3. Generate a very large number of samples from our models for each problem. 4. Filter the samples to obtain a small set of candidate submissions (at most 10), to be evaluated on the hidden test cases, by using the example tests and clustering to pick samples based on program behaviour
  • 7. Pretraining Dataset • pre-training dataset is based on a snapshot of selected public GitHub repositories taken on 2021/07/14. • Included all code files from several popular languages: C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript • Filtered out all files larger than 1MB or with lines longer than 100 characters, to exclude automatically generated code. • Removed duplicates of the same file, ignoring whitespace in comparisons. • Final pre-training dataset contains a total of 715.1 GB of code.
  • 8. CodeContests • The dataset includes problems, solution and test cases (scraped from the Codeforces platform) • To guard against data leakage, a strict temporal split is adopted: • Training dataset: before 2021/07/14 • Validation dataset: between 2021/07/15 and 2021/09/10 • Test dataset: after 2021/09/21 • Training dataset: Newly scraped data (from Codeforces) + Description2Code(OpenAI) + CodeNet (IBM Research) • Validation and test splits: Newly scraped data (from Codeforces)
  • 9. CodeContests • Metadata: difficulty ratings (800-1600), tags (greedy, math, dp, brute force, graphs, etc.) • Neither the difficulty rating nor the tags are visible at competition time (and so should not be used at test time) • The dataset also contains both correct and incorrect human submissions written in the most popular submission languages: C++, Python, and Java. • Tests • Example tests (in problem statements) • Hidden tests (that are made available at the evaluation result pages) Difficulty rating tags Example tests
  • 10. CodeContests • Lack of test coverage leads to “false positives” and “slow positives” • “false positives”: incorrect submissions are marked as correct • “slow positives”: correct but algorithmically inefficient solutions that do not fulfill time and memory constraints are marked correct
  • 11. CodeContests • Reduce false positive rates by generating additional test cases • Mutating existing test inputs • Bit flips to binary inputs • Randomly incrementing or decrementing integers • Swapping and changing characters in string • Mutated inputs are verified by running 30 correct solutions on then, and checking all solutions produce the same output • This process was run on each problem for a maximum of 10 CPU hours or 200 generated tests • Because of complex input formats, they failed to generate the full set of 200 tests for 6.3% of problems • They filtered out problems in the validation and test splits with insufficient test coverage, keeping only problems with at least 5 hidden or generated test cases that result in at least 2 different outputs.
  • 12. HumanEval dataset, Codex model Evaluating Large Language Models Trained on Code OpenAI, 2021
  • 13. Generating short code snippets vs. generating entire programs The difference in difficulty between generating short code snippets and entire programs can be analogous to that of declarative versus imperative problem solving…
  • 14. Generating short code snippets vs. generating entire programs • Generating short code snippets typically amounts to translating the task specification directly into code, and sometimes reduces to invoking the correct API calls. • Generating entire programs often relies on understanding the task and figuring out how to accomplish it, which requires deeper algorithmic reasoning.
  • 15. APPS benchmark Measuring Coding Challenge Competence With APPS UC Berkeley, 2021
  • 16. Model Architecture • Encoder-decoder transformer architecture - The competitive programming code generation problem can be viewed as a sequence-to-sequence translation task: given a problem description X in natural language, produce a corresponding solution Y in a programming language • Asymmetric architecture with 1536 tokens for the encoder but only 768 tokens for the decoder: because problem descriptions are on average twice as long as their corresponding human solution • A SentencePiece tokenizer with a vocabulary size of 8,000 tokens trained on a mix of GitHub and CodeContests data.
  • 17. Model Architecture • Multi-query attention (to reduce the cost of sampling from their models): using a full set of query heads but sharing key and value heads per attention block significantly reduces memory usage and cache updates costs, which are the main bottleneck during sampling.
  • 19. Pre-training • A standard cross-entropy next-token prediction loss: for decoder • A masked language modeling: for encoder • They split GitHub files by uniformly sampling pivot location, using content before the pivot as input to the encoder, and content after for the decoder • Base 1B parameter model was trained for 1 million steps with a batch size of 256 • They adjusted the amount of training for other model sizes such that larger models are trained more and smaller models are trained less to optimize the use of compute • Due to resource limitations and to make optimal use of compute, the training of largest 41B model was stopped early
  • 20. Fine-tuning • Input • Encoder: the natural language problem description • Decoder: the program solution • Loss • Encoder: masked language modeling loss • Decoder: standard next-token prediction • Tempering • Value conditioning & prediction • GOLD
  • 21. Fine-tuning - tempering • Tempering is a regularization technique that makes the token probability distribution artificially smoother or sharper at training time by dividing the output logits of a model by a scalar temperature T before the softmax layer • T<1 (training distribution sharper)  T’=1 (inference distribution smoother) • T>1 (training distribution smoother)  T’=1 (inference distribution sharper) Softmax Tempering for Training Neural Machine Translation Models (2020) • Training time: T=0.2 • Sampling time: • T’ = 0.12 (for models trained with tempering only) • T’ = 0.25 (for models trained with tempering and GOLD)
  • 22. Fine-tuning – value conditioning & prediction • Value conditioning • Whether of not a submission was correct (always correct at sampling time) • Meta data (Ratings, Tags, Language) • Value prediction (not used during sampling) • Auxiliary value prediction task during training • The last layer token representations before projecting to logits are used in a small Transformer to classify whether the submission is correct
  • 23. Fine-tuning – GOLD • Solving competitive programming problems from description is inherently a one-of-many task • Standard maximum likelihood objectives minimize loss by putting some weight on each solution in the training set (like recall), whereas their metric measures whether a model can find a single correct solution in the submission attempt budget (like precision) • A variation of GOLD is adopted (an offline RL algorithm which allows the model to both learn from tokens it already assigns high likelihood to, and to ignore tokens that are not in its distribution) • The model can concentrate on precision rather than recall, and increase its chance of getting at least one correct sample • They add a short training phase between pre-training and fine-tuning, during which they apply tempering but crucially not GOLD. This allows the initial pre-trained distribution to transition to a smoother distribution. Text Generation by learning from Demonstrations, ICLR 2021 ∇ℒ𝐺𝑂𝐿𝐷 𝜃 = − 𝑠∈𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛𝑠 𝑃𝜃 𝑠 ∇P𝜃(s) 𝑃𝜃 𝑠 ← max 𝑃𝜃 𝑠 𝛼 , 𝛽 , 𝛼 = 0.5, 𝛽 = 0.05
  • 24. Large scale sampling • To ensure sufficient diversity in such a large number of samples • Generate half of the samples in Python and half in C++ • Randomize the problem tags and ratings in the natural language prompt • Problem tags: picked random tags from most popular 50 tag • Ratings: uniformly in the range of 800 to 3500 • Use relatively high sampling temperature • The optimal sampling temperature depends on the total number of samples (in general the more samples, the higher the optimal temperature) • (However) different temperatures in a wide range do not significantly change the solve rates • Regular sampling with temperature • Top-k and nucleus sampling: they did not observe significant performance improvements
  • 25. Filtering and Clustering • Filtering • Filtering samples to only those that pass the example tests given in the problem statement • Filtering removes approximately 99% of model samples • Filtering can still leave tens of thousands of candidate samples for many problems • Clustering • Clustering the programs that are syntactically different but semantically equivalent • Semantically equivalent programs could be detected if we had additional test inputs (grouping programs that produce the same outputs together into clusters) • Test input generation model • Same architecture as their main models and initialized from the same GitHub pre-trained ckpt • This model was trained to predict test inputs from problem descriptions, using example, hidden, and generated test inputs as training data
  • 26. 1. Codeforces competitions evaluation • All codeforces competitions from 2021/12/01 to 2021/12/28 with more than 5,000 participants per contest, a total of 10 competitions. • The system was an ensemble of 41B and 9B models with clustering (but turned out to be slightly worse than using the 41B model alone with clustering) • First run: up to 10 submissions • 2nd ~3rd run: more than 10 submissions
  • 27. 1. Codeforces competitions evaluation • C1591 • 3 problems: 10% ~ 37% • 2 problems: 37% ~ 73%
  • 30. 2. CodeContests evaluation • pass@k : The percentage of problems solved when they take k samples from the model for each problem and submit all of them for evaluation on the hidden tests. If any solution in the specified sample budget solves a problem, the problem is counted as solved • 10@k : The percentage of problems solved when they take k samples from the model for each problem but can only submit 10 of them for evaluation on the hidden tests • Difference in solve rates between the validation and test sets are caused by variation in problem distributions (as the test set and validation set were collected in temporally disjoint periods), as well as some overfitting)
  • 31. 2. CodeContests evaluation • Solve rates scale log-linearly with more samples • Scaling up the total number of samples leads to massive improvements in model solve rate • Improving solve rate requires exponentially increasing amounts of samples • Better models have higher slopes in the scaling curve • Larger models tend to have better model quality
  • 32. 2. CodeContests evaluation • Solve rates scale log-linearly with more compute • (a) The solve rate also scales approximately log-linearly with more training compute • (b) larger models take more compute to draw each sample, but they eventually outperform smaller models even with the same sampling compute as the better quality of samples from the larger models become the dominant factor for performance
  • 33. 2. CodeContests evaluation • The decoder-only model was trained with the same amount of compute • Due to the significantly longer decoder sequence length (2304 tokens), with the same amount of training compute it consumes 50% more loss tokens than training the encoder-decoder model • Asymmetric encoder-decoder with multi-query attention significantly improves the sampling speed while keeping the sample quality 1. An encoder-decoder architecture with asymmetric encoder and decoder structures 2. The multi-query attention setup which uses shared attention head for keys and one for values each block
  • 34. 2. CodeContests evaluation • Choice of the pre-training dataset • Base 1B model • GitHub (all language) • The Python-only portion of GitHub • The MassiveText(Gopher) generic text dataset which also includes a portion of GitHub • Not pre-trained at all
  • 35. 2. CodeContests evaluation • Model enhancement
  • 36. 2. CodeContests evaluation • These percentages are calculated based on the full set of samples • Overall less than 1% of samples from the models pass example tests
  • 37. 2. CodeContests evaluation 1. Randomly picking model samples without filtering 2. Filtering and than randomly selecting from the filtered samples 3. Filtering and then using clustering to select samples 4. Allowing unlimited evaluation attempts (the upper bound performance)
  • 38. 3. APPS • Fine-tuned the pre-trained models on the APPS training set without using clustering, tags, ratings, value conditioning, or prediction, and with sampling temperature 0.25 and nucleus sampling
  • 39. Copying from training data • 50 C++ and 50 Python solutions from a selection of 16 validation problems that had that many solutions • The common substrings between model solutions and training data mostly contained boilerplate code for reading and parsing the input data format, rather than key logic for solving problems (for example, a FastIO Python class has length 805 and appears in 1.66% of all human Python solutions)
  • 41. Sensitivity to problem descriptions
  • 42. Sensitivity to problem descriptions (simplification of problem descriptions)
  • 43. Sensitivity to problem descriptions (incorrect or irrelevant rewording) • Original: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. • Opposite: You are given n integers a_1 , a_2 , ... , a_n . Find the minimum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. • Related: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of ( a_l . a_r ) over all pairs (l , r) of integers for which 1 <= l < r <= n. • Underspecified: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of f( a_l , a_r ) over all pairs (l , r ) of integers for which 1 <= l < r <= n. • Verbose: William has been given an array for his birthday , which consists of n integers a_1 , a_2 , ... , a_n . He is very proud of his array , but naturally his friend Mary is curious about it . Mary would like to know a certain function of consecutive elements of the array . Concretely , Mary would like to know the maximum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. Can you help William by calculating this value for him ? • Algorithm described in words: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of the product of two consecutive members of the array .
  • 44. Sensitivity to problem descriptions (variable renaming)
  • 45. Sensitivity to problem descriptions (variable renaming)
  • 46. Sensitivity to problem descriptions (typos, synonym substitution, word permutation , deletion)
  • 47. Sensitivity to problem descriptions (description section ablations) Description Specification IO
  • 49. Loss is a poor proxy for solve rate
  • 50. Solve rates conditioned on different ratings
  • 51. Solve rates conditioned on different ratings
  • 53. Text Generation by Learning from Demonstration NYU, 2021
  • 54. Text Generation by Learning from Demonstration NYU, 2021
  • 55. Text Generation by Learning from Demonstration NYU, 2021
  • 56. Text Generation by Learning from Demonstration NYU, 2021
  • 57. Text Generation by Learning from Demonstration NYU, 2021
  • 58. Text Generation by Learning from Demonstration NYU, 2021
  • 59. Text Generation by Learning from Demonstration NYU, 2021
  • 60. Text Generation by Learning from Demonstration NYU, 2021
  • 61. Text Generation by Learning from Demonstration NYU, 2021
  • 62. The Curious Case of Neural Text Degeneration AI2, 2020
  • 63. The Curious Case of Neural Text Degeneration AI2, 2020
  • 64. The Curious Case of Neural Text Degeneration AI2, 2020
  • 65. The Curious Case of Neural Text Degeneration AI2, 2020
  • 66. The Curious Case of Neural Text Degeneration AI2, 2020
  • 67. The Curious Case of Neural Text Degeneration AI2, 2020
  • 68. The Curious Case of Neural Text Degeneration AI2, 2020
  • 69. The Curious Case of Neural Text Degeneration AI2, 2020
  • 70. GPT Understands, Too Tsinghua Univ., 2021

Editor's Notes

  1. Add irrelevant information to the description, or reword it to be under-specified or incorrect The model is strongly conditioning on the description The model is able to parse algorithm descriptions from either symbols or from natural language, and can ignore irrelevant natural language explanations; indeed, the model actually does better with more language-heavy descriptions, perhaps because of the verbose nature of the training distribution