SlideShare a Scribd company logo
Competition-Level Code Generation with
AlphaCode
San Kim
2022.03.02
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi
Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert,
Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang,
Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molly, Daniel J. Mankowitz, Esme
Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol
Vinyals
Google Deepmind
Alpha Code
• Recent large-scale language models still perform poorly when evaluated on more
complex, unseen problems that require problem-solving skills beyond simply
translating instructions into code.
• Competitive programming problems which require an understanding of algorithms
and complex natural language remain extremely challenging.
1. An extensive and clean competitive programming dataset for training and
evaluation
2. Large and efficient-to-sample transformer-based architectures
3. Large-scale model sampling to explore the search space, followed by filtering
based on program behavior to a small set of submissions
3 key components were critical to achieve good and reliable performance
AlphaCode: a system for code generation that can create novel solutions to these
solutions that require deeper reasoning.
Achieved on average a ranking of top 54.3% in competitions with more than 5,000
participants (on Codeforce platform)
Competitive programming problem statement
A solution generated by AlphaCode
AlphaCode’s ranking
Alpha Code
1. Pre-train a transformer-based language model on GitHub code with standard
language modeling objectives. This model can reasonably represent the space of
human coding, which greatly reduces the problem search space.
2. Fine-tune the model on our dataset of competitive programming data, using
GOLD with tempering as the training objective. This further reduces the search
space, and compensates for the small amount of competitive programming data
by leveraging pre-training
3. Generate a very large number of samples from our models for each problem.
4. Filter the samples to obtain a small set of candidate submissions (at most 10), to
be evaluated on the hidden test cases, by using the example tests and clustering
to pick samples based on program behaviour
Pretraining Dataset
• pre-training dataset is based on a snapshot of selected public GitHub
repositories taken on 2021/07/14.
• Included all code files from several popular languages: C++, C#, Go, Java,
JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript
• Filtered out all files larger than 1MB or with lines longer than 100
characters, to exclude automatically generated code.
• Removed duplicates of the same file, ignoring whitespace in comparisons.
• Final pre-training dataset contains a total of 715.1 GB of code.
CodeContests
• The dataset includes problems, solution and test cases (scraped from the Codeforces platform)
• To guard against data leakage, a strict temporal split is adopted:
• Training dataset: before 2021/07/14
• Validation dataset: between 2021/07/15 and 2021/09/10
• Test dataset: after 2021/09/21
• Training dataset: Newly scraped data (from Codeforces) + Description2Code(OpenAI) + CodeNet (IBM
Research)
• Validation and test splits: Newly scraped data (from Codeforces)
CodeContests
• Metadata: difficulty ratings (800-1600), tags (greedy, math, dp, brute force, graphs, etc.)
• Neither the difficulty rating nor the tags are visible at competition time (and so should not be used at
test time)
• The dataset also contains both correct and incorrect human submissions written in the most popular
submission languages: C++, Python, and Java.
• Tests
• Example tests (in problem statements)
• Hidden tests (that are made available at the evaluation result pages)
Difficulty rating
tags
Example tests
CodeContests
• Lack of test coverage leads to “false positives” and “slow positives”
• “false positives”: incorrect submissions are marked as correct
• “slow positives”: correct but algorithmically inefficient solutions that do not fulfill time and memory
constraints are marked correct
CodeContests
• Reduce false positive rates by generating additional test cases
• Mutating existing test inputs
• Bit flips to binary inputs
• Randomly incrementing or decrementing integers
• Swapping and changing characters in string
• Mutated inputs are verified by running 30 correct solutions on then, and checking all solutions produce the
same output
• This process was run on each problem for a maximum of 10 CPU hours or 200 generated tests
• Because of complex input formats, they failed to generate the full set of 200 tests for 6.3% of problems
• They filtered out problems in the validation and test splits with insufficient test coverage, keeping only
problems with at least 5 hidden or generated test cases that result in at least 2 different outputs.
HumanEval dataset, Codex model
Evaluating Large Language Models Trained on Code
OpenAI, 2021
Generating short code snippets vs. generating entire programs
The difference in difficulty between generating short code snippets and entire programs can be
analogous to that of declarative versus imperative problem solving…
Generating short code snippets vs. generating entire programs
• Generating short code snippets typically amounts to translating the task specification
directly into code, and sometimes reduces to invoking the correct API calls.
• Generating entire programs often relies on understanding the task and figuring out
how to accomplish it, which requires deeper algorithmic reasoning.
APPS benchmark
Measuring Coding Challenge Competence With APPS
UC Berkeley, 2021
Model Architecture
• Encoder-decoder transformer architecture - The competitive programming code generation problem
can be viewed as a sequence-to-sequence translation task: given a problem description X in natural
language, produce a corresponding solution Y in a programming language
• Asymmetric architecture with 1536 tokens for the encoder but only 768 tokens for the decoder:
because problem descriptions are on average twice as long as their corresponding human solution
• A SentencePiece tokenizer with a vocabulary size of 8,000 tokens trained on a mix of GitHub and
CodeContests data.
Model Architecture
• Multi-query attention (to reduce the cost of sampling from their models): using a full set of query
heads but sharing key and value heads per attention block significantly reduces memory usage and
cache updates costs, which are the main bottleneck during sampling.
Model Architecture
Pre-training
• A standard cross-entropy next-token prediction loss: for decoder
• A masked language modeling: for encoder
• They split GitHub files by uniformly sampling pivot location, using content before the pivot as input
to the encoder, and content after for the decoder
• Base 1B parameter model was trained for 1 million steps with a batch size of 256
• They adjusted the amount of training for other model sizes such that larger models are trained more and
smaller models are trained less to optimize the use of compute
• Due to resource limitations and to make optimal use of compute, the training of largest 41B model was
stopped early
Fine-tuning
• Input
• Encoder: the natural language problem description
• Decoder: the program solution
• Loss
• Encoder: masked language modeling loss
• Decoder: standard next-token prediction
• Tempering
• Value conditioning & prediction
• GOLD
Fine-tuning - tempering
• Tempering is a regularization technique that makes the token probability distribution artificially
smoother or sharper at training time by dividing the output logits of a model by a scalar temperature
T before the softmax layer
• T<1 (training distribution sharper)  T’=1 (inference distribution smoother)
• T>1 (training distribution smoother)  T’=1 (inference distribution sharper)
Softmax Tempering for Training Neural Machine Translation Models (2020)
• Training time: T=0.2
• Sampling time:
• T’ = 0.12 (for models trained with tempering only)
• T’ = 0.25 (for models trained with tempering and GOLD)
Fine-tuning – value conditioning & prediction
• Value conditioning
• Whether of not a submission was correct (always correct at sampling time)
• Meta data (Ratings, Tags, Language)
• Value prediction (not used during sampling)
• Auxiliary value prediction task during training
• The last layer token representations before projecting to logits are used in a small Transformer to
classify whether the submission is correct
Fine-tuning – GOLD
• Solving competitive programming problems from description is inherently a one-of-many task
• Standard maximum likelihood objectives minimize loss by putting some weight on each solution in the
training set (like recall), whereas their metric measures whether a model can find a single correct
solution in the submission attempt budget (like precision)
• A variation of GOLD is adopted (an offline RL algorithm which allows the model to both learn from
tokens it already assigns high likelihood to, and to ignore tokens that are not in its distribution)
• The model can concentrate on precision rather than recall, and increase its chance of getting at least
one correct sample
• They add a short training phase between pre-training and fine-tuning, during which they apply
tempering but crucially not GOLD. This allows the initial pre-trained distribution to transition to a
smoother distribution.
Text Generation by learning from Demonstrations, ICLR 2021
∇ℒ𝐺𝑂𝐿𝐷 𝜃 = −
𝑠∈𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛𝑠
𝑃𝜃 𝑠 ∇P𝜃(s)
𝑃𝜃 𝑠 ← max 𝑃𝜃 𝑠 𝛼
, 𝛽 , 𝛼 = 0.5, 𝛽 = 0.05
Large scale sampling
• To ensure sufficient diversity in such a large number of samples
• Generate half of the samples in Python and half in C++
• Randomize the problem tags and ratings in the natural language prompt
• Problem tags: picked random tags from most popular 50 tag
• Ratings: uniformly in the range of 800 to 3500
• Use relatively high sampling temperature
• The optimal sampling temperature depends on the total number of samples (in general the
more samples, the higher the optimal temperature)
• (However) different temperatures in a wide range do not significantly change the solve rates
• Regular sampling with temperature
• Top-k and nucleus sampling: they did not observe significant performance improvements
Filtering and Clustering
• Filtering
• Filtering samples to only those that pass the example tests given in the problem statement
• Filtering removes approximately 99% of model samples
• Filtering can still leave tens of thousands of candidate samples for many problems
• Clustering
• Clustering the programs that are syntactically different but semantically equivalent
• Semantically equivalent programs could be detected if we had additional test inputs (grouping
programs that produce the same outputs together into clusters)
• Test input generation model
• Same architecture as their main models and initialized from the same GitHub pre-trained ckpt
• This model was trained to predict test inputs from problem descriptions, using example,
hidden, and generated test inputs as training data
1. Codeforces competitions evaluation
• All codeforces competitions from 2021/12/01 to 2021/12/28 with more than 5,000 participants per
contest, a total of 10 competitions.
• The system was an ensemble of 41B and 9B models with clustering (but turned out to be slightly worse
than using the 41B model alone with clustering)
• First run: up to 10 submissions
• 2nd ~3rd run: more than 10 submissions
1. Codeforces competitions evaluation
• C1591
• 3 problems: 10% ~ 37%
• 2 problems: 37% ~ 73%
1. Codeforces competitions evaluation
1. Codeforces competitions evaluation
2. CodeContests evaluation
• pass@k : The percentage of problems solved when they take k samples from the model for each
problem and submit all of them for evaluation on the hidden tests. If any solution in the specified sample
budget solves a problem, the problem is counted as solved
• 10@k : The percentage of problems solved when they take k samples from the model for each
problem but can only submit 10 of them for evaluation on the hidden tests
• Difference in solve rates between the validation and test sets are caused by variation in problem
distributions (as the test set and validation set were collected in temporally disjoint periods), as well as
some overfitting)
2. CodeContests evaluation
• Solve rates scale log-linearly with more samples
• Scaling up the total number of samples leads to massive improvements in model solve rate
• Improving solve rate requires exponentially increasing amounts of samples
• Better models have higher slopes in the scaling curve
• Larger models tend to have better model quality
2. CodeContests evaluation
• Solve rates scale log-linearly with more compute
• (a) The solve rate also scales approximately log-linearly with more training compute
• (b) larger models take more compute to draw each sample, but they eventually outperform smaller
models even with the same sampling compute as the better quality of samples from the larger
models become the dominant factor for performance
2. CodeContests evaluation
• The decoder-only model was trained with the same amount of compute
• Due to the significantly longer decoder sequence length (2304 tokens), with the same amount of
training compute it consumes 50% more loss tokens than training the encoder-decoder model
• Asymmetric encoder-decoder with multi-query attention significantly improves the sampling speed
while keeping the sample quality
1. An encoder-decoder architecture with asymmetric encoder and decoder structures
2. The multi-query attention setup which uses shared attention head for keys and one for values each
block
2. CodeContests evaluation
• Choice of the pre-training dataset
• Base 1B model
• GitHub (all language)
• The Python-only portion of GitHub
• The MassiveText(Gopher) generic text dataset which also includes a portion of GitHub
• Not pre-trained at all
2. CodeContests evaluation
• Model enhancement
2. CodeContests evaluation
• These percentages are calculated based on the full set of samples
• Overall less than 1% of samples from the models pass example tests
2. CodeContests evaluation
1. Randomly picking model samples without filtering
2. Filtering and than randomly selecting from the filtered samples
3. Filtering and then using clustering to select samples
4. Allowing unlimited evaluation attempts (the upper bound performance)
3. APPS
• Fine-tuned the pre-trained models on the APPS training set without using clustering, tags, ratings, value
conditioning, or prediction, and with sampling temperature 0.25 and nucleus sampling
Copying from training data
• 50 C++ and 50 Python solutions from a selection of 16 validation problems that had that many solutions
• The common substrings between model solutions and training data mostly contained boilerplate code
for reading and parsing the input data format, rather than key logic for solving problems (for example, a
FastIO Python class has length 805 and appears in 1.66% of all human Python solutions)
Model solution characteristics
Sensitivity to problem descriptions
Sensitivity to problem descriptions (simplification of problem descriptions)
Sensitivity to problem descriptions (incorrect or irrelevant rewording)
• Original: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Opposite: You are given n integers a_1 , a_2 , ... , a_n . Find the minimum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Related: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of ( a_l . a_r ) over all pairs (l , r) of
integers for which 1 <= l < r <= n.
• Underspecified: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of f( a_l , a_r ) over all pairs (l ,
r ) of integers for which 1 <= l < r <= n.
• Verbose: William has been given an array for his birthday , which consists of n integers a_1 , a_2 , ... , a_n . He is very
proud of his array , but naturally his friend Mary is curious about it . Mary would like to know a certain function of
consecutive elements of the array . Concretely , Mary would like to know the maximum value of a_l times a_ {l + 1} for
an integer l for which 1 <= l < n. Can you help William by calculating this value for him ?
• Algorithm described in words: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of the product
of two consecutive members of the array .
Sensitivity to problem descriptions (variable renaming)
Sensitivity to problem descriptions (variable renaming)
Sensitivity to problem descriptions (typos, synonym substitution, word permutation , deletion)
Sensitivity to problem descriptions (description section ablations)
Description
Specification
IO
Sensitivity to provided metadata
Loss is a poor proxy for solve rate
Solve rates conditioned on different ratings
Solve rates conditioned on different ratings
Conclusion
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
Text Generation by Learning from Demonstration NYU, 2021
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
The Curious Case of Neural Text Degeneration AI2, 2020
GPT Understands, Too Tsinghua Univ., 2021

More Related Content

Similar to Compeition-Level Code Generation with AlphaCode.pptx

Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
Lionel Briand
 
Testing, a pragmatic approach
Testing, a pragmatic approachTesting, a pragmatic approach
Testing, a pragmatic approach
Enrico Da Ros
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCop
Coverity
 
Improving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis codeImproving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis code
Johan Carlin
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
Michaela Greiler
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Research
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
Akhil Kaushik
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
Andreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
Warply
 
Architecture presentation 4
Architecture presentation 4Architecture presentation 4
Architecture presentation 4
Anoushiravan M. Ghamsari
 
CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)
Dilawar Khan
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
openseesdays
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
Inductive Automation
 
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech UpdateAdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
jamieayre
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
Inductive Automation
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
Praveen Penumathsa
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
ssuser1e7611
 
Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)
Christopher Conlan
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 

Similar to Compeition-Level Code Generation with AlphaCode.pptx (20)

Enabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
 
Testing, a pragmatic approach
Testing, a pragmatic approachTesting, a pragmatic approach
Testing, a pragmatic approach
 
Finding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCopFinding Defects in C#: Coverity vs. FxCop
Finding Defects in C#: Coverity vs. FxCop
 
Improving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis codeImproving the accuracy and reliability of data analysis code
Improving the accuracy and reliability of data analysis code
 
Can we induce change with what we measure?
Can we induce change with what we measure?Can we induce change with what we measure?
Can we induce change with what we measure?
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Architecture presentation 4
Architecture presentation 4Architecture presentation 4
Architecture presentation 4
 
CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)CS4443 - Modern Programming Language - I Lecture (1)
CS4443 - Modern Programming Language - I Lecture (1)
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech UpdateAdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
AdaCore Paris Tech Day 2016: Jose Ruiz - QGen Tech Update
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Testing of Object-Oriented Software
Testing of Object-Oriented SoftwareTesting of Object-Oriented Software
Testing of Object-Oriented Software
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)Algorithms 101 for Data Scientists (Part 2)
Algorithms 101 for Data Scientists (Part 2)
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

More from San Kim

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
San Kim
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
San Kim
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
San Kim
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
San Kim
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
San Kim
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
San Kim
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
San Kim
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
San Kim
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
San Kim
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
San Kim
 
Electra
ElectraElectra
Electra
San Kim
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
San Kim
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
San Kim
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
San Kim
 
Gan seminar
Gan seminarGan seminar
Gan seminar
San Kim
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
San Kim
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
San Kim
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
San Kim
 
Back propagation
Back propagationBack propagation
Back propagation
San Kim
 

More from San Kim (19)

20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptxLongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
LongT5_Efficient Text-toText Transformer for Long Sequences_san.pptx
 
slide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptxslide-acl2022-combined_san.pptx
slide-acl2022-combined_san.pptx
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
AI2 day.pptx
AI2 day.pptxAI2 day.pptx
AI2 day.pptx
 
Temporal reasoning task
Temporal reasoning taskTemporal reasoning task
Temporal reasoning task
 
Answering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrievalAnswering complex open domain questions with multi-hop dense retrieval
Answering complex open domain questions with multi-hop dense retrieval
 
Measuring massive multitask language understanding
Measuring massive multitask language understandingMeasuring massive multitask language understanding
Measuring massive multitask language understanding
 
Abductive commonsense reasoning
Abductive commonsense reasoningAbductive commonsense reasoning
Abductive commonsense reasoning
 
Electra
ElectraElectra
Electra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Face recognition v1
Face recognition v1Face recognition v1
Face recognition v1
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Deep learning study 1
Deep learning study 1Deep learning study 1
Deep learning study 1
 
Back propagation
Back propagationBack propagation
Back propagation
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Compeition-Level Code Generation with AlphaCode.pptx

  • 1. Competition-Level Code Generation with AlphaCode San Kim 2022.03.02 Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molly, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol Vinyals Google Deepmind
  • 2. Alpha Code • Recent large-scale language models still perform poorly when evaluated on more complex, unseen problems that require problem-solving skills beyond simply translating instructions into code. • Competitive programming problems which require an understanding of algorithms and complex natural language remain extremely challenging. 1. An extensive and clean competitive programming dataset for training and evaluation 2. Large and efficient-to-sample transformer-based architectures 3. Large-scale model sampling to explore the search space, followed by filtering based on program behavior to a small set of submissions 3 key components were critical to achieve good and reliable performance AlphaCode: a system for code generation that can create novel solutions to these solutions that require deeper reasoning. Achieved on average a ranking of top 54.3% in competitions with more than 5,000 participants (on Codeforce platform)
  • 4. A solution generated by AlphaCode
  • 6. Alpha Code 1. Pre-train a transformer-based language model on GitHub code with standard language modeling objectives. This model can reasonably represent the space of human coding, which greatly reduces the problem search space. 2. Fine-tune the model on our dataset of competitive programming data, using GOLD with tempering as the training objective. This further reduces the search space, and compensates for the small amount of competitive programming data by leveraging pre-training 3. Generate a very large number of samples from our models for each problem. 4. Filter the samples to obtain a small set of candidate submissions (at most 10), to be evaluated on the hidden test cases, by using the example tests and clustering to pick samples based on program behaviour
  • 7. Pretraining Dataset • pre-training dataset is based on a snapshot of selected public GitHub repositories taken on 2021/07/14. • Included all code files from several popular languages: C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript • Filtered out all files larger than 1MB or with lines longer than 100 characters, to exclude automatically generated code. • Removed duplicates of the same file, ignoring whitespace in comparisons. • Final pre-training dataset contains a total of 715.1 GB of code.
  • 8. CodeContests • The dataset includes problems, solution and test cases (scraped from the Codeforces platform) • To guard against data leakage, a strict temporal split is adopted: • Training dataset: before 2021/07/14 • Validation dataset: between 2021/07/15 and 2021/09/10 • Test dataset: after 2021/09/21 • Training dataset: Newly scraped data (from Codeforces) + Description2Code(OpenAI) + CodeNet (IBM Research) • Validation and test splits: Newly scraped data (from Codeforces)
  • 9. CodeContests • Metadata: difficulty ratings (800-1600), tags (greedy, math, dp, brute force, graphs, etc.) • Neither the difficulty rating nor the tags are visible at competition time (and so should not be used at test time) • The dataset also contains both correct and incorrect human submissions written in the most popular submission languages: C++, Python, and Java. • Tests • Example tests (in problem statements) • Hidden tests (that are made available at the evaluation result pages) Difficulty rating tags Example tests
  • 10. CodeContests • Lack of test coverage leads to “false positives” and “slow positives” • “false positives”: incorrect submissions are marked as correct • “slow positives”: correct but algorithmically inefficient solutions that do not fulfill time and memory constraints are marked correct
  • 11. CodeContests • Reduce false positive rates by generating additional test cases • Mutating existing test inputs • Bit flips to binary inputs • Randomly incrementing or decrementing integers • Swapping and changing characters in string • Mutated inputs are verified by running 30 correct solutions on then, and checking all solutions produce the same output • This process was run on each problem for a maximum of 10 CPU hours or 200 generated tests • Because of complex input formats, they failed to generate the full set of 200 tests for 6.3% of problems • They filtered out problems in the validation and test splits with insufficient test coverage, keeping only problems with at least 5 hidden or generated test cases that result in at least 2 different outputs.
  • 12. HumanEval dataset, Codex model Evaluating Large Language Models Trained on Code OpenAI, 2021
  • 13. Generating short code snippets vs. generating entire programs The difference in difficulty between generating short code snippets and entire programs can be analogous to that of declarative versus imperative problem solving…
  • 14. Generating short code snippets vs. generating entire programs • Generating short code snippets typically amounts to translating the task specification directly into code, and sometimes reduces to invoking the correct API calls. • Generating entire programs often relies on understanding the task and figuring out how to accomplish it, which requires deeper algorithmic reasoning.
  • 15. APPS benchmark Measuring Coding Challenge Competence With APPS UC Berkeley, 2021
  • 16. Model Architecture • Encoder-decoder transformer architecture - The competitive programming code generation problem can be viewed as a sequence-to-sequence translation task: given a problem description X in natural language, produce a corresponding solution Y in a programming language • Asymmetric architecture with 1536 tokens for the encoder but only 768 tokens for the decoder: because problem descriptions are on average twice as long as their corresponding human solution • A SentencePiece tokenizer with a vocabulary size of 8,000 tokens trained on a mix of GitHub and CodeContests data.
  • 17. Model Architecture • Multi-query attention (to reduce the cost of sampling from their models): using a full set of query heads but sharing key and value heads per attention block significantly reduces memory usage and cache updates costs, which are the main bottleneck during sampling.
  • 19. Pre-training • A standard cross-entropy next-token prediction loss: for decoder • A masked language modeling: for encoder • They split GitHub files by uniformly sampling pivot location, using content before the pivot as input to the encoder, and content after for the decoder • Base 1B parameter model was trained for 1 million steps with a batch size of 256 • They adjusted the amount of training for other model sizes such that larger models are trained more and smaller models are trained less to optimize the use of compute • Due to resource limitations and to make optimal use of compute, the training of largest 41B model was stopped early
  • 20. Fine-tuning • Input • Encoder: the natural language problem description • Decoder: the program solution • Loss • Encoder: masked language modeling loss • Decoder: standard next-token prediction • Tempering • Value conditioning & prediction • GOLD
  • 21. Fine-tuning - tempering • Tempering is a regularization technique that makes the token probability distribution artificially smoother or sharper at training time by dividing the output logits of a model by a scalar temperature T before the softmax layer • T<1 (training distribution sharper)  T’=1 (inference distribution smoother) • T>1 (training distribution smoother)  T’=1 (inference distribution sharper) Softmax Tempering for Training Neural Machine Translation Models (2020) • Training time: T=0.2 • Sampling time: • T’ = 0.12 (for models trained with tempering only) • T’ = 0.25 (for models trained with tempering and GOLD)
  • 22. Fine-tuning – value conditioning & prediction • Value conditioning • Whether of not a submission was correct (always correct at sampling time) • Meta data (Ratings, Tags, Language) • Value prediction (not used during sampling) • Auxiliary value prediction task during training • The last layer token representations before projecting to logits are used in a small Transformer to classify whether the submission is correct
  • 23. Fine-tuning – GOLD • Solving competitive programming problems from description is inherently a one-of-many task • Standard maximum likelihood objectives minimize loss by putting some weight on each solution in the training set (like recall), whereas their metric measures whether a model can find a single correct solution in the submission attempt budget (like precision) • A variation of GOLD is adopted (an offline RL algorithm which allows the model to both learn from tokens it already assigns high likelihood to, and to ignore tokens that are not in its distribution) • The model can concentrate on precision rather than recall, and increase its chance of getting at least one correct sample • They add a short training phase between pre-training and fine-tuning, during which they apply tempering but crucially not GOLD. This allows the initial pre-trained distribution to transition to a smoother distribution. Text Generation by learning from Demonstrations, ICLR 2021 ∇ℒ𝐺𝑂𝐿𝐷 𝜃 = − 𝑠∈𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛𝑠 𝑃𝜃 𝑠 ∇P𝜃(s) 𝑃𝜃 𝑠 ← max 𝑃𝜃 𝑠 𝛼 , 𝛽 , 𝛼 = 0.5, 𝛽 = 0.05
  • 24. Large scale sampling • To ensure sufficient diversity in such a large number of samples • Generate half of the samples in Python and half in C++ • Randomize the problem tags and ratings in the natural language prompt • Problem tags: picked random tags from most popular 50 tag • Ratings: uniformly in the range of 800 to 3500 • Use relatively high sampling temperature • The optimal sampling temperature depends on the total number of samples (in general the more samples, the higher the optimal temperature) • (However) different temperatures in a wide range do not significantly change the solve rates • Regular sampling with temperature • Top-k and nucleus sampling: they did not observe significant performance improvements
  • 25. Filtering and Clustering • Filtering • Filtering samples to only those that pass the example tests given in the problem statement • Filtering removes approximately 99% of model samples • Filtering can still leave tens of thousands of candidate samples for many problems • Clustering • Clustering the programs that are syntactically different but semantically equivalent • Semantically equivalent programs could be detected if we had additional test inputs (grouping programs that produce the same outputs together into clusters) • Test input generation model • Same architecture as their main models and initialized from the same GitHub pre-trained ckpt • This model was trained to predict test inputs from problem descriptions, using example, hidden, and generated test inputs as training data
  • 26. 1. Codeforces competitions evaluation • All codeforces competitions from 2021/12/01 to 2021/12/28 with more than 5,000 participants per contest, a total of 10 competitions. • The system was an ensemble of 41B and 9B models with clustering (but turned out to be slightly worse than using the 41B model alone with clustering) • First run: up to 10 submissions • 2nd ~3rd run: more than 10 submissions
  • 27. 1. Codeforces competitions evaluation • C1591 • 3 problems: 10% ~ 37% • 2 problems: 37% ~ 73%
  • 30. 2. CodeContests evaluation • pass@k : The percentage of problems solved when they take k samples from the model for each problem and submit all of them for evaluation on the hidden tests. If any solution in the specified sample budget solves a problem, the problem is counted as solved • 10@k : The percentage of problems solved when they take k samples from the model for each problem but can only submit 10 of them for evaluation on the hidden tests • Difference in solve rates between the validation and test sets are caused by variation in problem distributions (as the test set and validation set were collected in temporally disjoint periods), as well as some overfitting)
  • 31. 2. CodeContests evaluation • Solve rates scale log-linearly with more samples • Scaling up the total number of samples leads to massive improvements in model solve rate • Improving solve rate requires exponentially increasing amounts of samples • Better models have higher slopes in the scaling curve • Larger models tend to have better model quality
  • 32. 2. CodeContests evaluation • Solve rates scale log-linearly with more compute • (a) The solve rate also scales approximately log-linearly with more training compute • (b) larger models take more compute to draw each sample, but they eventually outperform smaller models even with the same sampling compute as the better quality of samples from the larger models become the dominant factor for performance
  • 33. 2. CodeContests evaluation • The decoder-only model was trained with the same amount of compute • Due to the significantly longer decoder sequence length (2304 tokens), with the same amount of training compute it consumes 50% more loss tokens than training the encoder-decoder model • Asymmetric encoder-decoder with multi-query attention significantly improves the sampling speed while keeping the sample quality 1. An encoder-decoder architecture with asymmetric encoder and decoder structures 2. The multi-query attention setup which uses shared attention head for keys and one for values each block
  • 34. 2. CodeContests evaluation • Choice of the pre-training dataset • Base 1B model • GitHub (all language) • The Python-only portion of GitHub • The MassiveText(Gopher) generic text dataset which also includes a portion of GitHub • Not pre-trained at all
  • 35. 2. CodeContests evaluation • Model enhancement
  • 36. 2. CodeContests evaluation • These percentages are calculated based on the full set of samples • Overall less than 1% of samples from the models pass example tests
  • 37. 2. CodeContests evaluation 1. Randomly picking model samples without filtering 2. Filtering and than randomly selecting from the filtered samples 3. Filtering and then using clustering to select samples 4. Allowing unlimited evaluation attempts (the upper bound performance)
  • 38. 3. APPS • Fine-tuned the pre-trained models on the APPS training set without using clustering, tags, ratings, value conditioning, or prediction, and with sampling temperature 0.25 and nucleus sampling
  • 39. Copying from training data • 50 C++ and 50 Python solutions from a selection of 16 validation problems that had that many solutions • The common substrings between model solutions and training data mostly contained boilerplate code for reading and parsing the input data format, rather than key logic for solving problems (for example, a FastIO Python class has length 805 and appears in 1.66% of all human Python solutions)
  • 41. Sensitivity to problem descriptions
  • 42. Sensitivity to problem descriptions (simplification of problem descriptions)
  • 43. Sensitivity to problem descriptions (incorrect or irrelevant rewording) • Original: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. • Opposite: You are given n integers a_1 , a_2 , ... , a_n . Find the minimum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. • Related: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of ( a_l . a_r ) over all pairs (l , r) of integers for which 1 <= l < r <= n. • Underspecified: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of f( a_l , a_r ) over all pairs (l , r ) of integers for which 1 <= l < r <= n. • Verbose: William has been given an array for his birthday , which consists of n integers a_1 , a_2 , ... , a_n . He is very proud of his array , but naturally his friend Mary is curious about it . Mary would like to know a certain function of consecutive elements of the array . Concretely , Mary would like to know the maximum value of a_l times a_ {l + 1} for an integer l for which 1 <= l < n. Can you help William by calculating this value for him ? • Algorithm described in words: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of the product of two consecutive members of the array .
  • 44. Sensitivity to problem descriptions (variable renaming)
  • 45. Sensitivity to problem descriptions (variable renaming)
  • 46. Sensitivity to problem descriptions (typos, synonym substitution, word permutation , deletion)
  • 47. Sensitivity to problem descriptions (description section ablations) Description Specification IO
  • 49. Loss is a poor proxy for solve rate
  • 50. Solve rates conditioned on different ratings
  • 51. Solve rates conditioned on different ratings
  • 53. Text Generation by Learning from Demonstration NYU, 2021
  • 54. Text Generation by Learning from Demonstration NYU, 2021
  • 55. Text Generation by Learning from Demonstration NYU, 2021
  • 56. Text Generation by Learning from Demonstration NYU, 2021
  • 57. Text Generation by Learning from Demonstration NYU, 2021
  • 58. Text Generation by Learning from Demonstration NYU, 2021
  • 59. Text Generation by Learning from Demonstration NYU, 2021
  • 60. Text Generation by Learning from Demonstration NYU, 2021
  • 61. Text Generation by Learning from Demonstration NYU, 2021
  • 62. The Curious Case of Neural Text Degeneration AI2, 2020
  • 63. The Curious Case of Neural Text Degeneration AI2, 2020
  • 64. The Curious Case of Neural Text Degeneration AI2, 2020
  • 65. The Curious Case of Neural Text Degeneration AI2, 2020
  • 66. The Curious Case of Neural Text Degeneration AI2, 2020
  • 67. The Curious Case of Neural Text Degeneration AI2, 2020
  • 68. The Curious Case of Neural Text Degeneration AI2, 2020
  • 69. The Curious Case of Neural Text Degeneration AI2, 2020
  • 70. GPT Understands, Too Tsinghua Univ., 2021

Editor's Notes

  1. Add irrelevant information to the description, or reword it to be under-specified or incorrect The model is strongly conditioning on the description The model is able to parse algorithm descriptions from either symbols or from natural language, and can ignore irrelevant natural language explanations; indeed, the model actually does better with more language-heavy descriptions, perhaps because of the verbose nature of the training distribution