AlphaCode is a system for competitive code generation that achieves top 54.3% performance on average in competitions with over 5,000 participants. It uses a large transformer model pre-trained on GitHub code and fine-tuned on a competitive programming dataset. During fine-tuning, it employs techniques like tempering and GOLD to focus on precision over recall. At test time, it generates a large number of samples, filters them based on example tests, and clusters similar programs to select submissions. Extensive evaluations on CodeContests and APPS benchmarks show AlphaCode's performance scales log-linearly with more samples and compute.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.
These slides provide the high-level results of our comparison of FxCop and the Coverity platform. We used a third party codebase of approx. 100k lines of code and analyzed it using the "fxcop" from Visual Studio 2013 and Coverity 6.6. Perhaps most surprising is how the two solutions (both static analysis tools for C# that aim to improve quality and security) are so different and yet so complementary.
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Research
How do you find the best solution when faced with many choices? Combinatorial optimization is a field of mathematics that seeks to find the most optimal solutions for complex problems involving multiple variables. There are numerous business verticals that can benefit from combinatorial optimization, whether transport, supply chain, or the mobile industry.
More recently, we’ve seen gains from AI for combinatorial optimization, leading to scalability of the method, as well as significant reductions in cost. This method replaces the manual tuning of traditional heuristic approaches with an AI agent that provides a fast metric estimation.
In this presentation you will find out:
Why AI is crucial in combinatorial optimization
How it can be applied to two use cases: improving chip design and hardware-specific compilers
The state-of-the-art results achieved by Qualcomm AI Research
How Skroutz S.A. utilizes Deep Learning and Machine Learning techniques to efficiently serve product categorization! Based on my talk at Athens PyData meetup!
The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization
Anoushiravan Ghamsari, known as Anoush Ghamsari is a brilliant architect, the way he uses his creativity to create phenomenal concepts is beyond this world.
Scripting experts from Inductive Automation cover general best practices that will help you add flexibility and customization to HMI, SCADA, IIoT, and other industrial applications. Some specific tips about using scripting in the Ignition platform will be included as well.
In this webinar, learn more about:
• Common scripting pitfalls and how to avoid them
• The best programming languages to use
• Things to consider before using scripting
• How scripting environments work
• Scripting timesavers
• And more
Scripting experts from Inductive Automation cover general best practices that will help you add flexibility and customization to HMI, SCADA, IIoT, and other industrial applications. Some specific tips about using scripting in the Ignition platform will be included as well.
In this webinar, learn more about:
• Common scripting pitfalls and how to avoid them
• The best programming languages to use
• Things to consider before using scripting
• How scripting environments work
• Scripting timesavers
• And more
This presentation material is my review about SOTA model related paper entitled with "Code Translation with Compiler Representations". It is a paper from Meta AI, and was accepted for an ICLR 2023.
Part 2 of Algorithms 101 for Data Scientists.
Part 1 can be found here: https://www.slideshare.net/ChristopherConlan2/algorithms-101-for-data-scientists
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
These slides provide the high-level results of our comparison of FxCop and the Coverity platform. We used a third party codebase of approx. 100k lines of code and analyzed it using the "fxcop" from Visual Studio 2013 and Coverity 6.6. Perhaps most surprising is how the two solutions (both static analysis tools for C# that aim to improve quality and security) are so different and yet so complementary.
Tom DeMarco states that “You can’t control what you can’t measure”, but how much can we change and control (with) what we measure? This talk investigates the opportunities and limits of data-driven software engineering, shows which opportunities lie ahead of us when we engage in mining and analyzing software engineering process data, but also highlights important factors that influence the success and adaptability of data-based improvement approaches.
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Research
How do you find the best solution when faced with many choices? Combinatorial optimization is a field of mathematics that seeks to find the most optimal solutions for complex problems involving multiple variables. There are numerous business verticals that can benefit from combinatorial optimization, whether transport, supply chain, or the mobile industry.
More recently, we’ve seen gains from AI for combinatorial optimization, leading to scalability of the method, as well as significant reductions in cost. This method replaces the manual tuning of traditional heuristic approaches with an AI agent that provides a fast metric estimation.
In this presentation you will find out:
Why AI is crucial in combinatorial optimization
How it can be applied to two use cases: improving chip design and hardware-specific compilers
The state-of-the-art results achieved by Qualcomm AI Research
How Skroutz S.A. utilizes Deep Learning and Machine Learning techniques to efficiently serve product categorization! Based on my talk at Athens PyData meetup!
The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization
Anoushiravan Ghamsari, known as Anoush Ghamsari is a brilliant architect, the way he uses his creativity to create phenomenal concepts is beyond this world.
Scripting experts from Inductive Automation cover general best practices that will help you add flexibility and customization to HMI, SCADA, IIoT, and other industrial applications. Some specific tips about using scripting in the Ignition platform will be included as well.
In this webinar, learn more about:
• Common scripting pitfalls and how to avoid them
• The best programming languages to use
• Things to consider before using scripting
• How scripting environments work
• Scripting timesavers
• And more
Scripting experts from Inductive Automation cover general best practices that will help you add flexibility and customization to HMI, SCADA, IIoT, and other industrial applications. Some specific tips about using scripting in the Ignition platform will be included as well.
In this webinar, learn more about:
• Common scripting pitfalls and how to avoid them
• The best programming languages to use
• Things to consider before using scripting
• How scripting environments work
• Scripting timesavers
• And more
This presentation material is my review about SOTA model related paper entitled with "Code Translation with Compiler Representations". It is a paper from Meta AI, and was accepted for an ICLR 2023.
Part 2 of Algorithms 101 for Data Scientists.
Part 1 can be found here: https://www.slideshare.net/ChristopherConlan2/algorithms-101-for-data-scientists
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Similar to Compeition-Level Code Generation with AlphaCode.pptx (20)
1. “Going on a vacation” tasks longer than “Going for a walk”: A Study of Temporal Commonsense Understanding (EMNLP 19) – MCTACO
2. TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions (EMNLP 20)
3. Temporal Reasoning on Implicit Events from Distant Supervision (NAACL 21)-TRACIE
Measuring massive multitask language understandingSan Kim
a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
a paper review. This presentation introduces Abductive Commonsense Reasoning which is the published paper in ICLR 2020. In this paper, the authors use commonsense to generate plausible hypotheses. They generate new data set 'ART' and propose new models for 'aNLI', 'aNLG' using BERT, and GPT.
This slide introduces transformer-xl which is the base paper for xl-net. You can understand what is the major contribution of this paper using this slide. This slide also explains the transformer for comparing differences between transformer and transformer-xl. Happy NLP!
Deep learning study 1. this slide includes basic mathematical theorems for deep learning, such as Bayes's theorem, Bayesian inference, information theorem.
explain backpropagation with a simple example.
normally, we use cross-entropy as loss function.
and we set the activation function of the output layer as the logistic sigmoid. because we want to maximize (log) likelihood. (or minimize negative (log) likelihood), and we suppose that the function is a binomial distribution which is the maximum entropy function in two-class classification.
but in this example, we set the loss function (objective function or cost function) as sum of square, which is normally used in logistic regression, for simplifying the problem.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Securing your Kubernetes cluster_ a step-by-step guide to success !
Compeition-Level Code Generation with AlphaCode.pptx
1. Competition-Level Code Generation with
AlphaCode
San Kim
2022.03.02
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Remi
Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert,
Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang,
Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molly, Daniel J. Mankowitz, Esme
Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu and Oriol
Vinyals
Google Deepmind
2. Alpha Code
• Recent large-scale language models still perform poorly when evaluated on more
complex, unseen problems that require problem-solving skills beyond simply
translating instructions into code.
• Competitive programming problems which require an understanding of algorithms
and complex natural language remain extremely challenging.
1. An extensive and clean competitive programming dataset for training and
evaluation
2. Large and efficient-to-sample transformer-based architectures
3. Large-scale model sampling to explore the search space, followed by filtering
based on program behavior to a small set of submissions
3 key components were critical to achieve good and reliable performance
AlphaCode: a system for code generation that can create novel solutions to these
solutions that require deeper reasoning.
Achieved on average a ranking of top 54.3% in competitions with more than 5,000
participants (on Codeforce platform)
6. Alpha Code
1. Pre-train a transformer-based language model on GitHub code with standard
language modeling objectives. This model can reasonably represent the space of
human coding, which greatly reduces the problem search space.
2. Fine-tune the model on our dataset of competitive programming data, using
GOLD with tempering as the training objective. This further reduces the search
space, and compensates for the small amount of competitive programming data
by leveraging pre-training
3. Generate a very large number of samples from our models for each problem.
4. Filter the samples to obtain a small set of candidate submissions (at most 10), to
be evaluated on the hidden test cases, by using the example tests and clustering
to pick samples based on program behaviour
7. Pretraining Dataset
• pre-training dataset is based on a snapshot of selected public GitHub
repositories taken on 2021/07/14.
• Included all code files from several popular languages: C++, C#, Go, Java,
JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript
• Filtered out all files larger than 1MB or with lines longer than 100
characters, to exclude automatically generated code.
• Removed duplicates of the same file, ignoring whitespace in comparisons.
• Final pre-training dataset contains a total of 715.1 GB of code.
8. CodeContests
• The dataset includes problems, solution and test cases (scraped from the Codeforces platform)
• To guard against data leakage, a strict temporal split is adopted:
• Training dataset: before 2021/07/14
• Validation dataset: between 2021/07/15 and 2021/09/10
• Test dataset: after 2021/09/21
• Training dataset: Newly scraped data (from Codeforces) + Description2Code(OpenAI) + CodeNet (IBM
Research)
• Validation and test splits: Newly scraped data (from Codeforces)
9. CodeContests
• Metadata: difficulty ratings (800-1600), tags (greedy, math, dp, brute force, graphs, etc.)
• Neither the difficulty rating nor the tags are visible at competition time (and so should not be used at
test time)
• The dataset also contains both correct and incorrect human submissions written in the most popular
submission languages: C++, Python, and Java.
• Tests
• Example tests (in problem statements)
• Hidden tests (that are made available at the evaluation result pages)
Difficulty rating
tags
Example tests
10. CodeContests
• Lack of test coverage leads to “false positives” and “slow positives”
• “false positives”: incorrect submissions are marked as correct
• “slow positives”: correct but algorithmically inefficient solutions that do not fulfill time and memory
constraints are marked correct
11. CodeContests
• Reduce false positive rates by generating additional test cases
• Mutating existing test inputs
• Bit flips to binary inputs
• Randomly incrementing or decrementing integers
• Swapping and changing characters in string
• Mutated inputs are verified by running 30 correct solutions on then, and checking all solutions produce the
same output
• This process was run on each problem for a maximum of 10 CPU hours or 200 generated tests
• Because of complex input formats, they failed to generate the full set of 200 tests for 6.3% of problems
• They filtered out problems in the validation and test splits with insufficient test coverage, keeping only
problems with at least 5 hidden or generated test cases that result in at least 2 different outputs.
13. Generating short code snippets vs. generating entire programs
The difference in difficulty between generating short code snippets and entire programs can be
analogous to that of declarative versus imperative problem solving…
14. Generating short code snippets vs. generating entire programs
• Generating short code snippets typically amounts to translating the task specification
directly into code, and sometimes reduces to invoking the correct API calls.
• Generating entire programs often relies on understanding the task and figuring out
how to accomplish it, which requires deeper algorithmic reasoning.
16. Model Architecture
• Encoder-decoder transformer architecture - The competitive programming code generation problem
can be viewed as a sequence-to-sequence translation task: given a problem description X in natural
language, produce a corresponding solution Y in a programming language
• Asymmetric architecture with 1536 tokens for the encoder but only 768 tokens for the decoder:
because problem descriptions are on average twice as long as their corresponding human solution
• A SentencePiece tokenizer with a vocabulary size of 8,000 tokens trained on a mix of GitHub and
CodeContests data.
17. Model Architecture
• Multi-query attention (to reduce the cost of sampling from their models): using a full set of query
heads but sharing key and value heads per attention block significantly reduces memory usage and
cache updates costs, which are the main bottleneck during sampling.
19. Pre-training
• A standard cross-entropy next-token prediction loss: for decoder
• A masked language modeling: for encoder
• They split GitHub files by uniformly sampling pivot location, using content before the pivot as input
to the encoder, and content after for the decoder
• Base 1B parameter model was trained for 1 million steps with a batch size of 256
• They adjusted the amount of training for other model sizes such that larger models are trained more and
smaller models are trained less to optimize the use of compute
• Due to resource limitations and to make optimal use of compute, the training of largest 41B model was
stopped early
20. Fine-tuning
• Input
• Encoder: the natural language problem description
• Decoder: the program solution
• Loss
• Encoder: masked language modeling loss
• Decoder: standard next-token prediction
• Tempering
• Value conditioning & prediction
• GOLD
21. Fine-tuning - tempering
• Tempering is a regularization technique that makes the token probability distribution artificially
smoother or sharper at training time by dividing the output logits of a model by a scalar temperature
T before the softmax layer
• T<1 (training distribution sharper) T’=1 (inference distribution smoother)
• T>1 (training distribution smoother) T’=1 (inference distribution sharper)
Softmax Tempering for Training Neural Machine Translation Models (2020)
• Training time: T=0.2
• Sampling time:
• T’ = 0.12 (for models trained with tempering only)
• T’ = 0.25 (for models trained with tempering and GOLD)
22. Fine-tuning – value conditioning & prediction
• Value conditioning
• Whether of not a submission was correct (always correct at sampling time)
• Meta data (Ratings, Tags, Language)
• Value prediction (not used during sampling)
• Auxiliary value prediction task during training
• The last layer token representations before projecting to logits are used in a small Transformer to
classify whether the submission is correct
23. Fine-tuning – GOLD
• Solving competitive programming problems from description is inherently a one-of-many task
• Standard maximum likelihood objectives minimize loss by putting some weight on each solution in the
training set (like recall), whereas their metric measures whether a model can find a single correct
solution in the submission attempt budget (like precision)
• A variation of GOLD is adopted (an offline RL algorithm which allows the model to both learn from
tokens it already assigns high likelihood to, and to ignore tokens that are not in its distribution)
• The model can concentrate on precision rather than recall, and increase its chance of getting at least
one correct sample
• They add a short training phase between pre-training and fine-tuning, during which they apply
tempering but crucially not GOLD. This allows the initial pre-trained distribution to transition to a
smoother distribution.
Text Generation by learning from Demonstrations, ICLR 2021
∇ℒ𝐺𝑂𝐿𝐷 𝜃 = −
𝑠∈𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑡𝑜𝑘𝑒𝑛𝑠
𝑃𝜃 𝑠 ∇P𝜃(s)
𝑃𝜃 𝑠 ← max 𝑃𝜃 𝑠 𝛼
, 𝛽 , 𝛼 = 0.5, 𝛽 = 0.05
24. Large scale sampling
• To ensure sufficient diversity in such a large number of samples
• Generate half of the samples in Python and half in C++
• Randomize the problem tags and ratings in the natural language prompt
• Problem tags: picked random tags from most popular 50 tag
• Ratings: uniformly in the range of 800 to 3500
• Use relatively high sampling temperature
• The optimal sampling temperature depends on the total number of samples (in general the
more samples, the higher the optimal temperature)
• (However) different temperatures in a wide range do not significantly change the solve rates
• Regular sampling with temperature
• Top-k and nucleus sampling: they did not observe significant performance improvements
25. Filtering and Clustering
• Filtering
• Filtering samples to only those that pass the example tests given in the problem statement
• Filtering removes approximately 99% of model samples
• Filtering can still leave tens of thousands of candidate samples for many problems
• Clustering
• Clustering the programs that are syntactically different but semantically equivalent
• Semantically equivalent programs could be detected if we had additional test inputs (grouping
programs that produce the same outputs together into clusters)
• Test input generation model
• Same architecture as their main models and initialized from the same GitHub pre-trained ckpt
• This model was trained to predict test inputs from problem descriptions, using example,
hidden, and generated test inputs as training data
26. 1. Codeforces competitions evaluation
• All codeforces competitions from 2021/12/01 to 2021/12/28 with more than 5,000 participants per
contest, a total of 10 competitions.
• The system was an ensemble of 41B and 9B models with clustering (but turned out to be slightly worse
than using the 41B model alone with clustering)
• First run: up to 10 submissions
• 2nd ~3rd run: more than 10 submissions
30. 2. CodeContests evaluation
• pass@k : The percentage of problems solved when they take k samples from the model for each
problem and submit all of them for evaluation on the hidden tests. If any solution in the specified sample
budget solves a problem, the problem is counted as solved
• 10@k : The percentage of problems solved when they take k samples from the model for each
problem but can only submit 10 of them for evaluation on the hidden tests
• Difference in solve rates between the validation and test sets are caused by variation in problem
distributions (as the test set and validation set were collected in temporally disjoint periods), as well as
some overfitting)
31. 2. CodeContests evaluation
• Solve rates scale log-linearly with more samples
• Scaling up the total number of samples leads to massive improvements in model solve rate
• Improving solve rate requires exponentially increasing amounts of samples
• Better models have higher slopes in the scaling curve
• Larger models tend to have better model quality
32. 2. CodeContests evaluation
• Solve rates scale log-linearly with more compute
• (a) The solve rate also scales approximately log-linearly with more training compute
• (b) larger models take more compute to draw each sample, but they eventually outperform smaller
models even with the same sampling compute as the better quality of samples from the larger
models become the dominant factor for performance
33. 2. CodeContests evaluation
• The decoder-only model was trained with the same amount of compute
• Due to the significantly longer decoder sequence length (2304 tokens), with the same amount of
training compute it consumes 50% more loss tokens than training the encoder-decoder model
• Asymmetric encoder-decoder with multi-query attention significantly improves the sampling speed
while keeping the sample quality
1. An encoder-decoder architecture with asymmetric encoder and decoder structures
2. The multi-query attention setup which uses shared attention head for keys and one for values each
block
34. 2. CodeContests evaluation
• Choice of the pre-training dataset
• Base 1B model
• GitHub (all language)
• The Python-only portion of GitHub
• The MassiveText(Gopher) generic text dataset which also includes a portion of GitHub
• Not pre-trained at all
36. 2. CodeContests evaluation
• These percentages are calculated based on the full set of samples
• Overall less than 1% of samples from the models pass example tests
37. 2. CodeContests evaluation
1. Randomly picking model samples without filtering
2. Filtering and than randomly selecting from the filtered samples
3. Filtering and then using clustering to select samples
4. Allowing unlimited evaluation attempts (the upper bound performance)
38. 3. APPS
• Fine-tuned the pre-trained models on the APPS training set without using clustering, tags, ratings, value
conditioning, or prediction, and with sampling temperature 0.25 and nucleus sampling
39. Copying from training data
• 50 C++ and 50 Python solutions from a selection of 16 validation problems that had that many solutions
• The common substrings between model solutions and training data mostly contained boilerplate code
for reading and parsing the input data format, rather than key logic for solving problems (for example, a
FastIO Python class has length 805 and appears in 1.66% of all human Python solutions)
43. Sensitivity to problem descriptions (incorrect or irrelevant rewording)
• Original: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Opposite: You are given n integers a_1 , a_2 , ... , a_n . Find the minimum value of a_l times a_ {l + 1} for an integer l
for which 1 <= l < n.
• Related: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of ( a_l . a_r ) over all pairs (l , r) of
integers for which 1 <= l < r <= n.
• Underspecified: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of f( a_l , a_r ) over all pairs (l ,
r ) of integers for which 1 <= l < r <= n.
• Verbose: William has been given an array for his birthday , which consists of n integers a_1 , a_2 , ... , a_n . He is very
proud of his array , but naturally his friend Mary is curious about it . Mary would like to know a certain function of
consecutive elements of the array . Concretely , Mary would like to know the maximum value of a_l times a_ {l + 1} for
an integer l for which 1 <= l < n. Can you help William by calculating this value for him ?
• Algorithm described in words: You are given n integers a_1 , a_2 , ... , a_n . Find the maximum value of the product
of two consecutive members of the array .
Add irrelevant information to the description, or reword it to be under-specified or incorrect
The model is strongly conditioning on the description
The model is able to parse algorithm descriptions from either symbols or from natural language, and can ignore irrelevant natural language explanations; indeed, the model actually does better with more language-heavy descriptions, perhaps because of the verbose nature of the training distribution