1. The document introduces Codex, a large language model fine-tuned on publicly available code from GitHub. It evaluates Codex's ability to generate Python code from docstrings on a new dataset of 164 programming problems called HumanEval.
2. Codex is able to solve 28.8% of problems in HumanEval with a single sample, compared to 0% for GPT-3 and 11.4% for GPT-J. Generating 100 samples per problem increases Codex's performance to 70.2%.
3. The document discusses limitations of existing code generation metrics and proposes evaluating functional correctness by checking if generated code passes unit tests. It also describes the sandbox used to safely execute generated code.