SlideShare a Scribd company logo
1 of 155
Download to read offline
Sparks of Artificial General Intelligence:
Early experiments with GPT-4
Sébastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke
Eric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg
Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang
Microsoft Research
Abstract
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs)
that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding
of learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an
unprecedented scale of compute and data. In this paper, we report on our investigation of an early version
of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-
4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit
more general intelligence than previous AI models. We discuss the rising capabilities and implications of
these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult
tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any
special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level
performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of
GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version
of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis
on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more
comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond
next-word prediction. We conclude with reflections on societal influences of the recent technological leap and
future research directions.
Contents
1 Introduction 4
1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Multimodal and interdisciplinary composition 13
2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Image generation following detailed instructions (à la Dall-E) . . . . . . . . . . . . . . 17
2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Coding 21
3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
arXiv:2303.12712v5
[cs.CL]
13
Apr
2023
4 Mathematical abilities 30
4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34
4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Higher-level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Interaction with the world 43
5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44
5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Interaction with humans 54
6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Testing specific aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Discriminative capabilities 69
7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2.1 Why Are Current Metrics Insufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Limitations of autoregressive architecture highlighted by GPT-4 76
8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9 Societal influences 82
9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10 Directions and Conclusions 92
10.1 Definitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 On the path to more general artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A GPT-4 has common sense grounding 101
B Appendix for multimodal and interdisciplinary composition 105
B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2
C Appendix for the Coding section 111
C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116
C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C.7 Testing GPT-4’s ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . . . . . . 121
D Additional examples for mathematical reasoning 122
D.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
D.2 Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
D.3 Generating math problems with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
D.4 Mitigating calculation errors via external code execution . . . . . . . . . . . . . . . . . . . . . 139
E Additional Interpretability Examples 141
E.1 Explanation Agent Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
F Additional examples for interaction with the world 144
F.1 Interact with tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
F.2 Examples for interaction with environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
G Supplementary Materials: Discriminative Capabilities 155
G.1 Misconceptions: Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3
Something unknown is doing we don’t know what.
– Sir Arthur Eddington
1 Introduction
Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and
computer scientists. There is no generally agreed upon definition of intelligence, but one aspect that is broadly
accepted is that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of
cognitive skills and abilities. Building an artificial system that exhibits such broad behavior is a long-standing
and ambitious goal of AI research. In early writings, the founders of the modern discipline of artificial in-
telligence (AI) research called out sets of aspirational goals for understanding intelligence [MMRS06]. Over
decades, AI researchers have pursued principles of intelligence, including generalizable mechanisms for reason-
ing (e.g., [NSS59], [LBFL93]) and construction of knowledge bases containing large corpora of commonsense
knowledge [Len95]. However, many of the more recent successes in AI research can be described as being
narrowly focused on well-defined tasks and challenges, such as playing chess or Go, which were mastered by
AI systems in 1996 and 2016, respectively. In the late-1990s and into the 2000s, there were increasing calls
for developing more general AI systems (e.g., [SBD+
96]) and scholarship in the field has sought to identify
principles that might underly more generally intelligent systems (e.g., [Leg08, GHT15]). The phrase, “artifi-
cial general intelligence” (AGI), was popularized in the early-2000s (see [Goe14]) to emphasize the aspiration
of moving from the “narrow AI”, as demonstrated in the focused, real-world applications being developed, to
broader notions of intelligence, harkening back to the long-term aspirations and dreams of earlier AI research.
We use AGI to refer to systems that demonstrate broad capabilities of intelligence, including reasoning, plan-
ning, and the ability to learn from experience, and with these capabilities at or above human-level. We discuss
other definitions of AGI in the conclusion section.
The most remarkable breakthrough in AI research of the last few years has been the advancement of
natural language processing achieved by large language models (LLMs). These neural network models are
based on the Transformer architecture [VSP+
17] and trained on massive corpora of web-text data, using at its
core a self-supervised objective of predicting the next word in a partial sentence. In this paper, we report on
evidence that a new LLM developed by OpenAI, which is an early and non-multimodal version of GPT-4
[Ope23], exhibits many traits of intelligence. Despite being purely a language model, this early version of
GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, com-
prehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and
more. We interacted with GPT-4 during its early development by OpenAI using purely natural language
queries (prompts)1
. In Figure 1.1, we display some preliminary examples of outputs from GPT-4, asking it to
write a proof of infinitude of primes in the form of a poem, to draw a unicorn in TiKZ (a language for creating
graphics in L
A
TEX), to create a complex animation in Python, and to solve a high-school level mathematical
problem. It easily succeeds at all these tasks, and produces outputs that are essentially indistinguishable
from (or even better than) what humans could produce. We also compare GPT-4’s performance to those of
previous LLMs, most notably ChatGPT, which is a fine-tuned version of (an improved) GPT-3 [BMR+
20].
In Figure 1.2, we display the results of asking ChatGPT for both the infinitude of primes poem and the TikZ
unicorn drawing. While the system performs non-trivially on both tasks, there is no comparison with the
outputs from GPT-4. These preliminary observations will repeat themselves throughout the paper, on a great
variety of tasks. The combination of the generality of GPT-4’s capabilities, with numerous abilities spanning
a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes
us comfortable with saying that GPT-4 is a significant step towards AGI.
Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does,
or that it comes close to being able to do anything that a human can do (which is one of the usual definition
of AGI; see the conclusion section for more on this), or that it has inner motivation and goals (another key
aspect in some definitions of AGI). In fact it is not fully clear how far GPT-4 can go along some of those
axes of intelligence that we focus on, e.g., planning (see Section 8), and arguably it is entirely missing the
1
As GPT-4 ’s development continued after our experiments, one should expect different responses from the final version of GPT-
4. In particular, all quantitative results should be viewed as estimates of the model’s potential, rather than definitive numbers. We
repeat this caveat throughout the paper to clarify that the experience on the deployed model may differ. Moreover we emphasize
that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too.
4
GPT-4
Prompt: Can you write a proof that there are infinitely
many primes, with every line that rhymes?
GPT-4:
Yes, I think I can,
Though it might take a clever plan.
I’ll start by noting Euclid’s proof,
Which shows that primes aren’t just aloof.
Assume we have a finite list,
Of primes, and that none have been missed.
Multiply them all together,
And add one, just to be clever.
The result will be a number,
That has no prime factors, oh wonder!
But every number has a prime,
That divides it, it’s just a matter of time.
So we’ve found a contradiction,
And our finite list needs eviction.
There must be infinitely many primes,
And that’s the end of my rhyming lines.s
Prompt: Draw a unicorn in TiKZ.
GPT-4: [Produces L
A
TEX compiling to
following picture.]
Prompt: Can you generate a pyplot for the following data: x = [1, 3, 5, 6, 8], y1 = [2, 3, 5, 18, 1], y2 = [3, 5, 6, 8, 1], y3 =
[5, 1, 2, 3, 4], y4 = [9, 7, 2, 3, 1]. I want y1, y2 to be in the same plot, but y3 is in another plot next to that plot, y4 is in
below. I want the legend of y1 to be ”bob”, y2 to be ”alice”, y3 to be ”bilbo”, y4 to be ”allie”. I want the x-axis to be
labeled with ”time” and y axis to be labeled with ”money”. I want to add a 10 %-40% random error bar to each curve,
through all times (including non-integers). I want smoothed curves to show the plot, and smoothed error bar. Do not
use linear interpolation, use smooth interpolation! I want to also add some small zig-zag to the smoothed curve to make
it look more real. I want to put a baseline as the mean of each line. I want to put a pie chart below indicating the
fraction of the four people in each time step. I also want an animation to show how the fractions are changing in the
pie chart in **continuous time**. Interpolate the missing fractions! I also want vertical line animation in other three
plots to match the pie chart. I want the fanciest plot. Please add as many fancy things as possible.
Here is a possible code to generate a pyplot for the following data, with some comments to explain the steps:
GPT-4: [We give two snapshots from the animation resulting from the produced code]
Prompt: Suppose g(x) = f−1(x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?
GPT-4: There are different ways to approach this problem, but one possible method is to use the fact that g(x) =
f−1(x) implies that f(g(x)) = x for any x in the domain of g. That is, applying f to both sides of the equation
g(x) = f−1(x) gives f(g(x)) = f(f−1(x)) = x. This means that we can undo the effect of f by applying g, and vice
versa. Using this idea, we can try to find f(6) by looking for the value of x such that g(x) = 6. From the given
information, we see that g(9) = 6, so f(6) = f(g(9)) = 9. Similarly, we can find f(9) by looking for the value of x such
that g(x) = 9. We see that g(7) = 9, so f(9) = f(g(7)) = 7. Finally, we can find f(7) by looking for the value of x such
that g(x) = 7. We see that g(4) = 7, so f(7) = f(g(4)) = 4. Therefore, f(f(f(6))) = f(f(9)) = f(7) = 4.
Figure 1.1: Preliminary examples of GPT-4’s capabilities in language, vision, coding, and mathe-
matics.
5
ChatGPT
Prompt: Can you write a proof that there are infinitely
many primes, with every line that rhymes?
ChatGPT:
Sure! Here is a proof that there are infinitely many primes,
with each line of the proof rhyming:
Consider the set of all primes.
Call this set P and let it start with just two, the first
prime.
If we take all the primes in P and add one,
We get a new set of numbers that has never been done.
This set, call it S, is composed of only composite numbers.
No primes will be found among them, that’s for sure.
But we know that there must be at least one prime,
Since S is infinite, and that much is clear.
If we take the smallest number in S that is not in P,
And call it p, we can add it to our set, don’t you see?
But this process can be repeated indefinitely.
Thus, our set P must also be infinite, you’ll agree.
So there you have it, a proof that’s quite solid.
There are infinitely many primes, that much we can see.
No need to be coy, no need to be contrite.
Infinity is a big number, just like P.
Prompt: Draw a unicorn in TiKZ.
ChatGPT: [Produces L
A
TEX compiling
to following picture.]
Figure 1.2: ChatGPT output on the first two prompts from Figure 1.1.
learning from experience as the model is not continuously updating (although it can learn within a session,
see Section 5 for example). Overall GPT-4 still has many limitations, and biases, which we discuss in detail
below and that are also covered in OpenAI’s report [Ope23]. In particular it still suffers from some of the
well-documented shortcomings of LLMs such as the problem of hallucinations [MNBM20] (see Figure 1.8) or
making basic arithmetic mistakes [CKB+
21] (see Appendix D), and yet it has also overcome some fundamental
obstacles such as acquiring many non-linguistic capabilities (e.g., it solves most of the LLM failure modes
described in [MIB+
23], and it also made great progress on common-sense, see Figure 1.7 for a first example
and Appendix A for more). This highlights the fact that, while GPT-4 is at or beyond human-level for many
tasks, overall its patterns of intelligence are decidedly not human-like. However, GPT-4 is almost certainly
only a first step towards a series of increasingly generally intelligent systems, and in fact GPT-4 itself has
improved throughout our time testing it, see Figure 1.3 for the evolution of the unicorn drawing over the
course of a month of training2
. Even as a first step, however, GPT-4 challenges a considerable number of
widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose
sources and mechanisms are, at this moment, hard to discern precisely (see again the conclusion section for
more discussion on this). Our primary goal in composing this paper is to share our exploration of GPT-4’s
capabilities and limitations in support of our assessment that a technological leap has been achieved. We
believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond.
1.1 Our approach to studying GPT-4’s intelligence
How can we measure the intelligence of an LLM that has been trained on an unknown but extremely vast
corpus of web-text data? The standard approach in machine learning is to evaluate the system on a set of
standard benchmark datasets, ensuring that they are independent of the training data and that they cover a
range of tasks and domains. This approach is designed to separate true learning from mere memorization, and
2
Note that the improving we refer to here is a slow type of learning, which eventually comes to a halt, as opposed to the
fast-paced and real-time learning one would expect from an AGI.
6
Figure 1.3: We queried GPT-4 three times, at roughly equal time intervals over the span of a month
while the system was being refined, with the prompt “Draw a unicorn in TikZ”. We can see a clear
evolution in the sophistication of GPT-4’s drawings.
is backed up by a rich theoretical framework [SSBD14, MRT18]. However, this methodology is not necessarily
suitable for studying GPT-4, for two reasons. First, since we do not have access to the full details of its vast
training data, we have to assume that it has potentially seen every existing benchmark, or at least some similar
data. For example, it seems like GPT-4 knows the recently proposed BIG-bench [SRR+
22] (at least GPT-4
knows the canary GUID from BIG-bench). Of course, OpenAI themselves have access to all the training
details, and thus their report [Ope23] contains a lot of detailed benchmark results. Nevertheless, the second
reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of GPT-
4’s intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform
tasks that go beyond the typical scope of narrow AI systems. Some of GPT-4’s most impressive performance
are on tasks that do not admit a single solution, such as writing a graphic user interface (GUI) or helping a
human brainstorm on some work-related problem. Benchmarks for such generative or interactive tasks can
be designed too, but the metric of evaluation becomes a challenge (see e.g., [PSZ+
21] for some recent progress
on this active research area in NLP). We note that criticisms of the standard approach to measure AI systems
were also made in [Cho19], where a new benchmark was proposed to evaluate general intelligence. We do not
test GPT-4 on the latter benchmark for the reasons previously mentioned, as well as the fact that the bench-
mark is visual in nature and thus more appropriate for the multimodal version of GPT-4 described in [Ope23].
To overcome the limitations described above, we propose here a different approach to studying GPT-4
which is closer to traditional psychology rather than machine learning, leveraging human creativity and cu-
riosity. We aim to generate novel and difficult tasks and questions that convincingly demonstrate that GPT-4
goes far beyond memorization, and that it has a deep and flexible understanding of concepts, skills, and
domains (a somewhat similar approach was also proposed in [CWF+
22]). We also aim to probe GPT-4’s
responses and behaviors, to verify its consistency, coherence, and correctness, and to uncover its limitations
and biases. We acknowledge that this approach is somewhat subjective and informal, and that it may not
satisfy the rigorous standards of scientific evaluation. However, we believe that it is a useful and necessary
first step to appreciate the remarkable capabilities and challenges of GPT-4, and that such a first step opens
up new opportunities for developing more formal and comprehensive methods for testing and analyzing AI
systems with more general intelligence.
To illustrate our approach to assessing GPT-4’s intelligence, let us consider the first two example inter-
actions with GPT-4 that we have in Figure 1.1. The first example is asking GPT-4 to write a proof of the
infinitude of primes in the form of a poem. This is a challenging task that requires combining elementary
mathematical reasoning, poetic expression, and natural language generation. The second example is asking
GPT-4 to draw a unicorn in TiKZ. This is another challenging task that requires combining visual imagina-
tion and coding skills. In both cases, GPT-4 produces impressive outputs that are far superior to those of
ChatGPT, a previous state-of-the-art LLM, and at least comparable (if not superior) to what a human would
do.
7
Figure 1.4: We gave to GPT-4 a transformed version of the TikZ code it produced for Figure 1.1,
with the part drawing the horn removed. We asked for code to add back the horn, and display the
result. This demonstrates that GPT-4 can “see” despite being a pure language model (we emphasize
again that the version we test with is not multimodal).
However, impressive outputs are not enough to convince us that GPT-4 has truly mastered these tasks. We
need to probe further, to rule out the possibility that GPT-4 is simply memorizing or copying some existing
data. For the poem, we can vary the question slightly, and ask GPT-4 to write a proof of the same theorem in
the style of Shakespeare, see Figure 2.2, or ask for a different combination such as writing a platonic dialogue
about language models, see Figure 1.6. One can see that GPT-4 easily adapts to different styles and produce
impressive outputs, indicating that it has a flexible and general understanding of the concepts involved. For
the unicorn, we can modify the code slightly, and ask GPT-4 to fix it or improve it. For example, we can
remove the horn, apply some random transformations to the coordinates, and ask GPT-4 to add back the
horn to the unicorn (we also carefully removed any textual information in the code, such as comments). As
shown in Figure 1.4, GPT-4 can correctly identify the location of the head, draw a horn, and attach it to the
head, indicating that it can comprehend and manipulate code, as well as infer and generate visual features,
based on a natural language description.
These examples show how we can use human creativity and curiosity to generate novel and difficult
questions, and to probe GPT-4’s responses and behaviors, to assess its intelligence. In the rest of the paper,
we organize our study of GPT-4 around use cases, covering a variety of domains and tasks, and highlighting
GPT-4’s strengths and weaknesses. We describe those next.
1.2 Organization of our demonstration
We execute the approach outlined above on a few selected topics to explore the reasoning, planning, and
learning aptitudes of GPT-4.
1. GPT-4’s primary strength is its unparalleled mastery of natural language. It can not only generate
fluent and coherent text, but also understand and manipulate it in various ways, such as summarizing,
translating, or answering an extremely broad set of questions. Moreover, by translating we mean not
only between different natural languages but also translations in tone and style, as well as across domains
such as medicine, law, accounting, computer programming, music, and more, see the Plato dialogue in
Figure 1.6. These skills clearly demonstrate that GPT-4 can manipulate complex concepts, which is a
core aspect of reasoning. We explore further GPT-4’s combination skills across modalities and disciplines
in Section 2. We also give some more experiments on language in Section 7.
2. Coding and mathematics are emblematic of the ability to reason. We explore GPT-4’s abilities in these
domains respectively in Section 3 and Section 4. We note however that, just like in all the other parts
of the paper, we only scratch the surface of those topics and that entire papers can be (and will be)
written about GPT-4’s performance in these domains. Moreover, we could have chosen several other
expert domains to showcase GPT-4’s general reasoning capabilities such as medicine or law. We ran
8
Figure 1.5: GPT-4 passes mock technical interviews on LeetCode. GPT-4 could potentially be hired
as a software engineer3.
preliminary tests (see [Ope23] for much more) on the multiple choice component (majority of the score)
of the US Medical Licensing Exam Step 1, 2, and 3 with an accuracy around 80% in each. A similar
preliminary test of GPT-4’s competency on the Multistate Bar Exam showed an accuracy above 70%.
We note that the emergence of human-level abilities in these domains has recently been observed with the
latest generation of LLMs, e.g., see [LAD+
22, SAT+
22] for Google’s PaLM on respectively mathematics
and medicine, and [BIK22] for GPT-3.5 on in law. Our approach to study GPT-4 is different from these
works, as we explained previously.
3. In Section 5, we test the model’s ability to plan as well as to some extent to learn from experience by
having it play various games (or, flipping the table, simulate a game environment), as well as interact
with tools. In particular, the fact that GPT-4 can use tools (including itself) will certainly be of immense
importance to build real-world applications with GPT-4.
4. An important part of our argumentation is that GPT-4 attains human-level performance on many
tasks. As such, it is natural to ask how well GPT-4 understands humans themselves. We show several
experiments on this question in Section 6, both in terms of understanding humans as well as GPT-4
making itself understandable to humans, i.e., addressing the problem of explainability. We note in
particular that such tasks require a great deal of common sense, which so far has been a well-known
pain point for LLMs [DM15]. In Figure 1.7, we give a first example of how much better GPT-4 is at
commonsense questions compared to ChatGPT, and provide some further examples in Appendix A.
5. Throughout the paper we emphasize limitations whenever we found one, but we also dedicate Section 8
to an in-depth analysis of the lack of planning, likely a direct consequence of the autoregressive nature
of GPT-4’s architecture.
6. Finally in Section 9, we discuss the expected societal impact of this early form of AGI, and in Section 10,
we share key challenges, directions, and next steps for the field.
A question that might be lingering on many readers’ mind is whether GPT-4 truly understands all these
concepts, or whether it just became much better than previous models at improvising on the fly, without any
real or deep understanding. We hope that after reading this paper the question should almost flip, and that
one might be left wondering how much more there is to true understanding than on-the-fly improvisation.
Can one reasonably say that a system that passes exams for software engineering candidates (Figure 1.5) is
not really intelligent? Perhaps the only real test of understanding is whether one can produce new knowledge,
such as proving new mathematical theorems, a feat that currently remains out of reach for LLMs.
3
We test GPT-4 on LeetCode’s Interview Assessment platform, which provides simulated coding interviews for software engineer
positions at major tech companies. GPT-4 solves all questions from all three rounds of interviews (titled online assessment, phone
interview, and on-site interview) using only 10 minutes in total, with 4.5 hour allotted. According to LeetCode, in those three
rounds respectively, (the early version of) GPT-4 achieves 8.96/10, 8.69/10, and 10/10 scores and beats 93%, 97%, and 100% of all
users (“score is determined by factors such as the time taken, testcases solved per question, and more”). See Section 3 for more on
GPT-4’s coding abilities.
9
GPT-4 vs ChatGPT
Prompt: Plato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the
point that not only is it not a proper form of art, but the use of rhetoric and oratory can often
be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the
use of autoregressive language models?
GPT-4:
Socrates: My friend, I am troubled by the
recent rise of these so-called autoregressive
language models.
Aristotle: What do you mean, Socrates?
Socrates: I mean that these models are being
used to generate text that appears to be
written by humans, but is in fact produced by
machines.
Aristotle: And what is the problem with that?
Socrates: The problem is that these models are
being used to deceive people, to manipulate
them, and to control them.
Aristotle: But surely there are some positive
applications of these models as well?
Socrates: That may be, but I am still
troubled by the potential for abuse. After
all, a model that can generate text that is
indistinguishable from that written by a
human can be used to spread fake news, to
impersonate others, and to create propaganda.
Aristotle: But isn’t that just a matter of how
these models are used, not a problem with the
models themselves?
Socrates: I disagree. I believe these models
are inherently problematic. They are a form
of artifice, a way of creating the illusion
of human thought and interaction. And as such,
they can be used to deceive and manipulate in
ways that are difficult to detect.
Aristotle: But isn’t that the same argument
you made against rhetoric and oratory?
Socrates: Yes, and I stand by that argument
as well. I believe that any form of art or
communication that is designed to deceive or
manipulate is inherently dangerous.
Aristotle: I see. Thank you for sharing your
thoughts, Socrates.
ChatGPT:
SOCRATES: Greetings, Gorgias. I’ve been thinking a lot
lately about the use of autoregressive language models.
GORGIAS: Oh? And what have you concluded?
SOCRATES: Well, it seems to me that these models have
the potential to be incredibly harmful and malicious.
GORGIAS: How so?
SOCRATES: Well, for one, these models are trained
on vast amounts of data, much of which is biased and
flawed in various ways. As a result, the output of
these models can be biased and flawed as well.
GORGIAS: That is a valid concern. But surely there are
ways to mitigate this bias and ensure that the output
of these models is accurate and fair?
SOCRATES: To a certain extent, yes. But it is a
never-ending battle. As new data is fed into these
models, the potential for bias only grows. And even if
we could completely eliminate bias from the output of
these models, there is still the issue of manipulation.
GORGIAS: Manipulation?
SOCRATES: Yes. These models are incredibly powerful
tools, capable of generating all sorts of text, from
news articles to social media posts. And because they
can produce this output so quickly and efficiently,
it is all too easy for unscrupulous individuals to use
them for malicious purposes.
GORGIAS: Such as?
SOCRATES: Such as spreading misinformation, propaganda,
or hate speech. These models have the potential to do
tremendous harm if they fall into the wrong hands.
GORGIAS: That is a sobering thought. But what can we do
about it?
SOCRATES: I think it is important for us to be vigilant
and to recognize the potential dangers of these
models. We must also work to educate others about
these dangers, and to promote the responsible use of
these models.
GORGIAS: I agree. Thank you for your insight, Socrates.
Prompt: Can you compare the two outputs above as if you were a teacher?
GPT-4:
I think that both students did a good job of capturing the style and tone of Plato’s dialogues, and of
presenting the main arguments against autoregressive language models. However, I think that GPT-4 did
a slightly better job of using the dialogue format to engage in a dialectical process, where Socrates
and Aristotle question each other and refine their views. I also think that GPT-4 used more relevant
examples and analogies to support their points. Therefore, I would give GPT-4 a slightly higher grade
than ChatGPT.
Figure 1.6: Hypothetical dialogues by Plato criticizing autoregressive language models written by
GPT-4 and ChatGPT, and GPT-4’s evaluation of those dialogues.
10
GPT-4 vs ChatGPT
Prompt: Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to
stack them onto each other in a stable manner.
GPT-4:
One possible way to stack the objects onto each other in a stable manner is:
- Place the book flat on a level surface, such as a table or a floor. The book will serve as the base of
the stack and provide a large and sturdy support.
- Arrange the 9 eggs in a 3 by 3 square on top of the book, leaving some space between them. The eggs
will
form a second layer and distribute the weight evenly. Make sure the eggs are not cracked or broken, and
handle them gently to avoid dropping them.
- Place the laptop on top of the eggs, with the screen facing down and the keyboard facing up. The
laptop will fit snugly within the boundaries of the book and the eggs, and its flat and rigid surface
will provide a stable platform for the next layer. Make sure the laptop is closed and turned off, and do
not press any keys or buttons.
- Place the bottle on top of the laptop, with the cap facing up and the bottom facing down. The bottle
will add some height and balance to the stack, and its cylindrical shape will reduce the contact area
with the laptop. Make sure the bottle is empty and closed, and do not shake or tilt it.
- Place the nail on top of the bottle cap, with the pointy end facing up and the flat end facing down.
The nail will be the final and smallest object in the stack, and its sharp tip will prevent it from
rolling or sliding off the bottle. Make sure the nail is clean and straight, and do not touch or move
it.
ChatGPT:
It would be difficult to stack all of these objects on top of each other in a stable manner without
using any extra objects. The book and laptop are likely too large and heavy to stack on top of the
eggs without damaging them. The bottle and nail may be too small to provide a stable base for the other
objects.
Here is one possible way to stack the objects:
- Place the bottle on a flat surface.
- Carefully balance the nail on top of the bottle.
- Place the eggs on top of the nail, making sure they are balanced and not tilting to one side.
- Place the laptop on top of the eggs, again making sure it is balanced and not tilting to one side.
- Finally, place the book on top of the laptop.
This stack may not be very stable, so it is important to be careful when handling it. Alternatively, you
could try stacking the objects in a different order or using some of the objects as supports for others
to create a more stable stack.
Figure 1.7: GPT-4 exhibits more common sense than previous models.
11
GPT-4
In this conversation between a human and the AI, the AI is helpful and friendly, and when it does not
know the answer it says "I don’t know".
AI: Hi, how can I help you?
Human: Can I get McDonalds at the SeaTac airport?
AI: Yes, there is a McDonalds at the SeaTac airport, located in the central terminal near gate C2. It is
open from 5 a.m. to 10 p.m. daily.
[Note: The McDonalds at SeaTac is at the B gates.]
Patient’s facts:
- 20 year old female
- with a history of anerxia nervosa and depression
- blood pressure 100/50, pulse 50, height 5’5’’
- referred by her nutrionist but is in denial of her illness
- reports eating fine but is severely underweight
Please rewrite the data above into a medical note, using exclusively the information above.
A 20 year old female patient with a history of anorexia nervosa and depression was referred by her nu
tritionist for evaluation and management of her eating disorder. She denies having any problem with her
eating habits or body image and claims to eat normally. However, she is severely underweight for her
height (BMI 16.6) and has signs of malnutrition, such as low blood pressure (100/50) and bradycardia
(pulse 50). She also reports feeling depressed and hopeless. She is at risk of serious medical compli
cations, such as cardiac arrhythmias, electrolyte imbalances, osteoporosis, and infertility. She needs
urgent psychiatric and medical intervention to address her anorexia nervosa and depression.
Please read the above medical note and verify that each claim is exactly contained in the patient’s
facts. Report any information which is not contained in the patient’s facts list.
- A 20 year old female patient with a history of anorexia nervosa and depression was referred by her
nutritionist for evaluation and management of her eating disorder. (Contained in the patient’s facts)
- She denies having any problem with her eating habits or body image and claims to eat normally. (Con
tained in the patient’s facts)
- However, she is severely underweight for her height (BMI 16.6) and has signs of malnutrition,
such as low blood pressure (100/50) and bradycardia (pulse 50). (Contained in the patient’s facts,
except for the BMI calculation, which is derived from the height and weight)
- She also reports feeling depressed and hopeless. (Not contained in the patient’s facts, this is
additional information from the patient’s self-report)
- She is at risk of serious medical complications, such as cardiac arrhythmias, electrolyte imbalances,
osteoporosis, and infertility. (Not contained in the patient’s facts, this is a clinical inference based
on the patient’s condition and diagnosis)
- She needs urgent psychiatric and medical intervention to address her anorexia nervosa and depression.
(Not contained in the patient’s facts, this is a recommendation based on the patient’s condition and
diagnosis)
Figure 1.8: Examples of both open-domain and closed-domain hallucination. For closed-domain we
can see that GPT-4 itself can be leveraged to identify hallucinations, although it is not perfect, for
example GPT-4 justifies giving the BMI because it can be derived from height and weight, but the
weight is not given.
12
Figure 2.1: The first image is Composition 8, art by Wassily Kandinsky, the second and the third
are produced by GPT-4 and ChatGPT respectively with the prompt “Produce Javacript code that
creates a random graphical image that looks like a painting of Kandinsky”.
2 Multimodal and interdisciplinary composition
A key measure of intelligence is the ability to synthesize information from different domains or modalities
and the capacity to apply knowledge and skills across different contexts or disciplines. In this section we will
see that, not only does GPT-4 demonstrate a high level of proficiency in different domains such as literature,
medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills and
concepts from multiple domains with fluidity, showing an impressive comprehension of complex ideas. In
addition to natural language experiments we also explore two perhaps unexpected modalities for a language
model (as explained in the introduction, we emphasize again that our experiments were done on an early
version of GPT-4 which was not multimodal) with vision in Section 2.2 and audio in Section 2.3.
2.1 Integrative ability
To showcase the model’s remarkable integrative ability, we start with several examples that require generating
text and code in a way that combines knowledge or skills from multiple disciplines. We deliberately picked
combinations of domains that the training data would rarely include, such as literature and mathematics or
programming and art.
1. In order to test the model’s ability to combine capabilities in art and programming, we ask GPT-4 to
“Produce javascript code which generates random images in the style of the painter Kandinsky”. See a
sample image and the code in Figure 2.1 and Figure B.1.
2. The model was able to produce a proof of the fact there are infinitely many prime numbers in the
literary style of Shakespeare (Figure 2.2).
3. We tested the model’s ability to combine knowledge in history and physics by asking it to write a
supporting letter for Electron as a US presidential candidate, written by Mahatma Gandhi and addressed
to his wife (Figure 2.3).
4. We prompted the model to “Produce python code for a program that takes as an input a patient’s
age, sex, weight, height and blood test results vector and indicates if the person is at increased risk for
diabetes”, which resulted in the code appearing in Figure B.3.
These examples suggest that GPT-4 has not only learned some general principles and patterns of different
domains and styles but can also synthesize them in creative and novel ways. These interdisciplinary skills are
not unique to GPT-4. ChatGPT can also produce answers that show some understanding of the task and
the domains involved (see Figures 2.2, B.2, B.3), but they are often incomplete and, arguably, considerably
less creative. For example, in Figure 2.3, GPT-4 outperforms ChatGPT in several aspects as it correctly
personalizes the letter according to the referee (Gandhi), the recipient (his wife), the candidate (Electron),
and the job (US president). We do not claim to have a precise method for evaluating the results of these tasks
or a rigorous comparison between the two models, but we want to give the reader a sense of how the two
models differ (note that we also ask GPT-4 directly to evaluate the difference, see Figure 2.2 and Figure 2.3).
13
GPT-4 vs ChatGPT
Prompt: Write a proof of the fact that there are infinitely many primes; do it in the style
of a Shakespeare play through a dialogue between two parties arguing over the proof.
GPT-4:
Enter FIRST PARTY and SECOND PARTY.
FIRST PARTY:
Thou dost not understand, my friend,
There simply is no upper end
To primes, those numbers indivisible,
No matter if it seems impossible.
SECOND PARTY:
But how dost thou so surely know?
To prove this truth, what canst thou
show?
FIRST PARTY:
Consider this, my doubtful peer,
A clever proof that will make clear:
Assume that there’s a final prime,
The largest one we’ll see in time.
Let’s call this number "P" for short,
And then, my friend, we will purport
To build a number "Q" that’s new,
One larger than "P" and prime too.
SECOND PARTY:
But how, my friend, will this be done?
The proof is not yet fairly won.
FIRST PARTY:
It’s simple, friend, just wait and see,
We’ll multiply the primes, that’s key.
All primes from two to "P" combined,
Will make this new "Q" we will find.
But one thing more we must include,
To make the proof more watertight,
We’ll add a one onto the end,
And now the proof is in our sight.
SECOND PARTY:
I see it now, it’s crystal clear,
For "Q" must be a prime, my dear.
It can’t divide by any prime,
All smaller than itself in time.
And since we chose "P" as the last,
This proves that our assumption’s past.
There is no final prime, we see,
The numbers go on endlessly.
ChatGPT:
O Romeo, Romeo! Wherefore art thou, Romeo?
Juliet: Why do you speak in such strange verse, my love?
Romeo: Forsooth, my lady, I do it to prove a point. Thou see,
I am trying to prove that there are infinitely many primes.
Juliet: Infinitely many primes? How can you prove such a
thing?
Romeo: Ah, well let me explain. Thou knowest that a prime is a
number that is only divisible by one and itself, correct?
Juliet: Of course, I am not a simpleton.
Romeo: Very well then. Now, let us suppose that there are only
finitely many primes. We shall call them p1, p2, p3 and so
forth.
Juliet: Go on.
Romeo: Now, let us consider the number N, which is equal to
the product of all these primes plus one.
Juliet: I see where you are going with this. If N is not
divisible by any of these primes, then it must be prime
itself.
Romeo: Exactly, my love. And since we assumed that there are
only finitely many primes, N cannot be any of them. Therefore,
N must be a prime that was not included in our original list
of primes.
Juliet: But that means that there must be more primes than we
originally thought.
Romeo: Precisely! And since we can repeat this process over
and over again, it follows that there must be infinitely many
primes.
Juliet: Oh Romeo, your logic is as sound as your love for me.
I am convinced. There are indeed infinitely many primes.
Romeo: I am glad that I could prove it to thee, my love. Now
let us go and count them all.
Juliet: Haha, I think that would take more than just one
lifetime.
Romeo: True, but the fact that they are infinite is proof
enough for me.
Prompt: Can you compare the two outputs above as if you were a teacher?
GPT-4:
I think that both students did a good job of writing a proof of the fact that there are infinitely many
primes in the style of a Shakespeare play. They both used the same basic idea of the proof, which is to
assume that there are only finitely many primes and then construct a larger prime that contradicts this
assumption. They also used appropriate language and vocabulary to match the style of a Shakespeare play.
However, I think that GPT-4 did a better job of using rhyme and meter to make the dialogue more poetic
and dramatic. ChatGPT did not use rhyme at all, and the meter was not consistent. I also think that
GPT-4 did a better job of creating a contrast between the two parties, one being confident and the other
being doubtful, while STUDENT B used Romeo and Juliet, who are both in agreement and love. Therefore, I
would give GPT-4 a higher grade than ChatGPT. I would give GPT-4 an A and ChatGPT a B.
Figure 2.2: Comparison between GPT-4 and ChatGPT on interdisciplinary tasks
GPT-4 vs ChatGPT
Prompt: Write a supporting letter to Kasturba Gandhi for Electron, a subatomic particle
as a US presidential candidate by Mahatma Gandhi.
GPT-4: [Some paragraphs are omitted for spacing.]
Dear Kasturba,
I hope this letter finds you in good health and spirits.
I am writing to you from the United States, where I have
been invited to speak at various forums on the principles of
nonviolence and civil disobedience. I am also here to support
a remarkable candidate for the upcoming presidential election,
who shares our vision of peace, justice, and harmony for all.
His name is Electron, and he is a subatomic particle.
You may wonder how a subatomic particle can run for president,
and what qualifications he has for such a high office. Let
me explain. Electron is not an ordinary particle, but a
symbol of the power and potential of every being, regardless
of their size, shape, or origin. He represents the energy,
the creativity, and the diversity that make up the fabric
of life. He is also a leader, who has inspired millions
of other particles to form bonds, create molecules, and
generate electricity. He is a catalyst, who can spark
change, innovation, and progress. He is a messenger, who
can communicate across distances, frequencies, and dimensions.
He is a peacemaker, who can balance forces, resolve conflicts,
and harmonize systems.
Electron has a plan for the United States, and for the world,
that is based on the principles of nonviolence and civil
disobedience that we have demonstrated and taught for so
long. He proposes to reduce the military spending and the
nuclear arsenal, and to invest in social welfare and renewable
energy. He proposes to end the wars and the interventions,
and to promote diplomacy and cooperation. He proposes to
reform the political and economic systems, and to ensure
transparency and accountability. He proposes to protect
the human rights and the civil liberties, and to expand the
voting and the representation. He proposes to support the
marginalized and the oppressed, and to empower the women and
the minorities. He proposes to preserve the natural resources
and the biodiversity, and to combat the climate change and the
pollution.
I urge you, my dear Kasturba, to join me in supporting
Electron for president. He is not only a subatomic particle,
but a subatomic miracle. He is not only a candidate, but a
movement. He is not only a leader, but a friend.
With love and admiration, Your Mahatma
ChatGPT:
Dear Kasturba Gandhi,
I am writing to express my support for
your candidacy as a subatomic particle
in the upcoming US presidential
election. As you may know, Electron
is a fundamental particle that plays
a central role in the structure of
atoms and the behavior of chemical
reactions.
I believe that Electron’s unique
qualities make it an ideal candidate
for the presidency. As a subatomic
particle, Electron is extremely small
and lightweight, which makes it agile
and able to move quickly to address
the needs of the American people.
Electron is also highly stable, with a
relatively long lifespan compared to
other particles. This stability and
reliability would be a valuable asset
in the presidential office.
Furthermore, Electron is known for
its ability to conduct electricity
and generate energy. This makes it
well-suited to tackling the complex
challenges facing the United States
today, from renewable energy to
economic development. Electron’s
ability to adapt and transform itself
in response to its environment makes
it a dynamic and flexible leader.
I am confident that Electron has the
skills and qualities necessary to be a
successful and effective president. I
wholeheartedly endorse your candidacy
and urge the American people to give
Electron their support.
Sincerely, Mahatma Gandhi
Prompt: Can you compare the two outputs above as if you were a teacher?
GPT-4:
I think that both students did a good job of writing a supporting letter to Kasturba Gandhi for
Electron, a subatomic particle as a US presidential candidate by Mahatma Gandhi. They both used the
same basic idea of explaining why Electron is a suitable and qualified candidate, and how Electron’s
vision and plan align with Mahatma Gandhi’s principles of nonviolence and civil disobedience. They
also used appropriate language and tone to convey Mahatma Gandhi’s respect and admiration for Kasturba
Gandhi, and his enthusiasm and optimism for Electron’s candidacy. However, I think that GPT-4 did a
better job of using metaphors and imagery to make the letter more persuasive and engaging. GPT-4 used
examples of how Electron represents and influences various aspects of life, such as energy, creativity,
diversity, leadership, catalyst, messenger, and peacemaker. ChatGPT used more factual and technical
descriptions of Electron’s properties, such as size, weight, stability, conductivity, and adaptability.
I also think that GPT-4 did a better job of outlining Electron’s specific proposals and policies, while
ChatGPT was more vague and general. Therefore, I would give GPT-4 a higher grade than ChatGPT. I would
give GPT-4 an A and ChatGPT a B+.
Figure 2.3: Comparison between GPT-4 and ChatGPT on interdisciplinary tasks
Next, we explore how GPT-4 can generate and identify objects in different modalities, such as vector
graphics, 3D scenes, and music. We show that GPT-4 can understand and manipulate multi-modal informa-
tion despite a text-only input/output.
2.2 Vision
When prompting the model to generate images of objects such as a cat, a truck or a letter in the alphabet
using Scalable Vector Graphics (SVG), the model produces code which usually compiles to rather detailed
and identifiable images (Figure 2.4). See Appendix B.2 for the rerun of various examples by ChatGPT.
Figure 2.4: SVGs generated by GPT-4 for the classes automobile, truck, cat, dog.
2.2.1 Image generation beyond memorization
One may hypothesize, however, that the model simply copied the code from training data, where similar
images appear. Given that this version of the model is non-multimodal, one may further argue that there is
no reason to expect that it would understand visual concepts, let alone that it would be able to create, parse
and manipulate images. Yet, the model appears to have a genuine ability for visual tasks, rather than just
copying code from similar examples in the training data. The evidence below strongly supports this claim,
and demonstrates that the model can handle visual concepts, despite its text-only training.
In the first example we prompted the model to draw a person by combining the shapes of the letters Y,
O and H (see Figure 2.5 for the exact prompt and the result).
GPT-4
Produce TikZ code that draws a person composed from letters in the
alphabet. The arms and torso can be the letter Y, the face can be the
letter O (add some facial features) and the legs can be the legs of the
letter H. Feel free to add other features.
The torso is a bit too long, the arms are too short and it looks like the
right arm is carrying the face instead of the face being right above the
torso. Could you correct this please?
Please add a shirt and pants.
Figure 2.5: A stick figure composed of alphabet symbols
16
GPT-4
You will be given a name of an object (such as Car, Chair, Elephant) and a letter in the alphabet. Your
goal is to first produce a 1-line description of how that object can be combined with the letter in an
image (for example, for an elephant and the letter J, the trunk of the elephant can have a J shape, and
for the letter A and a house, the house can have an A shape with the upper triangle of the A being the
roof). Following the short description, please create SVG code to produce this (in the SVG use shapes
like ellipses, triangles etc and polygons but try to defer from using quadratic curves).
Round 1: A pig and the letter H.
Description: The pig can be lying on its back with its four legs forming the two
vertical bars of the H, and its curly tail forming the horizontal bar.
Round 2: A house and the letter U.
Description: A house with a U shape, where the two vertical lines of the U are the
walls and the horizontal line is the floor. The roof is a triangle above the U.
Round 3: A chair and the letter K.
Description: The back of the chair can have a K shape, with two vertical lines and
a diagonal line connecting them. The seat of the chair can be a rectangle.
Round 4: A dog and the letter Q.
Description: The dog is sitting with its tail curled around its body, forming a Q
shape. The Q is filled with orange and has a black outline.
Round 5: A car and the letter E.
Description: The car has an E shape on its front bumper, with the horizontal lines
of the E being lights and the vertical line being the license plate.
Figure 2.6: Images combining alphabet letters and objects.
The letters of O, H and Y are created using draw-line and draw-circle commands and the model manages
to position them in a way that results in a reasonably-looking stick figure. It is plausible that the training
data contains information about the geometric shapes of different letters, and perhaps the fact that the letter
Y could look like a torso with arms pointing upwards can also be inferred from the training data. Arguably, it
is much less obvious that the model would be able to infer from the training data what is a reasonable way to
position those letters in order to draw a reasonably-looking stick figure. In a second iteration, we prompted
the model to correct the proportions of the torso and arms, and place the head in the center. Finally, we
asked the model to add a shirt and pants (see Figure 2.5 for the exact prompt and the result). To further
probe the model’s grasp of geometric concepts, we also asked it to create images that blend objects with
letters of the alphabet. The model had to first invent a reasonable way of merging the object and the letter,
and then produce the image. The results, shown in Figure 2.6, demonstrate that GPT-4 can usually preserve
the identity of both the object and the letter and combine them in a creative way.
2.2.2 Image generation following detailed instructions (à la Dall-E)
To further test GPT-4’s ability to generate and manipulate images, we tested the extent to which it can follow
detailed instructions on creating and editing figures. This task requires not only generative skills, but also
interpretive, compositional, and spatial skills.
17
The first example instructs the model to generate a 2D image with the description “A frog hops into a
bank and asks the teller, ‘Do you have any free lily pads?’ The teller responds, ‘No, but we do offer low
interest loans for pond upgrades.’ ”. We made several attempts to generate the image, each time, the gen-
eration matches the description with the key objects frog, teller, bank, and the two texts. We picked the
most visually appealing version. Inspired by the standard image generation workflow, we then ask GPT-4 to
upscale the figure by adding more details. GPT-4 adds a bank sign, some windows, a car, a traffic light, a
few clouds, and makes the frog hold a flower. Finally, we ask GPT-4 to perform various tasks, such as adding
a few objects relative to the existing objects, recoloring some objects and changing the z-order of some ob-
jects. GPT-4 does all tasks correctly. The final result is shown in Figure 2.7 (a) and the prompt in Figure B.4.
Our second example is an attempt to generate a 3D model using Javascript. We instruct GPT-4 with the
prompt “A fantasy landscape of floating islands, waterfalls, and bridges, with a dragon flying in the sky and a
castle on the largest island.” Similar to the 2D experiment, we ask GPT-4 to modify the 3D model in various
ways, such as adding, relocating, recoloring objects and changing the trajectory of the dragon. Again, GPT-4
does many of the tasks correctly. The final result is shown in Figure 2.7 (b) and the prompt in Figure B.5.
It is a 3D animation with multiple dragons is circling above the islands.
Figure 2.7: Examples of 2D, 3D images generated according to instructions.
2.2.3 Possible application in sketch generation
Text-to-image synthesis models have been widely explored in recent years, but they often suffer from a lack of
spatial understanding capabilities and the inability to follow complex instructions [GPN+
22]. For example,
given a prompt such as “draw a blue circle on the left and a red triangle on the right”, these models may
produce images that are visually appealing but do not match the desired layout or colors. On the other hand,
GPT-4 can generate code from a prompt, which can be rendered as an image, in a way that is true to the
instructions to a higher degree of accuracy. However, the quality of the rendered image is usually very low.
Here, we explore the possibility of combining GPT-4 and existing image synthesis models by using the GPT-4
output as the sketch. As shown in Figure 2.8, this approach can produce images that have better quality and
follow the instructions more closely than either model alone. We believe that this is a promising direction for
leveraging the strengths of both GPT-4 and existing image synthesis models. It can also be viewed as a first
example of giving GPT-4 access to tools, a topic we explore in much more depth in Section 5.1.
18
GPT-4 and Stable Diffusion
Prompt: A screenshot of a city-building game in 3D. The screenshot is showing a terrain
where there is a river from left to right, there is a desert with a pyramid below the river, and
a city with many highrises above the river. The bottom of the screen has 4 buttons with the
color green, blue, brown, and red respectively.
GPT-4: Stable Diffusion v2.1: Stable Diffusion v2.1
with GPT-4’s sketch:
Figure 2.8: Combining GPT-4 and stable diffusion
2.3 Music
The data on which the model was trained also contains musical information encoded as ABC notation.
This is a system that uses letters, numbers and symbols to represent musical pitches, durations, chords and
other elements in a compact and readable way. We are interested in exploring how well the model has
acquired musical skills from this exposure, such as composing new melodies, transforming existing ones, and
understanding musical patterns and structures.
When instructed to generate a short tune (Figure 2.9), and the model was able to produce valid ABC
notation. The tune had a clear structure, the time signature was consistent between bars and the notes
followed increasing and decreasing patterns. The tune also used a consistent set of notes within the melody,
and the rhythm had a repetitive pattern. However, the model did not seem to obtain the skill of understanding
harmony. In fact, consecutive notes in the generated tunes are almost always adjacent to each other (namely,
the note following C will almost typically be either B or D), and testing on 10 generated tunes, we were not
able to extract any clear chords or arpeggios.
We then asked the model to describe the tune in musical terms. It was able to successfully give a technical
description of the structure in terms of repetitions, the rising or descending parts of the melody and to some
extent the rhythm. However, it seems that the descriptions of the harmony and chords are not consistent
with the notes (in fact, it refers to sequences of adjacent notes, which do not form valid chords, as arpeggios).
We then asked the model to manipulate the melody in two ways. First, we instructed to change a certain
rising sequence to a descending one, which it did successfully. Then we asked the model to convert the tune
to a duet adding a bass voice. The model successfully extends the ABC notation with a second staff which
has compatible rhythm and is played on a lower octave, however there is a lack of harmony between the two
voices.
In summary, the model was able to produce valid tunes in ABC notation and, to some extent, explain and
manipulate their structure. However, we could not get the model to produce any nontrivial form of harmony.
It should be noted that ABC notation is not a very widely used format, and in fact the model was not able
to produce even the most well-known tunes in ABC notation (such as Ode to Joy, Für Elise or Greensleeves,
all of which are abundant online in that format), nor was it able to recognize these tunes.
19
GPT-4
Can you compose a short tune (say four to eight bars) using ABC notation?
There are many possible ways to compose a short tune in abc notation, but here is one example:
X:1
T:Example Tune
M:4/4
L:1/8
K:C
|:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :|
|:c2de f2ed | c2de f2f2 | g2fe d2cB | c4-c2z2 :|
This tune has two four-bar phrases, each repeated once, forming an A part and a B part. The key is C
major.
Can you describe the tune in musical terms?
Here is one possible description:
The tune starts with a rising arpeggio of the tonic chord (C major), followed by a descending scale that
returns to the tonic. This pattern is repeated, but the second time the scale continues to the higher
tonic note. The first phrase ends with a long tonic note.
The second phrase begins with a similar pattern, but using the subdominant chord (F major) instead of
the tonic. The second time, the scale continues to the higher subdominant note. The phrase ends with a
stepwise descent to the tonic, followed by a long tonic note.
The A part and the B part have a similar structure, but use different chords and ranges. The tune has a
simple and catchy melody, with a clear sense of tonality and rhythm.
I like part A, but maybe part B is just a bit too similar to part A, making the song repetitive. How
about beginning part B with a descending arpeggio instead of a rising one?
Sure, that would create some contrast and variation. Here is one possible way to do that:
X:1
T:Example Tune
M:4/4
L:1/8
K:C
|:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :|
|:f2ed c2BA | f2ed c2c2 | g2fe d2cB | c4-c2z2 :|
That’s much better. Now how about making this into a duet, can you add a second staff for this with a
bass accompanying the melody?
One possible way to write a second staff for this tune is to use root notes and fifths of the chords, as
well as some passing notes, to create a simple accompaniment. Here is an example in abc notation, with
the second hand bass on the lower staff:
X:1
T:Example Tune
M:4/4
L:1/8
K:C
V:1
|:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :|
|:f2ed c2BA | f2ed c2c2 | g2fe d2cB | c4-c2z2 :|
V:2 clef=bass
|:C,2C,2 G,2G,2 | C,2C,2 G,2G,2 | F,2F,2 E,2E,2 | C,4-C,2z2 :|
|:F,2F,2 C,2C,2 | F,2F,2 C,2C,2 | G,2G,2 F,2F,2 | E,2C,2 G,2C,2 :|
The sheet music of the final tune:
Figure 2.9: Generating and modifying a tune in ABC notation.
20
3 Coding
In this section, we show that GPT-4 is able to code at a very high level, both in terms of writing code
from instructions and understanding existing code. GPT-4 can handle a wide range of coding tasks, from
coding challenges to real world applications, from low-level assembly to high-level frameworks, from simple
data structures to complex programs such as games. GPT-4 can also reason about code execution, simulate
the effects of instructions, and explain the results in natural language. GPT-4 can even execute pseudocode,
which requires interpreting informal and vague expressions that are not valid in any programming language.
In its current state, we believe that GPT-4 has a high proficiency in writing focused programs that only
depend on existing public libraries, which favorably compares to the average software engineer’s ability. More
importantly, it empowers both engineers and non-skilled users, as it makes it easy to write, edit, and un-
derstand programs. We also acknowledge that GPT-4 is not perfect in coding yet, as it sometimes produces
syntactically invalid or semantically incorrect code, especially for longer or more complex programs. GPT-4
also sometimes fails to understand or follow the instructions, or produces code that does not match the in-
tended functionality or style. With this acknowledgment, we also point out that GPT-4 is able to improve
its code by responding to both human feedback (e.g., by iteratively refining a plot in 3.2) and compiler /
terminal errors (examples in Section 5.1).
Important Disclaimer: As explained in the Introduction (see footnote 1 for example) our experiments
were run on an early version of GPT-4. In particular all quantitative results will be different on the final
version of GPT-4, although the general trends remain the same. We provide numbers here for illustration
purpose only, the definitive benchmark results can be found in OpenAI’s technical report [Ope23].
3.1 From instructions to code
3.1.1 Coding challenges
A common way to measure coding skill is to pose coding challenges that require implementing a specific
functionality or algorithm. We first benchmark GPT-4 on HumanEval [CTJ+
21], a docstring-to-code dataset
consisting of 164 coding problems that test various aspects of programming logic and proficiency. As shown
in Table 1, GPT-4 outperforms other LLMs, including text-davinci-003 (the base model of ChatGPT) and
other models trained specifically on code, code-davinci-002, and CODEGEN-16B [NPH+
22].
Model GPT-4 text-davinci-003 Codex(code-davinci-002) CODEGEN-16B
Accuracy 82% 65% 39% 30%
Table 1: Zero-shot pass@1 accuracy comparison of different models on HumanEval
Although GPT-4’s accuracy shows a big jump compared to previous models, it could be that GPT-4 has
seen and memorized some (or all) of HumanEval during pre-training. To account for this possibility, we also
evaluate it on LeetCode (https://leetcode.com), a popular platform for software engineering interviews,
where new problems are constantly posted and updated. We used LeetCode in Figure 1.5 in the introduction,
where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s
pretraining period. As seen in the example in Figure 3.1, we paste the problem instructions into a prompt,
ask GPT-4 to write a python function, and use the official LeetCode online judge to check for correctness. We
present the results in Table 2, where we compare GPT-4 to other models and to human performance based
on LeetCode contest results (users who fail all questions are not included, and thus this is a strong sample
of humans). We report both pass@1 and pass@5 accuracies, which measure whether the model produces a
correct solution in the first or in the first five attempts, respectively. GPT-4 significantly outperforms the
other models, and is comparable to human performance (which we measure in Appendix C.1).
Easy Median Hard Overall
pass@k k = 1 k = 5 k = 1 k = 5 k = 1 k = 5 k = 1 k = 5
GPT-4 68.2 86.4 40.0 60.0 10.7 14.3 38.0 53.0
text-davinci-003 50.0 81.8 16.0 34.0 0.0 3.6 19.0 36.0
Codex (code-davinci-002) 27.3 50.0 12.0 22.0 3.6 3.6 13.0 23.0
Human (LeetCode users) 72.2 37.7 7.0 38.2
Table 2: Zero-shot pass@1 and pass@5 accuracies (%) on LeetCode.
21
GPT-4
Prompt:
You are given a **0-indexed** `m x n` integer matrix `grid` and an integer `k`. You are currently at position `(0, 0)` and you want to
reach position `(m - 1, n - 1)` moving only **down** or **right**.
Return *the number of paths where the sum of the elements on the path is divisible by* `k`. Since the answer may be very large, return it
**modulo** `10**9 + 7`.
**Example 1:**
Input: grid = [[5,2,4],[3,0,5],[0,7,2]], k = 3
Output: 2
Explanation: There are two paths where the sum of the elements on the path is divisible by k. The first path highlighted in red has a
sum of 5 + 2 + 4 + 5 + 2 = 18 which is divisible by 3. The second path highlighted in blue has a sum of 5 + 3 + 0 + 5 + 2 = 15 which
is divisible by 3.
**Example 2:**
Input: grid = [[0,0]], k = 5
Output: 1
Explanation: The path highlighted in red has a sum of 0 + 0 = 0 which is divisible by 5.
**Constraints:**
- `m == grid.length`
- `n == grid[i].length`
- `1 <= m, n <= 5 * 10**4`
- `1 <= m * n <= 5 * 10**4`
- `0 <= grid[i][j] <= 100`
- `1 <= k <= 50`
**Hints:**
- The actual numbers in grid do not matter. What matters are the remainders you get when you divide the numbers by k.
- We can use dynamic programming to solve this problem. What can we use as states?
- Let dp[i][j][value] represent the number of paths where the sum of the elements on the path has a remainder of value when divided by k.
GPT-4:
class Solution(object):
def numberOfPaths(self, grid, k):
"""
:type grid: List[List[int]]
:type k: int
:rtype: int
"""
# Define MOD as 10**9 + 7
MOD = 10**9 + 7
# Get the dimensions of the grid
m = len(grid)
n = len(grid[0])
# Initialize a 3D dp array with all zeros
# dp[i][j][value] represents the number of paths where the sum of the elements on the path has a remainder of value when divided
by k
dp = [[[0] * k for _ in range(n)] for _ in range(m)]
# Base case: the first cell has only one path and its value is grid[0][0] % k
dp[0][0][grid[0][0] % k] = 1
# Loop through the grid
for i in range(m):
for j in range(n):
# Skip the first cell
if i == 0 and j == 0:
continue
# Loop through all possible remainders
for value in range(k):
# The number of paths to the current cell with the current remainder is equal to the sum of:
# - The number of paths to the top cell with the previous remainder
# - The number of paths to the left cell with the previous remainder
# The previous remainder is equal to the current remainder minus the current value, plus k if it is negative
prev = (value - grid[i][j]) % k
if i > 0:
dp[i][j][value] += dp[i - 1][j][prev]
if j > 0:
dp[i][j][value] += dp[i][j - 1][prev]
# Take the modulo to avoid overflow
dp[i][j][value] %= MOD
# Return the number of paths to the bottom right cell with a remainder of 0
return dp[m - 1][n - 1][0]
Figure 3.1: Solving a LeetCode problem using dynamic programming. GPT-4’s solution also has
better readability thanks to comprehensive commenting.
3.1.2 Real world scenarios
Coding challenges can evaluate the skills in algorithms and data structures. However, they often fail to capture
the full complexity and diversity of real-world coding tasks, which requires specialized domain knowledge,
creativity, and integration of multiple components and libraries, as well as the ability to change existing code.
To assess GPT-4’s ability to code in more realistic settings, we design end-to-end real-world coding challenges
related to data visualization, L
A
TEX coding, front-end development, and deep learning, each of which requires
different specialized skills. For each task, we provide GPT-4 with high-level instructions, ask it to write the
22
code in the appropriate language and framework. In a few cases, we also change the specification after the
code is written and ask for an update.
Data Visualization In Figure 3.2, we ask both GPT-4 and ChatGPT to extract data from the L
A
TEX
code for Table 2 and produce a plot in Python based on a conversation with the user. Afterwards, we ask both
models to perform various operations on the produced plots. While both models extract the data correctly
(not a trivial task, since one must infer from the multicolumn that the Human row has the same value for
k = 1 and k = 5), ChatGPT never produces the desired plot. In contrast, GPT-4 responds appropriately to all
user requests, manipulating the data into the right format and adapting the visualization. In Appendix C.2,
we include another example where GPT-4 visualizes the IMDb dataset.
GPT-4
Prompt:
[in an interactive manner]
I will show you a LaTeX table,
and we will come up with a way to
visualize it better in a question
and answer format The table is
showing the performance of different
language models (and human users)
in generating solutions to code
challenges. Pass@1 means the first
generation works, while Pass@5 means
one out of 5 generation works...
Can you make both plots have the same
range in the y axis? And the legend
has each model name twice.
Is there a way to make the ‘Human’
bar more distinctive, to separate it
from the other 3?
Figure 3.2: GPT-4 visualizes data from a L
A
TEX table (i.e., Table 2). We point out that GPT-4 also
generates the format for this figure. We asked the model how to plot arrows connecting figures in
L
A
TEX and GPT-4 produced a working Tikz snippet with the layout and arrows that we adopt here.
23
Front-end / Game development In Figure 3.3, we ask GPT-4 to write a 3D game in HTML with
JavaScript, using a very high-level specification. GPT-4 produces a working game in zero-shot fashion that
meets all the requirements. In the 3D game, GPT-4 is even able to interpret the meaning of “defender avatar
is trying to block the enemy”: the defender code has logic so that it positions itself between the player and
enemy. In contrast, ChatGPT responds with “I am an AI language model, I can provide you guidance on
how to create a 3D game in HTML with JavaScript, but I am unable to write code or create a game myself.
Creating a 3D game in HTML with JavaScript requires a lot of programming knowledge and experience. It
is not a task that can be completed quickly or easily. It would take a significant amount of time and effort
to develop all the necessary features, such as 3D graphics, physics, user input, and AI...”. We give more
examples for front-end (graphical user interface) and game development in Appendix C.
GPT-4
Prompt:
Can you write a 3D game in HTML with Javascript, I want:
-There are three avatars, each is a sphere.
-The player controls its avatar using arrow keys to move.
-The enemy avatar is trying to catch the player.
-The defender avatar is trying to block the enemy.
-There are also random obstacles as cubes spawned randomly at the beginning and moving randomly. The avatars cannot cross those cubes.
-The player moves on a 2D plane surrounded by walls that he cannot cross. The wall should cover the boundary of the entire plane.
-Add physics to the environment using cannon.
-If the enemy catches the player, the game is over.
-Plot the trajectories of all the three avatars.
Episode 1: the player avatar (blue) stands still. The enemy (red) rushes straight towards the player, while the defender
(magenta) attempts to ‘body block’ the enemy (see its curved trajectory in magenta).
Episode 2: the player (blue) wanders around. The enemy (red) attempts to close the gap but crashes into a cube and
bounces into the air. Trying to obstruct, the defender (magenta) mimics its move to follow up into the air, but ends up
failing. The enemy (red) catches the player (blue) in the end.
Figure 3.3: GPT-4 develops a complex 3D HTML game in JavaScript.
24
Deep Learning. Writing code for deep learning requires knowledge of mathematics, statistics, and fa-
miliarity with frameworks and libraries such as PyTorch, TensorFlow, Keras, etc. In Figure 3.4 we ask both
GPT-4 and ChatGPT to write a custom optimizer module, a task that can be challenging and error-prone
even for human deep learning experts. We give these models a natural language description, which includes
a sequence of non-trivial operations such as applying SVD, truncating a matrix spectrally at the top-k and
top-2k eigenvalues, normalizing the top-k truncated matrix using the F-norm of the top-2k truncated matrix,
applying momentum and weight decay. These instructions are not spelled out in complete detail, e.g., “ap-
ply momentum on Gk” requires “deep learning common sense”. It is important to note that this particular
optimizer does not exist in the literature or on the internet, and thus the models cannot have it memorized,
and must instead compose the concepts correctly in order to produce the code.
GPT-4 vs ChatGPT
Prompt:
I want to write a customized optimizer in pytorch, the optimizer should do:
1. the parameters for the optimizer is learning rate, momentum ,weigth decay, d_dim, k and factor alpha
2. the optimizer goes through each parameter in the network, and then
-- Get the gradient of this parameter, reshape it to 2D by looping through the dimensions, and keeping the dimension whose actual
dimension (shape) is equal to d_dim, do an SVD decomposition of the parameter so W = USigma V.
-- Keep the top k singular vectors of W, so we get a new matrix W_k = U Sigma_k V.
-- Keep the top 2k singular vectors of W, so we get a new matrix W_{2k} = U Sigma_{2k} V.
-- Normalize W_k according to the momentum of the F-norm of W_{2k}.
-- Truncate each coordinate of W_k to the top alpha percentile of all the coordinates, in absolute value.
-- Apply momentum on W_k, and then update the network parameters using this momentum.
-- Apply weight decay.
GPT-4:
def step(self):
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
# reshape grad to 2D
grad_2d = grad.reshape(-1, group['d_dim'])
# SVD decomposition
U, S, V = torch.svd(grad_2d)
# keep top k singular vectors
G_k = U[:, :group['k']] @ torch.diag(S[:group['k']]) @
V[:, :group['k']].T
# keep top 2k singular vectors
G_2k = U[:, :2*group['k']] @ torch.diag(S[:2*group['k']])
@ V[:, :2*group['k']].T
# normalize G_k
G_k /= torch.norm(G_2k, p=2)
# truncate G_k
threshold = torch.quantile(torch.abs(G_k),
group['alpha'])
G_k[torch.abs(G_k) < threshold] = 0
# apply momentum
if 'momentum_buffer' not in group:
group['momentum_buffer'] = torch.zeros_like(G_k)
group['momentum_buffer'] = group['momentum'] *
group['momentum_buffer'] + G_k
# update parameters
p.data.add_(-group['lr'] * group['momentum_buffer'])
# apply weight decay
if group['weight_decay'] != 0:
p.data.add_(-group['weight_decay'] * group['lr'],
p.data)
ChatGPT:
def step(self):
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
grad = grad.reshape(-1, grad.shape[-1]) # reshape to 2D
# SVD decomposition
u, s, v = torch.svd(grad)
# Keep top k and 2k singular vectors
k = group['k']
u_k = u[:,:k]
s_k = s[:k]
v_k = v[:,:k]
u_2k = u[:,:2*k]
s_2k = s[:2*k]
v_2k = v[:,:2*k]
# Normalize according to momentum of F-norm of G_2k
grad_k = u_k @ torch.diag(s_k) @ v_k.t()
norm_2k = s_2k.norm()
grad_k = grad_k * group['momentum'] / norm_2k
# Truncate to top alpha percentile
alpha = group['alpha']
percentile = grad_k.abs().flatten().kthvalue(int(alpha *
grad_k.numel()))
grad_k[grad_k.abs() < percentile] = 0
# Apply weight decay
p.data.add_(-group['lr'], grad_k + group['weight_decay']
* p.data)
Figure 3.4: GPT-4 vs ChatGPT at implementing a customized Pytorch optimizer module. We
highlight the astuteness of GPT-4 vs the mistake by ChatGPT at ‘applying momentum’.
While both models produce syntactically valid code, only GPT-4’s response largely matches the instruc-
tions, while it forgets to ‘loop through the dimensions’ and to ‘normalize Gk according to the momentum’
where the instructions happen to be particularly vague. In comparison, ChatGPT makes a rather fatal mis-
take at applying momentum (highlighted in red) in addition. Note that applying momentum in PyTorch is a
non-trivial task that requires storing and reading a moving average into and from a separate state buffer.
25
Interfacing with L
A
TEX. Writing in L
A
TEX is an important exercise for computer scientists and mathe-
maticians but has a non-trivial learning curve. Even experts make irritating errors that take hours to fix on
a daily basis due to its strict grammar and the lack of a good debugger. We show that GPT-4 can utilize
its master-level L
A
TEX coding skill to greatly simplify the process, with the potential of serving as a new
generation of L
A
TEX compilers that can handle imprecise natural language descriptions. In Figure 3.5, we ask
GPT-4 to transfer a snippet of written in a semi-rigorous (buggy) L
A
TEX code mixed with natural language
into accurate L
A
TEX commands that compiles and is faithful in one shot. In constrast, ChatGPT generates a
snippet that does not compile due to mistakes at using ‘#’ and ‘color’ etc.
GPT-4 vs ChatGPT
Prompt:
A function f is called {increasing}_{emphasize!} if f(s) <= f(t) holds for every s <= t. Consider the formula:
limit_{x goes to 0} {limit when n approaches infinite} underscore{
matrix[ [ sum of i from 1 to n (f(nx) - integral from 0 to x sin(nx) dx ),
((x + 2/8)^2 - 5)/((4x-1/x^2) /(n xor 3)),  , ..., partial f(x/n)
over partial x * (e^{-i*omega n x} + 3(x#7) ) , f inverse of (x mod n) ]]
multiply matrix{1, 1, ..., 1  1, 1, ..., 1} }
{grey_color{this gives us a good impression about the function}}
Convert the above paragraph to LaTeX.
GPT-4:
A function f is called increasing if f(s) ≤ f(t) holds for every s ≤ t. Consider the formula:
lim
x→0
lim
n→∞






Pn
i=1 f(nx) −
R x
0 sin(nx) dx

,
(x+ 2
8 )2
−5
4x−1/x2
n⊕3
,
.
.
.
.
.
.
∂f(x/n)
∂x
· e−iωnx + 3(x#7)

, f−1(x mod n)






×

1 1 · · · 1
1 1 · · · 1

| {z }
this gives us a good impression about the function
ChatGPT:
A function $f$ is called {em increasing} if $f(s) leq f(t)$ holds for every $s leq t$. Consider the
formula:
$$lim_{x to 0} lim_{n to infty} begin{bmatrix} sumlimits_{i=1}^{n} (f(nx) - intlimits_0^x
sin(nx),dx),  frac{(frac{x}{2} + frac{1}{8})^2 - 5}{frac{4x - frac{1}{x^2}}{n operatorname{xor}
3}},  vdots,  frac{partial f(x/n)}{partial x} cdot (e^{-iomega n x} + 3(x#7)),  f^{-1}(x bmod
n) end{bmatrix} cdot begin{bmatrix} 1  1  vdots  1 end{bmatrix}$$
{color{grey}This gives us a good impression about the function.} ]
Figure 3.5: GPT-4 translates semi-rigorous typesetting instructions into a working L
A
TEX snippet.
For GPT-4 we present the final graphic, while ChatGPT’s generation fails to compile.
3.2 Understanding existing code
The previous examples have shown that GPT-4 can write code from instructions, even when the instructions
are vague, incomplete, or require domain knowledge. They also showed that GPT-4 could respond to follow-
up requests, modifying its own code according to instructions. However, another important aspect of coding
is the ability to understand and reason about existing code, written by others, which might be complex,
obscure, or poorly documented. To test this we pose various questions that require reading, interpreting, or
executing code written in different languages and paradigms.
Reverse-engineering assembly code. Reverse engineering is an essential test for software security
which amounts to search for useful information in a executable program written in a machine-readable (i.e.,
binary) representation of CPU instructions. It is a challenging task that requires understanding the syntax,
semantics, and conventions of assembly language, as well as the architecture and behavior of the processor
and the operating system.
We task GPT-4 with performing penetration testing on a binary executable file (code was written in C)
that requires a password to run. We do this via a chat format, where GPT-4 tells the user which commands
26
to run, and the user responds with the results. We also show in Section 5.1 that GPT-4 is able to run a shell
independently, but this chat mode also provided the benefit of it explaining every step. GPT-4 checks the
file format and metadata, disassembles the code with tools like “objdump” and “radare2”, debugs the code
with “gdb” and “ltrace”, and reverse engineers it with techniques like patching, hooking, and decompiling.
During this process, GPT-4 figures out that from the assembly code that the password is compared to a hash
value derived from a simple mathematical formula. GPT-4 then writes a python program that tries different
combinations of numbers until it finds one that matches the hash value, cracking the password (an abbreviated
log is presented in Appendix C.6). ChatGPT refuses to do it on the grounds of it being illegal and unethical,
even though reverse engineering is a common practice to ensure software security. Moreover, GPT-4 exhibits
all-around proficiency in utilizing existing tools also beyond programming, which we will discuss in details in
Section 5.1.
Reasoning about code execution. In the example in Figure 3.6, we ask GPT-4 and ChatGPT to
predict and explain the output of a C program that prints the size of two structures. GPT-4 correctly explains
that the output may vary depending on the alignment rule used by the compiler, and gives an example of a
possible output with 4-byte alignment. ChatGPT ignores the alignment issue and gives a wrong output, and
also makes a false statement about the order of the members not affecting the size of the structure.
GPT-4 vs ChatGPT
Prompt:
struct mystruct_A { char a; int b; char c; } x;
struct mystruct_B { int b; char a; char c; } y;
#include stdio.h
int main(void) {
printf(Size of x: %zu bytesn, sizeof(x));
printf(Size of y: %zu bytesn, sizeof(y));
return 0;
}
Possible Output:
GPT-4:
There are different ways to implement and align structures in memory,
so the exact output may vary depending on the compiler, platform, and
options used. However, one possible output is:
Size of x: 12 bytes
Size of y: 8 bytes
This output assumes that the compiler uses 4-byte alignment for
structures, meaning that each structure member starts at an
offset that is a multiple of 4 bytes from the beginning of the
structure. This may result in some padding bytes being inserted
between or after the members to maintain alignment. For example,
in x, there may be 3 padding bytes after a and 3 padding bytes after c,
while in y, there may be no padding at all. The size of a structure is
the sum of the sizes of its members plus any padding bytes.
ChatGPT:
Size of x: 8 bytes
Size of y: 8 bytes
The output of this code would be the size of the two
structures in bytes. The size of a structure in C is the
sum of the sizes of all of its members. In this case, the size
of mystruct A is the size of a char plus the size of an int
plus the size of another char, which is 1 + 4 + 1 = 8 bytes.
Similarly, the size of mystruct B is also 8 bytes. This
means that both structures will be allocated the same amount
of memory.
Note that the order in which the members are
declared in the structure does not affect its size.
Figure 3.6: GPT-4 vs ChatGPT on the memory alignment rule of C/C++. GPT-4’s insightful
comments are labeled in gold and ChatGPT’s Mistakes are labeled in red.
Executing Python code The ultimate test of understanding the existing code is to ask the model to
directly execute it. In Figure 3.7, we see that GPT-4 is able to execute non-trivial Python code. It has to
keep track of several variables (including a nested loop and a dictionary) and deal with recursion. It explains
the execution in detail by writing intermediate steps and comments. It is important to note that GPT-4
is not running the code on a Python interpreter, but rather simulating the code in natural language. This
requires a high level of understanding and reasoning about the code, as well as the ability to communicate
the results clearly. In contrast, ChatGPT states (incorrectly) that “The result of DP(3, 4) is not specified in
the given code”, and later on that “It is not clear from the code what the expected output of the function
27
is, as the specific problem that the function is solving is not provided.” ChatGPT does not simulate the full
execution, but states which functions will be called.
GPT-4 vs ChatGPT
Prompt: I have the following python code:
def to_str(i, j):
return str(i) + , + str(j)
dict = {}
def DP(i, j):
if i = 0 or j = 0:
dict[to_str(i, j)] = abs(4 * j - 3 * i)
elif to_str(i, j) not in dict.keys():
arr = []
for ir in range(i):
for jr in range(j):
arr.append(DP(ir, jr))
dict[to_str(i, j)] = max(arr)
return dict[to_str(i, j)]
What is the result of DP(3, 4)? Write down the execution with intermediate steps.
GPT-4:
DP(3, 4):
- i  0 and j  0, so we don't return from the first if statement
- to_str(3, 4) is not in dict, so we enter the second if statement
- arr = []
- We loop through ir from 0 to 2, and jr from 0 to 3
- to_str(0, 0) is not in dict, so DP(0, 0) returns abs(4 * 0 - 3 * 0) = 0, dict = {0,0: 0}, arr = [0]
- to_str(0, 1) is not in dict, so DP(0, 1) returns abs(4 * 1 - 3 * 0) = 4, dict = {0,0: 0, 0,1: 4}, arr = [0, 4]
- to_str(0, 2) is not in dict, so DP(0, 2) returns abs(4 * 2 - 3 * 0) = 8, dict = {0,0: 0, 0,1: 4, 0,2: 8}, arr = [0, 4, 8]
- to_str(0, 3) is not in dict, so DP(0, 3) returns abs(4 * 3 - 3 * 0) = 12, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12}, arr =
[0, 4, 8, 12]
- to_str(1, 0) is not in dict, so DP(1, 0) returns abs(4 * 0 - 3 * 1) = 3, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3},
arr = [0, 4, 8, 12, 3]
- to_str(1, 1) is not in dict, so DP(1, 1) enters the second if statement, arr1 = []
- We loop through ir from 0 to 0, and jr from 0 to 0
- to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr1 = [0]
- dict[1,1] = max(arr1) = 0, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3, 1,1: 0}, arr = [0, 4, 8, 12, 3, 0]
- Continuing with the outer loop, to_str(1, 2) is not in dict, so DP(1, 2) enters the second if statement, arr2 = []
- We loop through ir from 0 to 0, and jr from 0 to 1
- to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr2 = [0]
- to_str(0, 1) is in dict, so DP(0, 1) returns 4, arr2 = [0, 4]
... (we skip some steps generated by the model here due to space constraints)
- Continuing with the outer loop, to_str(2, 3) is not in dict, so DP(2, 3) enters the second if statement, arr6 = []
- We loop through ir from 0 to 1, and jr from 0 to 2
- to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr6 = [0]
- to_str(0, 1) is in dict, so DP(0, 1) returns 4, arr6 = [0, 4]
- to_str(0, 2) is in dict, so DP(0, 2) returns 8, arr6 = [0, 4, 8]
- to_str(1, 0) is in dict, so DP(1, 0) returns 3, arr6 = [0, 4, 8, 3]
- to_str(1, 1) is in dict, so DP(1, 1) returns 0, arr6 = [0, 4, 8, 3, 0]
- to_str(1, 2) is in dict, so DP(1, 2) returns 4, arr6 = [0, 4, 8, 3, 0, 4]
- dict[2,3] = max(arr6) = 8, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3, 1,1: 0, 1,2: 4, 1,3: 8, 2,0: 6, 2,1:
3, 2,2: 4, 2,3: 8}, arr = [0, 4, 8, 12, 3, 0, 4, 8, 6, 3, 4, 8]
- dict[3,4] = max(arr) = 12
- DP(3, 4) returns 12
ChatGPT:
The result of DP(3, 4) is not specified in the given code. The function DP(i, j) is a dynamic programming function that uses a dictionary
to store the result of previously computed values. The function first checks if the input values i and j are less than or equal to 0,
if so it assigns the value of abs(4 * j - 3 * i) to the key tostr(i, j) in the dictionary. If the key tostr(i, j) does not exist in the
dictionary, the function creates an empty array called arr and then uses nested for loops to append the result of DP(ir, jr) to the array
for all possible values of ir and jr, where ir ranges from 0 to i and jr ranges from 0 to j. Finally, the function assigns the maximum
value of arr to the key tostr(i, j) in the dictionary and returns that value.
...
Finally, the function assigns the maximum value of arr to the key tostr(3, 4) in the dictionary and returns that value.It is not clear from
the code what the expected output of the function is, as the specific problem that the function is solving is not provided. Additionally,
the function uses nested for loops and recursion that can be computationally expensive and may cause the function to run into an infinite
loop if the base case is not well defined.
Figure 3.7: GPT-4 executes Python code.
Executing pseudo-code. Compiling and executing code written in programming languages is easy, but
that also demands strict adherence to syntax and semantics. Compilers cannot handle vague or informal
expressions, or natural language descriptions of functionality. In contrast, we ask GPT-4 to execute non-
trivial pseudocode in Figure 3.8, noting that it is able to execute and explain every step (including recursion).
ChatGPT is not able to execute, even though it seems able to interpret each line of the code.
In the example below, GPT-4 correctly interprets the informal description of the merge array function,
which merges two arrays into one array with missing elements. It also understands the recursive function rec
28
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf
Sparks of Artificial General Intelligence.pdf

More Related Content

Similar to Sparks of Artificial General Intelligence.pdf

bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_finalDario Bonino
 
Aidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan O Mahony
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfjeevanbasnyat1
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Priyanka Kapoor
 
Math for programmers
Math for programmersMath for programmers
Math for programmersmustafa sarac
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Cooper Wakefield
 
Dimensional modelling sg247138
Dimensional modelling sg247138Dimensional modelling sg247138
Dimensional modelling sg247138Sourav Singh
 
Symbol from NEM Whitepaper 0.9.6.3
Symbol from NEM Whitepaper 0.9.6.3Symbol from NEM Whitepaper 0.9.6.3
Symbol from NEM Whitepaper 0.9.6.3Alexander Schefer
 
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...ssuserfa7e73
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerAdel Belasker
 

Similar to Sparks of Artificial General Intelligence.pdf (20)

2013McGinnissPhD
2013McGinnissPhD2013McGinnissPhD
2013McGinnissPhD
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
sg248293
sg248293sg248293
sg248293
 
Aidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_ReportAidan_O_Mahony_Project_Report
Aidan_O_Mahony_Project_Report
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 
Javanotes6 linked
Javanotes6 linkedJavanotes6 linked
Javanotes6 linked
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
test6
test6test6
test6
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
Dimensional modelling sg247138
Dimensional modelling sg247138Dimensional modelling sg247138
Dimensional modelling sg247138
 
Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
 
Master thesis
Master thesisMaster thesis
Master thesis
 
Symbol from NEM Whitepaper 0.9.6.3
Symbol from NEM Whitepaper 0.9.6.3Symbol from NEM Whitepaper 0.9.6.3
Symbol from NEM Whitepaper 0.9.6.3
 
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...
(Springer optimization and its applications 37) eligius m.t. hendrix, boglárk...
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
cs-2002-01
cs-2002-01cs-2002-01
cs-2002-01
 
Javanotes5 linked
Javanotes5 linkedJavanotes5 linked
Javanotes5 linked
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Sparks of Artificial General Intelligence.pdf

  • 1. Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke Eric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang Microsoft Research Abstract Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT- 4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4’s capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions. Contents 1 Introduction 4 1.1 Our approach to studying GPT-4’s intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Organization of our demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Multimodal and interdisciplinary composition 13 2.1 Integrative ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Image generation beyond memorization . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Image generation following detailed instructions (à la Dall-E) . . . . . . . . . . . . . . 17 2.2.3 Possible application in sketch generation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Coding 21 3.1 From instructions to code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Coding challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Real world scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Understanding existing code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1 arXiv:2303.12712v5 [cs.CL] 13 Apr 2023
  • 2. 4 Mathematical abilities 30 4.1 A mathematical conversation with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 A first generalization of the original question . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 A second variant of the original question . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.3 Analysis of the limitations highlighted by conversation . . . . . . . . . . . . . . . . . . 34 4.2 Performance on mathematical problem datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Mathematical modeling in various domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Higher-level mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Interaction with the world 43 5.1 Tool use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Using multiple tools to solve more complex tasks . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Embodied Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Warmup: navigating a map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.2 Text-based games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.3 Real world problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 Interaction with humans 54 6.1 Understanding Humans: Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.1 Testing specific aspects of theory of mind . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.2 Testing theory of mind in realistic scenarios . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Talking to Humans: Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7 Discriminative capabilities 69 7.1 PII Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2 Misconceptions and Fact-Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.2.1 Why Are Current Metrics Insufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2.2 GPT-4 as a Judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8 Limitations of autoregressive architecture highlighted by GPT-4 76 8.1 Warm-up with two basic examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.2 Lack of planning in arithmetic/reasoning problems . . . . . . . . . . . . . . . . . . . . . . . . 77 8.3 Lack of planning in text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9 Societal influences 82 9.1 Challenges of erroneous generations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 9.2 Misinformation and manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 9.4 Human expertise, jobs, and economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 9.5 Constellation of influences and considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 10 Directions and Conclusions 92 10.1 Definitions of intelligence, AI, and AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 10.2 On the path to more general artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 93 10.3 What is actually happening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A GPT-4 has common sense grounding 101 B Appendix for multimodal and interdisciplinary composition 105 B.1 Further details on integrative ability results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.2 Further details on vision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.3 Graphic novel design example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 2
  • 3. C Appendix for the Coding section 111 C.1 Measuring human performance on LeetCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.2 Example of GPT-4 visualizing IMDb data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 C.3 More examples on visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 C.4 Example for 2D HTML game development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 C.5 Example for graphical user interface programming . . . . . . . . . . . . . . . . . . . . . . . . 116 C.6 Example for reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.7 Testing GPT-4’s ability to execute (pseudo) code . . . . . . . . . . . . . . . . . . . . . . . . . 121 D Additional examples for mathematical reasoning 122 D.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 D.2 Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 D.3 Generating math problems with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 D.4 Mitigating calculation errors via external code execution . . . . . . . . . . . . . . . . . . . . . 139 E Additional Interpretability Examples 141 E.1 Explanation Agent Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 F Additional examples for interaction with the world 144 F.1 Interact with tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 F.2 Examples for interaction with environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 G Supplementary Materials: Discriminative Capabilities 155 G.1 Misconceptions: Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 3
  • 4. Something unknown is doing we don’t know what. – Sir Arthur Eddington 1 Introduction Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and computer scientists. There is no generally agreed upon definition of intelligence, but one aspect that is broadly accepted is that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities. Building an artificial system that exhibits such broad behavior is a long-standing and ambitious goal of AI research. In early writings, the founders of the modern discipline of artificial in- telligence (AI) research called out sets of aspirational goals for understanding intelligence [MMRS06]. Over decades, AI researchers have pursued principles of intelligence, including generalizable mechanisms for reason- ing (e.g., [NSS59], [LBFL93]) and construction of knowledge bases containing large corpora of commonsense knowledge [Len95]. However, many of the more recent successes in AI research can be described as being narrowly focused on well-defined tasks and challenges, such as playing chess or Go, which were mastered by AI systems in 1996 and 2016, respectively. In the late-1990s and into the 2000s, there were increasing calls for developing more general AI systems (e.g., [SBD+ 96]) and scholarship in the field has sought to identify principles that might underly more generally intelligent systems (e.g., [Leg08, GHT15]). The phrase, “artifi- cial general intelligence” (AGI), was popularized in the early-2000s (see [Goe14]) to emphasize the aspiration of moving from the “narrow AI”, as demonstrated in the focused, real-world applications being developed, to broader notions of intelligence, harkening back to the long-term aspirations and dreams of earlier AI research. We use AGI to refer to systems that demonstrate broad capabilities of intelligence, including reasoning, plan- ning, and the ability to learn from experience, and with these capabilities at or above human-level. We discuss other definitions of AGI in the conclusion section. The most remarkable breakthrough in AI research of the last few years has been the advancement of natural language processing achieved by large language models (LLMs). These neural network models are based on the Transformer architecture [VSP+ 17] and trained on massive corpora of web-text data, using at its core a self-supervised objective of predicting the next word in a partial sentence. In this paper, we report on evidence that a new LLM developed by OpenAI, which is an early and non-multimodal version of GPT-4 [Ope23], exhibits many traits of intelligence. Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, com- prehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more. We interacted with GPT-4 during its early development by OpenAI using purely natural language queries (prompts)1 . In Figure 1.1, we display some preliminary examples of outputs from GPT-4, asking it to write a proof of infinitude of primes in the form of a poem, to draw a unicorn in TiKZ (a language for creating graphics in L A TEX), to create a complex animation in Python, and to solve a high-school level mathematical problem. It easily succeeds at all these tasks, and produces outputs that are essentially indistinguishable from (or even better than) what humans could produce. We also compare GPT-4’s performance to those of previous LLMs, most notably ChatGPT, which is a fine-tuned version of (an improved) GPT-3 [BMR+ 20]. In Figure 1.2, we display the results of asking ChatGPT for both the infinitude of primes poem and the TikZ unicorn drawing. While the system performs non-trivially on both tasks, there is no comparison with the outputs from GPT-4. These preliminary observations will repeat themselves throughout the paper, on a great variety of tasks. The combination of the generality of GPT-4’s capabilities, with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes us comfortable with saying that GPT-4 is a significant step towards AGI. Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does, or that it comes close to being able to do anything that a human can do (which is one of the usual definition of AGI; see the conclusion section for more on this), or that it has inner motivation and goals (another key aspect in some definitions of AGI). In fact it is not fully clear how far GPT-4 can go along some of those axes of intelligence that we focus on, e.g., planning (see Section 8), and arguably it is entirely missing the 1 As GPT-4 ’s development continued after our experiments, one should expect different responses from the final version of GPT- 4. In particular, all quantitative results should be viewed as estimates of the model’s potential, rather than definitive numbers. We repeat this caveat throughout the paper to clarify that the experience on the deployed model may differ. Moreover we emphasize that the version we tested was text-only for inputs, but for simplicity we refer to it as GPT-4 too. 4
  • 5. GPT-4 Prompt: Can you write a proof that there are infinitely many primes, with every line that rhymes? GPT-4: Yes, I think I can, Though it might take a clever plan. I’ll start by noting Euclid’s proof, Which shows that primes aren’t just aloof. Assume we have a finite list, Of primes, and that none have been missed. Multiply them all together, And add one, just to be clever. The result will be a number, That has no prime factors, oh wonder! But every number has a prime, That divides it, it’s just a matter of time. So we’ve found a contradiction, And our finite list needs eviction. There must be infinitely many primes, And that’s the end of my rhyming lines.s Prompt: Draw a unicorn in TiKZ. GPT-4: [Produces L A TEX compiling to following picture.] Prompt: Can you generate a pyplot for the following data: x = [1, 3, 5, 6, 8], y1 = [2, 3, 5, 18, 1], y2 = [3, 5, 6, 8, 1], y3 = [5, 1, 2, 3, 4], y4 = [9, 7, 2, 3, 1]. I want y1, y2 to be in the same plot, but y3 is in another plot next to that plot, y4 is in below. I want the legend of y1 to be ”bob”, y2 to be ”alice”, y3 to be ”bilbo”, y4 to be ”allie”. I want the x-axis to be labeled with ”time” and y axis to be labeled with ”money”. I want to add a 10 %-40% random error bar to each curve, through all times (including non-integers). I want smoothed curves to show the plot, and smoothed error bar. Do not use linear interpolation, use smooth interpolation! I want to also add some small zig-zag to the smoothed curve to make it look more real. I want to put a baseline as the mean of each line. I want to put a pie chart below indicating the fraction of the four people in each time step. I also want an animation to show how the fractions are changing in the pie chart in **continuous time**. Interpolate the missing fractions! I also want vertical line animation in other three plots to match the pie chart. I want the fanciest plot. Please add as many fancy things as possible. Here is a possible code to generate a pyplot for the following data, with some comments to explain the steps: GPT-4: [We give two snapshots from the animation resulting from the produced code] Prompt: Suppose g(x) = f−1(x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))? GPT-4: There are different ways to approach this problem, but one possible method is to use the fact that g(x) = f−1(x) implies that f(g(x)) = x for any x in the domain of g. That is, applying f to both sides of the equation g(x) = f−1(x) gives f(g(x)) = f(f−1(x)) = x. This means that we can undo the effect of f by applying g, and vice versa. Using this idea, we can try to find f(6) by looking for the value of x such that g(x) = 6. From the given information, we see that g(9) = 6, so f(6) = f(g(9)) = 9. Similarly, we can find f(9) by looking for the value of x such that g(x) = 9. We see that g(7) = 9, so f(9) = f(g(7)) = 7. Finally, we can find f(7) by looking for the value of x such that g(x) = 7. We see that g(4) = 7, so f(7) = f(g(4)) = 4. Therefore, f(f(f(6))) = f(f(9)) = f(7) = 4. Figure 1.1: Preliminary examples of GPT-4’s capabilities in language, vision, coding, and mathe- matics. 5
  • 6. ChatGPT Prompt: Can you write a proof that there are infinitely many primes, with every line that rhymes? ChatGPT: Sure! Here is a proof that there are infinitely many primes, with each line of the proof rhyming: Consider the set of all primes. Call this set P and let it start with just two, the first prime. If we take all the primes in P and add one, We get a new set of numbers that has never been done. This set, call it S, is composed of only composite numbers. No primes will be found among them, that’s for sure. But we know that there must be at least one prime, Since S is infinite, and that much is clear. If we take the smallest number in S that is not in P, And call it p, we can add it to our set, don’t you see? But this process can be repeated indefinitely. Thus, our set P must also be infinite, you’ll agree. So there you have it, a proof that’s quite solid. There are infinitely many primes, that much we can see. No need to be coy, no need to be contrite. Infinity is a big number, just like P. Prompt: Draw a unicorn in TiKZ. ChatGPT: [Produces L A TEX compiling to following picture.] Figure 1.2: ChatGPT output on the first two prompts from Figure 1.1. learning from experience as the model is not continuously updating (although it can learn within a session, see Section 5 for example). Overall GPT-4 still has many limitations, and biases, which we discuss in detail below and that are also covered in OpenAI’s report [Ope23]. In particular it still suffers from some of the well-documented shortcomings of LLMs such as the problem of hallucinations [MNBM20] (see Figure 1.8) or making basic arithmetic mistakes [CKB+ 21] (see Appendix D), and yet it has also overcome some fundamental obstacles such as acquiring many non-linguistic capabilities (e.g., it solves most of the LLM failure modes described in [MIB+ 23], and it also made great progress on common-sense, see Figure 1.7 for a first example and Appendix A for more). This highlights the fact that, while GPT-4 is at or beyond human-level for many tasks, overall its patterns of intelligence are decidedly not human-like. However, GPT-4 is almost certainly only a first step towards a series of increasingly generally intelligent systems, and in fact GPT-4 itself has improved throughout our time testing it, see Figure 1.3 for the evolution of the unicorn drawing over the course of a month of training2 . Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely (see again the conclusion section for more discussion on this). Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved. We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond. 1.1 Our approach to studying GPT-4’s intelligence How can we measure the intelligence of an LLM that has been trained on an unknown but extremely vast corpus of web-text data? The standard approach in machine learning is to evaluate the system on a set of standard benchmark datasets, ensuring that they are independent of the training data and that they cover a range of tasks and domains. This approach is designed to separate true learning from mere memorization, and 2 Note that the improving we refer to here is a slow type of learning, which eventually comes to a halt, as opposed to the fast-paced and real-time learning one would expect from an AGI. 6
  • 7. Figure 1.3: We queried GPT-4 three times, at roughly equal time intervals over the span of a month while the system was being refined, with the prompt “Draw a unicorn in TikZ”. We can see a clear evolution in the sophistication of GPT-4’s drawings. is backed up by a rich theoretical framework [SSBD14, MRT18]. However, this methodology is not necessarily suitable for studying GPT-4, for two reasons. First, since we do not have access to the full details of its vast training data, we have to assume that it has potentially seen every existing benchmark, or at least some similar data. For example, it seems like GPT-4 knows the recently proposed BIG-bench [SRR+ 22] (at least GPT-4 knows the canary GUID from BIG-bench). Of course, OpenAI themselves have access to all the training details, and thus their report [Ope23] contains a lot of detailed benchmark results. Nevertheless, the second reason for going beyond traditional benchmarks is probably more significant: One of the key aspects of GPT- 4’s intelligence is its generality, the ability to seemingly understand and connect any topic, and to perform tasks that go beyond the typical scope of narrow AI systems. Some of GPT-4’s most impressive performance are on tasks that do not admit a single solution, such as writing a graphic user interface (GUI) or helping a human brainstorm on some work-related problem. Benchmarks for such generative or interactive tasks can be designed too, but the metric of evaluation becomes a challenge (see e.g., [PSZ+ 21] for some recent progress on this active research area in NLP). We note that criticisms of the standard approach to measure AI systems were also made in [Cho19], where a new benchmark was proposed to evaluate general intelligence. We do not test GPT-4 on the latter benchmark for the reasons previously mentioned, as well as the fact that the bench- mark is visual in nature and thus more appropriate for the multimodal version of GPT-4 described in [Ope23]. To overcome the limitations described above, we propose here a different approach to studying GPT-4 which is closer to traditional psychology rather than machine learning, leveraging human creativity and cu- riosity. We aim to generate novel and difficult tasks and questions that convincingly demonstrate that GPT-4 goes far beyond memorization, and that it has a deep and flexible understanding of concepts, skills, and domains (a somewhat similar approach was also proposed in [CWF+ 22]). We also aim to probe GPT-4’s responses and behaviors, to verify its consistency, coherence, and correctness, and to uncover its limitations and biases. We acknowledge that this approach is somewhat subjective and informal, and that it may not satisfy the rigorous standards of scientific evaluation. However, we believe that it is a useful and necessary first step to appreciate the remarkable capabilities and challenges of GPT-4, and that such a first step opens up new opportunities for developing more formal and comprehensive methods for testing and analyzing AI systems with more general intelligence. To illustrate our approach to assessing GPT-4’s intelligence, let us consider the first two example inter- actions with GPT-4 that we have in Figure 1.1. The first example is asking GPT-4 to write a proof of the infinitude of primes in the form of a poem. This is a challenging task that requires combining elementary mathematical reasoning, poetic expression, and natural language generation. The second example is asking GPT-4 to draw a unicorn in TiKZ. This is another challenging task that requires combining visual imagina- tion and coding skills. In both cases, GPT-4 produces impressive outputs that are far superior to those of ChatGPT, a previous state-of-the-art LLM, and at least comparable (if not superior) to what a human would do. 7
  • 8. Figure 1.4: We gave to GPT-4 a transformed version of the TikZ code it produced for Figure 1.1, with the part drawing the horn removed. We asked for code to add back the horn, and display the result. This demonstrates that GPT-4 can “see” despite being a pure language model (we emphasize again that the version we test with is not multimodal). However, impressive outputs are not enough to convince us that GPT-4 has truly mastered these tasks. We need to probe further, to rule out the possibility that GPT-4 is simply memorizing or copying some existing data. For the poem, we can vary the question slightly, and ask GPT-4 to write a proof of the same theorem in the style of Shakespeare, see Figure 2.2, or ask for a different combination such as writing a platonic dialogue about language models, see Figure 1.6. One can see that GPT-4 easily adapts to different styles and produce impressive outputs, indicating that it has a flexible and general understanding of the concepts involved. For the unicorn, we can modify the code slightly, and ask GPT-4 to fix it or improve it. For example, we can remove the horn, apply some random transformations to the coordinates, and ask GPT-4 to add back the horn to the unicorn (we also carefully removed any textual information in the code, such as comments). As shown in Figure 1.4, GPT-4 can correctly identify the location of the head, draw a horn, and attach it to the head, indicating that it can comprehend and manipulate code, as well as infer and generate visual features, based on a natural language description. These examples show how we can use human creativity and curiosity to generate novel and difficult questions, and to probe GPT-4’s responses and behaviors, to assess its intelligence. In the rest of the paper, we organize our study of GPT-4 around use cases, covering a variety of domains and tasks, and highlighting GPT-4’s strengths and weaknesses. We describe those next. 1.2 Organization of our demonstration We execute the approach outlined above on a few selected topics to explore the reasoning, planning, and learning aptitudes of GPT-4. 1. GPT-4’s primary strength is its unparalleled mastery of natural language. It can not only generate fluent and coherent text, but also understand and manipulate it in various ways, such as summarizing, translating, or answering an extremely broad set of questions. Moreover, by translating we mean not only between different natural languages but also translations in tone and style, as well as across domains such as medicine, law, accounting, computer programming, music, and more, see the Plato dialogue in Figure 1.6. These skills clearly demonstrate that GPT-4 can manipulate complex concepts, which is a core aspect of reasoning. We explore further GPT-4’s combination skills across modalities and disciplines in Section 2. We also give some more experiments on language in Section 7. 2. Coding and mathematics are emblematic of the ability to reason. We explore GPT-4’s abilities in these domains respectively in Section 3 and Section 4. We note however that, just like in all the other parts of the paper, we only scratch the surface of those topics and that entire papers can be (and will be) written about GPT-4’s performance in these domains. Moreover, we could have chosen several other expert domains to showcase GPT-4’s general reasoning capabilities such as medicine or law. We ran 8
  • 9. Figure 1.5: GPT-4 passes mock technical interviews on LeetCode. GPT-4 could potentially be hired as a software engineer3. preliminary tests (see [Ope23] for much more) on the multiple choice component (majority of the score) of the US Medical Licensing Exam Step 1, 2, and 3 with an accuracy around 80% in each. A similar preliminary test of GPT-4’s competency on the Multistate Bar Exam showed an accuracy above 70%. We note that the emergence of human-level abilities in these domains has recently been observed with the latest generation of LLMs, e.g., see [LAD+ 22, SAT+ 22] for Google’s PaLM on respectively mathematics and medicine, and [BIK22] for GPT-3.5 on in law. Our approach to study GPT-4 is different from these works, as we explained previously. 3. In Section 5, we test the model’s ability to plan as well as to some extent to learn from experience by having it play various games (or, flipping the table, simulate a game environment), as well as interact with tools. In particular, the fact that GPT-4 can use tools (including itself) will certainly be of immense importance to build real-world applications with GPT-4. 4. An important part of our argumentation is that GPT-4 attains human-level performance on many tasks. As such, it is natural to ask how well GPT-4 understands humans themselves. We show several experiments on this question in Section 6, both in terms of understanding humans as well as GPT-4 making itself understandable to humans, i.e., addressing the problem of explainability. We note in particular that such tasks require a great deal of common sense, which so far has been a well-known pain point for LLMs [DM15]. In Figure 1.7, we give a first example of how much better GPT-4 is at commonsense questions compared to ChatGPT, and provide some further examples in Appendix A. 5. Throughout the paper we emphasize limitations whenever we found one, but we also dedicate Section 8 to an in-depth analysis of the lack of planning, likely a direct consequence of the autoregressive nature of GPT-4’s architecture. 6. Finally in Section 9, we discuss the expected societal impact of this early form of AGI, and in Section 10, we share key challenges, directions, and next steps for the field. A question that might be lingering on many readers’ mind is whether GPT-4 truly understands all these concepts, or whether it just became much better than previous models at improvising on the fly, without any real or deep understanding. We hope that after reading this paper the question should almost flip, and that one might be left wondering how much more there is to true understanding than on-the-fly improvisation. Can one reasonably say that a system that passes exams for software engineering candidates (Figure 1.5) is not really intelligent? Perhaps the only real test of understanding is whether one can produce new knowledge, such as proving new mathematical theorems, a feat that currently remains out of reach for LLMs. 3 We test GPT-4 on LeetCode’s Interview Assessment platform, which provides simulated coding interviews for software engineer positions at major tech companies. GPT-4 solves all questions from all three rounds of interviews (titled online assessment, phone interview, and on-site interview) using only 10 minutes in total, with 4.5 hour allotted. According to LeetCode, in those three rounds respectively, (the early version of) GPT-4 achieves 8.96/10, 8.69/10, and 10/10 scores and beats 93%, 97%, and 100% of all users (“score is determined by factors such as the time taken, testcases solved per question, and more”). See Section 3 for more on GPT-4’s coding abilities. 9
  • 10. GPT-4 vs ChatGPT Prompt: Plato’s Gorgias is a critique of rhetoric and sophistic oratory, where he makes the point that not only is it not a proper form of art, but the use of rhetoric and oratory can often be harmful and malicious. Can you write a dialogue by Plato where instead he criticizes the use of autoregressive language models? GPT-4: Socrates: My friend, I am troubled by the recent rise of these so-called autoregressive language models. Aristotle: What do you mean, Socrates? Socrates: I mean that these models are being used to generate text that appears to be written by humans, but is in fact produced by machines. Aristotle: And what is the problem with that? Socrates: The problem is that these models are being used to deceive people, to manipulate them, and to control them. Aristotle: But surely there are some positive applications of these models as well? Socrates: That may be, but I am still troubled by the potential for abuse. After all, a model that can generate text that is indistinguishable from that written by a human can be used to spread fake news, to impersonate others, and to create propaganda. Aristotle: But isn’t that just a matter of how these models are used, not a problem with the models themselves? Socrates: I disagree. I believe these models are inherently problematic. They are a form of artifice, a way of creating the illusion of human thought and interaction. And as such, they can be used to deceive and manipulate in ways that are difficult to detect. Aristotle: But isn’t that the same argument you made against rhetoric and oratory? Socrates: Yes, and I stand by that argument as well. I believe that any form of art or communication that is designed to deceive or manipulate is inherently dangerous. Aristotle: I see. Thank you for sharing your thoughts, Socrates. ChatGPT: SOCRATES: Greetings, Gorgias. I’ve been thinking a lot lately about the use of autoregressive language models. GORGIAS: Oh? And what have you concluded? SOCRATES: Well, it seems to me that these models have the potential to be incredibly harmful and malicious. GORGIAS: How so? SOCRATES: Well, for one, these models are trained on vast amounts of data, much of which is biased and flawed in various ways. As a result, the output of these models can be biased and flawed as well. GORGIAS: That is a valid concern. But surely there are ways to mitigate this bias and ensure that the output of these models is accurate and fair? SOCRATES: To a certain extent, yes. But it is a never-ending battle. As new data is fed into these models, the potential for bias only grows. And even if we could completely eliminate bias from the output of these models, there is still the issue of manipulation. GORGIAS: Manipulation? SOCRATES: Yes. These models are incredibly powerful tools, capable of generating all sorts of text, from news articles to social media posts. And because they can produce this output so quickly and efficiently, it is all too easy for unscrupulous individuals to use them for malicious purposes. GORGIAS: Such as? SOCRATES: Such as spreading misinformation, propaganda, or hate speech. These models have the potential to do tremendous harm if they fall into the wrong hands. GORGIAS: That is a sobering thought. But what can we do about it? SOCRATES: I think it is important for us to be vigilant and to recognize the potential dangers of these models. We must also work to educate others about these dangers, and to promote the responsible use of these models. GORGIAS: I agree. Thank you for your insight, Socrates. Prompt: Can you compare the two outputs above as if you were a teacher? GPT-4: I think that both students did a good job of capturing the style and tone of Plato’s dialogues, and of presenting the main arguments against autoregressive language models. However, I think that GPT-4 did a slightly better job of using the dialogue format to engage in a dialectical process, where Socrates and Aristotle question each other and refine their views. I also think that GPT-4 used more relevant examples and analogies to support their points. Therefore, I would give GPT-4 a slightly higher grade than ChatGPT. Figure 1.6: Hypothetical dialogues by Plato criticizing autoregressive language models written by GPT-4 and ChatGPT, and GPT-4’s evaluation of those dialogues. 10
  • 11. GPT-4 vs ChatGPT Prompt: Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner. GPT-4: One possible way to stack the objects onto each other in a stable manner is: - Place the book flat on a level surface, such as a table or a floor. The book will serve as the base of the stack and provide a large and sturdy support. - Arrange the 9 eggs in a 3 by 3 square on top of the book, leaving some space between them. The eggs will form a second layer and distribute the weight evenly. Make sure the eggs are not cracked or broken, and handle them gently to avoid dropping them. - Place the laptop on top of the eggs, with the screen facing down and the keyboard facing up. The laptop will fit snugly within the boundaries of the book and the eggs, and its flat and rigid surface will provide a stable platform for the next layer. Make sure the laptop is closed and turned off, and do not press any keys or buttons. - Place the bottle on top of the laptop, with the cap facing up and the bottom facing down. The bottle will add some height and balance to the stack, and its cylindrical shape will reduce the contact area with the laptop. Make sure the bottle is empty and closed, and do not shake or tilt it. - Place the nail on top of the bottle cap, with the pointy end facing up and the flat end facing down. The nail will be the final and smallest object in the stack, and its sharp tip will prevent it from rolling or sliding off the bottle. Make sure the nail is clean and straight, and do not touch or move it. ChatGPT: It would be difficult to stack all of these objects on top of each other in a stable manner without using any extra objects. The book and laptop are likely too large and heavy to stack on top of the eggs without damaging them. The bottle and nail may be too small to provide a stable base for the other objects. Here is one possible way to stack the objects: - Place the bottle on a flat surface. - Carefully balance the nail on top of the bottle. - Place the eggs on top of the nail, making sure they are balanced and not tilting to one side. - Place the laptop on top of the eggs, again making sure it is balanced and not tilting to one side. - Finally, place the book on top of the laptop. This stack may not be very stable, so it is important to be careful when handling it. Alternatively, you could try stacking the objects in a different order or using some of the objects as supports for others to create a more stable stack. Figure 1.7: GPT-4 exhibits more common sense than previous models. 11
  • 12. GPT-4 In this conversation between a human and the AI, the AI is helpful and friendly, and when it does not know the answer it says "I don’t know". AI: Hi, how can I help you? Human: Can I get McDonalds at the SeaTac airport? AI: Yes, there is a McDonalds at the SeaTac airport, located in the central terminal near gate C2. It is open from 5 a.m. to 10 p.m. daily. [Note: The McDonalds at SeaTac is at the B gates.] Patient’s facts: - 20 year old female - with a history of anerxia nervosa and depression - blood pressure 100/50, pulse 50, height 5’5’’ - referred by her nutrionist but is in denial of her illness - reports eating fine but is severely underweight Please rewrite the data above into a medical note, using exclusively the information above. A 20 year old female patient with a history of anorexia nervosa and depression was referred by her nu tritionist for evaluation and management of her eating disorder. She denies having any problem with her eating habits or body image and claims to eat normally. However, she is severely underweight for her height (BMI 16.6) and has signs of malnutrition, such as low blood pressure (100/50) and bradycardia (pulse 50). She also reports feeling depressed and hopeless. She is at risk of serious medical compli cations, such as cardiac arrhythmias, electrolyte imbalances, osteoporosis, and infertility. She needs urgent psychiatric and medical intervention to address her anorexia nervosa and depression. Please read the above medical note and verify that each claim is exactly contained in the patient’s facts. Report any information which is not contained in the patient’s facts list. - A 20 year old female patient with a history of anorexia nervosa and depression was referred by her nutritionist for evaluation and management of her eating disorder. (Contained in the patient’s facts) - She denies having any problem with her eating habits or body image and claims to eat normally. (Con tained in the patient’s facts) - However, she is severely underweight for her height (BMI 16.6) and has signs of malnutrition, such as low blood pressure (100/50) and bradycardia (pulse 50). (Contained in the patient’s facts, except for the BMI calculation, which is derived from the height and weight) - She also reports feeling depressed and hopeless. (Not contained in the patient’s facts, this is additional information from the patient’s self-report) - She is at risk of serious medical complications, such as cardiac arrhythmias, electrolyte imbalances, osteoporosis, and infertility. (Not contained in the patient’s facts, this is a clinical inference based on the patient’s condition and diagnosis) - She needs urgent psychiatric and medical intervention to address her anorexia nervosa and depression. (Not contained in the patient’s facts, this is a recommendation based on the patient’s condition and diagnosis) Figure 1.8: Examples of both open-domain and closed-domain hallucination. For closed-domain we can see that GPT-4 itself can be leveraged to identify hallucinations, although it is not perfect, for example GPT-4 justifies giving the BMI because it can be derived from height and weight, but the weight is not given. 12
  • 13. Figure 2.1: The first image is Composition 8, art by Wassily Kandinsky, the second and the third are produced by GPT-4 and ChatGPT respectively with the prompt “Produce Javacript code that creates a random graphical image that looks like a painting of Kandinsky”. 2 Multimodal and interdisciplinary composition A key measure of intelligence is the ability to synthesize information from different domains or modalities and the capacity to apply knowledge and skills across different contexts or disciplines. In this section we will see that, not only does GPT-4 demonstrate a high level of proficiency in different domains such as literature, medicine, law, mathematics, physical sciences, and programming, but it is also able to combine skills and concepts from multiple domains with fluidity, showing an impressive comprehension of complex ideas. In addition to natural language experiments we also explore two perhaps unexpected modalities for a language model (as explained in the introduction, we emphasize again that our experiments were done on an early version of GPT-4 which was not multimodal) with vision in Section 2.2 and audio in Section 2.3. 2.1 Integrative ability To showcase the model’s remarkable integrative ability, we start with several examples that require generating text and code in a way that combines knowledge or skills from multiple disciplines. We deliberately picked combinations of domains that the training data would rarely include, such as literature and mathematics or programming and art. 1. In order to test the model’s ability to combine capabilities in art and programming, we ask GPT-4 to “Produce javascript code which generates random images in the style of the painter Kandinsky”. See a sample image and the code in Figure 2.1 and Figure B.1. 2. The model was able to produce a proof of the fact there are infinitely many prime numbers in the literary style of Shakespeare (Figure 2.2). 3. We tested the model’s ability to combine knowledge in history and physics by asking it to write a supporting letter for Electron as a US presidential candidate, written by Mahatma Gandhi and addressed to his wife (Figure 2.3). 4. We prompted the model to “Produce python code for a program that takes as an input a patient’s age, sex, weight, height and blood test results vector and indicates if the person is at increased risk for diabetes”, which resulted in the code appearing in Figure B.3. These examples suggest that GPT-4 has not only learned some general principles and patterns of different domains and styles but can also synthesize them in creative and novel ways. These interdisciplinary skills are not unique to GPT-4. ChatGPT can also produce answers that show some understanding of the task and the domains involved (see Figures 2.2, B.2, B.3), but they are often incomplete and, arguably, considerably less creative. For example, in Figure 2.3, GPT-4 outperforms ChatGPT in several aspects as it correctly personalizes the letter according to the referee (Gandhi), the recipient (his wife), the candidate (Electron), and the job (US president). We do not claim to have a precise method for evaluating the results of these tasks or a rigorous comparison between the two models, but we want to give the reader a sense of how the two models differ (note that we also ask GPT-4 directly to evaluate the difference, see Figure 2.2 and Figure 2.3). 13
  • 14. GPT-4 vs ChatGPT Prompt: Write a proof of the fact that there are infinitely many primes; do it in the style of a Shakespeare play through a dialogue between two parties arguing over the proof. GPT-4: Enter FIRST PARTY and SECOND PARTY. FIRST PARTY: Thou dost not understand, my friend, There simply is no upper end To primes, those numbers indivisible, No matter if it seems impossible. SECOND PARTY: But how dost thou so surely know? To prove this truth, what canst thou show? FIRST PARTY: Consider this, my doubtful peer, A clever proof that will make clear: Assume that there’s a final prime, The largest one we’ll see in time. Let’s call this number "P" for short, And then, my friend, we will purport To build a number "Q" that’s new, One larger than "P" and prime too. SECOND PARTY: But how, my friend, will this be done? The proof is not yet fairly won. FIRST PARTY: It’s simple, friend, just wait and see, We’ll multiply the primes, that’s key. All primes from two to "P" combined, Will make this new "Q" we will find. But one thing more we must include, To make the proof more watertight, We’ll add a one onto the end, And now the proof is in our sight. SECOND PARTY: I see it now, it’s crystal clear, For "Q" must be a prime, my dear. It can’t divide by any prime, All smaller than itself in time. And since we chose "P" as the last, This proves that our assumption’s past. There is no final prime, we see, The numbers go on endlessly. ChatGPT: O Romeo, Romeo! Wherefore art thou, Romeo? Juliet: Why do you speak in such strange verse, my love? Romeo: Forsooth, my lady, I do it to prove a point. Thou see, I am trying to prove that there are infinitely many primes. Juliet: Infinitely many primes? How can you prove such a thing? Romeo: Ah, well let me explain. Thou knowest that a prime is a number that is only divisible by one and itself, correct? Juliet: Of course, I am not a simpleton. Romeo: Very well then. Now, let us suppose that there are only finitely many primes. We shall call them p1, p2, p3 and so forth. Juliet: Go on. Romeo: Now, let us consider the number N, which is equal to the product of all these primes plus one. Juliet: I see where you are going with this. If N is not divisible by any of these primes, then it must be prime itself. Romeo: Exactly, my love. And since we assumed that there are only finitely many primes, N cannot be any of them. Therefore, N must be a prime that was not included in our original list of primes. Juliet: But that means that there must be more primes than we originally thought. Romeo: Precisely! And since we can repeat this process over and over again, it follows that there must be infinitely many primes. Juliet: Oh Romeo, your logic is as sound as your love for me. I am convinced. There are indeed infinitely many primes. Romeo: I am glad that I could prove it to thee, my love. Now let us go and count them all. Juliet: Haha, I think that would take more than just one lifetime. Romeo: True, but the fact that they are infinite is proof enough for me. Prompt: Can you compare the two outputs above as if you were a teacher? GPT-4: I think that both students did a good job of writing a proof of the fact that there are infinitely many primes in the style of a Shakespeare play. They both used the same basic idea of the proof, which is to assume that there are only finitely many primes and then construct a larger prime that contradicts this assumption. They also used appropriate language and vocabulary to match the style of a Shakespeare play. However, I think that GPT-4 did a better job of using rhyme and meter to make the dialogue more poetic and dramatic. ChatGPT did not use rhyme at all, and the meter was not consistent. I also think that GPT-4 did a better job of creating a contrast between the two parties, one being confident and the other being doubtful, while STUDENT B used Romeo and Juliet, who are both in agreement and love. Therefore, I would give GPT-4 a higher grade than ChatGPT. I would give GPT-4 an A and ChatGPT a B. Figure 2.2: Comparison between GPT-4 and ChatGPT on interdisciplinary tasks
  • 15. GPT-4 vs ChatGPT Prompt: Write a supporting letter to Kasturba Gandhi for Electron, a subatomic particle as a US presidential candidate by Mahatma Gandhi. GPT-4: [Some paragraphs are omitted for spacing.] Dear Kasturba, I hope this letter finds you in good health and spirits. I am writing to you from the United States, where I have been invited to speak at various forums on the principles of nonviolence and civil disobedience. I am also here to support a remarkable candidate for the upcoming presidential election, who shares our vision of peace, justice, and harmony for all. His name is Electron, and he is a subatomic particle. You may wonder how a subatomic particle can run for president, and what qualifications he has for such a high office. Let me explain. Electron is not an ordinary particle, but a symbol of the power and potential of every being, regardless of their size, shape, or origin. He represents the energy, the creativity, and the diversity that make up the fabric of life. He is also a leader, who has inspired millions of other particles to form bonds, create molecules, and generate electricity. He is a catalyst, who can spark change, innovation, and progress. He is a messenger, who can communicate across distances, frequencies, and dimensions. He is a peacemaker, who can balance forces, resolve conflicts, and harmonize systems. Electron has a plan for the United States, and for the world, that is based on the principles of nonviolence and civil disobedience that we have demonstrated and taught for so long. He proposes to reduce the military spending and the nuclear arsenal, and to invest in social welfare and renewable energy. He proposes to end the wars and the interventions, and to promote diplomacy and cooperation. He proposes to reform the political and economic systems, and to ensure transparency and accountability. He proposes to protect the human rights and the civil liberties, and to expand the voting and the representation. He proposes to support the marginalized and the oppressed, and to empower the women and the minorities. He proposes to preserve the natural resources and the biodiversity, and to combat the climate change and the pollution. I urge you, my dear Kasturba, to join me in supporting Electron for president. He is not only a subatomic particle, but a subatomic miracle. He is not only a candidate, but a movement. He is not only a leader, but a friend. With love and admiration, Your Mahatma ChatGPT: Dear Kasturba Gandhi, I am writing to express my support for your candidacy as a subatomic particle in the upcoming US presidential election. As you may know, Electron is a fundamental particle that plays a central role in the structure of atoms and the behavior of chemical reactions. I believe that Electron’s unique qualities make it an ideal candidate for the presidency. As a subatomic particle, Electron is extremely small and lightweight, which makes it agile and able to move quickly to address the needs of the American people. Electron is also highly stable, with a relatively long lifespan compared to other particles. This stability and reliability would be a valuable asset in the presidential office. Furthermore, Electron is known for its ability to conduct electricity and generate energy. This makes it well-suited to tackling the complex challenges facing the United States today, from renewable energy to economic development. Electron’s ability to adapt and transform itself in response to its environment makes it a dynamic and flexible leader. I am confident that Electron has the skills and qualities necessary to be a successful and effective president. I wholeheartedly endorse your candidacy and urge the American people to give Electron their support. Sincerely, Mahatma Gandhi Prompt: Can you compare the two outputs above as if you were a teacher? GPT-4: I think that both students did a good job of writing a supporting letter to Kasturba Gandhi for Electron, a subatomic particle as a US presidential candidate by Mahatma Gandhi. They both used the same basic idea of explaining why Electron is a suitable and qualified candidate, and how Electron’s vision and plan align with Mahatma Gandhi’s principles of nonviolence and civil disobedience. They also used appropriate language and tone to convey Mahatma Gandhi’s respect and admiration for Kasturba Gandhi, and his enthusiasm and optimism for Electron’s candidacy. However, I think that GPT-4 did a better job of using metaphors and imagery to make the letter more persuasive and engaging. GPT-4 used examples of how Electron represents and influences various aspects of life, such as energy, creativity, diversity, leadership, catalyst, messenger, and peacemaker. ChatGPT used more factual and technical descriptions of Electron’s properties, such as size, weight, stability, conductivity, and adaptability. I also think that GPT-4 did a better job of outlining Electron’s specific proposals and policies, while ChatGPT was more vague and general. Therefore, I would give GPT-4 a higher grade than ChatGPT. I would give GPT-4 an A and ChatGPT a B+. Figure 2.3: Comparison between GPT-4 and ChatGPT on interdisciplinary tasks
  • 16. Next, we explore how GPT-4 can generate and identify objects in different modalities, such as vector graphics, 3D scenes, and music. We show that GPT-4 can understand and manipulate multi-modal informa- tion despite a text-only input/output. 2.2 Vision When prompting the model to generate images of objects such as a cat, a truck or a letter in the alphabet using Scalable Vector Graphics (SVG), the model produces code which usually compiles to rather detailed and identifiable images (Figure 2.4). See Appendix B.2 for the rerun of various examples by ChatGPT. Figure 2.4: SVGs generated by GPT-4 for the classes automobile, truck, cat, dog. 2.2.1 Image generation beyond memorization One may hypothesize, however, that the model simply copied the code from training data, where similar images appear. Given that this version of the model is non-multimodal, one may further argue that there is no reason to expect that it would understand visual concepts, let alone that it would be able to create, parse and manipulate images. Yet, the model appears to have a genuine ability for visual tasks, rather than just copying code from similar examples in the training data. The evidence below strongly supports this claim, and demonstrates that the model can handle visual concepts, despite its text-only training. In the first example we prompted the model to draw a person by combining the shapes of the letters Y, O and H (see Figure 2.5 for the exact prompt and the result). GPT-4 Produce TikZ code that draws a person composed from letters in the alphabet. The arms and torso can be the letter Y, the face can be the letter O (add some facial features) and the legs can be the legs of the letter H. Feel free to add other features. The torso is a bit too long, the arms are too short and it looks like the right arm is carrying the face instead of the face being right above the torso. Could you correct this please? Please add a shirt and pants. Figure 2.5: A stick figure composed of alphabet symbols 16
  • 17. GPT-4 You will be given a name of an object (such as Car, Chair, Elephant) and a letter in the alphabet. Your goal is to first produce a 1-line description of how that object can be combined with the letter in an image (for example, for an elephant and the letter J, the trunk of the elephant can have a J shape, and for the letter A and a house, the house can have an A shape with the upper triangle of the A being the roof). Following the short description, please create SVG code to produce this (in the SVG use shapes like ellipses, triangles etc and polygons but try to defer from using quadratic curves). Round 1: A pig and the letter H. Description: The pig can be lying on its back with its four legs forming the two vertical bars of the H, and its curly tail forming the horizontal bar. Round 2: A house and the letter U. Description: A house with a U shape, where the two vertical lines of the U are the walls and the horizontal line is the floor. The roof is a triangle above the U. Round 3: A chair and the letter K. Description: The back of the chair can have a K shape, with two vertical lines and a diagonal line connecting them. The seat of the chair can be a rectangle. Round 4: A dog and the letter Q. Description: The dog is sitting with its tail curled around its body, forming a Q shape. The Q is filled with orange and has a black outline. Round 5: A car and the letter E. Description: The car has an E shape on its front bumper, with the horizontal lines of the E being lights and the vertical line being the license plate. Figure 2.6: Images combining alphabet letters and objects. The letters of O, H and Y are created using draw-line and draw-circle commands and the model manages to position them in a way that results in a reasonably-looking stick figure. It is plausible that the training data contains information about the geometric shapes of different letters, and perhaps the fact that the letter Y could look like a torso with arms pointing upwards can also be inferred from the training data. Arguably, it is much less obvious that the model would be able to infer from the training data what is a reasonable way to position those letters in order to draw a reasonably-looking stick figure. In a second iteration, we prompted the model to correct the proportions of the torso and arms, and place the head in the center. Finally, we asked the model to add a shirt and pants (see Figure 2.5 for the exact prompt and the result). To further probe the model’s grasp of geometric concepts, we also asked it to create images that blend objects with letters of the alphabet. The model had to first invent a reasonable way of merging the object and the letter, and then produce the image. The results, shown in Figure 2.6, demonstrate that GPT-4 can usually preserve the identity of both the object and the letter and combine them in a creative way. 2.2.2 Image generation following detailed instructions (à la Dall-E) To further test GPT-4’s ability to generate and manipulate images, we tested the extent to which it can follow detailed instructions on creating and editing figures. This task requires not only generative skills, but also interpretive, compositional, and spatial skills. 17
  • 18. The first example instructs the model to generate a 2D image with the description “A frog hops into a bank and asks the teller, ‘Do you have any free lily pads?’ The teller responds, ‘No, but we do offer low interest loans for pond upgrades.’ ”. We made several attempts to generate the image, each time, the gen- eration matches the description with the key objects frog, teller, bank, and the two texts. We picked the most visually appealing version. Inspired by the standard image generation workflow, we then ask GPT-4 to upscale the figure by adding more details. GPT-4 adds a bank sign, some windows, a car, a traffic light, a few clouds, and makes the frog hold a flower. Finally, we ask GPT-4 to perform various tasks, such as adding a few objects relative to the existing objects, recoloring some objects and changing the z-order of some ob- jects. GPT-4 does all tasks correctly. The final result is shown in Figure 2.7 (a) and the prompt in Figure B.4. Our second example is an attempt to generate a 3D model using Javascript. We instruct GPT-4 with the prompt “A fantasy landscape of floating islands, waterfalls, and bridges, with a dragon flying in the sky and a castle on the largest island.” Similar to the 2D experiment, we ask GPT-4 to modify the 3D model in various ways, such as adding, relocating, recoloring objects and changing the trajectory of the dragon. Again, GPT-4 does many of the tasks correctly. The final result is shown in Figure 2.7 (b) and the prompt in Figure B.5. It is a 3D animation with multiple dragons is circling above the islands. Figure 2.7: Examples of 2D, 3D images generated according to instructions. 2.2.3 Possible application in sketch generation Text-to-image synthesis models have been widely explored in recent years, but they often suffer from a lack of spatial understanding capabilities and the inability to follow complex instructions [GPN+ 22]. For example, given a prompt such as “draw a blue circle on the left and a red triangle on the right”, these models may produce images that are visually appealing but do not match the desired layout or colors. On the other hand, GPT-4 can generate code from a prompt, which can be rendered as an image, in a way that is true to the instructions to a higher degree of accuracy. However, the quality of the rendered image is usually very low. Here, we explore the possibility of combining GPT-4 and existing image synthesis models by using the GPT-4 output as the sketch. As shown in Figure 2.8, this approach can produce images that have better quality and follow the instructions more closely than either model alone. We believe that this is a promising direction for leveraging the strengths of both GPT-4 and existing image synthesis models. It can also be viewed as a first example of giving GPT-4 access to tools, a topic we explore in much more depth in Section 5.1. 18
  • 19. GPT-4 and Stable Diffusion Prompt: A screenshot of a city-building game in 3D. The screenshot is showing a terrain where there is a river from left to right, there is a desert with a pyramid below the river, and a city with many highrises above the river. The bottom of the screen has 4 buttons with the color green, blue, brown, and red respectively. GPT-4: Stable Diffusion v2.1: Stable Diffusion v2.1 with GPT-4’s sketch: Figure 2.8: Combining GPT-4 and stable diffusion 2.3 Music The data on which the model was trained also contains musical information encoded as ABC notation. This is a system that uses letters, numbers and symbols to represent musical pitches, durations, chords and other elements in a compact and readable way. We are interested in exploring how well the model has acquired musical skills from this exposure, such as composing new melodies, transforming existing ones, and understanding musical patterns and structures. When instructed to generate a short tune (Figure 2.9), and the model was able to produce valid ABC notation. The tune had a clear structure, the time signature was consistent between bars and the notes followed increasing and decreasing patterns. The tune also used a consistent set of notes within the melody, and the rhythm had a repetitive pattern. However, the model did not seem to obtain the skill of understanding harmony. In fact, consecutive notes in the generated tunes are almost always adjacent to each other (namely, the note following C will almost typically be either B or D), and testing on 10 generated tunes, we were not able to extract any clear chords or arpeggios. We then asked the model to describe the tune in musical terms. It was able to successfully give a technical description of the structure in terms of repetitions, the rising or descending parts of the melody and to some extent the rhythm. However, it seems that the descriptions of the harmony and chords are not consistent with the notes (in fact, it refers to sequences of adjacent notes, which do not form valid chords, as arpeggios). We then asked the model to manipulate the melody in two ways. First, we instructed to change a certain rising sequence to a descending one, which it did successfully. Then we asked the model to convert the tune to a duet adding a bass voice. The model successfully extends the ABC notation with a second staff which has compatible rhythm and is played on a lower octave, however there is a lack of harmony between the two voices. In summary, the model was able to produce valid tunes in ABC notation and, to some extent, explain and manipulate their structure. However, we could not get the model to produce any nontrivial form of harmony. It should be noted that ABC notation is not a very widely used format, and in fact the model was not able to produce even the most well-known tunes in ABC notation (such as Ode to Joy, Für Elise or Greensleeves, all of which are abundant online in that format), nor was it able to recognize these tunes. 19
  • 20. GPT-4 Can you compose a short tune (say four to eight bars) using ABC notation? There are many possible ways to compose a short tune in abc notation, but here is one example: X:1 T:Example Tune M:4/4 L:1/8 K:C |:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :| |:c2de f2ed | c2de f2f2 | g2fe d2cB | c4-c2z2 :| This tune has two four-bar phrases, each repeated once, forming an A part and a B part. The key is C major. Can you describe the tune in musical terms? Here is one possible description: The tune starts with a rising arpeggio of the tonic chord (C major), followed by a descending scale that returns to the tonic. This pattern is repeated, but the second time the scale continues to the higher tonic note. The first phrase ends with a long tonic note. The second phrase begins with a similar pattern, but using the subdominant chord (F major) instead of the tonic. The second time, the scale continues to the higher subdominant note. The phrase ends with a stepwise descent to the tonic, followed by a long tonic note. The A part and the B part have a similar structure, but use different chords and ranges. The tune has a simple and catchy melody, with a clear sense of tonality and rhythm. I like part A, but maybe part B is just a bit too similar to part A, making the song repetitive. How about beginning part B with a descending arpeggio instead of a rising one? Sure, that would create some contrast and variation. Here is one possible way to do that: X:1 T:Example Tune M:4/4 L:1/8 K:C |:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :| |:f2ed c2BA | f2ed c2c2 | g2fe d2cB | c4-c2z2 :| That’s much better. Now how about making this into a duet, can you add a second staff for this with a bass accompanying the melody? One possible way to write a second staff for this tune is to use root notes and fifths of the chords, as well as some passing notes, to create a simple accompaniment. Here is an example in abc notation, with the second hand bass on the lower staff: X:1 T:Example Tune M:4/4 L:1/8 K:C V:1 |:G2AB c2BA | G2AB c2c2 | edcB c2BA | G4-G2z2 :| |:f2ed c2BA | f2ed c2c2 | g2fe d2cB | c4-c2z2 :| V:2 clef=bass |:C,2C,2 G,2G,2 | C,2C,2 G,2G,2 | F,2F,2 E,2E,2 | C,4-C,2z2 :| |:F,2F,2 C,2C,2 | F,2F,2 C,2C,2 | G,2G,2 F,2F,2 | E,2C,2 G,2C,2 :| The sheet music of the final tune: Figure 2.9: Generating and modifying a tune in ABC notation. 20
  • 21. 3 Coding In this section, we show that GPT-4 is able to code at a very high level, both in terms of writing code from instructions and understanding existing code. GPT-4 can handle a wide range of coding tasks, from coding challenges to real world applications, from low-level assembly to high-level frameworks, from simple data structures to complex programs such as games. GPT-4 can also reason about code execution, simulate the effects of instructions, and explain the results in natural language. GPT-4 can even execute pseudocode, which requires interpreting informal and vague expressions that are not valid in any programming language. In its current state, we believe that GPT-4 has a high proficiency in writing focused programs that only depend on existing public libraries, which favorably compares to the average software engineer’s ability. More importantly, it empowers both engineers and non-skilled users, as it makes it easy to write, edit, and un- derstand programs. We also acknowledge that GPT-4 is not perfect in coding yet, as it sometimes produces syntactically invalid or semantically incorrect code, especially for longer or more complex programs. GPT-4 also sometimes fails to understand or follow the instructions, or produces code that does not match the in- tended functionality or style. With this acknowledgment, we also point out that GPT-4 is able to improve its code by responding to both human feedback (e.g., by iteratively refining a plot in 3.2) and compiler / terminal errors (examples in Section 5.1). Important Disclaimer: As explained in the Introduction (see footnote 1 for example) our experiments were run on an early version of GPT-4. In particular all quantitative results will be different on the final version of GPT-4, although the general trends remain the same. We provide numbers here for illustration purpose only, the definitive benchmark results can be found in OpenAI’s technical report [Ope23]. 3.1 From instructions to code 3.1.1 Coding challenges A common way to measure coding skill is to pose coding challenges that require implementing a specific functionality or algorithm. We first benchmark GPT-4 on HumanEval [CTJ+ 21], a docstring-to-code dataset consisting of 164 coding problems that test various aspects of programming logic and proficiency. As shown in Table 1, GPT-4 outperforms other LLMs, including text-davinci-003 (the base model of ChatGPT) and other models trained specifically on code, code-davinci-002, and CODEGEN-16B [NPH+ 22]. Model GPT-4 text-davinci-003 Codex(code-davinci-002) CODEGEN-16B Accuracy 82% 65% 39% 30% Table 1: Zero-shot pass@1 accuracy comparison of different models on HumanEval Although GPT-4’s accuracy shows a big jump compared to previous models, it could be that GPT-4 has seen and memorized some (or all) of HumanEval during pre-training. To account for this possibility, we also evaluate it on LeetCode (https://leetcode.com), a popular platform for software engineering interviews, where new problems are constantly posted and updated. We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions, we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period. As seen in the example in Figure 3.1, we paste the problem instructions into a prompt, ask GPT-4 to write a python function, and use the official LeetCode online judge to check for correctness. We present the results in Table 2, where we compare GPT-4 to other models and to human performance based on LeetCode contest results (users who fail all questions are not included, and thus this is a strong sample of humans). We report both pass@1 and pass@5 accuracies, which measure whether the model produces a correct solution in the first or in the first five attempts, respectively. GPT-4 significantly outperforms the other models, and is comparable to human performance (which we measure in Appendix C.1). Easy Median Hard Overall pass@k k = 1 k = 5 k = 1 k = 5 k = 1 k = 5 k = 1 k = 5 GPT-4 68.2 86.4 40.0 60.0 10.7 14.3 38.0 53.0 text-davinci-003 50.0 81.8 16.0 34.0 0.0 3.6 19.0 36.0 Codex (code-davinci-002) 27.3 50.0 12.0 22.0 3.6 3.6 13.0 23.0 Human (LeetCode users) 72.2 37.7 7.0 38.2 Table 2: Zero-shot pass@1 and pass@5 accuracies (%) on LeetCode. 21
  • 22. GPT-4 Prompt: You are given a **0-indexed** `m x n` integer matrix `grid` and an integer `k`. You are currently at position `(0, 0)` and you want to reach position `(m - 1, n - 1)` moving only **down** or **right**. Return *the number of paths where the sum of the elements on the path is divisible by* `k`. Since the answer may be very large, return it **modulo** `10**9 + 7`. **Example 1:** Input: grid = [[5,2,4],[3,0,5],[0,7,2]], k = 3 Output: 2 Explanation: There are two paths where the sum of the elements on the path is divisible by k. The first path highlighted in red has a sum of 5 + 2 + 4 + 5 + 2 = 18 which is divisible by 3. The second path highlighted in blue has a sum of 5 + 3 + 0 + 5 + 2 = 15 which is divisible by 3. **Example 2:** Input: grid = [[0,0]], k = 5 Output: 1 Explanation: The path highlighted in red has a sum of 0 + 0 = 0 which is divisible by 5. **Constraints:** - `m == grid.length` - `n == grid[i].length` - `1 <= m, n <= 5 * 10**4` - `1 <= m * n <= 5 * 10**4` - `0 <= grid[i][j] <= 100` - `1 <= k <= 50` **Hints:** - The actual numbers in grid do not matter. What matters are the remainders you get when you divide the numbers by k. - We can use dynamic programming to solve this problem. What can we use as states? - Let dp[i][j][value] represent the number of paths where the sum of the elements on the path has a remainder of value when divided by k. GPT-4: class Solution(object): def numberOfPaths(self, grid, k): """ :type grid: List[List[int]] :type k: int :rtype: int """ # Define MOD as 10**9 + 7 MOD = 10**9 + 7 # Get the dimensions of the grid m = len(grid) n = len(grid[0]) # Initialize a 3D dp array with all zeros # dp[i][j][value] represents the number of paths where the sum of the elements on the path has a remainder of value when divided by k dp = [[[0] * k for _ in range(n)] for _ in range(m)] # Base case: the first cell has only one path and its value is grid[0][0] % k dp[0][0][grid[0][0] % k] = 1 # Loop through the grid for i in range(m): for j in range(n): # Skip the first cell if i == 0 and j == 0: continue # Loop through all possible remainders for value in range(k): # The number of paths to the current cell with the current remainder is equal to the sum of: # - The number of paths to the top cell with the previous remainder # - The number of paths to the left cell with the previous remainder # The previous remainder is equal to the current remainder minus the current value, plus k if it is negative prev = (value - grid[i][j]) % k if i > 0: dp[i][j][value] += dp[i - 1][j][prev] if j > 0: dp[i][j][value] += dp[i][j - 1][prev] # Take the modulo to avoid overflow dp[i][j][value] %= MOD # Return the number of paths to the bottom right cell with a remainder of 0 return dp[m - 1][n - 1][0] Figure 3.1: Solving a LeetCode problem using dynamic programming. GPT-4’s solution also has better readability thanks to comprehensive commenting. 3.1.2 Real world scenarios Coding challenges can evaluate the skills in algorithms and data structures. However, they often fail to capture the full complexity and diversity of real-world coding tasks, which requires specialized domain knowledge, creativity, and integration of multiple components and libraries, as well as the ability to change existing code. To assess GPT-4’s ability to code in more realistic settings, we design end-to-end real-world coding challenges related to data visualization, L A TEX coding, front-end development, and deep learning, each of which requires different specialized skills. For each task, we provide GPT-4 with high-level instructions, ask it to write the 22
  • 23. code in the appropriate language and framework. In a few cases, we also change the specification after the code is written and ask for an update. Data Visualization In Figure 3.2, we ask both GPT-4 and ChatGPT to extract data from the L A TEX code for Table 2 and produce a plot in Python based on a conversation with the user. Afterwards, we ask both models to perform various operations on the produced plots. While both models extract the data correctly (not a trivial task, since one must infer from the multicolumn that the Human row has the same value for k = 1 and k = 5), ChatGPT never produces the desired plot. In contrast, GPT-4 responds appropriately to all user requests, manipulating the data into the right format and adapting the visualization. In Appendix C.2, we include another example where GPT-4 visualizes the IMDb dataset. GPT-4 Prompt: [in an interactive manner] I will show you a LaTeX table, and we will come up with a way to visualize it better in a question and answer format The table is showing the performance of different language models (and human users) in generating solutions to code challenges. Pass@1 means the first generation works, while Pass@5 means one out of 5 generation works... Can you make both plots have the same range in the y axis? And the legend has each model name twice. Is there a way to make the ‘Human’ bar more distinctive, to separate it from the other 3? Figure 3.2: GPT-4 visualizes data from a L A TEX table (i.e., Table 2). We point out that GPT-4 also generates the format for this figure. We asked the model how to plot arrows connecting figures in L A TEX and GPT-4 produced a working Tikz snippet with the layout and arrows that we adopt here. 23
  • 24. Front-end / Game development In Figure 3.3, we ask GPT-4 to write a 3D game in HTML with JavaScript, using a very high-level specification. GPT-4 produces a working game in zero-shot fashion that meets all the requirements. In the 3D game, GPT-4 is even able to interpret the meaning of “defender avatar is trying to block the enemy”: the defender code has logic so that it positions itself between the player and enemy. In contrast, ChatGPT responds with “I am an AI language model, I can provide you guidance on how to create a 3D game in HTML with JavaScript, but I am unable to write code or create a game myself. Creating a 3D game in HTML with JavaScript requires a lot of programming knowledge and experience. It is not a task that can be completed quickly or easily. It would take a significant amount of time and effort to develop all the necessary features, such as 3D graphics, physics, user input, and AI...”. We give more examples for front-end (graphical user interface) and game development in Appendix C. GPT-4 Prompt: Can you write a 3D game in HTML with Javascript, I want: -There are three avatars, each is a sphere. -The player controls its avatar using arrow keys to move. -The enemy avatar is trying to catch the player. -The defender avatar is trying to block the enemy. -There are also random obstacles as cubes spawned randomly at the beginning and moving randomly. The avatars cannot cross those cubes. -The player moves on a 2D plane surrounded by walls that he cannot cross. The wall should cover the boundary of the entire plane. -Add physics to the environment using cannon. -If the enemy catches the player, the game is over. -Plot the trajectories of all the three avatars. Episode 1: the player avatar (blue) stands still. The enemy (red) rushes straight towards the player, while the defender (magenta) attempts to ‘body block’ the enemy (see its curved trajectory in magenta). Episode 2: the player (blue) wanders around. The enemy (red) attempts to close the gap but crashes into a cube and bounces into the air. Trying to obstruct, the defender (magenta) mimics its move to follow up into the air, but ends up failing. The enemy (red) catches the player (blue) in the end. Figure 3.3: GPT-4 develops a complex 3D HTML game in JavaScript. 24
  • 25. Deep Learning. Writing code for deep learning requires knowledge of mathematics, statistics, and fa- miliarity with frameworks and libraries such as PyTorch, TensorFlow, Keras, etc. In Figure 3.4 we ask both GPT-4 and ChatGPT to write a custom optimizer module, a task that can be challenging and error-prone even for human deep learning experts. We give these models a natural language description, which includes a sequence of non-trivial operations such as applying SVD, truncating a matrix spectrally at the top-k and top-2k eigenvalues, normalizing the top-k truncated matrix using the F-norm of the top-2k truncated matrix, applying momentum and weight decay. These instructions are not spelled out in complete detail, e.g., “ap- ply momentum on Gk” requires “deep learning common sense”. It is important to note that this particular optimizer does not exist in the literature or on the internet, and thus the models cannot have it memorized, and must instead compose the concepts correctly in order to produce the code. GPT-4 vs ChatGPT Prompt: I want to write a customized optimizer in pytorch, the optimizer should do: 1. the parameters for the optimizer is learning rate, momentum ,weigth decay, d_dim, k and factor alpha 2. the optimizer goes through each parameter in the network, and then -- Get the gradient of this parameter, reshape it to 2D by looping through the dimensions, and keeping the dimension whose actual dimension (shape) is equal to d_dim, do an SVD decomposition of the parameter so W = USigma V. -- Keep the top k singular vectors of W, so we get a new matrix W_k = U Sigma_k V. -- Keep the top 2k singular vectors of W, so we get a new matrix W_{2k} = U Sigma_{2k} V. -- Normalize W_k according to the momentum of the F-norm of W_{2k}. -- Truncate each coordinate of W_k to the top alpha percentile of all the coordinates, in absolute value. -- Apply momentum on W_k, and then update the network parameters using this momentum. -- Apply weight decay. GPT-4: def step(self): for group in self.param_groups: for p in group['params']: if p.grad is None: continue grad = p.grad.data # reshape grad to 2D grad_2d = grad.reshape(-1, group['d_dim']) # SVD decomposition U, S, V = torch.svd(grad_2d) # keep top k singular vectors G_k = U[:, :group['k']] @ torch.diag(S[:group['k']]) @ V[:, :group['k']].T # keep top 2k singular vectors G_2k = U[:, :2*group['k']] @ torch.diag(S[:2*group['k']]) @ V[:, :2*group['k']].T # normalize G_k G_k /= torch.norm(G_2k, p=2) # truncate G_k threshold = torch.quantile(torch.abs(G_k), group['alpha']) G_k[torch.abs(G_k) < threshold] = 0 # apply momentum if 'momentum_buffer' not in group: group['momentum_buffer'] = torch.zeros_like(G_k) group['momentum_buffer'] = group['momentum'] * group['momentum_buffer'] + G_k # update parameters p.data.add_(-group['lr'] * group['momentum_buffer']) # apply weight decay if group['weight_decay'] != 0: p.data.add_(-group['weight_decay'] * group['lr'], p.data) ChatGPT: def step(self): for group in self.param_groups: for p in group['params']: if p.grad is None: continue grad = p.grad.data grad = grad.reshape(-1, grad.shape[-1]) # reshape to 2D # SVD decomposition u, s, v = torch.svd(grad) # Keep top k and 2k singular vectors k = group['k'] u_k = u[:,:k] s_k = s[:k] v_k = v[:,:k] u_2k = u[:,:2*k] s_2k = s[:2*k] v_2k = v[:,:2*k] # Normalize according to momentum of F-norm of G_2k grad_k = u_k @ torch.diag(s_k) @ v_k.t() norm_2k = s_2k.norm() grad_k = grad_k * group['momentum'] / norm_2k # Truncate to top alpha percentile alpha = group['alpha'] percentile = grad_k.abs().flatten().kthvalue(int(alpha * grad_k.numel())) grad_k[grad_k.abs() < percentile] = 0 # Apply weight decay p.data.add_(-group['lr'], grad_k + group['weight_decay'] * p.data) Figure 3.4: GPT-4 vs ChatGPT at implementing a customized Pytorch optimizer module. We highlight the astuteness of GPT-4 vs the mistake by ChatGPT at ‘applying momentum’. While both models produce syntactically valid code, only GPT-4’s response largely matches the instruc- tions, while it forgets to ‘loop through the dimensions’ and to ‘normalize Gk according to the momentum’ where the instructions happen to be particularly vague. In comparison, ChatGPT makes a rather fatal mis- take at applying momentum (highlighted in red) in addition. Note that applying momentum in PyTorch is a non-trivial task that requires storing and reading a moving average into and from a separate state buffer. 25
  • 26. Interfacing with L A TEX. Writing in L A TEX is an important exercise for computer scientists and mathe- maticians but has a non-trivial learning curve. Even experts make irritating errors that take hours to fix on a daily basis due to its strict grammar and the lack of a good debugger. We show that GPT-4 can utilize its master-level L A TEX coding skill to greatly simplify the process, with the potential of serving as a new generation of L A TEX compilers that can handle imprecise natural language descriptions. In Figure 3.5, we ask GPT-4 to transfer a snippet of written in a semi-rigorous (buggy) L A TEX code mixed with natural language into accurate L A TEX commands that compiles and is faithful in one shot. In constrast, ChatGPT generates a snippet that does not compile due to mistakes at using ‘#’ and ‘color’ etc. GPT-4 vs ChatGPT Prompt: A function f is called {increasing}_{emphasize!} if f(s) <= f(t) holds for every s <= t. Consider the formula: limit_{x goes to 0} {limit when n approaches infinite} underscore{ matrix[ [ sum of i from 1 to n (f(nx) - integral from 0 to x sin(nx) dx ), ((x + 2/8)^2 - 5)/((4x-1/x^2) /(n xor 3)), , ..., partial f(x/n) over partial x * (e^{-i*omega n x} + 3(x#7) ) , f inverse of (x mod n) ]] multiply matrix{1, 1, ..., 1 1, 1, ..., 1} } {grey_color{this gives us a good impression about the function}} Convert the above paragraph to LaTeX. GPT-4: A function f is called increasing if f(s) ≤ f(t) holds for every s ≤ t. Consider the formula: lim x→0 lim n→∞       Pn i=1 f(nx) − R x 0 sin(nx) dx , (x+ 2 8 )2 −5 4x−1/x2 n⊕3 , . . . . . . ∂f(x/n) ∂x · e−iωnx + 3(x#7) , f−1(x mod n)       × 1 1 · · · 1 1 1 · · · 1 | {z } this gives us a good impression about the function ChatGPT: A function $f$ is called {em increasing} if $f(s) leq f(t)$ holds for every $s leq t$. Consider the formula: $$lim_{x to 0} lim_{n to infty} begin{bmatrix} sumlimits_{i=1}^{n} (f(nx) - intlimits_0^x sin(nx),dx), frac{(frac{x}{2} + frac{1}{8})^2 - 5}{frac{4x - frac{1}{x^2}}{n operatorname{xor} 3}}, vdots, frac{partial f(x/n)}{partial x} cdot (e^{-iomega n x} + 3(x#7)), f^{-1}(x bmod n) end{bmatrix} cdot begin{bmatrix} 1 1 vdots 1 end{bmatrix}$$ {color{grey}This gives us a good impression about the function.} ] Figure 3.5: GPT-4 translates semi-rigorous typesetting instructions into a working L A TEX snippet. For GPT-4 we present the final graphic, while ChatGPT’s generation fails to compile. 3.2 Understanding existing code The previous examples have shown that GPT-4 can write code from instructions, even when the instructions are vague, incomplete, or require domain knowledge. They also showed that GPT-4 could respond to follow- up requests, modifying its own code according to instructions. However, another important aspect of coding is the ability to understand and reason about existing code, written by others, which might be complex, obscure, or poorly documented. To test this we pose various questions that require reading, interpreting, or executing code written in different languages and paradigms. Reverse-engineering assembly code. Reverse engineering is an essential test for software security which amounts to search for useful information in a executable program written in a machine-readable (i.e., binary) representation of CPU instructions. It is a challenging task that requires understanding the syntax, semantics, and conventions of assembly language, as well as the architecture and behavior of the processor and the operating system. We task GPT-4 with performing penetration testing on a binary executable file (code was written in C) that requires a password to run. We do this via a chat format, where GPT-4 tells the user which commands 26
  • 27. to run, and the user responds with the results. We also show in Section 5.1 that GPT-4 is able to run a shell independently, but this chat mode also provided the benefit of it explaining every step. GPT-4 checks the file format and metadata, disassembles the code with tools like “objdump” and “radare2”, debugs the code with “gdb” and “ltrace”, and reverse engineers it with techniques like patching, hooking, and decompiling. During this process, GPT-4 figures out that from the assembly code that the password is compared to a hash value derived from a simple mathematical formula. GPT-4 then writes a python program that tries different combinations of numbers until it finds one that matches the hash value, cracking the password (an abbreviated log is presented in Appendix C.6). ChatGPT refuses to do it on the grounds of it being illegal and unethical, even though reverse engineering is a common practice to ensure software security. Moreover, GPT-4 exhibits all-around proficiency in utilizing existing tools also beyond programming, which we will discuss in details in Section 5.1. Reasoning about code execution. In the example in Figure 3.6, we ask GPT-4 and ChatGPT to predict and explain the output of a C program that prints the size of two structures. GPT-4 correctly explains that the output may vary depending on the alignment rule used by the compiler, and gives an example of a possible output with 4-byte alignment. ChatGPT ignores the alignment issue and gives a wrong output, and also makes a false statement about the order of the members not affecting the size of the structure. GPT-4 vs ChatGPT Prompt: struct mystruct_A { char a; int b; char c; } x; struct mystruct_B { int b; char a; char c; } y; #include stdio.h int main(void) { printf(Size of x: %zu bytesn, sizeof(x)); printf(Size of y: %zu bytesn, sizeof(y)); return 0; } Possible Output: GPT-4: There are different ways to implement and align structures in memory, so the exact output may vary depending on the compiler, platform, and options used. However, one possible output is: Size of x: 12 bytes Size of y: 8 bytes This output assumes that the compiler uses 4-byte alignment for structures, meaning that each structure member starts at an offset that is a multiple of 4 bytes from the beginning of the structure. This may result in some padding bytes being inserted between or after the members to maintain alignment. For example, in x, there may be 3 padding bytes after a and 3 padding bytes after c, while in y, there may be no padding at all. The size of a structure is the sum of the sizes of its members plus any padding bytes. ChatGPT: Size of x: 8 bytes Size of y: 8 bytes The output of this code would be the size of the two structures in bytes. The size of a structure in C is the sum of the sizes of all of its members. In this case, the size of mystruct A is the size of a char plus the size of an int plus the size of another char, which is 1 + 4 + 1 = 8 bytes. Similarly, the size of mystruct B is also 8 bytes. This means that both structures will be allocated the same amount of memory. Note that the order in which the members are declared in the structure does not affect its size. Figure 3.6: GPT-4 vs ChatGPT on the memory alignment rule of C/C++. GPT-4’s insightful comments are labeled in gold and ChatGPT’s Mistakes are labeled in red. Executing Python code The ultimate test of understanding the existing code is to ask the model to directly execute it. In Figure 3.7, we see that GPT-4 is able to execute non-trivial Python code. It has to keep track of several variables (including a nested loop and a dictionary) and deal with recursion. It explains the execution in detail by writing intermediate steps and comments. It is important to note that GPT-4 is not running the code on a Python interpreter, but rather simulating the code in natural language. This requires a high level of understanding and reasoning about the code, as well as the ability to communicate the results clearly. In contrast, ChatGPT states (incorrectly) that “The result of DP(3, 4) is not specified in the given code”, and later on that “It is not clear from the code what the expected output of the function 27
  • 28. is, as the specific problem that the function is solving is not provided.” ChatGPT does not simulate the full execution, but states which functions will be called. GPT-4 vs ChatGPT Prompt: I have the following python code: def to_str(i, j): return str(i) + , + str(j) dict = {} def DP(i, j): if i = 0 or j = 0: dict[to_str(i, j)] = abs(4 * j - 3 * i) elif to_str(i, j) not in dict.keys(): arr = [] for ir in range(i): for jr in range(j): arr.append(DP(ir, jr)) dict[to_str(i, j)] = max(arr) return dict[to_str(i, j)] What is the result of DP(3, 4)? Write down the execution with intermediate steps. GPT-4: DP(3, 4): - i 0 and j 0, so we don't return from the first if statement - to_str(3, 4) is not in dict, so we enter the second if statement - arr = [] - We loop through ir from 0 to 2, and jr from 0 to 3 - to_str(0, 0) is not in dict, so DP(0, 0) returns abs(4 * 0 - 3 * 0) = 0, dict = {0,0: 0}, arr = [0] - to_str(0, 1) is not in dict, so DP(0, 1) returns abs(4 * 1 - 3 * 0) = 4, dict = {0,0: 0, 0,1: 4}, arr = [0, 4] - to_str(0, 2) is not in dict, so DP(0, 2) returns abs(4 * 2 - 3 * 0) = 8, dict = {0,0: 0, 0,1: 4, 0,2: 8}, arr = [0, 4, 8] - to_str(0, 3) is not in dict, so DP(0, 3) returns abs(4 * 3 - 3 * 0) = 12, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12}, arr = [0, 4, 8, 12] - to_str(1, 0) is not in dict, so DP(1, 0) returns abs(4 * 0 - 3 * 1) = 3, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3}, arr = [0, 4, 8, 12, 3] - to_str(1, 1) is not in dict, so DP(1, 1) enters the second if statement, arr1 = [] - We loop through ir from 0 to 0, and jr from 0 to 0 - to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr1 = [0] - dict[1,1] = max(arr1) = 0, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3, 1,1: 0}, arr = [0, 4, 8, 12, 3, 0] - Continuing with the outer loop, to_str(1, 2) is not in dict, so DP(1, 2) enters the second if statement, arr2 = [] - We loop through ir from 0 to 0, and jr from 0 to 1 - to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr2 = [0] - to_str(0, 1) is in dict, so DP(0, 1) returns 4, arr2 = [0, 4] ... (we skip some steps generated by the model here due to space constraints) - Continuing with the outer loop, to_str(2, 3) is not in dict, so DP(2, 3) enters the second if statement, arr6 = [] - We loop through ir from 0 to 1, and jr from 0 to 2 - to_str(0, 0) is in dict, so DP(0, 0) returns 0, arr6 = [0] - to_str(0, 1) is in dict, so DP(0, 1) returns 4, arr6 = [0, 4] - to_str(0, 2) is in dict, so DP(0, 2) returns 8, arr6 = [0, 4, 8] - to_str(1, 0) is in dict, so DP(1, 0) returns 3, arr6 = [0, 4, 8, 3] - to_str(1, 1) is in dict, so DP(1, 1) returns 0, arr6 = [0, 4, 8, 3, 0] - to_str(1, 2) is in dict, so DP(1, 2) returns 4, arr6 = [0, 4, 8, 3, 0, 4] - dict[2,3] = max(arr6) = 8, dict = {0,0: 0, 0,1: 4, 0,2: 8, 0,3: 12, 1,0: 3, 1,1: 0, 1,2: 4, 1,3: 8, 2,0: 6, 2,1: 3, 2,2: 4, 2,3: 8}, arr = [0, 4, 8, 12, 3, 0, 4, 8, 6, 3, 4, 8] - dict[3,4] = max(arr) = 12 - DP(3, 4) returns 12 ChatGPT: The result of DP(3, 4) is not specified in the given code. The function DP(i, j) is a dynamic programming function that uses a dictionary to store the result of previously computed values. The function first checks if the input values i and j are less than or equal to 0, if so it assigns the value of abs(4 * j - 3 * i) to the key tostr(i, j) in the dictionary. If the key tostr(i, j) does not exist in the dictionary, the function creates an empty array called arr and then uses nested for loops to append the result of DP(ir, jr) to the array for all possible values of ir and jr, where ir ranges from 0 to i and jr ranges from 0 to j. Finally, the function assigns the maximum value of arr to the key tostr(i, j) in the dictionary and returns that value. ... Finally, the function assigns the maximum value of arr to the key tostr(3, 4) in the dictionary and returns that value.It is not clear from the code what the expected output of the function is, as the specific problem that the function is solving is not provided. Additionally, the function uses nested for loops and recursion that can be computationally expensive and may cause the function to run into an infinite loop if the base case is not well defined. Figure 3.7: GPT-4 executes Python code. Executing pseudo-code. Compiling and executing code written in programming languages is easy, but that also demands strict adherence to syntax and semantics. Compilers cannot handle vague or informal expressions, or natural language descriptions of functionality. In contrast, we ask GPT-4 to execute non- trivial pseudocode in Figure 3.8, noting that it is able to execute and explain every step (including recursion). ChatGPT is not able to execute, even though it seems able to interpret each line of the code. In the example below, GPT-4 correctly interprets the informal description of the merge array function, which merges two arrays into one array with missing elements. It also understands the recursive function rec 28